.NET Zone is brought to you in partnership with:

Sasha Goldshtein is a Senior Consultant for Sela Group, an Israeli company specializing in training, consulting and outsourcing to local and international customers.Sasha's work is divided across these three primary disciplines. He consults for clients on architecture, development, debugging and performance issues; he actively develops code using the latest bits of technology from Microsoft; and he conducts training classes on a variety of topics, from Windows Internals to .NET Performance. You can read more about Sasha's work and his latest ventures at his blog: http://blogs.microsoft.co.il/blogs/sasha. Sasha writes from Herzliya, Israel. Sasha is a DZone MVB and is not an employee of DZone and has posted 202 posts at DZone. You can read more from them at their website. View Full User Profile

Visual Studio 2012 C++ Auto-Parallelizer

06.19.2012
| 5744 views |
  • submit to reddit

As you might have gathered from some scarce reports on the Web and the initial list of new features in Visual Studio 2012, the new C++ compiler is now capable of automatically vectorizing loop bodies—a feature I’ve already covered here, and also automatically parallelizing them using multiple threads.

Here’s an example. Consider the classic prime number calculation loop, designed to count the number of primes in a given range:

__declspec(noinline) bool is_prime(int n) {
    for (int x = 2; x < n; ++x) {
        if (n % x == 0 && n != x) return false;
    }
    return true;
}

LONG count = 0;
for (int i = 3; i < N; ++i) {
  if (is_prime(i)) {
    ++count;
  }
}
printf(“Count = %d"\n”, count);

This is a classic, ripe candidate for parallelization—although we need to be a little careful with the shared count variable. With N=100000 the loop completes in ~1600ms on my desktop; perhaps the compiler can make it faster automatically.

We go ahead and enable the /Qpar switch in the project properties. This allows the C++ compiler to perform automatic parallelization, but it still sometimes requires an explicit hint regarding the loops that might benefit from parallelization.

image

This hint is given in the form of a #pragma, indicating also how many threads you recommend that the runtime should use:

#pragma loop(hint_parallel(4))
for (int i = 3; i < N; ++i) {
  if (is_prime(i)) {
    ++count;
  }
}

This still takes ~1600ms on my machine, and no parallelization is visible. What’s wrong? The shared variable, of course. The compiler notices that it would be unsafe to parallelize the loop body and refrains from doing it. Changing the loop to…

#pragma loop(hint_parallel(4))
for (int i = 3; i < N; ++i) {
  if (is_prime(i)) {
    InterlockedIncrement(&count);
  }
}

…suddenly works, and brings down the time to ~450ms. Here are the four threads and a representative call stack, showing that the underlying engine is the same as in OpenMP (with its #pragma omp directives introduced in Visual Studio 2005!):

>Debug.ListCallStack
Index  Function
--------------------------------------------------------------------------------
*1      ParallelizingCompilerCpp.exe!wmain$par$1()
2      vcomp110.dll!_vcomp::C2VectParallelRegion::serialCallback(_vcomp::C2VectParallelRegion * c2pr, int)
3      vcomp110.dll!_vcomp::C2VectParallelRegion::parallelCallback_Guided(_vcomp::C2VectParallelRegion * c2pr=0x002af840)
4      vcomp110.dll!_vcomp::fork_helper_wrapper(void (...) *)
5      vcomp110.dll!_vcomp::ParallelRegion::HandlerThreadFunc(void * context=0x002af7dc, unsigned long index=0x00000000)
6      vcomp110.dll!InvokeThreadTeam(_THREAD_TEAM * ptm=0x002dddd8, void (void *, unsigned long) * pvContext=0x002af7dc, void *)
7      vcomp110.dll!_vcomp_fork(int if_test=0x00000001, int arg_count=0x00000001, void (...) * funclet=0x0f941d54, ...)
8      vcomp110.dll!_vcomp::C2VectParallelRegion::Execute()
9      vcomp110.dll!C2VectParallel(int start=0x00000003, int end=0x000186a0, int stride=0x00000001, int inclusive=0x00000000, unsigned int numChunks=0x00000004, int schedule=0x00000003, void (int, int, ...) * func=0x012018d0, int argcnt, ...)
10     Demo.exe!wmain(int argc=0x00000001, wchar_t * * argv=0x002dc178)
11     Demo.exe!__tmainCRTStartup()
12     kernel32.dll!@BaseThreadInitThunk@12()
13     ntdll.dll!___RtlUserThreadStart@8()
14     ntdll.dll!__RtlUserThreadStart@8()

>Debug.ListThreads
Index Id     Name                           Location
--------------------------------------------------------------------------------
*1     2340   Main Thread                    wmain$par$1
2     8084   vcomp110.dll!_vcomp::PersistentThreadFunc _RtlUserThreadStart@8
3     2444   vcomp110.dll!_vcomp::PersistentThreadFunc @RtlpAllocateHeap@24
4     6764   vcomp110.dll!_vcomp::PersistentThreadFunc _RtlUserThreadStart@8 
>

The documentation now is much better than it was in the Beta, and you can find online more details about the /Qpar compiler switch and the parallelization #pragmas.

Published at DZone with permission of Sasha Goldshtein, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)