{"id":55892,"date":"2009-06-06T16:27:00","date_gmt":"2009-06-06T16:27:00","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/pfxteam\/2009\/06\/06\/achieving-speedups-with-small-parallel-loop-bodies\/"},"modified":"2009-06-06T16:27:00","modified_gmt":"2009-06-06T16:27:00","slug":"achieving-speedups-with-small-parallel-loop-bodies","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/achieving-speedups-with-small-parallel-loop-bodies\/","title":{"rendered":"Achieving Speedups with Small Parallel Loop Bodies"},"content":{"rendered":"<p>The Parallel class represents a significant advancement in parallelizing managed loops.&nbsp; For many common scenarios, it just works, resulting in terrific speedups.&nbsp; However, while ideally Parallel.For could be all things to all people, such things rarely work out, and we&rsquo;ve had to prioritize certain scenarios over others.<\/p>\n<p>One area Parallel.For may fall a bit short is in attempts to use it with very small loop bodies, such as:<\/p>\n<blockquote>\n<p>int [] array = new int[100000000]; <br>Parallel.For(0, array.Length, i=&gt; <br>{ <br>&nbsp;&nbsp;&nbsp; array[i] = i*i*i; <br>});<\/p>\n<\/blockquote>\n<p>Such an operation should be readily parallelizable; after all, every iteration is completely independent of every other iteration, and we&rsquo;re dealing with a big data parallel problem.&nbsp; The devil is in the details, however.&nbsp; To handle typical scenarios where the time it takes to complete each iteration ends up being non-uniform, Parallel.For&rsquo;s implementation takes a lot of care to load balance across all threads participating in the loop, and that load balancing comes at a small performance cost.&nbsp; This cost is typically trumped by the performance benefits that come from doing the load balancing; however, when the body of the loop is as tiny as it is in the example above, even small overheads add up.&nbsp; Another overhead that also contributes is the delegate invocation required to invoke the loop body.&nbsp; It can be easy to forget when looking at a Parallel.For call that Parallel.For is really just a method, accepting as a parameter a delegate to be invoked for every iteration.&nbsp; That invocation isn&rsquo;t free, and in the above case may even be more expensive than the body of the loop itself.<\/p>\n<p>Fear not, however, as there exist ways to still achieve good speedups on such cases.&nbsp; One way is based on creating larger chunks of work for Parallel to operate on: as the chunk size increases, the overhead costs start to pale in comparison, and speedups are realized.&nbsp; <\/p>\n<p>Consider a new ForRange method you could implement:<\/p>\n<blockquote>\n<p>public static ParallelLoopResult ForRange( <br>&nbsp;&nbsp;&nbsp; int fromInclusive, int toExclusive, Action&lt;int, int&gt; body);<\/p>\n<\/blockquote>\n<p>Unlike For, which invokes the body once per iteration, ForRange will invoke the body with a start and end of a range.&nbsp; Thus, given an initial sequential loop like the following:<\/p>\n<blockquote>\n<p>for(int i=0; i&lt;N; i++) <br>{ <br>&nbsp;&nbsp;&nbsp; DoWork(i); <br>}<\/p>\n<\/blockquote>\n<p>with For it would be parallelized by replacing the for loop with a Parallel.For:<\/p>\n<blockquote>\n<p>Parallel.For(0, N, i=&gt; <br>{ <br>&nbsp;&nbsp;&nbsp; DoWork(i); <br>});<\/p>\n<\/blockquote>\n<p>and with ForRange, it would be parallelized by <em>wrapping<\/em> the for loop with a ForRange:<\/p>\n<blockquote>\n<p>ForRange(0, N, (from,to) =&gt; <br>{ <br>&nbsp;&nbsp;&nbsp; for(int i=from; i&lt;to; i++) <br>&nbsp;&nbsp;&nbsp; { <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; DoWork(i); <br>&nbsp;&nbsp;&nbsp; } <br>});<\/p>\n<\/blockquote>\n<p>There are several ways we can now implement ForRange.&nbsp; The first is simply by doing a little math.&nbsp; We can calculate the boundaries of each range and use a Parallel.For to run the user-supplied body action for each range, e.g.<\/p>\n<blockquote>\n<p>public static ParallelLoopResult ForRange( <br>&nbsp;&nbsp;&nbsp; int fromInclusive, int toExclusive, Action&lt;int, int&gt; body) <br>{ <br>&nbsp;&nbsp;&nbsp; int numberOfRanges = Environment.ProcessorCount; <br>&nbsp;&nbsp;&nbsp; int range = toExclusive &#8211; fromInclusive; <br>&nbsp;&nbsp;&nbsp; int stride = range \/ numberOfRanges; <br>&nbsp;&nbsp;&nbsp; if (range &lt;= 0) numberOfRanges = 0; <br>&nbsp;&nbsp;&nbsp; return Parallel.For(0, numberOfRanges, i =&gt; { <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int start = i * stride; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int end = (i == numberOfRanges &#8211; 1) ? toExclusive : start + stride; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; body(start, end); <br>&nbsp;&nbsp;&nbsp; }); <br>}<\/p>\n<\/blockquote>\n<p>Another way is actually by using Parallel.ForEach under the covers.&nbsp; Rather than doing the math as was done above, we can write an iterator in C# that yields the ranges, and then Parallel.ForEach over those ranges, e.g.<\/p>\n<blockquote>\n<p>public static ParallelLoopResult ForRange( <br>&nbsp;&nbsp;&nbsp; int fromInclusive, int toExclusive, Action&lt;int, int&gt; body) <br>{ <br>&nbsp;&nbsp;&nbsp; int rangeSize = (toExclusive &#8211; fromInclusive) \/ Environment.ProcessorCount; <br>&nbsp;&nbsp;&nbsp; if (rangeSize == 0) rangeSize = 1; <br>&nbsp;&nbsp;&nbsp; return Parallel.ForEach( <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; CreateRanges(fromInclusive, toExclusive, rangeSize), range =&gt; <br>&nbsp;&nbsp;&nbsp; { <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; body(range.Item1, range.Item2); <br>&nbsp;&nbsp;&nbsp; }); <br>}<\/p>\n<\/blockquote>\n<blockquote>\n<p>private static IEnumerable&lt;Tuple&lt;int,int&gt;&gt; CreateRanges( <br>&nbsp;&nbsp;&nbsp; int fromInclusive, int toExclusive, int rangeSize) <br>{ <br>&nbsp;&nbsp;&nbsp; \/\/ Enumerate all of the ranges <br>&nbsp;&nbsp;&nbsp; for (int i = fromInclusive; i &lt; toExclusive; i += rangeSize) <br>&nbsp;&nbsp;&nbsp; { <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int from = i, to = i + rangeSize; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (to &gt; toExclusive) to = toExclusive; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yield return Tuple.Create(from, to); <br>&nbsp;&nbsp;&nbsp; } <br>}<\/p>\n<\/blockquote>\n<p>(You can download an implementation of ForRange as part of the Beta 1 samples at <a title=\"https:\/\/code.msdn.microsoft.com\/ParExtSamples\" href=\"https:\/\/code.msdn.microsoft.com\/ParExtSamples\">https:\/\/code.msdn.microsoft.com\/ParExtSamples<\/a>.)<\/p>\n<p>In general, we expect the design and implementation of Parallel.For will be right for the vast majority of scenarios.&nbsp; However, solutions like those above can be used to accommodate cases that don&rsquo;t quite fit the mold.<\/p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Parallel class represents a significant advancement in parallelizing managed loops.&nbsp; For many common scenarios, it just works, resulting in terrific speedups.&nbsp; However, while ideally Parallel.For could be all things to all people, such things rarely work out, and we&rsquo;ve had to prioritize certain scenarios over others. One area Parallel.For may fall a bit short [&hellip;]<\/p>\n","protected":false},"author":360,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[7908],"tags":[7907,7911,7909,7913,7912],"class_list":["post-55892","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-pfxteam","tag-net-4","tag-code-samples","tag-parallel-extensions","tag-parallelism-blockers","tag-task-parallel-library"],"acf":[],"blog_post_summary":"<p>The Parallel class represents a significant advancement in parallelizing managed loops.&nbsp; For many common scenarios, it just works, resulting in terrific speedups.&nbsp; However, while ideally Parallel.For could be all things to all people, such things rarely work out, and we&rsquo;ve had to prioritize certain scenarios over others. One area Parallel.For may fall a bit short [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/55892","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/360"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=55892"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/55892\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=55892"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=55892"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=55892"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}