{"id":56010,"date":"2012-03-14T09:59:51","date_gmt":"2012-03-14T09:59:51","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/pfxteam\/2012\/03\/14\/is-it-ok-to-use-nested-parallel-for-loops\/"},"modified":"2012-03-14T09:59:51","modified_gmt":"2012-03-14T09:59:51","slug":"is-it-ok-to-use-nested-parallel-for-loops","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/dotnet\/is-it-ok-to-use-nested-parallel-for-loops\/","title":{"rendered":"Is it ok to use nested Parallel.For loops?"},"content":{"rendered":"<p>Every now and then, I get this question: \u201cis it ok to use nested Parallel.For loops?\u201d The short answer is \u201cyes.\u201d\u00a0 As is often the case, the longer answer is, well, longer.<\/p>\n<p>Typically when folks ask this question, they\u2019re concerned about one of two things.\u00a0 First, they\u2019re concerned that each nested loop will assume it \u201cowns the machine\u201d and will thus try to use all of the cores for itself, e.g. if their code was running on an eight-core machine, they\u2019re concerned that the outer loop would run eight of the nested loops in parallel, each of which would use eight threads, such that there would be 64 threads all churning on the inner loop.\u00a0 Second, they\u2019re concerned that the outer loop\u2019s threads will end up blocking waiting for the inner parallel loops to finish.<\/p>\n<p>Let me start by waving my hands a bit and saying that, for the most part, you don\u2019t need to worry about such things, in that Parallel.For was designed to accommodate this usage pattern.\u00a0 To understand why, it helps to have a high-level understanding for how Parallel.For is implemented.\u00a0 Parallel.For doesn\u2019t just queue MaxDegreeOfParallelism tasks and block waiting for them all to complete; that would be a viable implementation if we could assume that the parallel loop is the only thing doing work on the box, but we can\u2019t assume that, in part because of the question that spawned this blog post.\u00a0 Instead, Parallel.For begins by creating just one task.\u00a0 When that task is executed, it\u2019ll first queue a replica of itself, and will then enlist in the processing of the loop; at this point, it\u2019s the only task processing the loop.\u00a0 The loop will be processed serially until the underlying scheduler decides to spare a thread to process the queued replica.\u00a0 At that point, the replica task will be executed: it\u2019ll first queue a replica of itself, and will then enlist in the processing of the loop.\u00a0 Starting to see a pattern?\u00a0 In effect, Parallel.For makes sure there\u2019s always an additional task hanging around that will join into the loop\u2019s processing if another thread becomes available.\u00a0 If no additional threads become available, when the loop\u2019s processing has completed, that additional task will be canceled.\u00a0 Further, the first task the loop creates is actually run synchronously (using Task.RunSynchronously).<\/p>\n<p>What can we infer from this hand-wavy description of how Parallel.For achieves parallelism?\u00a0 Because the first task the loop creates is run synchronously, and because the loop only spreads itself out to additional threads when the underlying scheduler gives it more threads, if the underlying scheduler has no threads to share (because they\u2019re all occupied processing, say, other iterations of the outer parallel loop), the Parallel.For will be content to complete on just the thread that invoked it.\u00a0 Going back to the original question, if I have a nested set of Parallel.For loops, the outer loop can spread itself out across available threads, and the inner loops can mostly complete on just the thread used by the outer loop to invoke the inner loop.<\/p>\n<p>We can see this with a small code example:<\/p>\n<blockquote><p><span style=\"font-family: Consolas\">using System;\nusing System.Threading;\nusing System.Threading.Tasks; <\/span><\/p>\n<p><span style=\"font-family: Consolas\">class Program\n{\nstatic void Main()\n{\nconst int OUTER_ITERS = 2000;\nconst int INNER_ITERS = 2000; <\/span><\/p>\n<p><span style=\"font-family: Consolas\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Parallel.For(0, OUTER_ITERS, i =&gt;\n{\nint innerConcurrencyLevel = 0;\nint innerConcurrencyLevelSum = 0;\nParallel.For(0, INNER_ITERS, j =&gt;\n{\nInterlocked.Increment(ref innerConcurrencyLevel);\nfor (int spin = 0; spin &lt; 50000; spin++);\nInterlocked.Add(ref innerConcurrencyLevelSum, Volatile.Read(ref innerConcurrencyLevel));\nfor (int spin = 0; spin &lt; 50000; spin++);\nInterlocked.Decrement(ref innerConcurrencyLevel);\n});\nConsole.Write(&#8220;{0}, &#8220;, Math.Round(innerConcurrencyLevelSum \/ (double)INNER_ITERS));\n});\n}\n}<\/span><\/p><\/blockquote>\n<p>Here I have nested parallel for loops.\u00a0 The inner loop\u2019s body keeps track of how many other iterations of this loop are currently running, and outputs the average for the loop to the console.\u00a0 Here\u2019s what I see when I run this on my quad-core:<\/p>\n<blockquote><p><span style=\"font-family: Consolas\">2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2,\n1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2\n, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2,\n1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n\u2026 \/\/ and on and on<\/span><\/p><\/blockquote>\n<p>Note that while there are some \u201c2\u201ds sprinkled in there (in particular towards the beginning as processing was getting going), for the most part each inner loop was running with a concurrency level close to 1.<\/p>\n<p>We can see this as well by using the Concurrency Visualizer in Visual Studio 11 Beta.\u00a0 I tweaked the above example to remove the concurrency-level tracking code (since that introduces unnecessary I\/O and synchronization for the purposes of this experiment), and here\u2019s what I see:<\/p>\n<blockquote><p><a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2012\/03\/0285.image_thumb_1731A000-1.png\"><img decoding=\"async\" style=\"padding-left: 0px;padding-right: 0px;padding-top: 0px;border: 0px\" title=\"image\" src=\"https:\/\/devblogs.microsoft.com\/dotnet\/wp-content\/uploads\/sites\/10\/2012\/03\/0285.image_thumb_1731A000-1.png\" alt=\"image\" width=\"644\" height=\"160\" border=\"0\" \/><\/a><\/p><\/blockquote>\n<p>For the purposes of this screenshot, I hid some things the Concurrency Visualizer was showing me (e.g. the GPU and disk I\/O channels) so that the important results here really pop.\u00a0 The key things to see are that 1) only five threads are being used in the processing of this loop (not 16 as we might expect if, on my quad-core, each inner parallel loop was utilizing four threads), and 2) there\u2019s no red here, meaning there\u2019s little unnecessary blocking happening.<\/p>\n<p>Now, I\u2019m not saying that there\u2019s no way to mess with this logic; I\u2019m sure you could come up with a case.\u00a0 I\u2019m also not saying that I advocate using nested parallel loops; if it\u2019s possible to easily transform your code into one that uses a single parallel loop, that\u2019s likely going to result in better performance, as you\u2019ll be minimizing the number of parallel loops you invoke (each of which has some overhead) and you\u2019ll be avoiding any potential problems that could arise from corner cases.\u00a0 What I am saying is that we designed Parallel.For (and Parallel.ForEach, of course) to behave decently well in these cases, so that if you need to use nested parallel loops, you should feel comfortable in doing so.\u00a0 This is particularly important from a composability perspective.\u00a0 If you implement a method in a library that uses a parallel loop, a consumer of your library should be able to invoke your method from within a parallel loop and not be overly thoughtful about how your library method was implemented.\u00a0 This is also important for non-nested cases, for other cases where multiple parallel loops may be run in parallel.\u00a0 For example, with ASP.NET, we want to make sure the system behaves well if the processing of a request involves a parallel loop; many requests may be processed in parallel, and the parallel loop for any given request should not dominate the system and starve the processing of other requests.<\/p>\n<p>I hope that helps.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Every now and then, I get this question: \u201cis it ok to use nested Parallel.For loops?\u201d The short answer is \u201cyes.\u201d\u00a0 As is often the case, the longer answer is, well, longer. Typically when folks ask this question, they\u2019re concerned about one of two things.\u00a0 First, they\u2019re concerned that each nested loop will assume it [&hellip;]<\/p>\n","protected":false},"author":360,"featured_media":58792,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[7908],"tags":[7907,7909,7912,7914],"class_list":["post-56010","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-pfxteam","tag-net-4","tag-parallel-extensions","tag-task-parallel-library","tag-threadpool"],"acf":[],"blog_post_summary":"<p>Every now and then, I get this question: \u201cis it ok to use nested Parallel.For loops?\u201d The short answer is \u201cyes.\u201d\u00a0 As is often the case, the longer answer is, well, longer. Typically when folks ask this question, they\u2019re concerned about one of two things.\u00a0 First, they\u2019re concerned that each nested loop will assume it [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/56010","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/users\/360"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/comments?post=56010"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/posts\/56010\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media\/58792"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/media?parent=56010"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/categories?post=56010"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/dotnet\/wp-json\/wp\/v2\/tags?post=56010"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}