Write FAST C# Code - Task.WhenAll vs Parallel.ForEachAsync in DotNet
July 20, 2023
• 5,825 views
You've found yourself trying to optimize your algorithm and the only way you can see to squeeze out more performance is to run things in parallel. We can use a parallel foreach or we can look at task when all in C#! So, do you go with Task.WhenAll or do you leverage Parallel.ForEachAsync? Let's use BenchmarkDotNet and have the benchmarks speak for themselves.
For more videos on programming with detailed examples, check this out:
https://www.youtube.com/playlist?list=PLzATctVhnsghN6XlmOvRzwh4JSp...
View Transcript
all right we all want to write code that goes fast and oftentimes we're sitting there looking through our algorithms trying to figure out what we can possibly do to tune them and we resort to running things in parallel of course if we can run things in parallel and we can do that work at the same time that we can trim down the amount of time overall that processing is going to take but if we have a quick look at the results that come up on the Internet when we start looking for this kind of thing we can see that there's a lot of things that look similar but also different suggestions that come up and at some point you'll probably come across searching the idea of task when all in parallel for each or parallel for each async when you start reading through some of
this stuff you'll see a lot of different conflicting answers on stack Overflow and across the internet and today we're going to compare some benchmarks that pit these against each other now I have a unique situation because I was trying to optimize some code in my asp.net core application and I was running into this problem where I wanted to run things in parallel now I found that I had to switch between task when all and parallel for reach and I'll explain what was going on at the end of this video but ultimately I had to move between these two different implementations even though they're running things in parallel so that my application would end up working now it's going to be very interesting is that the solution I had to rely on actually is going to be at odds with what we see in the benchmarks
we run today so stay tuned to the end so you can see exactly why I was having issues in my parallel for each code versus my task when all code now let's jump over to visual studio start looking at some benchmark.net code and walk through our different scenarios all right I'm here in Visual Studio I have benchmark.net installed which is a popular nuget package for benchmarking so with benchmark.net we have two actual Benchmark classes we're going to be looking at and when we go to compare task when all compared with parallel for each async we actually want to be simulating two different types of things so in a lot of the threads that you'll read about these two implementations you'll hear the suggestion of IO versus CPU bound so what that will mean is whether your tasks are waiting on other things to complete that
aren't the actual processor running on your computer versus that task is running and busy doing computations so in these two examples that we're going to be looking at we're going to try to simulate i o and simulate CPU work to compare them to start with the i o example let's go through what we have here we're going to have two different parameters that get configured this is really just to give us a spread over what things will be exercising and that way we have a range where we can start really small and actually scale things up to run in parallel so how we're going to be simulating i o operations is actually not by doing I O at all and that's because it might be a little bit unreliable to keep running these tests if I have to go run them against my actual computer
and other disk activity that might be happening so instead of actually doing I O work we're just going to simulate it now to simulate the i o the idea that I had was well if I keep the CPU busy doing math or something else that's actually simulating CPU work instead when we're actually waiting on i o the thread that we are on is not doing any work we're simply just waiting for that I O to complete so I figured if we put a task delay in we can actually simulate how long we're waiting for that I O operation to take so this first Benchmark that we have here is for the task when all implementation so you can see I'm just just starting up a bunch of tasks here and then calling when all and awaiting on it with all of the tasks inside of
that if we go to the other Benchmark this is parallel for each async we can see the syntax of that here same idea we're going to be going over a range that's up to this collection count which is one of the parameters we have and the syntax is very similar I just have this cancellation token passed and that's really the only difference that we have going on in both cases they are going to be delaying the same amount of time which is also configured by a parameter let's go jump over to the other Benchmark class and this is where we're going to be simulating CPU work so it will be very similar and we do have this other parameter to scale up the amount of work that we're going to be doing in the benchmarks so the first one that we're going to look at
is task when all here and for my example of doing CPU work I'm actually just going to be calculating random numbers we're not doing anything with it but we're going to be keeping the thread busy by doing this computation if we go to parallel for each again very similar just the syntax of how we set this up looks a little bit different but the body of this is very much the same I'm going to show you one more thing at the end of this because when I was trying to put this video together I had actually run these benchmarks elsewhere and I found that I had some interesting results that weren't quite adding up and by the time I tried to combine all of it I actually couldn't reproduce what I saw but I did notice that if I changed up the syntax a little
bit more and actually use a task run here that I had some interesting results when we compare this parallel for each with the one before now they do look very similar the bodies are very much the same between the two but the difference that you'll see is that we have this await task run here and on the one before it we don't have that and we have this value task completed task getting returned so now that you've seen the setup let's go jump into the actual Benchmark results insert anal analyzing what we see there so let's start by looking at the simulated i o benchmarks and for this like we were looking at we're just going to have the task when all versus parallel for each implementation so from top to bottom we're kind of going from the most basic simple cases with fewest number
of iterations fewest amount of you know delays in terms of the the simulation we're doing and if you go to the very bottom of this we're actually looking at the more heavy hitting sort of uh longer run time so we can sort of see from the one end of the spectrum where there's not a lot of work to the other end of the spectrum where there is significantly more work so in particular we're going to be focusing on the run time which means we want to be looking at this mean column which is going to give us that time so starting from the top just in the most simple case when we compare task went all with parallel for each basically this is going to be if you were to think about the loop that we would want to run it's just one iteration and
we're only delaying by one millisecond if you compare both of these they are extremely similar and sort of go a little bit lower here we're still only looking at one iteration of the loop but there is this longer delay in terms of the simulation that we want to be running now interestingly if you compare all of these numbers for tasks when all in parallel for each when we only have one iteration of the loop that we want to be running in parallel they are almost exactly the same the whole way through we see like I said in the beginning task when all is a little bit slower and then task when all is a tiny bit faster and then task when all is a tiny bit faster here and then on the last entry for this we end up seeing that taskman all is again
a little bit faster so all in all these are very comparable there's not much of a difference when we have this single element and we're simulating some different lengths of i o delay that has to happen if we go up to now 10 iterations of the loop that's this middle section here again we see that these are very comparable all the way up through a thousand milliseconds of delay so again 10 collection count and then all the way up from one to one thousand milliseconds very similar results to what we see in the first section in fact not only are these comparable to each other but they're comparable to the first type of the test that we were running so 10 in the collection count and then one in the collection count all these numbers are extremely similar in fact we actually see that this
was faster slightly than doing it for one that's kind of interesting in my opinion we're probably just seeing a tiny little bit of overhead for the setup to go run this and then the actual delay that we're experiencing is just because we have a little bit of difference here in this simulated i o delay milliseconds per run that we're doing now if we go up by another order of magnitude on the collection count this is where things start to get a little bit more interesting if we start looking at tasks when all versus parallel for each we actually see that this is the first time we see a significant difference between these two numbers so parallel for each is significantly slower compared to tasks when all in this case so a very small delay in terms of the i o simulation but a hundred elements
in the list that we're trying to run in parallel again if we go down to a 10 millisecond delay it's very similar numbers to one millisecond delay and if we start to hypothesize why this is happening this is likely because there is a scheduler that is used when we're using parallel for each async and when we're using task when all it's actually just relying on the scheduling of those threads to be run so there's no dedicated scheduler that's going to run these in batches and that will mean that parallel for each is actually going to be restricting itself in terms of how much it will do in parallel we start to see a more significant jump when we have a bit more of a delay so if we go down a little bit further here we can see that we're at about 100 milliseconds to
550 almost milliseconds so that difference starts to be a little bit more exaggerated right and then in this last section we have a one second delay so a thousand milliseconds we can see that we're dealing with about a four second difference and that to me is a good indicator that we're seeing this difference in scheduling between the threads that are able to run so with parallel for each there's only so much that it's going to allow in parallel by default and with task when all it's going to try running all of them in parallel and then if there's no core available to go run the thread then it would pause but because we are just putting these threads to sleep essentially with the task delay it's able to go run all of them and wait on all of them so I thought this was pretty
interesting and I think if we're looking at doing I O heavy stuff in parallel that's what this is supposed to simulate right it's not actually doing I O but that's the purpose of the simulation we start to see that task went all is coming out ahead in terms of being faster it's very comparable when we only have a few things we want to run in parallel but as we scale this up well beyond the number of cores that I have in the machine we start to see that task when all starts to below to head and that inflection point just from looking at here if I have 14 cores in my machine that will mean that we're at or below that 14 count parallel for each and task when all should be very comparable however when we start to go beyond that that's when the
parallel for each is going to start capping out how many things it will try to run in parallel due to how it schedules next I want to look at the benchmarks for looking at the CPU simulated work so I did show you three benchmarks in the code but I want to focus on the first two that I think are really important and kind of demonstrate a direct comparison to what we just looked at for the i o simulated work this is going to be a very similar layout to what we just looked at so this column here is actually going to be focusing on the collection count that we're going to try to run in parallel and then instead of a delay in terms of the milliseconds we want to simulate for Io we're actually just indicating the amount of work that we want to
be doing for the CPU so this is a little bit contrived in terms of what this directly translates to the CPU work iterations but I just want to be able to demonstrate that when we start scaling things up by a factor of 10 the type of impact that we'll have on the results let's go ahead and start looking at some of the results that we see for these benchmarks so we're going to be looking at this mean column again and in the first example when we have a collection count of one all the way down through here this is going to be stuff that essentially is not running in parallel at all because there is is no work to go do in parallel there's only a single item in the collection the first four lines actually show that it's very comparable between task when all
in parallel for each we do see this interesting little drop here and I'm just curious if this is an outlier because we actually see that parallel for each was significantly faster than task went all I personally think this was a little bit of a blip maybe something was going on on my machine during that time because I don't think I have an explanation for why that would be so much faster and then if we go down to the next part where we're actually increasing the CPU work iterations by another factor of 10. we actually see this totally flipped around where we see parallel for each is coming out slower than task went all so this line here for the collection count of one and the 100 000 CPU work iterations I think this one is an outlier so I just don't really trust this piece
of data to be honest if we start scaling up the collection count by 10 and recall that I only have 14 cores on my machine to work with we can start to see that tasks went all actually is already becoming a little bit slower than parallel for each and we can see that with the 1000 count up through 10 000 count where the parallel for each is about half the time it starts to pull even further ahead as we go up another order of magnitude in the CPU work iterations and it's a lot more dramatic when we go up another factor of 10 being slightly less than one-fifth of the time compared to task went off so that's pretty interesting if we go to scale up the collection count by another factor of 10 and again this is going to be pushing us past that
CPU core boundary of 14 that we're talking about now let's go over and look at the mean so again parallel for each definitely pulling way further ahead than task when all here and we can see that continue down through the rest of the benchmarks in fact the amount of run time that this takes is actually almost scaling by a factor of 10 for both task when all and parallel for each as we start to look through this last result set it's just that task when all was already about five times the run time of parallel for reach in this group so they both do scale by approximately a factor of 10. so far our results are indicating that task when all is superior when we're doing I O bound work and parallel for each seems to be superior when we're dealing with CPU bound work
let's go check out that last comparison I had and this is going to be almost identical to what we just looked at except there's going to be another parallel for each which is called parallel for each task run if you recall from the code I was showing this parallel for each task run is actually just wrapping the work that we were calling in parallel for each and a task.run now in the very first case when we're dealing with a collection count of one so this area here this is not being a run in parallel because there's only one item if we start from here we can actually see that parallel for each task run is about twice the time that it takes parallel for each so a bit of overhead with the task run there but what's really interesting is that it actually becomes faster
than parallel for reach in this next little range here and if we start scrolling down a little bit further we can see that while all of these are similar it always seems to be that parallel for reach with the task run is a little bit faster than parallel for each now in this first group we do also see task when all is coming a little bit ahead of parallel for each but they are pretty comparable keep in mind we are talking about microseconds here but let's go down to another factor of 10 and start looking at the results like we saw before parallel for each starts to pull ahead and parallel reach task run is comparable here it's a little bit slower than parallel for each and that might be expected because there's this overhead for starting a task and that's okay but when we
start to go into the lower sections we can actually see that parallel for reach with the task run is actually faster than parallel for each and if we go into the final grouping where we're up another order of magnitude we do see that parallel for each very comparable with the task run a little bit faster than with the task run and then again it starts to pull a little bit ahead with the parallel for each task run in these last two groups now I don't yet have an explanation for this but I think it's kind of interesting that while I would imagine the task run does have extra overhead we do see that it seems to be a little bit consistent with these higher ranges of CPU work iterations where it does seem to finish a little bit faster than the parallel for each and
just to highlight that we can see that here where my cursor is where it's a little bit faster here now lower iterations but then we see it here as well and it kind of breaks down at this point where we would expect that based on the other results we just saw that it would come out further ahead but it doesn't so while these are all very comparable and very fast it's still interesting to me that we do see a bit of an advantage with the parallel for each task run in this particular case all right so those are some pretty interesting results just to quickly summarize we did see that the parallel for each async was pulling ahead on CPU bound work when we were starting to go over the number of cores that I have in my machine and Task 1 all seem to
pull ahead when it was i o bound or simulated i o Bound in this case and I think that's because the scheduler for parallel for each async is kind of doing its job to do some throttling and there is no throttling with the task when all so when it's not CPU bound it can go blast out all those tasks and let the OS actually schedule the threads to run but because they're actually just waiting on the i o and that's going to bring us back to my particular situation where my solution actually goes at odds with what was suggested here so in my particular asp.net core application I was trying to do a bunch of i o work in parallel so I had some process on my server that had to go interact with the SQL database and do a lot of queries in parallel
what was happening was that I was using a task when all and actually having this situation where my server would totally crash now using what we just learned about the performance characteristics of task when all versus parallel for each why is it that it wouldn't be beneficial for me to use task window when we clearly just saw that that was the best way to get performance out of i o bound work like a database well this goes back to throttling so interestingly enough it very well might have been the case that task went all would be the most beneficial thing for me to use however because because there was no throttling what was actually happening was that I had a ton of SQL connections opening up in parallel and hitting the limits of the database of course I could go change the connection limit and
work through things like that but that seemed like a bit of a hack because there was no actual scheduling to try and manage just how much work I was doing in parallel the easy solution for me was to go to parallel for each use some of that automatic scheduling that's built in and all of the code still runs super fast compared to running it sequentially but I do get that scheduling built in for me if I were trying to squeeze out all of the performance I possibly could is there an opportunity that I could have used task when all to make that more efficient if I wrote some custom scheduling for sure could I have actually changed all of my code to hold open a database connection and then tried using task when all so that I don't have this limit of the number of
connections for sure that could be a solution as well but for me the easiest solution that still gave me all of the performance characteristics I needed currently the best thing to do was to switch to parallel for each async so that was a little bit of a Twist in terms of the benchmarks that I was looking at and I thought that I would share that for you because it's a real world situation that actually helped me a lot so I hope you found this interesting if you have comments about how the benchmarks were put together you have different thoughts about how we could simulate the i o or the CPU would love to hear from you in the comments I'd love to learn from you if you have different thoughts on these optimizations as well so thank you so much for watching I hope you
found this insightful and we'll see you next time
Frequently Asked Questions
What is the main difference between Task.WhenAll and Parallel.ForEachAsync?
The main difference lies in how they handle parallelism and scheduling. Task.WhenAll is designed to run tasks concurrently without any built-in throttling, which can lead to resource exhaustion if too many tasks are initiated at once. On the other hand, Parallel.ForEachAsync includes a scheduler that manages the number of concurrent operations, making it more suitable for scenarios where you want to limit the load on resources.
In what scenarios should I use Task.WhenAll versus Parallel.ForEachAsync?
You should use Task.WhenAll for I/O-bound operations where you want to maximize throughput without worrying about resource limits, as it can handle many tasks simultaneously. However, for CPU-bound operations, especially when you're exceeding the number of available cores, Parallel.ForEachAsync is often more efficient because it automatically throttles the number of concurrent tasks to avoid overwhelming the system.
Why did I experience issues with Task.WhenAll in my ASP.NET Core application?
The issues arose because Task.WhenAll does not provide throttling, which led to too many SQL connections being opened in parallel, hitting the database limits. In my case, switching to Parallel.ForEachAsync allowed for built-in scheduling, which helped manage the load on the database while still providing good performance.
These FAQs were generated by AI from the video transcript.