Hidden Dangers of Iterators & Collections in C#

Name: Hidden Dangers of Iterators & Collections in C#
Uploaded: 2023-02-20T11:00:02.0000000+00:00
Duration: 13 min 43 s

February 20, 2023

• 1,087 views

Enumerables and iterators are a confusing topic for many people even if they aren't new to C#. In this video, I walk through code examples based on patterns observed in industry over the last decade. While these examples have many solutions (beyond what I mention in the video), I wanted to discuss them to illustrate based on real world misunderstandings of iterators and collections.

Here's the companion blog post for this video:
https://www.devleader.ca/2023/02/22/beware-of-these-iterator-and-c...

View Transcript

today is going to be all about iterators and Collections and the different pitfalls that we come across this is all going to be based on real world examples I'd love to hear from you in the comments what real world examples have you seen come up from either using iterators or from using collections causing you problems in your production code or in your real projects that you work on outside of work what we're going to be diving into today is looking at some examples of using collections how that will translate into some performance impact we're going to follow up with looking at how an iterator can try to solve that and then we're going to follow up with how we might be able to clean that up all right let's go look at the code all right I'm here in Visual Studio I have quite a large comment here I'm just showing you this because this is actually going to be checked in on the public GitHub repository I will have a link in the description so you can go check all this out you can play with these examples this is going to be based on some of my real world experience but the examples themselves aren't going to be actually lifted out of production code or something like that the first thing we're going to look at is what I would call probably the more common case you'd write a function that's going to actually go talk to a database so again I'm kind of simulating all this but you're going to return uh an actual materialized list of the data that you're sending back from the database so on the surface nothing seems wrong with this to simulate kind of maybe what a real database might look like I actually have something like a thread.sleep here in the beginning five seconds is definitely exaggerated but I wanted to show it having an impact in the console as we're stepping through these examples so I'm I'm also going to be printing to the console so we can see some time stamps when this completes and because we're not looking at an iterator to start with we're actually just going to be creating the collection of results we're going to simulate pulling back a hundred thousand results putting them into a collection and then returning them so I have seen code like this in production many many times where all of the results of your database query get jammed into a collection whether it's a list or something else and then after the function is finished we return that whole collection back this little bit of code here it's really again just to simulate so when I go to press play on this we'll see the impact of it what follows is really just some information so we can see how long stuff's taking and then we can go look at the memory consumption we can see we're getting data from the database using the list it has finished as it says connecting you know getting data from the database it's starting to stream it back this is all contrived and made up right now but this part here is going to basically weight 100 milliseconds or 100 iterations before it sleeps for a millisecond and then we're going to get the results back we can see uh between these two lines in the console that we're waiting five seconds that corresponds with this initial delay connecting to the database like the simulation part and then from there we go from 21 to 37 so if my mental math is correct that's about 16 seconds to actually get all of the data back into a collection so that's a full what 21 seconds just to pull a hundred thousand records back obviously I'm I'm making up the actual delays so please don't uh you know anchor yourself too much to that but the other thing that I want to call out that we can see here is that we actually had a memory increase of 10 megabytes so we had a lot of memory allocated just to be able to basically do two things check if we have data and count the data and just to show you where those two things are in the code again right here because we have this database results list and then we're asking for any and count what's going on here how does an iterator actually avoid this because that's a lot of memory to allocate just to be able to do these two things right just to be able to know is there data and just to be able to know the count of the data for those of you that have actually built software that connects out to databases and done this before you might stop and say well why are you using a dumb query like that why would you not just go write a purpose-built query that can go do you know a check uh you know has data or a check that or a query that actually does counting on the database side and returns it yes you can absolutely do that my point with this example is that in real code I have absolutely seen this kind of thing come up and this is actually one of the the instances that people run into we're using full collections like this will cause them some headaches so how do iterators help here well let's try it out in the next part of this example so we have a similarly structured function and we're going to notice that I have a yield return in here so having the yield return makes this function actually an iterator now unlike the one before this which put everything into a list and just return the list without a yield return we still have the same amount of sleep this part's still the same as well you can pause the video go back and check it or check the source online if you don't believe me but everything here is structured the same except this yield return I have something I have a couple more print lines here just to be able to to demonstrate some interesting things that are happening we're now familiar with iterators and the fact that they are more like a function pointer so you can tell this line on 83 will essentially complete instantly because we're not actually performing the iteration at this point in time it's only when we start to materialize the iterator that we have to pay a performance impact because that's when it's actually doing the work and not just pointing at the function now we'll run this I will zoom in now if we're checking out the timestamps here we can see that between the first two lines here there's essentially no time and that's just because that we know an iterator when we assign it to a variable is just a function pointer it's not actually performing the execution of that function that we actually only paid the performance impact of the iterator on the the simulated database connection of five seconds when we tried to see if we had data and then we pay the full price to go count all of the data right so has data right we take all of the results from this assign it to this variable database results iterator right here right and then we're going to call any and count on it and any and count are link methods that will start to force some enumeration over that result set so when we call any like I said we're going to pay that five second penalty and then all of a sudden it's going to see that it has at least one result it stops right away cool so five second penalty on that and then to actually go count we're going to pay that five second penalty and we're gonna go pay the rest of the penalty to start reading all the items back so we're going to come back to that in a sec why did that happen twice that's something interesting but the other interesting thing is if we look at the amount of memory that we use it is significantly smaller we're talking if I do like rough uh bite math here um you know 25 kilobytes versus the 10 megabytes and this is because we haven't actually taken those results and stored the whole result set in memory we didn't need to we only needed to go one at a time so that's very interesting so well why did we end up paying a performance hit twice by doing two actual database connections right we can see here DB now sending back results in DB now sending back results something's weird about this why did that happen twice and it comes back to any and count okay so why does this happen well it stems from the fact that an iterator is more like a function pointer than an actual materialized collection that's a really important point so how do we go fix this well let's go look at the next little bit of code all right so we're still using the same function to iterate however we're going to do something a little bit different we're going to say on our iterator we're going to call to list on it materialize the whole result set and then we can call any and count on the iterator so what's going to happen when we do that some of you that are Keen will already know the impact of this so there's the first DB now sending back results so that's like one database connection so far and our goal here is to avoid two so it looks like we're still paying that performance impact of having to go get the data set so wait a second here we just paid the full performance impact and we're right back to allocating 10 megabytes what the heck is going on here why like why would anyone go do this then well let's go see what we can actually change to make this better so this is a demonstration actually of two things in a row where using an iterator can actually lead to behavior that you might not have been expecting right because if you end up calling to list on it and it's a large result set you're back at square one for materializing the whole thing not what you want so let's be smarter about what we're doing here all right so what does this code do now well because this is a little bit of a contrived example what we can do is that we don't actually ever need the full result set to know if we have data and to know the count of that data sure we have to go enumerate the full result set to actually count it but we don't need to hold all of those values in memory we never use them for this case so what we could do instead is actually count right we can use Link to count the full set first just store the count and then we know if we have data and that we know the count right after so we can get away with doing just one iterator called the whole way through the result set okay great now let's pay special attention to the time stamps that we have here so we can see that we start off we like basically in every case we pay that five second penalty for sending back data from the database then we pay the full penalty of getting all of the data right that full time penalty so from 33 to 48 15 seconds now you'll notice that to check has data it's basically instant to check the count basically instant and when we check the memory increase we can see that it's not the 10 megabytes anymore it's still the 25k so in this particular example we could use an iterator to actually get us the answers we wanted without allocating everything okay so yes these are slightly contrived examples admittedly but these are examples of things that I have seen in production code so just to kind of call some of things out here that I think are important and I mentioned this earlier but yes you could and potentially should go write different database queries to get you those results and you could do that without having to use iterators at all all right you could go write a query that checks to see if data exists you could go write a whole second query to go perform account and do that on the database side get you one result back that's the count yes you could do that and my point here is that literally in production over many years I have seen this type of thing come up where people don't and they don't because they think that they have access to it at their fingertips using these methods and they don't realize the performance set that they're paying now the other part is that when we have iterators and we're talking about flexibility you still run into a similar set of problems except it's more on the performance impact of timing versus the memory allocation and that's because people aren't totally aware that iterators act like function pointers and not like materialize collections so when you start calling things like any or count different types of things on your unmaterialized uh iterators you end up performing the iteration block again and depending on what's Happening under the hood that could be very expensive alright so just a quick recap for today we ended up looking at some contrived examples that reflect real world examples between iterators and Collections and some of the pitfalls that people run into so my question to you is first of all did that make sense have you seen this kind of thing come up in production code at work or have you been encountering some things like this in your own hobby code I'd love to hear in the comments or if you have different thoughts about different challenges you run into let's hear below as well so thank you for watching if you found this interesting please give the video a thumbs up subscribe to the channel and feel free to share this with other people you know that are having challenges with iterators and collections in C sharp thanks and we'll see you next time

Frequently Asked Questions

What are the main pitfalls of using collections in C# as discussed in the video?

In the video, I highlight that one of the main pitfalls of using collections is the performance impact and memory allocation that can occur when materializing large datasets. Specifically, when you pull all results from a database into a collection, it can lead to significant memory usage and longer execution times.

How do iterators help in managing performance when dealing with large datasets?

Iterators help by allowing you to process data one item at a time instead of loading everything into memory at once. This means that you can check for data or count items without incurring the full performance penalty of materializing the entire dataset, which can save both time and memory.

Why might using 'ToList()' on an iterator lead to performance issues?

Using 'ToList()' on an iterator can lead to performance issues because it forces the entire result set to be materialized in memory, similar to using a collection. This means you lose the benefits of using an iterator, as it negates the memory efficiency and can lead to longer execution times if the dataset is large.

These FAQs were generated by AI from the video transcript.