SURPRISE Benchmarks - IEnumerable vs Paging (SQL & Entity Framework)
January 3, 2024
• 863 views
In this video, we set out to pit IEnumerables & Iterators against a paging approach. We'll be looking at SQL and Entity Framework core as we compare benchmarks in CSharp.
BenchmarkDotNet comes to our rescue to help us make sense of the results... but... the results are surprising once again!
View Transcript
I was totally wrong well sort of but this changes everything once again in this video we're going to look at benchmarks that compare iterators fully materializing collections as well as paging I'm going to be walking us through some of the Benchmark code so that we can understand it a little bit better and then going through all of the results in more detail to see what stands out this isn't the first time that I've done a video on benchmarking some results like this and to be honest once again I'm quite surprised by the results and no this isn't clickbait I'm going to be explaining the things that I was EXP expecting and totally wasn't expecting and that's okay for me to be incorrect about what I was assuming just a quick reminder before we dive in to check out that pin comment for a link to
my free weekly software engineering newsletter all right I'm going to start off the explanation by giving you a little bit of a disclaimer about some of the benchmarks that I'm looking at here the reason that I have my Benchmark set up in this particular way which I'll explain in more detail is that I'm trying to get a feel for what things were looking like in production code bases that I was dealing with this all stems from the fact that I was dealing with situations where people were fully materializing collections once upon a time then we moved heavily towards iterators and then more recently I've taken a paging approach so I wanted to be able to highlight some of the different ways that I saw people using fully materialized collections or iterators and it doesn't necessarily mean that I think that they were using the right
things but part of this is to try showing that we want to be able to minimize some of the impact when people are using things incorrectly so one of the particular cases we'll look at is counting records and I was seeing that people were basically querying entire data sets to be able to count records we know that this isn't the right thing to do we should write dedicated queries for this if that's what you're interested in but I wanted to be able to show the impact of doing this sort of wrong way if you're using fully materialized collections iterators paging and other options that we have so some of the things that the benchmarks covered that I was interested in include things like fully materializing very large data sets into memory as I mentioned being able to count results and that's either using link or
again fully materialized data sets and in particular when we talk about link doing things like calling any on a result set when really we don't need a full result set or even the type of query that we were using we just wanted to check if there was anything and that happened to be the data that we had access to and before going over to visual studio I just wanted to give you a high Lev view of what the code is going to look like so that when I scan through it you have some of these things in mind already we're going to be looking at result sets of different sizes because I think that's important when we're benchmarking things to see if the performance and the memory usage starts to change over time as we are doing different sizes so we're going to be looking
at result sets that are a th000 items 100,000 items and 10 million items some of the things that we'll look at in particular are using the any link method being able to do different types of counting when we're not directly calling a query that is strictly responsible for counting and then fully materializing entire result sets in terms of some of the implementations we're going to look at I am going to be using Entity framework core and that's going to be backed by an inmemory SQL light database the SQL light database being inmemory was just for Simplicity I didn't want to go set up an entire mySQL database to go connect out to so just something simple like an inmemory SQL a option seem to fit and I'm also going to be comparing doing raw SQL queries with Entity framework core and that's because I still
to this day do a lot of manual SQL queries and I'm not really using a Entity framework or Dapper or anything like that so part of that was for me to try to get some exposure to it make sure that I'm understanding some of the nuances there and opening up my own eyes to what I could be doing and I know you want to see some results so let's go check out some code super quick and then we're going to go do the analysis all right so I'm going to be using benchmark.com benchmarks so you can see right at the top this is just a really lightweight setup to be able to get I'm using the Benchmark switcher and then that way I'm able to go run the benchmarks that I'm interested in from here we just have a simple record that we're going to
be dealing with it just has an ID and a username so very very lightweight and the sample DB context just has users as the data set that we're interested in now before I get into the individual benchmarks what I did was created this Benchmark fixture which is just a reusable class to set up some data and then access the data in the different ways that we're interested in so when I go to do setup I'm creating a sqlite connection um we're using this because I want to also have access to the connection to do raw SQL queries and down here you can see that I'm setting up Entity framework by getting the context created and leveraging that same SQL light connection that we just created above and then in order to have these test scenarios work for the benchmarks I'm seeding the database with a
number of entries and I mentioned that this is going to range from 1,000 up to 10 million depending on the scenario so we save those out and this part here is important for me it's not included in the benchmarking time but I wanted to guarantee that I had the right number of results in the data set and yes this seems kind of silly but I just wanted to pause for a moment because I wanted to send this type of information out to my newsletter audience about a week ago and the issue was that I was creating these benchmarks and when I got the results they just weren't right and I don't mean not right and that they were disproving what I had to say I mean they were doing that but at the same time the actual results that were coming out didn't make any
sense when I was materializing a collection of 10 million items it was saying that it wasn't taking any memory that's not right it should take some amount of memory to store any type of results so for me this was an issue where I was clearly doing something wrong with my Benchmark setup I remember talking to my wife about this and being frustrated and saying I don't know what I'm going to do and she said well why don't you tell them the truth about about the results and I said well I'm trying to but the results just aren't right I need to figure out what I'm doing wrong so this was just a quick sanity check and I wanted to share with you that I was getting frustrated with collecting these Benchmark results but ultimately I think we figured things out this is just a quick
sanity check left in place and that's really going to be the end of seating our data and making sure that we're good to go for our benchmarks so the rest of the methods that are in this class are just the reusable calls that we can do to access the data so in order to get a fully materialized result set from Entity framework I'm just calling the users property and then calling to list and that's going to give us the entire set the next part is going to be using an iterator approach so the only real difference between these two you can see that I'm not doing to list on it and I'm just returning up the I enumerable and I'm going to pause again because this is really interesting it's kind of like rubber ducking when I'm going to make this video but I just
clued into something here and this is a really important part about iterators and I enumerables and the whole whole reason that I started making this video series my goal here if you looked at the name of the method was that it was the iterator approach for Entity framework core I'm returning an I enumerable I don't anywhere in that method have yield return I have no idea if this is truly an iterator and that's a huge Miss opportunity for me to ensure that I'm doing the right thing here so I need to go back after all of this and go address some things that are going to come out of this video I'm hoping that if you have some comments about how some of stuff is set up then I can go address those as well but the point here is that this is supposed to
be an iterator i in fact cannot guarantee that because I don't know what it's doing under the hood for Entity framework core so this one might not actually be an iterator so we might see some interesting results for this one that don't line up with iterator usage below we have getting users that are paged through Entity framework core so you can see that I'm using Skip and take when using Entity framework this should get pushed out into the database so Skip and take otherwise would be working on a result set in memory so if you are doing this kind of thing and it's not translating your link to SQL what would happen is you would get the full data set and then in memory you would skip over the amount specified by opset and then you would go do take on that result set in
memory but because this should be doing link to sequel this offset and Page size should become part of the query and done at the database level next next we get into some of the raw SQL queries so you can see that I'm just doing get star from users and then doing paging in an offset these values get passed in here as parameters and then what I'm doing is putting this information into a list and you can see that the return type is an I readon list because in my previous videos I mentioned that I like using I readon list as a return type and then this variation right below is very similar but because this return type is now a list everything else in here is the same but this return type as a list is because there are supposed to be some optimizations for
doing for each loops on lists and arrays and if I scroll back up when we're dealing with an i readon list I cannot guarantee that the compiler knows that it's backed by a list or an array so we're hoping to see if there's some difference here with using a list in terms of performance this next one here is also very similar you can see that the paging is taken out of the SQL query and then instead of putting the information into a list I am now using an iterator and this one is for Real an iterator because we have yield return as well as I enumerable these are the two ingredients that we need for an iterator and last but not least is getting all of the result sets through raw SQL queries and putting them into a list so you can see here that
I have select star from users no paging and then I'm putting everything into the list so it's very similar to one of the ones that we saw earlier without the paging involved at all and next I'm going to go through some of the the benchmarks but instead of showing you every single one of these because there's a lot and they follow the same pattern I'm just going to show you one of the examples of these to save some time so we're going to go through the any Benchmark example and you can see that I'm doing the data sizes that I mentioned earlier we call this fixture to get set up on the global setup which runs before the set of all benchmarks so that means that as this parameter changes we'll go run this another time next we get into the individual Benchmark runs and
because I pulled that code out to be usable we've already looked at most of what it's doing so that's why I said I don't want to waste all of your time going through all of the examples of these but we can see that when we're doing any for a fully materialized NTI framework core all that I'm doing is calling that method here and I'm just doing count equals not zero so this is a slight optimization because if we're calling any on a list using link it should in theory be able to short circuit for us behind the scenes because it knows it's a list and that that way it has the count property available to us so any in the link method does do this behind the scenes I'm just taking a shortcut same thing with this fully materialized raw SQL one just doing count
so it's not the link method but that's because we have access to this count property as soon as we get into the iterator situations we can see that for both of these I'm doing raw SQL Entity framework core but I just calling any on here and that's because we don't have a count accessible to us keep in mind that this is supposed to be replicating what some people would be doing in production code bases that I saw next we're going to look at a paging approach so if we just wanted to check if there was anything available we could go pull back one record and ask if there's anything there again you'd probably want to go write a dedicated query for this or you don't pull back any records just a Boolean if the count is greater than zero we'll do the same thing with
a page raw SQL query instead of Entity framework core again these are different because the paging approach with raw SQL has an I readon list and the Entity framework core one is still just using an innumerable and finally the last one here is only different from the one above it because we are dealing with a list versus an I readon list you can't see that from how this code is written here but when we looked at the implementations that was the difference and I won't go through all of them in detail but the count benchmarks are almost the exact same you can see that they're just set up in a very similar way but they're doing count or count with the link method instead and as I mentioned I have the fully materialized benchmarks as well so we're going to look at the results of
those so that's all there we already looked at the code behind that it's going to be calling it's just that the Benchmark code is sort of a oneliner that calls into that and now what you're probably interested in is the results of those benchmarks so let me pull those up they're going to be in paint because I had to go adjust the screen a little bit to fit it all so I didn't go doctor the results they're interesting I didn't have to do any doctoring to make them interesting but I needed to fit them on the screen properly so I'm going to have them in paint that's why it's a little bit weird let's go check it out all right there are a ton of numbers on screen but I have them highlighted for us already to call out the ones that are outstanding the
first data set in this magenta color up at the top here is for 1,000 records it looks like using the iterator approach with raw SQL is not only the fastest but has the smallest memory footprint that we can see over here and the worst offender is going to be materializing the full thing with raw SQL so that's the slowest and if if we check out the fully materialized Entity framework core that has the biggest memory footprint otherwise the fully materialized raw SQL one is also pretty terrible in comparison and I think that that checks out this is kind of what I would expect because calling any with an iterator is nice and lightweight we don't have to go materialize the entire data set that's why I expect the allocated memory to be quite low and the run time to be pretty quick as well because
we don't have to get a lot of data interestingly though if we look at the Entity Framework work implementations the memory Footprints a little bit bigger comparatively so this one here is over 5 kilobytes this one here for Entity framework is almost 3,000 kilobytes and then of course we saw this heavy hitter here so comparatively they are higher than doing the other implementations we can see that when we're doing page with the raw SQL query the one that was returning an I readon list happens to be the fastest although it's extremely comparable to returning as a list this is a little bit weird in my opinion because I would have expected that the one would have been faster but they're very very similar so this could just be related to error these two approaches also have the smallest memory footprint with just over 1,000 bytes
now the worst offender is counting with a raw SQL query so that's right here and we can see that the worst Defender for a memory footprint is again a fully materialized Entity framework core set just a quick look at the run times for all of these though they're all pretty bad when you're doing a fully materialized collection or an iterator and the reason that an iterator is also pretty crappy for performance here is because when we have to call the count link method on an iterator we have to go resolve that entire data set in order to be able to count so again nothing too surprising here for me and now when we look at the fully materialized data sets we can see that the fastest implementation is a fully materialized Entity framework core data set however if we're looking at the worst case which
is the fully materialized raw SQL one we can see that they're all roughly in the same range this is not too outrageous when we start comparing them they're all pretty close now the paged raw SQL implementation is also very very similar to the fully materialized Entity framework core but all in all I don't see a huge swing in the different performance here if we start looking at the memory footprint however we can see that the nty framework core implementations of things are a lot heavier so they all are going to resolve the full materialized data set that's why these are higher in memory but it's interesting that the Entity framework ones are across the board higher memory usage than the others but the results sets are going to become a little bit more interesting as we press forward so far what we've seen is that
Entity framework core basically across the board has a higher memory footprint and something else and this one's not very surprising is that the slowest implementations of things seem to be when we have to fully materialize a data set especially even when we shouldn't have to and and the other thing in this isn't really surprising is that the slowest implementations of things happen to be when we're fully materializing a data set and that's either because the implementation we're asking for explicitly is materializing a full data set or because we're using link to do something like counting over an iterator and we shouldn't have had to go fully materialize a data set but we are anyway so these are really in my opinion sort of just misuses of Link or the SQL queries that we're running that one's not totally shocking let's go check out the other
bench marks I'll start to gloss over the things that are largely the same so when we're talking about the performance of fully materialized stuff or the memory footprint of Entity framework that's not going to be totally interesting because that's going to be a common pattern we start to see what starts to become more interesting is that the performance of Entity framework starts to pull ahead in some cases so I want to scroll through that and look for some of those examples we can see that here for the any and count implementations in this section that the difference between what we were seeing before and what we're seeing now is very very similar right the performance of any iterator compared to fully materializing is of course going to pull ahead and when we start to look at the count it's the same type of thing however
when we start to get these fully materialized data sets I feel like we're starting to see a little bit of a difference in terms of the numbers and this might just be because I'm noticing them more when they're blown up to a higher magnitude but let's have a look the fully materialized iterator using raw SQL is what I would expect this is is one of the reasons why originally I tried getting people to move away from using fully materialized data sets in the first place because they were doing things like fully materializing things they shouldn't and then trying to do operations on tons of Records in memory again the right thing would have been to change the query to get what they need however this is just something that I observed in production but we can see that when we're looking at Entity framework here
it starts to pull ahead consistently so we can see that the fastest implementation here is just under 28,000 microc but if we go up to the other Entity framework records we see 28,000 and 28,000 these are consistently faster than the other implementations we have here including when I'm doing raw SQL so I thought that was a really interesting thing to point out that they're notably faster here and again if we look at the memory footprint consistently the ntity framework stuff is a lot worse for the memory footprint if we if we scroll down a little bit lower to the 10 million record size I think we start to see the same type of pattern showing up with Entity framework really pulling ahead again any encount not totally interesting it's very similar to what we saw before but these fully materialized data sets just become really
interesting in terms of the performance of nity framework again we can see that if we're looking at the slowest the fully materialized raw SQL one is by far the worst and it's a similar thing if we're looking at the iterator for raw SQL because once we get that iterator and then try to pull it all into memory it's going to be slow as well so these two things pretty bad the paging implementation for these two things is also kind of bad because it's effectively doing the same type of thing it's just changing the query marginally under the hood with the paging parameters but the paging parameters are the entire data set anyway so what's really interesting to me is that Entity framework again pulls ahead we can see right here and right above in the green is the fastest and again up here at the
top for the fully materialized data set using Entity framework it's faster than every single implementation that's not using Entity framework so very cool and again if we look at the memory footprint all of the Entity framework stuff is significantly worse than the others so here and here as well and the red is of course the worst which is n framework with the fully materialized data set so this is really cool to me because I wasn't expecting to have any interesting results from Entity framework I just included them because like I said in the beginning of this video I don't really use n framework I'm writing my own SQL queries and I figured I might as well throw it into the mix to learn about something one of the things that I set out to prove in this video because I was pretty confident about it
is that the performance of iterators was going to be slower than a fully materialized collection of course if we're doing something like calling any when we shouldn't have to fully materialize a data set an iterator is going to be faster I wanted that to show through but what I wasn't expecting is that anyti framework even with an iterator was going to outpace a fully materialized collection doing a raw SQL query totally unexpected to me but the other thing that's interesting is that the memory footprint of doing so is a lot greater with Entity framework now I need to admit and I kind of did at the beginning that because I don't use Entity framework there's going to be stuff about this that I'm not optimizing properly or assumptions that I'm making that if I get your feedback you might be able to say well you
set up this Benchmark wrong and it's not really a fair comparison so please feel free to share that in the comments and I'm happy to go back and redo some of these and share the results and at this point you're probably saying well Nick you missed the most important Benchmark you set out to go look at paging you don't have any paging here you're getting the fully materialized data set is the only page that's a really big page but don't worry I did in fact go get these benchmarks I mentioned earlier that when I was explaining to my wife that all this stuff was busted and I couldn't send out my newsletter as a result when I finally got this to work and get real data coming back that wasn't totally clobbered I didn't include the paging and had to go back after I ran
all the benchmarks to go at it I think I was too eager to see it working but I do have those results and we'll check them out next and now with these results results I was totally wrong well sort of but this changes everything once again all right so this is a paging Benchmark that's pulling back 100 records for the different full data sets that we have so this is a smaller set because I'm only looking at methods that are going to make sense for pulling back pages of things looking at the 1,000 item data set pulling back 100 Pages the fastest implementation is a paging approach with raw SQL it's very comparable to the others so this group right here is all very very similar and then the worst is of course doing a fully materialized raw SQL data set it's also comparable to
getting the Entity framework result set fully materialized if we look at the memory usage the iterator approach for raw SQL happen to have the lowest footprint and then the others for paging with an i readon list and a list are almost the exact same for memory footprint so very very small and not surprisingly using Entity framework is orders of magnitude more than what the others are overall though the performance very very comparable when we're dealing with 1,000 records when we start to get into 10,000 records we can see that the iterator is pulling ahead for the performance using Entity framework core and the page results are almost the same performance but Entity framework with an iterator did manage to pull ahead these numbers are very very similar but it's kind of disproving my theory that having a paged implementation was going to be faster the
memory footprint that we're seeing on the right here is almost the exact same that we saw above so nothing really surprising on the memory side and in fact for the next section it's going to be the same thing so we skip over that but the most surprising thing that I saw in this data set was looking at these numbers here when we're pulling back dealing with a very large set of data in the database and only pulling back a 100 records we can see that the fastest implementation had nothing to do with my new paging approach that I was taking in my code in fact it's just using an iter with raw SQL this is literally what I just moved away from in my code base to try doing these two things down here the reality is that these numbers are very very similar so
I'm not too upset that I moved away I'm just totally shocked that I thought the iterator was going to be much slower for pulling these pieces of data back now of course I'm not changing the size of the page and having only 100 items come back in my opinion makes it even more surprising that the iterator is performing even better better and that's because there's overhead for setting up the iterator in the first place so personally I would expect for a small number of items like 100 that the iterator should have more overhead for performance but what we're seeing here is not only does it not have more overhead it's going faster okay so what have I learned from doing all of this the first and I think this is the biggest takeaway for you is that doing performance analysis of things in isolation does
not necessarily Translate to the same performance gains in practice when the queries are set up comparatively iterators should be slower than putting stuff into a list or an array but there's clearly some nuances here in how the code is set up the other takeaway for me is that doing a fully materialized collection as the only way to get data sucks and this one's not a shock to me this is the motivation behind all of this that I've been talking about if that's the only way that you're accessing data I think you really need to go back and revisit your data access patterns because if you're always pulling pulling back all of the data and then trying to work on it in memory there's probably some more optimal ways that you can do that the next takeaway for me is enity framework it looks like across
the board it's significantly worse with memory but it happened to be pulling ahead a lot in terms of performance now one of the things that I learned as I was making this video and I included it earlier was that I had this moment where I said maybe the iterator approach that I put in for Entity framework isn't truly an iterator I can't tell because it's just an I inumerable return type and I can't tell if it's yield returning anything under the hood I would have expected that it is but maybe it's not and if that's the case I need to go back and force this to be a for loop with a yield return so that it forces it to be an iterator regardless though Entity framework was using up more memory but it happened to be a lot faster in situations where I just
wasn't expecting it to be so I have a lot still to learn with Entity framework core and that's going to bring me to my last Point here about some of the tech choices that I made so yeah this was all with SQL light I'm not even using SQL light in my production system so sqlite was just a way for me to set this stuff up make it nice quick and easy for the benchmarks but if we're talking back to my first point about doing things in practice I should be going to Benchmark my real application to see how I can tune it this was more of a generalized approach so sqlite maybe has some different performance characteristics from my SQL but I was trying to focus more on the implementation of what we're doing once we're in memory versus the querying part itself so that
is just another difference that I want to call out and it's not necessarily an Apples to Apples comparison with what I have in production so I do have a couple of more benchmarks to go rerun to see if anything else stands out with the Entity framework stuff and I need to be using Entity framework more in my project so the next vertical slice project that I put together I'm going to be trying to use Entity framework and if you want to see more about vertical slice architectures you can watch this video next thanks and I'll see you next time
Frequently Asked Questions
What are the main approaches compared in the benchmarks?
In the benchmarks, I compared three main approaches: fully materializing collections, using iterators, and implementing paging. Each approach has its own performance characteristics, and I aimed to highlight the differences in memory usage and execution time.
Why is fully materializing collections considered a bad practice?
Fully materializing collections can be inefficient because it loads all data into memory, which can lead to high memory usage and slower performance, especially with large datasets. Instead, I recommend using approaches like iterators or paging to minimize the impact on memory and improve performance.
What surprising results did you find regarding Entity Framework in your benchmarks?
I was surprised to find that Entity Framework, despite having a higher memory footprint, often outperformed raw SQL queries in terms of execution speed. This was unexpected for me, as I typically don't use Entity Framework, and it challenged my assumptions about its performance.
These FAQs were generated by AI from the video transcript.