I Made A Big Mistake - Only 3x Regex Performance Gains in C#

Name: I Made A Big Mistake - Only 3x Regex Performance Gains in C#
Uploaded: 2024-04-15T12:30:06.0000000+00:00
Duration: 22 min 1 s

April 15, 2024

• 386 views

In my original video, I wanted to showcase some performance benchmarks for regular expressions in C#. It turns out that shortly after posting it, a viewer pointed out a critical flaw in my benchmarking code!

This is my redemption video where I explain what the viewer correctly pointed out about C# Regex methods and how the MatchCollection class works in C#. These benchmarks should be more accurate -- but let me know if you spot anything funky!

View Transcript

all right first things first I set out to get you some C regular expression benchmarks and well I screwed up I put a video together wrote an article posted it all and some of the initial feedback said wait a minute this doesn't look quite right hi my name is Nick centino and I'm a principal software engineering manager at Microsoft in this video I'm going to be talking about my last video which unfortunately I totally screwed up well not totally but some of the information and some of the assumptions that I was using to calculate my benchmarks for regular Expressions was just wrong I was very lucky to have a viewer very early on when the video was put up comment and say wait a sec this doesn't look quite right these times look a little bit too fast given the amount of data that you're using a regular expression for and they suggested maybe the match collection for regular Expressions is lazy in this video I'm going to walk through what actually happened with those benchmarks we'll see just how lazy the regular expressions are in C and I will attempt to redeem myself with this set of benchmarks it's not going to be an 800x performance Improvement but it will still be pretty significant let's head over to visual studio look at the Rex and match collection classes that we have in C on my screen I have the updated Benchmark code but it's very similar to what I had before the first thing I want to start doing is looking at the Rex class and I want to look at the match collection that we have in particular for context what was being suggested is that once we do a Rex matches method call what's actually happening is nothing we're just making a match collection and it's kind of like how an iterator is lazy it doesn't actually perform any of the work until we ask for the results and we're going to go see if that's the case or not so if I press F12 to jump into matches you can see that the return type here is match collection and just a quick note if you're looking in the top right of the screen here you can see that this is the source link for the rex. match file we can see that in the top left we're operating in the system text regular Expressions uh name space and uh dll here right so this is not my code this is buil in so I'm pressing F12 once more to go into match collection and I want to talk about how we can see whether or not this is lazy and again to repeat what I mean by lazy is that we're able to create a match collection without actually having collected all of the matches what my benchmarks were assuming is that the match collection when you go to con it has evaluated all of the matches that it needed to already now the first thing that's a bit of a giveaway if we look at the match collection Constructor is that it doesn't take in a list of matches of any kind all that it's doing is taking in the regx object the input string and the start ad so this is a pretty good indicator that it's lazy right it doesn't mean necessarily that it's lazy because in the Constructor it might actually go do the evaluation but if we look at the code from line 27 through 37 nope doesn't look like it none of the code in here is actually going through and trying to process those matches so in fact we create the match collection without having done any matches at that point this basically obsoletes and disproves any of the benchmarks that I was showing in my previous video so first of all I wanted to apologize for that I did not know that the match collection itself was lazy but if we explore a little bit more right just to kind of see how we go evaluate these things and whether or not we're dealing with an iterator or some other type of lazy Behavior if we go look at the count and maybe I'll jump back to the new benchmarks I figured I want to start using something that will force evaluation you can see that on all of these now I'm doing a DOT count right so dot count is here it's in this Benchmark it's basically if you look at the Highlight on all of the ends of the lines here there's a DOT count on all of them and I'm returning an integer type so back to the match collection why am I using count well if we go look in the property for count we can see insure initialized kind of interesting what does that do well if we go through that it will try to get a match at the max value if it has not got this flag yet called done right so what does get match do well inside here is where all of the magic happens and it has this Loop that's essentially going to try going through all of the matches and essentially Force the evaluation for us I wanted to show you this because in the previous benchmarks yep I was using a match collection because it's lazily evaluated it's not doing any work except creating a new object reference and now if I'm using do count like I am in the current benchmarks this will go Force an evaluation of all of them I did not want to go create uh a new iterator by doing any or do count like the link method so the method call instead of this property because there's a bit of overhead for creating an iterator and I didn't know if match collection was going to be an iterator in and of itself but it turns out it's not it just has this lazy loading Behavior not exactly like an iterator which uses the yield return pattern with an I inumerable return type and again you can see like we're not doing that anywhere here right like there is no iterator syntax using like I said yield return and I inumerable return type so match collection is lazy match collection itself is not an iterator it does have the ability to enumerate over it but if we go back to these benchmarks now we are using count I'm not using like I said any or the link method. count just to show you very briefly that would would look like this this would go create an iterator over matches and step through them I don't need to do that because count this property we just looked at will in fact go enumerate all of those results okay so this should be a pretty good update for us getting Benchmark results now because what we're doing just like originally we are doing things like creating the Rex instance right this one is using the compiled flag this one is not but we're creating that new instance every time these other ones are caching the regular expression so I'm not going to spend all of the time going through the bench mark setup because the only difference between this video and the last video is this dot count with that said if you still want more information about the different scenarios that we're going through I'd highly recommend checking out that previous video because the results will not be accurate and I'll cover those in just a moment but the setup and why we're looking at these different things that will be valuable because the whole point of doing that original video was saying hey look there's a bunch of different ways we can seemingly do the same thing what's the difference when we go to do this on my screen currently is the previous set of Benchmark results the results that we're about to go look at right after this are in the same order but I wanted to show you the ratio column what I wanted to call out in the last video is that if we look at the new line and the new compiled line so that's lines 2 and three and we go to the ratio column it's it's pretty common that I see people are creating regular expression objects every time they want to go use them and what I wanted to call out is that if you're doing that and you switch to just using the static method you'll basically get a 100x performance Improvement it's pretty dramatic and if you're the kind of person that was doing some research and you were looking at the compiled flag and going oh man we should totally use that for performance gains if you were doing that every time you wanted to go run the Reax and then you switch to just using the static method it's basically over an 800x performance Improvement so these are two common ways that I see people misusing regular expressions and I was trying to say hey look here's how much faster it could be but that's just not the reality because that's not what these benchmarks were covering these benchmarks were only creating the Rex object they were not doing the evaluation of the regular expression so that's why these benchmarks are invalid I just wanted to mention that they have some value it's just not the value that I was trying to capture in the previous video I'm going to jump over to these other benchmarks that I have which are from the code I just showed you which should be accurate so let's walk through we have the same Baseline at the top with this static method call for the reex class and calling matches on it and that's going to be a ratio of one because that's our Baseline I just wanted to start with something right if we look at an example where we're creating a new regular expression instance every time we want to go do the matches it's going to be a little bit faster apparently than the static method call whether or not this is a slight rounding error or they are in fact a little bit different this is what the benchmarks came out to be it's also worth mentioning that if you haven't gone and watched the previous video what I am doing is looking for words that end in ining and Ed in an ebook that was published so there's about 22 or 2600 lines of text so there are plenty of words in English in that text that end in ing and Ed so the important thing that we should be thinking about as we go through these benchmarks is the time to create the Rex object versus the time it takes to go match the previous benchmarks we looked at they were not going and doing any matching like I thought they were these ones are so it's a little bit faster to create a new one every time than using the static method so a quick scan shows us the static method is apparently the worst performer in this list doing a new instance every time is a little bit faster but there's a huge performance Improvement already for using the compiled flag right this third line down it's a ratio 64 it's not quite half the time but it's a lot faster and if you think about why that's so interesting we are creating a new Rex object and we're compiling that redx there's extra work that we have to do when we're using that Rex you would think that there's a lot of overhead for doing that and we're going to see there's definitely some overhead but it's not as much overhead as the speed gain that you get from using the compiled Rex so it does come out to be much faster now forgetting the compiled flag again and instead of making a new Rex object every time we want to go match if we just look at this cached row here it's just a tiny bit faster again this is potentially just a rounding error than making a new one every time so what I gather from this one is that the overhead from creating that reedax object is actually quite low compared to the time it would go take to match your Rex pattern on that body of text and something again to think about is that body of text is 22 to 2600 lines if you were dealing with very short input strings to match on the discrepancy here might be a lot bigger now one of the biggest performance improvements we're going to see so far is if we look at cached and compiled so we compiled a Rex just one time that's going to bring us down to 32 on the ratio that's about a 3X Improvement on the performance so if you were still wondering well if the Rex compiled is so performant is it really take that long to go compile it well yeah because if you look at the example before at 64 sure that's faster than the static method right that's definitely the case but if you only had to compile it once you cut that time in half again for these benchmarks so it's definitely worth taking a cached version of when you compile a reject so if you're trying to think about some optimizations you could include that's definitely something you're going to want to consider if we look at the last set of the benchmarks here if you did watch the previous video I was saying that the generated regular Expressions that we get by using the attribute those compiled regular Expressions that we get should be cached automatically for us so there should be basically no difference between using the cached versions here and they also do use compiled builtin the documentation explicitly says both of these things you shouldn't need to pass a compiled flag it does it for you and you shouldn't need to cash it yourself because it will do that for you as well I did notice in the previous benchmarks a tiny little difference but what we see in these ones is basically that those are rounding errors anyway same kind of thing here right they're basically all a ratio3 233 in this case all approximately the same and these come out to being the same performance of the cache compiled one but that could very well be just because of the regular expression that I used I highly recommend you go through the Microsoft documentation for just how powerful these generated regular Expressions can be because they truly do go generate the code for you ahead of time you can debug it so you can technically step through that code and it even has Doc comments built up for you which is in my opinion just super cool but again the big takeaway here is that these options at the bottom so doing anything that's cashing your regular expression they are expensive to create so it's beneficial to cash and another thing is that using a compiled flag or using the the the attribute that does that compilation for you ahead of time is going to be what you want if your after performance okay so at this point I should be in the clear right there should be nothing else that could possibly go wrong with these benchmarks I've gone ahead and I've proven I've proven that the match collection class is lazy we know that so I made sure that by doing Dot count everything's evaluated so we should be comparing the time it takes to construct the reject object with the time it takes to go perform that evaluation and find all of those matches it's got to be bulletproof right well not quite there's one more thing that fortunately I caught or at least looked at before going and publishing my new article and going to record this video so there's one thing I want to jump back to in the code and we can see we could possibly screw up these benchmarks so fortunately I remembered this when I was thinking back to my digital forensic days when I recall that there was some weird Behavior with regular expressions and when we started to have too many regular expressions and what could that possibly be well it's hidden behind the scenes a little bit but if we go to the static method matches if I go into the definition of that we have something very interesting here on line 179 and this says Rex cash get or add this pattern can you see what the problem might be because I'm using the exact same pattern across all of the regular Expressions so what would that mean if I was trying to gauge the performance of creating a Rex and the match time on top of that well if it's already cashed we're not paying the performance penalty of going and creating that Rex so I was almost all the way through writing my article feeling very confident that I finally got this right and then I remembered and then I saw this code and then I went oh crap do I got to do this all over again what I did is we go back to program CS here if I scroll to the top there's this line of code that I commented out for us so I didn't spoil this entire video and we can see that there's this Rex cash size Pro property we can set this cach size to zero and it will turn off caching for us if I step into here you can see that there's actually code that if you go to set this at zero and we go a little bit deeper you can technically clear the cach by setting it to zero so that's kind of interesting but there's this sore cach list list that gets used for caching and what I ended up doing just to prove that we're not cashing anything it's a little bit tricky to explain but uh if I go put a breakpoint here and I'm going to press play and debug this we're going to look at visual studio and how I can examine whether or not we cached anything okay so my breakpoint is hit after the cache size was equal to zero what I didn't include when debugging this for you is that in my test I created a new Benchmark basically right after setting the cache size to zero because that's the way these benchmarks are going to work I'm going to set a cash size to zero then my benchmarks are going to run creating regular Expressions so I'm not including that line I started recording and figured I'd just say it out loud but the interesting thing if I go to the watch here this debug watches by the way I do have videos on this if you're not familiar with how to use debug watches you can check those out but I've added a watch here and to explain what's going on I'm getting the type of Rex cache then I'm asking for the field so this is all reflection that's called s cash list technically we can't see this thing from outside the class but we can use reflection to go look at it so if you remember all of those people that were telling you hey reflection is bad never use it yeah there's a lot of places you shouldn't use reflection but here's a really awesome use case for it from there we go asking for the static and non-public field called sashless right then I asked to get the value and because it statically pass in and all what I can do is if I press this little symbol here to evaluate we get a list that's empty now in this case I didn't create a regular expression but like I was saying when I was trying to prove this to myself even after making a regular expression after setting the cache size to zero it does in fact not cash it that means that the behavior I want is right here if I set the cach size to zero it will not cash or safe the reason that I had to do this in a watch is that I'm having a very difficult time getting access to the Rex cash type and for some reason in the watch it's able to see it it's an internal class if you didn't notice it when I go back to this code here Rex cach it's internal we should not be able to see this from the outside of the assembly however we totally can when we put it in a watch that's how I was able to get the type when I tried writing it in code I don't have an opportunity to pass in The Binding flags for nonpublic so I did this in a watch I just wanted to explain to you that I could prove to myself setting this to zero will not cash but how does that affect the results well fortunately I added this in my blog article because I said I don't want to screw it up one more time the reality is that the caching didn't really have any effect at all the Benchmark results came out to be almost identical iCal in the second run I think one of the numbers was off by like 0.01 in one case here but the Benchmark results are basically identical so the caching of the regular expressions in this set of benchmarks at least was not contributing to a difference in the performance characteristics of the benchmarks with all of that said we can see that there are some performance optimizations to be had with regular expressions in C if you recall what I said earlier cash the redx are making compile it if you can and in the best case scenario you're able to use the new attribute for regular Expressions that allows the compiler to do that ahead of time for you that will in theory give you the best results something else I wanted to call out which is the whole point of doing this video in the first place and the original video it was not to tell you here's the best way to go make regular Expressions what I wanted to show you is that if you're curious about something you should be able to go Benchmark it and explore it for me this was a very interesting turn of events when I tried to put this information out right I was completely wrong but from the first video I learned about this attribute for the regular expressions and what Microsoft is able to do behind the scenes so that was a really cool learning I'm very glad I put that video together but what was even cooler was that I had someone give me constructive criticism call out an error didn't do it in a way that made me feel bad it's a little embarrassed of course I think anyone would be but it's a great opportunity for me to show you look it's okay to mess up right I'm still making videos I didn't stop it's not like my my YouTube career ended I'm still learning stuff so after 10 plus years of using C with regular Expressions I had no idea the match collection was lazy that's completely new to me so I thought that was super cool and then going through this not only did I learn that but I got to go re-explore the regular expression cache so some takeaways right just keep in mind that it's okay to be wrong it's great to go experiment and learn things and if you notice that someone's made an error there's an easy way to go tell them that without trying to be rude about it right we're all just trying to learn and help each other thank you so much for watching I hope you can forgive me for my mess up in the previous video and I hope this one was still interesting too thanks and I'll see you next time

Frequently Asked Questions

What mistake did you make in your previous video about regex benchmarks?

In my previous video, I mistakenly assumed that the match collection for regular expressions was not lazy, which led to inaccurate benchmark results. I didn't realize that creating a match collection didn't actually evaluate the matches until I accessed them.

How did you correct the benchmarks in this video?

I corrected the benchmarks by using the 'Count' property on the match collection, which forces the evaluation of all matches. This way, I ensured that the benchmarks accurately reflected the performance of the regex operations.

What are some key takeaways regarding performance optimizations for regex in C#?

Some key takeaways include caching your regex instances and using the compiled flag for better performance. Additionally, utilizing attributes that allow the compiler to optimize regex ahead of time can yield significant performance improvements.

These FAQs were generated by AI from the video transcript.