OVER 800x IMPROVEMENT?! Benchmarking Regular Expressions in C#

Name: OVER 800x IMPROVEMENT?! Benchmarking Regular Expressions in C#
Uploaded: 2024-04-12T12:30:35.0000000+00:00
Duration: 15 min 29 s

April 12, 2024

• 514 views

If you check out these benchmarks, you'll see a scenario that you might be doing in your very own code and you could be getting an 800x performance boost. And if you think there's no way you have a Regex in C# that will speed up by 800x, then you might see another benchmark where you could get a 100x gain!

In this video, I walk through the various ways that you can construct a regular expression in C# and the different performance characteristics of each. While this isn't an exhaustive benchmarking collection over a wide variety of C# regexes, hopefully, this serves as some inspiration to benchmark your code!

PLEASE NOTE: There is a critical flaw in the benchmark code as pointed out by a viewer (and THANK YOU for doing so in a very constructive and helpful way). You can read about the correction here:
https://www.devleader.ca/2024/04/12/csharp-regular-expression-benchmarks-how-to-avoid-my-mistakes/

View Transcript

more recently when I've been programming if I find multiple ways to do something very similar I've been getting very curious about the performance characteristics of either way to do it when it comes to regular expressions in C I know that Rex can be pretty darn slow and I wanted to make sure that if I was trying to work with Rex I could work with the most optimal way of doing it if I really needed to use it hi my name is Nick centino and I'm a principal software engineering manager at Microsoft I spent a lot of time before Microsoft working in digital forensics and a big part of that was making sure that we could match on particular sequences of characters or bites that meant that every time we had to go touch a Rex because that was the best way to go match a pattern my heart sank a little bit because I knew it was going to be a performance hit now we've had some awesome improvements to net over the years and I haven't been in digital forensics for a bit but I still find myself using Rex so in this video I wanted to walk through some benchmarks using benchmark.com for a link to my free Weekly Newsletter and my courses on dome train let's go check out The Benchmark code as I mentioned I am using benchmark.us in this assembly and we really only have this one benchmarks class here so there's a whole bunch going on and I'm going to kind of walk through it bit by bit but we're going to look at a handful of different ways that we can go create and run regular Expressions before I go into the details of this I just wanted to mention that the text that we're going to be matching against is from the Project Gutenberg ebook of the adventure of a black coat I just downloaded this from the internet you can see if I scroll through this pretty quick right there's a whole lot of text here there's over 2,000 lines 2266 lines here of text that we can go match against so I wanted to have something that was pretty substantial and just to kind of illustrate it to you in this setup I'm going to load it all at once so it's not like per Benchmark I'm reading in this file that time is going to be omitted from the benchmarks but there's a handful of different benchmarks that I want to go run when using a Rex so I'm going to go collapse these and then we're going to step through them one by one the first one that we're going to look at is going to be the Baseline that we're using as well for all of these and it's going to be using the static class with the static method for matches and we're going to compare the source text and look for the pattern and I should have mentioned that the pattern that we're going to be using is words that end in iing or end an ed no real specific reason I just wanted to see these are kind of natural endings that show up periodically in words and that way I wanted to see that we're not going to try to match something that doesn't exist at all or matching something like the letter a and having a ton of matches come up so I wanted to have something that could be representative across that body of text so we're going to use the static class and static method as the Baseline and then the next most common way that I see rexes get use is creating a new Rex every time you want to go run it and then use matches on that as well so you can see that we also still have the Rex pattern and the source text passed in here and another variation of this this comes up a lot because if you're kind of reading into Rex you'll have heard that using the compiled flag can really speed up the performance of your regular expression but one of the traps here is that if you're creating a new one and putting in the compiled flag every time you're technically paying the performance penalty of compiling it each time you go to use it so if you're using it once it's maybe not so bad but that's the case do you need to compile it I don't know you probably want to go Benchmark that and find out but we're going to compare and see the overhead of doing that compared to just neing it up compared to the static method and static class now similarly to what we've seen already I'm going to Cache the reject that we're creating so if you want to see this get used and I'll show you this with a compiled one as well we'll check these both at the same time but in the setup code that I have at the very top here all that I'm doing is create creating those regular expression objects once in the setup and then we can reuse them for each run so this way if there is any performance overhead for creating those Rex objects we're not going to pay it each time we're just going to cash it and do it once and finally at the bottom here you'll see that I have these generated variations and they're going to be very similar in terms of doing a cached variation using the compiled flag and this is sort of the Matrix of all of them but what is this generated Reax like what's the difference here let me scroll back up to the top and we can see where these are created and I'll explain in a little bit of detail what's going on here so on line 24 you can see where I am creating this get generated Rex method and you'll notice that it's marked as partial and static and that means that the class itself also has to be marked as partial now these are going to be Source generated regular expressions and I believe it was introduced in Net 7 I might have the version wrong but it's relatively recently where we were getting these attributes that you could put onto these and in fact Visual Studio will often give you an option over these so if I hover over you can see convert to generated Rex attribute so this is a feature in Visual Studio that suggests you might want to switch over to this pattern now they have really good documentation online in msdn that explain this and I'm not going to go into the details of this I just wanted to show you very briefly that if I go into this if I press F12 you can see the generated code for this and they did this on purpose because they want you to be able to debug your regular Expressions if you need to so they include things like comments and stuff like that's a really important part they said but there's a lot of super cool stuff that can happen when they go to do these and have the source generation go alongside it so I do recommend that if you're using regular Expressions extremely heavily check out the msdn information for the uh these generated rejects with the source generators super cool it's totally above my head because I use Rex a lot but not this level of detail however I thought that this was important because if this is the direction that they're pushing Us in why right there's got to be some type of performance impact here so I wanted to check that out I don't know if I have enough representation in the pattern I'm trying to be transparent about this the pattern that I'm using here I don't know if that's going to be able to show us enough variation across these but this is something that I think you want to explore if you're trying to optimize and work with different regular Expressions so the thing that I do know is that when you're using these uh generated Rex attributes it should by default give you compiled as a flag so you'll see I've actually created one that includes it specifically the documentation does say that this does get ignored I did this on purpose to prove it to us right these two should functionally be identical if the compiled flag is ignored and that's because the Rex itself like I said when it's doing this Source generator is going to basically use the same concept of compiled automatically for us so you should not have to include it but certainly more optimizations to be had when doing this so again I'll say it one more time check out the msdn documentation for all of the details because there's a ton of work put into this for you to be able to go investigate them one more point that I'll mention is you'll notice when I hover over this the tool tip text that pops up right it actually explains to you this is so awesome it explains to you what your regular expression is supposed to be doing so if you typed up a reex and you want to make sure that it's doing what you expect having this explanation written out in English is so powerful because if you work with Rex you know that they can get complicated super fast okay so the benchmarks that I have I just wanted to illustrate here let's go down a little bit lower I'll expand all of these I mentioned that there should be no difference between the comp piled variations and not right so these two should be equivalent to these two because the documentation does say that that compiled flag does get ignored because it's basically done for us but the thing that I wanted to call out as well you'll notice that these are method calls right so I have to call get generated Rex to be able to access that Source generated code documentation does also say that it caches that for us so once you do this you still get access to that code it doesn't have to go recreate anything which is awesome so in theory right in theory this should be identical to this in terms of performance because if behind the scenes this is doing cashing for us it should be identical to this however I just wanted to see for myself by running this if there was overhead to doing this method call so with all of that said I do expect all of these to be very much the same set of results when it comes to Performance like with all of my benchmarking videos I'm not going to sit here and make you wait for them to all finish I've done that ahead of time so let's check out the results okay there's a lot of variability and a lot of similarity when we start scanning this right so the columns if you're not used to the benchmark.pl columns that we're interested in are going to be this mean column which is going to be the average run time that it took we're interested in the ratio because this is going to give us a comparison against the Baseline which was the static class with the static method and then there's also memory allocation as well so we have this allocated column right here the second last and also a ratio to compare against the Baseline with that said starting with our Baseline the static class and static method right at the top that's just over 12 NS in my opinion that's actually way faster than I ever thought that these rexes would have run against that text I don't know why I just assumed it was going to take significant longer this was doing matches plural so it did go look for all of the matches with in and Ed for the words and there should be a whole bunch 12 NS is our Baseline but it's going to get more interesting as we start comparing the other results so if you were to new up a Rex object every time you wanted to go use it it is 100 times 100 times slower right if you look at the ratio column so say 101.7 to so literally 101 times slower um there's a lot of overhead to go create these Rex objects right you can see even for the memory allocated there's a lot of it's two orders of magnitude more memory right it's just something that you don't want to do is go create a new Rex object every time you're in a hot path of your code and it has to do a lot of Rex matching you should not do this if anything just switch right away to using the static method on the static CL class you'll get a performance gain that's about 100 times right it's a huge impact now we talked about this compiled flag being an optimization and truly having it compiled ahead of time is an optimization so you do want to do that but you don't want to create a new redx object with a compiled flag in it every time you go to use it because we saw that it was 100 times slower to make a new Rex object but it's 10 times slower than that when you put the compiled flag in so if you look at the third line here 870x the Baseline right it took 10 micros seconds so literally it's three orders of magnitude more um it's ridiculous right you don't want to be doing this compiled flag every single time you go to create and use a regular expression lose all of the performance benefits of doing that because you're paying this overhead of compiling it each time and this is where the the evidence is going to start to show itself right so the cached and cash compiled variations you can see here that this right here on the fourth line down right the ratio it's 79 so just by caching it no compile flag just by making the redx object once we do get a performance gain over using the static method the runtime is about 80% right so there is a boost to doing that when we add the compiled flag in on the cach one we can see that is a little bit faster in my opinion this could just rounding errors because this is already super fast and we're getting down to the nitty-gritty here I think that this could just be rounding errors and things like that it's worth doing more benchmarking and maybe running this over larger bodies of texture maybe even different patterns to match with to see if there's a variation but it does look like it's a little bit faster to use the compiled flag and cache so again both of these two benchmarks in particular illustrate that the previous two are not things you want to do you don't want to keep making new Rex instances that's the takeaway from this video please don't do it the last four that we see in these Benchmark results are using the source generated regular Expressions these are all extremely close in terms of the runtime if we look at the ratio column right we saw 79 for the last two at the the upper end and the upper end of these ones is 82 they're all right around the same territory for the the runtime that they have and I did seem to notice that there is a little bit of a performance gain when doing the cached version right so not having that method call even though the documentation says that that information's cached already so you can safely call it across two benchmarks it does look a little bit faster again it could just be rounding errors and it's pretty insignificant at this scale it might be worth doing it again across bigger bodies of text more Loops all sorts of things that we could try out to see if there's a bit of a gain there but ultimately if we look at the last six results here these look across the board to be faster than the Baseline right we can see about a 77 up to 82 you know ratio against the Baseline these are all faster but the two that we looked at near the beginning definitely slower again the takeaways that you don't want to make a new redx and especially put in that compiled flag for each time I mentioned at the beginning of this video that I've been getting really curious about the performance characteristics when there's multiple ways to do basically the same thing right if I can pick and choose which way to go I might as well try to pick the way that's going to be faster especially if it's on the hot path of my code if it weren't for doing that investigation I wouldn't have learned a little bit more about the source generators with those special attributes for regular expressions in fact I have a lot more to go learn now that I've created this video and started just scratching the surface of that I urge you to do the same thing I think you should be curious about the code you're working with and especially when it comes to benchmarking things because there's a lot that we can learn and while benchmarks like this can be interesting and point you in a direction you will want to do profiling and benchmarking on your own code because your scenarios will look different than mine and if you want to learn about how to Benchmark your own code you can go watch this video next thanks and I'll see you next time

Frequently Asked Questions

What are the main takeaways from the benchmarks on regular expressions in C#?

The main takeaways are that you should avoid creating new regex objects every time you need to use them, especially with the compiled flag, as this can lead to significant performance hits. Instead, using a static method or caching regex objects can greatly improve performance.

How does using the compiled flag affect the performance of regular expressions?

Using the compiled flag can speed up regex performance, but if you create a new regex object with the compiled flag every time, it actually incurs a performance penalty. It's better to cache the regex object instead.

What should I do if I want to optimize my regex usage in my code?

I recommend caching your regex objects and avoiding creating new instances unnecessarily. Additionally, consider using source-generated regular expressions, as they can provide performance benefits and are easier to debug.

These FAQs were generated by AI from the video transcript.