ASP.NET Service OVERFLOWING With Errors! - Debug & Fix!

Name: ASP.NET Service OVERFLOWING With Errors! - Debug & Fix!
Uploaded: 2023-12-13T15:00:04.0000000+00:00
Duration: 15 min 51 s

December 13, 2023

• 309 views

I've been on a bad streak for breaking things in my ASP.NET core web API. Unfortunately, I was rushing and not paying attention to the proper bug fix nor was I running my C# tests as I should have been. The result was repeated breakages. Join me in this video where I squash the final bug!

Have you subscribed to my weekly newsletter yet? A 5-minute read every weekend, right to your inbox, so you can start your weekend learning off strong:
https://subscribe.devleader.ca

Check out all of my cours...

View Transcript

so I logged into the cloud console and at this point my production server is spewing logs like crazy and I have no idea what to do in this video I'm going to guide you through the debugging process and the fixes that I had for an issue that I had with my application called meal coach now originally I was trying to commit a fix for a customer and then unfortunately that resulted in me breaking something else and because I didn't seem to really learn my lesson when I was trying to make a fix for that things went wrong yet again after I verified the fix running it locally and interacting with prod ction data I pushed up the fix to run it in the cloud a little bit later I tried logging into the application to use it for myself and everything was down so I logged into the cloud console and at this point my production server is spewing logs like crazy and I have no idea what to do now before we dive into the details just a quick reminder to check out that pin comment to subscribe to my free Weekly Newsletter and with that said let's dive in So at the rate that my server is logging errors and the fact that I'm not getting any responsiveness when I try to access my website I figured okay first thing that I need to to do instinctively is roll back to the last known good now you might be familiar with deploying things to the cloud and if not a lot of the time the first thing that we're supposed to be trying to do is get back to the last known good State or the lkg and the reason that we do this is because if we were confident everything was working before if we go to commit a change and then we notice things break we go okay if we put it back to that last state we should be able to recover and be back in a good spot I think instinctively a lot of us try to rush to go fix things and in fact if you watch the last two videos which I'll link up here if you haven't seen them already it'll give you a lot more context I was not really listening to my instincts that I know from work and from my experience I was kind of falling into this trap of let me just forward fix Let Me patch it up let me push instead of going hold on let me go back to the last known good State pause collect myself and then figure out how to move forward properly but this was the second time that I broke production and this was significantly worse than the last time which would have only affected the password reset path instead my entire application was down there's something to be said about fool me once shame on you fool me twice you get it so step number one was immediately trying to get back to last Stone good so I had gone onto my continuous delivery Pipeline and made sure that I could deploy the last one that was working before everything was blown up and before I move on to the next part of this if I can impart anything upon you from this part of the video so far it's really to try and go back to the last known good as the first line of defense like I said it's really a bad habit for people that we want to rush into things and try to forward fix I want to share with you that I do manage two deployment teams at Microsoft the very first thing that we say to people as soon as something's going bad is how can we fast revert how can we get back to a last known good and as much as possible we're trying to reduce the opportunity for people to forward fix things because we see it time and time again where someone will try to fix something and then there's a cascading effect effect of things going wrong just like what was happening to me and I didn't take my own advice so let me impart that on you please so now that things are back to a state where things are up and running and the logs aren't being totally blasted away in the cloud I wanted to make sure that I could go fix that first there is no reason that my server should be spamming the same log over and over that something's going wrong so let me show you that in the code first all right so the log messages coming from my server were in fact coming from this class called check-in reminder Notifier sweeper and this is something that I run periodically and it is attempting to make sure that I can send reminders out to clients based on different things so it's a little job script that runs and it triggers for me every 6 hours which you can see right here from line 28 to 30 and yes I know there's job scheduling Frameworks and to be honest I'll probably move over to something like quartz or hangfire but I haven't spent enough time with them quite yet and this pattern's been working for me until right this moment where things weren't going so well so when everything is good and well this bit of code doesn't really have any issues the way it's laid out but the problem with spamming my logs is directly inside of this method and the code that we can see on the screen right now and if we're reading carefully if we think about exceptional cases and how I have this try catch here we can probably find out pretty quick what's going on so if my checkin reminder Notifier is running and I have a bug somewhere in here what's going to happen is that it's going to get to this catch block and end trying to log an exception which is what we're seeing in the cloud as this thing was busted now once that finishes logging which should be pretty quick it's going to go right back up to the top of the loop and try the exact same thing again there is no wait time in between so this bit of code right here only Waits if things are working well which for me it has been it's been going great so I never really noticed this bug but what I need to be able to do is in fact move this code down here and this fixes all of that log spamming so when something goes wrong I need to make sure that we can get down into here wait for that 6-h hour period And if anything is stopping this this cancellation token is going to end up stopping this we'd go back up to the top for this while loop it would be evaluating to false and we'd end up leaving this method so simply moving this outside of the TR catch is what was going to save my log files from blowing up like crazy the next time something goes wrong inside of here which if I'm a little bit more careful hopefully doesn't happen ever again but I can't ever say never again now at least the nice part about this was the log in the cloud was completely filled with exactly what the problem was and I ended up having a SQL error and this was a little bit weird to me because the SQL that I was running I had verified this before I had run that locally to be able to reset the user's password made sure that everything worked when I ran that little tool that I wrote and I was able to prove it all locally so what was going wrong in this case well it's probably a little bit bit easier to talk with my hand so before I jump over to the code what I'm going to explain is that I have this filtering system that I've built and it translates things into SQL and this is working really well but what I had done was had a code path that assumed a table was joined in and another code path that assumed it wasn't and the reality is this filter that I'm using that I was working with to go find the user based on an email needed to be responsible for joining in the table it can't assume that it's already joined so in one code path that I was running locally and it was working it was already joined in the other code path that this part in production was sitting that table was not joined so there was a really simple thing to do here and it was to go where that filter is translated into SQL and make sure that I'm just putting a join in it's that straightforward so I'll show you that quickly all right so on my screen right now I have the bit of code that is responsible for taking a filter object and that filter object knows how to filter on the par particular metadata I'm looking for this is what I was patching up in the last video now what I needed to do here was include this join that you see from line 94 to 96 and that's because the query that comes in I cannot guarantee that I have this user table attached to it and that's because I can't guarantee that this user metadata table is already joined on that query because this is operating on user objects that metadata is extra data and I need to make sure that if I'm filtering on that extra data that I joined that table in so thankfully the logs were very obvious when telling me where this join was missing but now I needed to figure out something else and that something else is where's the Gap what I had done before was I was testing a fix locally I tested it by running it manually ensuring that the code path is exercised and then I ran a set of tests locally that I thought were going to cover that change and they did they partially covered it but because this gets really complicated when you're putting a bunch of filters together and the different code paths that can access it it's not sufficient it's good but it's not enough and this leads me to learning number two which is that I probably should have just run the full Suite of my test I'm currently in the middle of migrating my continuous integration Pipeline and because there aren't a ton of other people that are adding code to this I know that if I just run the test locally I should be able to get a really good signal because right now they're not going to be running in the continuous integration pipeline that's the goal right you run them in the PIP P line they would fail the build and I wouldn't even be allowed to commit my code or push it anywhere but I didn't listen to my own advice if I wasn't going to run it in the pipeline then I should have ran it locally and I shouldn't have been selective about what I ran I should have ran it all I know that if I run the full Suite of tests something is going to catch this and the reason I know that is because some of the tests that spin up are going to run this sweeper code that we just looked at and fixed it's going to run the same thing that would have called into that been missing the joint and then been spamming the logs I would have seen the problem if I just ran other tests now something else to call out is that I don't have a ton of really granular test coverage on this part of the filtering library and the reason I haven't written that yet is because it's timec consuming I have tests at a higher functional level that are integrating more pieces that would exercise This and like I said if I would have run them all at this point I would have caught the bug so it would have been good enough but I'm not not saying that I should stop there I totally should invest more time into writing more specific tests on this and those would allow me to catch things in a more granular level so that's important I just haven't done it but the main takeaway for this part is to run the whole Suite because sometimes we make assumptions about what our tests are covering and even when you have unit tests that you think are covering every line of code it doesn't necessarily mean that all of your assumptions about your dependencies that you're mocking out are going to be enough and even with fun functional tests I made sure that I was covering a bunch of stuff like there's a lot of line coverage but that doesn't necessarily mean that all of the different scenarios and variations that could exist are getting exercised and asserted on but in this particular case I already know that I have tests that are going to exercise this code and this fix and I know that because the previous change that I put in added more tests that were a bit of a gap and the test that already exists like I said would have caught this bug before it hit production so that means if I run all of the tests I should get a green light now if everything's good so that's the next thing that I needed to do and given my streak so far with messing up here this is the second time that I'm fixing something in production what are your thoughts about how lucky I got with this passing or failing with the fix that I just showed you well if you said that everything was green and lit up properly thank you for having faith in me but unfortunately it was not it wasn't the right fix and let me jump back to the code to explain why so the fix was part ially right this part here was necessary to include but it introduced another problem so thankfully this time I did run the whole test Suite to go see what was up and if I wouldn't have done that I would have broke production yet again so having this was necessary but it caused another problem which we can go check out if I explain to you what the error message was and if you don't have any guesses this was adding another join but I had told you earlier in this video that I already had code that was adding this join in elsewhere that's why in the previous video I was able to use this filter without having this join here and it was working properly on the one code path so now I have a fix for the code path that was missing the join but I have a new issue because this is getting joined in twice now so before rushing into a fix for this I wanted to think about the design of my SQL queries here so I'm going to jump into this in join method and I'm going to explain why I was okay with this fix this inner join the point of it is supposed to be that it's collecting the joins that I can operate on them and at the end of using this iql query this implementation here I'm able to translate all of these objects into a query that I can go execute and my opinion was that if I was doing a join already and the same information was being included I should just ignore it and that's what this logic was added in to do so I'm going to explain it briefly here but what I do is if I have a given query already and this is an extension method if you're not totally familiar with this syntax that's what this this keyword does and the static keyword does here I'm able to add a method onto iql query so given a query what I'm doing is taking in a new join and to make sure that it's okay to add I'm going to iterate over the other joins that exist already and if the table is the exact same then I'm just not going to worry about it and to clarify this is isn't just a table name this is a record type and it's going to check the table and the Alias that I'm using in the join but if we go a little bit lower if it's not equal I check to make sure that if we're using two aliases that are the exact same that's when we're going to have an issue technically I could join in the same table multiple times with a different Alias I'm not exactly sure off the top of my head why we might want that but it's not erroneous to do it but to have the same Alias is an error so I did need to add add this in here to make it a little bit more explicit when things are going wrong and that means that I can catch the problem before I'm running it at the database level so as soon as I'm constructing that query I can see exactly when it's happening if I'm adding in a duplicate join that has a different Alias but if it's entirely duplicated I don't really care it's really just to be safe that I'm adding it and with that fix in place again given my lesson that I've been learning so far go run all of the tests not just run the one that was failing make sure that everything's passing and thankfully when I did that I had a green light across everything finally I was done I made sure that everything that I was changing and everything that was going to be exercising that code had a green light with it and the reason that I have that much more confidence is because a lot of these other tests that I'm running now that I wasn't running originally in some of the other fixes that I did early on actually starts up the application there are functional tests that literally spin up the app and connect to a database that's in a container and I can interact with it so it is in fact launching and not just calling the backend code I have other tests that call the backend code directly and some that hit from the API level which means the server has to be spun up to do that so in my opinion the only thing that could be going wrong at this point between me pushing up this code and what's happening in production is if something's wrong with the things that are being being deployed and I didn't touch any of that so I have a high degree of confidence at this point if I can spin up the app locally and run those code paths that were being touched things should be okay and if there was anything nice to take away from this I think it was kind of cool that I didn't have to go write new tests I don't mind writing new tests I liked having testable code and I like adding tests but I didn't have to do that this time because I already had the test coverage I was just being an idiot not running all the tests originally so if you found this video interesting and the ones that preceded this to kind of tell this story about all the pain that I just had you can check out this video next for the PO incident review thanks and I'll see you next time

Frequently Asked Questions

What should I do if my production server is logging errors and not responding?

The first thing I recommend is to roll back to the last known good state of your application. This is a common practice when things go wrong, as it allows you to recover to a point where everything was functioning properly. After that, take your time to analyze the logs and identify the root cause of the issue.

How can I prevent my logs from being spammed with error messages?

To prevent log spamming, I suggest reviewing your error handling logic. In my case, I had a loop that would continuously log errors without any wait time in between. By moving the error handling code outside of the loop, I was able to ensure that the logging only occurred once per error occurrence, thus reducing the log spam.

What is the importance of running the full suite of tests before deploying changes?

Running the full suite of tests is crucial because it helps catch issues that might not be covered by selective tests. In my experience, I learned that even if I think I have sufficient test coverage, there can be gaps. Running all tests ensures that any potential issues are identified before they reach production, preventing further complications.

These FAQs were generated by AI from the video transcript.