The BEST Way To Learn From Your Development FAILS

Name: The BEST Way To Learn From Your Development FAILS
Uploaded: 2023-12-15T15:00:52.0000000+00:00
Duration: 15 min 13 s

December 15, 2023

• 122 views

In this video, I'm going to show you the best way to learn from your development failures. By using post incident review and analysis, you'll be able to identify the causes of your failures and learn from them so you don't make the same mistakes again.

If you're looking to improve your development process, then this is the video for you. By following these techniques, you'll be able to effectively learn from your fails. Watch and learn as I share the best way to learn from your development fail...

View Transcript

and unfortunately each time I broke production it got worse yeah I up big time in this video I'm going to discuss with you the postmortem process for dealing with incidents and I'm going to do this by walking you through a recent scenario where I was having some issues with my own asp.net core app I was trying to fix a bug and I pushed that production and ended up breaking it twice and unfortunately each time I broke production it got worse but I figured this would be an awesome opportunity to talk to all of you about the postmortem process and what we can learn when we have incidents before I dive into it just a quick reminder to check that pinned comment so you can get access to my free Weekly Newsletter so what is a postmortem well a postmortem is something that we do after we have an incident and more commonly we're referring to these as po incident reviews or a p and this is because the terminology of postmortem is a little bit dark and I think we're trying to move to some terminology that's just a little bit nicer overall to deal with it's the same thing with a lot of military analogy and things like that kind of going out of style but we've been using a lot of these terms in software engineering for a long time so it's a bit of a movement to try and get away from those just to have some better terminology that we can all align with so when we're doing a PO incident review what are we doing what's the goal and what's involved with that well this is a process where you get together and have a discussion with stakeholders that are involved and your goal here is to have something that's completely blameless you don't want to be focused on blaming anyone for anything but you're trying to get to the root cause of the issue so you're going to be trying to do some analysis to figure out what went wrong the different points that we can look at for failure and opportunities for improvement and ultimately the takeaway here is that we want to look for different types of repairs or fixes that we can go Implement review maybe different processes or learnings and from there we ideally aren't repeating this mistake in the future in software engineering there's tons of these opportunities for us to have these feedback Cycles where we can learn and improve Pro it's the same idea if you're doing like a Sprint retrospective depending on how your team is set up and you want to look at how you did how you landed your commitments or not and then from there make adjustments to see if you can do better for next time but I'm going to dive into each one of these sections in a little bit more detail so we'll talk about being blameless we're going to talk about how you can do a bit of an analysis to figure out what went wrong and then the third part that we're going to look at is the actions that we can take and I'm going to relate these all back to the recent issue that I had with my own application now when we talk about being blameless what I want you to think about here is that in software engineering organizations generally we're working in teams in really big orgs you have multiple teams that are responsible for different parts of the entire uh development process the build process the deployment process maybe testing there's tons of people involved even if you might be the one person that's writing the code for that particular issue that ended up breaking in production but that means that each one of these different parts there's a stakeholder involved from making sure that they can help protect production and make sure your changes are safe it starts with you but it certainly doesn't end with you as an individual now this situation is a little bit unique because this was totally me I'm doing this on my own so I'm trying to be a little bit more compassionate and understanding of myself not beat myself up over this because the reality is it's not going to help if I do that and that's the same thing if you're working in teams you don't want to pile up on top of someone just to say like you screwed up you're totally an idiot never write code again like that's not going to help anyone right you want to come up with reasons why we can go make things better and I need to remind myself as I'm reflecting on this that yeah I messed up and that's okay because I'm going to make things better going forward so try to keep that attitude when you're going into posst incident reviews um it doesn't matter like if someone pushed up code and it broke production cause some issues and headaches for your team like the reality is that people aren't doing this intentionally they're not trying to break things to make you upset cause headaches for people it's an accident we're all human and that means that we can find ways to improve and get better as we go forward and we can do that together as soon as we start getting into this blame game what happens is people get defensive if you've done a POS incident review or you've seen one and you've been in these situations where people are starting to blame the conversation doesn't go very far you have people that get defensive they're pointing fingers and ultimately you just have a room of people that are upset with each other and not really driving to results so in order to get past that defensiveness we want to go in blamef free even if that means that yeah one person wrote the code but the reality is you had a team of people that should have reviewed that you have a build system that should have been running your automated tests and the people reviewing the code should have ensured that there was good test coverage along with the original developer and if you're deploying stuff to a live service there should be some type of mechanism to allow you to flight these things and safely roll them out and revert them quickly if something goes was wrong there's lots of people involved in the whole process now before I move on from this point I just wanted to call out this interesting idea that some people have and I don't really agree with it personally but it's this concept of having like a Wall of Shame right so it kind of goes in the face of being blame free and the idea is that they want people to feel the heat they want to feel the tension and the discomfort from doing something and that's usually the case when there's systems in place that are me to be used to prevent this type of issue whatever it happens to be and people are just avoiding it they're circumventing it and going Yeah I broke it because I didn't use x and people go look we want to make sure everyone uses X whatever that happens to be right the the testing infrastructure that was built we want to make sure everyone's using that and there's people that are consistently not so we don't want them to get away and just kind of be like whatever like screw it next time maybe I'll think about it we want people to feel that disc comfort from kind of breaking the rules now I get the idea behind that but I don't really like it and the reason I don't like it is because if people are avoiding doing something it's either like there's an education problem where people haven't been taught to use it properly or there's something that's maybe not optimal with that solution and it's taking them time or it's complex and if you're just kind of doing the Wall of Shame approach I feel like you're not really looking into why people are not using the thing that you want them to be using that's just my take though I do like going the blamef free Road entirely wall a shame thing not really my jam okay now that we've established it's supposed to be blame free we want to get into the analysis and this is often where people use this concept that's called five yse and the idea here is that if you ask five y's and you start asking more specific questions as you're going this isn't like a hard and fast rule but generally you want to try and get more specific from there once you've done about five questions and answered them you should have a good idea about getting getting to some type of root cause and the reason that we want to do this is that if we're getting more and more specific asking questions it's ideally going to help us find where we have some weak spots that we can go improve now you don't have to use a five wise process you can come up with your own techniques for doing analysis but I've seen five y used in a handful of different places I've used them at a startup I used them at Microsoft so it's a pretty common thing that gets used and you can find a lot of information about that elsewhere online so you can go check that out if you want to get more details on how to carry that out but I figured that I walk you through the five y that I came up with for the recent issue that I just had so the first why that I would ask is well why didn't I fix the first bug and for context if you haven't watched the original video I'll link it right up here and you can check that out and come back but I didn't fix the first bug because I was rushing right I fixed the symptom of what I was looking at and ultimately I didn't end up go trying to to create some type of regression coverage on an end to-end scenario that would have demonstrated like that bug being present and then being fixed afterwards I had a really narrow lens and I said I see where the issue is specifically based on the stack trace and I fixed that and tested for that specific thing but because it was more of a hyperfocus test and I didn't do more coverage that was kind of going from the API end point all the way through that was a gap that I missed and if I would have put tests over that I think I would have caught this issue pretty early on or at least prevented some of the stuff Downstream because everything else came after that because I was trying to go fix it and continue to rush so the second why that I have for myself is well why did the second fix that I tried to introduce break things and I think this one's pretty obvious I think the answer to that is because I didn't run the regression Suite based on the first set of changes I made I added new tests and the test that already exist existed would have exercised the code that would have caught this bug before I pushed it to production so the answer in my opinion to this second why is I just didn't run the regression suite and if you see the pattern of this you might guess that the next question I have for myself we're on number three now is well why didn't you run the regression suite and I kind of had three different answers to this that I wanted to include and if I'm being honest just like in the first point I was rushing I was trying to get this fixed up and not really taking my time the next part is that the pipeline that I'm using I'm in the middle of migrating between two different systems and the regression Suite is not running in the pipeline I run it manually normally and because I'm not worried about other developers tripping up over things right now I can just go ahead and make sure that I'm practicing running things but I didn't do that this time because I was rushing and the other part that I wanted to mention as well is that the test Suite I have is pretty resource intensive because I have a ton of tests and the way that the test Runner is trying to work is it's doing a lot of stuff in parallel and as a result there's a lot of containers and stuff that try to spin up and it's a bit messy so in my haste I kind of skipped everything and busted it okay so one of my next more specific questions we're on number four here for five wise is well why have I not migrated that pipeline right that's what the last issue was so why haven't I done it and the reason for that is that I'm running into some issues running in a specific environment specific to some Linux VMS and that's the reality of it it's not really a good excuse it's just the current state I only have so much time to either be developing or working on things on the side so that's just something that's taken a bit of a back seat for now but the reality is there's probably some more work that could be done here which we'll touch on in a moment when we get into the repair items and the final why that I have for myself is not more specific than this last one but kind of goes back to the the third why that we had so this last one is why is that test Suite slow to run locally and I touched on it a little bit already when I was answering the last one but when I run it locally there's some issues with how the containers are spinning up for the functional tests and because I have thousands of tests and it's running in parallel my options are to run them in serial and that takes forever or run them in parallel but I need to find a better way to scope down the resources that are being used so I understand why it's happening but I haven't come up with a good solution for it yet so if I were to Briefly summarize what I think the root cause of what's going on here is that I just haven't invested properly into my continuous integration system to run my test Suite because if that would have been in place the reality is that I wouldn't have done as much damage but the other point I need to make is that I didn't fix the issue in the first place anyway because I was rushing but I think if I would have done an end to-end test in the first place that would have prevented everything from happening but I think baring that and accepting that there's always going to be some issues like this that escape if I have my continuous integration system working and running my test this would have basically covered me for the rest of the five wi that I have but that's kind of bringing us to the repair items part and when you're thinking about going through your five y's have a check through all of the Y's that you've listed and see if you have ideas for things that you could improve if you're working in teams it's a good opportunity for other teams to have repair items as well so for example maybe you have a situation where as you're answering your five wi you're saying something like we don't have a testing infrastructure that supports functional tests for example maybe you can run unit test but you don't have something that's going to allow you to run containers in your build system and that might be something that maybe another team owns like if there's a build and infrastructure team perhaps they could take a repair item to say look we'll build out that functionality and then that way you and other teams have access to run tests that need containers I'm kind of just making this up right now as an example for you to think through but that means that you can have repair items with your team but other teams could as well and repair items can come in all different shapes and sizes right it could be something like maybe another team had a bug or you had multiple bugs and your kind of situation that caused these issues so you have bug fixes that have to get fixed or other teams do it could be that like I mentioned a little bit earlier that there's sort of functionality that's missing and if there's infrastructure teams or other teams that can add more features you can go build on top of those things so they become more like a infrastructure Foundation type of change to enable you to have better support going forward but sometimes repair items could be things like educating and building culture and around different processes so to give you a quick example from real life I've been in situations where there's alerts but they're too noisy so if they're not meeting a certain severity no one's looking at them but the reality is if we would have been looking at them we would have noticed the issue so we could go change the severity or we could change the culture around how we're looking at alerts to make sure that they're getting enough visibility there could be situations where other teams came up with solutions that can apply to other teams to leverage for the the exact type of issue that was encountered but maybe they didn't do a good job like articulating that sharing that information out to a broader audience so people aren't even aware they have that to take advantage of that could totally be like an educational or cultural shift that the organization has to do to get on board with that but that's going to bring us to the conclusion of looking at po incident reviews so if you're not doing these already I highly recommend that you do because it's just a really awesome opportunity to have a feedback loop so you want to remember to be blameless as much as possible you want to do an analysis and you could leverage five wise or not but some type of analysis to get to a root cause and then from there you want to have repair items or things that you can improve upon which could be anything from fixes to platform changes to just learning opportunities and you need to remember that we're human and the software that we're building it's not perfect it's never going to be perfect so finding ways to embrace the imperfection and try to be nimble and evolve and learn I think is really beneficial and if you want to hear more about that you can check out this video next thanks and I'll see you next time

Frequently Asked Questions

What is a postmortem process in software development?

A postmortem process, often referred to as a PO incident review, is conducted after an incident occurs in software development. It's a blameless discussion with stakeholders aimed at analyzing what went wrong, identifying root causes, and finding opportunities for improvement to prevent similar issues in the future.

Why is it important to approach postmortem reviews in a blameless manner?

It's crucial to approach postmortem reviews in a blameless manner because blaming individuals can lead to defensiveness and hinder open communication. By focusing on understanding the root causes and collective responsibility, we create a more constructive environment that encourages learning and improvement.

How can I effectively analyze incidents using the five whys technique?

To effectively analyze incidents using the five whys technique, start by asking 'why' about the initial problem and continue to ask 'why' for each subsequent answer. This helps drill down to the root cause of the issue. It's a flexible approach, so feel free to adapt it to your specific situation to uncover deeper insights.

These FAQs were generated by AI from the video transcript.