Safe Deployment Practices - Principal Software Engineering Manager AMA

Name: Safe Deployment Practices - Principal Software Engineering Manager AMA
Uploaded: 2024-07-23T17:34:10.0000000+00:00
Duration: 1 h 10 min 50 s

July 23, 2024

• 156 views

Yet another post about CrowdStrike...

Actually this is about Safe Deployment Practices and different things you can consider for rolling out changes safely.

I don't want to pretend to be an expert on the CrowdStrike outage. I don't know exactly what happened, and at the time I write this, I haven't seen if even CrowdStrike has a root cause.

So instead of pretending to know things that I don't I'd rather talk to you about some things that I do know:

Safe Deployment Practices.

This is an AMA livestream! Come with your questions about programming, software engineering, career progression, etc... Happy to help share my experiences and insights!

Today we focus on:
- My newsletter points regarding Safe Deployment Practices
- Jumping into articles/posts from LinkedIn & Reddit
- Answering YOUR questions

View Transcript

Instagram's usually the last one to go and that seems pretty good cool I think we're live um we're going to see how long I go on this one tonight I do uh try to aim for about two hours so from 900 p.m. my time to 11:00 p.m. uh that's in Pacific so we'll see uh I'm going to be going over kind of like I do typically over my previous newsletter article I'm going to drop that into the chat uh if you are on a different platform I'm going to see if I can well I don't even think on uh on Tik Tok I'm not even sure if I can send links but anyway here you go um and Instagram I don't think it's going to work but anyway if you want to see what I'm going to be referring to uh it will be that newsletter article so it's big big shock to everyone right it's going to be about uh crowd strike but not specifically I don't want to focus uh on crowd strike I would just want to use this as an opportunity to talk about safe deployment so that's going to be the topic today um the reason why I might cut it a little bit shorter than the 2hour Mark is just because uh tomorrow I am going to be flying out to Texas so I got invited to go speak at a smaller Meetup um in Texas for uh some net topics so I'm going to be focusing on doing plug-in architecture tomorrow in Dallas um it's it's going to be recorded uh as far as I know so uh my plan is that I don't know if I'm going to get my hands on the live uh sorry the yeah the Live recorded version of that's weird to say um but if not it's a presentation that I can give uh kind of online in a different form as well so I might I'll see what I can get out of that and if not I'll just record it and put it on YouTube kind of speaking to the camera like this as well CU it's going to be pretty high level it's not going to dive into like because it's a live talk trying to talk about like the lines of code you're going to want to write and stuff I think it's going to land very well to a live audience so um it'll be some high Lev things and then for if I do it for YouTube I'll have a A variation of it but that will be coming but the the reason why I'm mentioning this is because the flight is at uh it leaves at 7 in the morning so I have to basically leave here at 4:30 in the morning uh which is only a few hours from now so might cut it a little short but we'll see how it goes okay with that said I'm assuming everyone has heard of the big outage that happened um and so it's kind of interesting right because when this kind of thing happens uh maybe not specifically a global outage like this but when there's something big in the news like this youve probably seen it a million times and now everyone's suddenly an expert right we see this all the time uh there's a lot of people talking about something it's all over the place and suddenly all these people are like well I'm an expert on this and they're not um and that's okay and I want it to come right out and say like I still don't like they've released there was a null pointer exception um but I haven't seen anything beyond that like the the reason I don't want to focus specifically on the crowd strike uh incident is that I haven't seen the post incident riew sure null pointer exception okay anyone who's written and deployed code before has had a bug at some point I don't care if it's a null point or exception an off by one ER or what ever it happens to be there's more than one thing that goes wrong in a chain of events like this so just to be able to leave it at null point or exception and we're going to stop talking about it like doesn't really feel um conclusive so I haven't seen anything that's come out of that post incident review and that means that I am not able to speak specifically about this incident it doesn't make sense I'm just speculating if I go to do that and I don't want to do that but I do want to talk about safe to and um I feel like I'm in a pretty good position to be able to talk about this um I have recently switched teams at Microsoft but prior to the team that I'm currently on which is for doing uh proxying so uh I manage a couple of uh different feature Crews one of them is responsible for the substrate firewall uh we have you know caching technology I mentioned the proxying so we're responsible for routing all of the traffic that is flowing through substrate which is pretty cool but prior to being on this team I was on the deployment team in substrate and deployment is a a bigger umbrella within substrate so my manager oversaw all of deployment but I had uh one of the feature Crews and a sort of like a secondary deployment team as well so um I did that for about three years so in terms of like expertise I don't like calling myself an expert in many things uh I would say that doing deployment for 3 years as a manager and not someone you know developing the deployment uh infrastructure I still don't want to call myself an expert I think that's sort of an injustice to the people that were working on this full-time as the software Engineers that have been doing it for years before me but I do have experience in it and at scale so I wanted to be able to talk about some of the things that we do some of the best p uh you know best practices the patterns and things like that that we follow and again the goal here is not to say like see crowd Strike should have done this I I don't know what they did or did not do so I'm just highlighting as I go through this talk that these are going to be things to try and keep in mind so uh you'll notice if you've seen my other content uh probably have if you're watching this that I don't like to just say like here is the one right way to do things in fact I think that anyone who's saying there's only one right way is probably wrong and at least maybe not fully wrong but they are partially right but to say you you know there's only one right way doesn't really uh land well with me I think there's always several ways to do things so when I go through these uh you know different deployment practices these are sort of generalizations things that I think that you can take think about and apply to your own circumstances so you won't hear me prescribing like do exactly this because I don't know your circumstances and they're going to be wildly different from Scenario to scenario so that's my uh my little precursor to all of this hope that makes sense um chat should be active so if you want at any point drop messages in the chat um I am I'm just double-checking I can see Tik Tok I can see Instagram um I haven't seen anyone send messages in chat yet but they should be coming in from LinkedIn uh Facebook Twitter or X um I do see that there are people most of the audience is from X it's kind of interesting that this happen so um yeah feel free to jump into the chat I will try to answer um I much rather like because it's live if you have questions or comments I don't care if it's not related to what I'm talking about I'd much rather just answer your your questions that you have live um otherwise I would just go record a YouTube video and not do this at 9:00 at night on a Monday sound good cool um so this is going to be for my newsletter like I mentioned um and to start things off um you know so what happened while we had this big outage kind of interesting a lot of people in the media it's very easy to point the finger at Microsoft and say well it's Microsoft and it's like well not quite um it affected Microsoft systems but does not mean that it was necessarily caused by Microsoft uh I do think that Microsoft has done a great job trying to help and offer support um they technically I mean like didn't have to do that but I think that it's really good that they uh have the means to do so uh and obviously they want people that are using products to be able to kind of get back up and running as fast as possible is uh quite impactful I still don't uh someone in the chat might know the number of uh uh machines across the world that were affected but like you know pretty pretty incredible I've heard some people saying and I believe this I don't think that there's ever been like an outage like this in in history potentially so that's a pretty um pretty significant thing to think about I don't I don't think that anything that you know in my person life was affected um you know I'm going to be flying tomorrow right so uh and I think I have a Delta flight on the way back I might I might have uh I don't know maybe some of the delays from that are still carrying over because my flight on the way back did get moved which is kind of interesting but otherwise you know I felt like I was pretty isolated from this stuff but there are tons of infrastructure and systems that were affected and um you know there were unfortunately people lost their lives and things like that so it's not it's like not a trivial thing this is pretty incredible right um especially the scale so um we had this issue right so crowd strike was able to deploy a change effectively null po or exception uh like I said we're not going to spend all the time on that but the first thing that I would love to see that came out of this post incident review is like well how do you know how do we go from null pointer exception to every machine that's in and I'm assuming it's every machine inow strikes sort of uh control just has this problem now did someone just press the button and like instantly all the machines get it um so I don't know what happened in between there but what I'm going to be talking about are some of the best uh practices and patterns for what we try to lean into at least in substrate uh to be able to kind of fill that Gap uh question in the chat should I do deployment on a Friday I've heard that if you have good practices and are confident you should go ahead and do this I am of the camp personally that um a Friday is absolutely no different than any other day of the week now um I have seen some good points online come up about this for in terms of framing so we have to talk about the product that you're deploying so to give you an example um if you if you're deploying things that and maybe maybe I'll step back for a second I think personally when we're talking about deployment Technologies the technology should allow you to deploy and roll back at the speed of light so as fast as you could possibly do it the technology should allow it forward and back that does not mean that you should go do it it means that you have the means to do it so if we decouple these things deployment Tech can go forward and back as fast as possible Right you want a new change it's out there instantly across the world like that again not that you should but the technology should support it and to go backwards to get back to a last known good State I think that you should have the same ability as fast as possible now in reality I think that you do want to be able to go back to last known good truly as fast as possible but I like thinking about this from the technology perspective should be able to go as fast as possible now in terms of how we do this in practice just because the technology can go as fast as possible doesn't mean you should so to give you a metaphor here let's think about fast cars okay so you're driving let's pick a Lamborghini Lamborghinis are fun everyone loves Lamborghini so Lamborghini can go really fast it means that if you are driving on the highway if someone slammed on their brakes in front of you or you had to maneuver or something you should be able to put your foot down and the car should be able to respond and do its thing to get you to where you need to be um kind of just like instinctually you could Dodge things move around maybe if you're not a good driver it's not the case but you know you have the ability with the technology in your hands okay that doesn't mean that just because you're on the highway you should say oh I'm on the highway unless you're on the Autobon and you put your foot down and say well I'm in a Lamborghini I'm going to go 300 km an hour just because the technology can does not mean that you should but it means that you have the ability to do so when you need to so that's what I want to call out now when it comes to deploying on Fridays um again do you need to deploy on a Friday right so if you're deploying to a set of custom customers that um I don't know uh if you think about the release Cadence that they're expecting because that's a big part of it depending on what you're deploying you may not want to do that um to give you another example let's think about uh I used to make desktop software and you'd have to go download it from the website if we were to release it on a Friday night odds are like probably no one's going to download it anyway like minimal people would so like is there even incentive for us to do it right just a thought a lot of the people that are watching this and a lot of software that's built these days is going to be something like a web service or something that's kind of rolled out behind the scenes right these changes are going out and you know you as a user of a website or some application some service you might not even know that that's happening so I think that's probably the majority of these situations people are thinking about but I just want to call out that you may have situations that from a sort of like a customer expectation that you're not deploying on you know whether it's weekends certain days of the week holidays just because of the the use case so something to think about now all that aside one of the biggest arguments I hear for hey don't Deploy on a Friday is well if it breaks right if it breaks now you're spending the weekend you got to come in on the weekend and fix this stuff and like it's you know why would you do that to yourselves why would you just make it so that you could go Deploy on some other day of the week and that way you don't have to go waste a weekend and I just think like this is kind of like a silly argument because I don't know about you but I'm also busy during the week uh so on a Monday right so you don't want to deploy on a Friday because it's not safe you don't want to come in on the weekend but you'll do it on a Monday okay so is the is the likelihood statistically and may maybe there is someone's done some analysis maybe the the statistical likelihood of deploying on a Friday just happens to be uh worse that you're going to have a sorry higher likelihood you're going to have a problem I kind kind of doubt it but let's let's assume that it's equal likelihood of something going wrong so if you do it on a Monday instead and something goes wrong now what you're going to you're going to spend your Tuesday maybe your Wednesday fixing things and you go but at least I didn't waste my weekend and I'm just thinking like man like I still got work to do so what you're I'm just going to fix the issues on Tuesday and Wednesday and go thankfully I didn't waste my weekend but like what about all the work I still had to to do anyway or like the fact that maybe you're going to have to wake up super early on a Tuesday at 4 in the morning to go help fix something and stay up super late on Tuesday to go fix it but at least you didn't waste your weekend like I just think it's the wrong like it's not a great argument I don't know personally so the reason I like saying Deploy on a Friday is because I think it makes you ask good questions so if I say to someone like do you feel like your team could Deploy on a Friday again not not that your customers expected or whatever we're kind of parking that but do you feel like confident that you could Deploy on a Friday and you'll have a lot of people that say no well then I would say well what makes you feel confident you can deploy on a Tuesday like why is it different and I would start to analyze that because my opinion on this kind of stuff is like if you watch my other content I talk about testing a lot and it's not like a sexy Topic in software development everyone's kind of like we just want to code stuff we we don't like tests we know they're important but like you know shut up about it already but this like this is why like tests give you confidence right and you can deploy things in ways that will also give you confidence so I like I like asking people like if you don't feel comfortable deploying on a Friday oops there's my mic um if you don't feel comfortable about it why like truly and it's not meant to make you feel bad like I think I'm superior to you because I do want to deploy on a Friday like not at all it's it's a question that I want you to dig into the systems and stuff you have in place you might say well you can't possibly do that because you know we've the deployment could get stuck part way I'm just making things up right like it we could be partially deployed and be in this weird situation and someone's going to have to come in and I was say well that sounds like a weird problem to have like how often does that happen oh you know half the time we're dealing with this kind of thing and I go well maybe you should go like invest into some of your deployment technology to make sure this doesn't happen right maybe your deployment Tech's great and you go well we don't feel confident doing it on a Friday because we're going to have to come in on the weekend bad argument um and it's just because like our tests you know we don't have we have really good unit tests but like we're always catching these weird things um at deployment time and then I would say well what what if you had tests that covered those scenarios well we our infrastructure doesn't support it like we can only write these unit tests and I go well just maybe you should think about investing into some of the infrastructure around this kind of thing so uh so Sergio on LinkedIn um it might be a good I don't know if you want to kind of go through this this thought exercise but like maybe like if you're kind of thinking like I can we Deploy on a Friday do we feel good about that truly like what what holds you back from doing it it's not again it's not a right or a wrong thing this is just meant to make people think about like it's really kind of doing like a root cause analysis like why do you feel that it's not okay so the the point of saying all of this is that I like being able to deploy on a Friday does not mean I can tell you right now I have some things that if if someone was like well Nick you said you like to deploy on a Friday go deploy this thing I would be like absolutely not the point is that it's a goal state that I want to be in I want to be able to say I am comfortable deploying this thing any day of the week I don't care if it's Christmas if it's New Year's if it's Thanksgiving if it's my birthday the next day and I don't want to ruin it like I should be able to do it whenever and feel confident and if I don't what needs to be adjusted to get there so Sergio Sergio I hope that helps kind of answer frame up some thoughts around that and what I'm going to be talking about in here some of it's going to be testing and the rest is deployment of course it's sort of the topic but hopefully this will give you some ideas for some things to think about so the first part of my article after introducing crowd strike was really like okay well the obvious one is tests right so um I I kind of use the the phrase in here that comes up a lot I don't know when people started using this more and more I feel like not a week goes by where I don't hear people saying it but shift left right this is this phrase where we want to keep pushing things closer and closer to being earlier and earlier so we have more insight so things like uh testing and being able to know that things are functional being able to do that as early as possible right um one extreme that I like to think of that's really cool is like um so if you're a net developer and I'm sure there's similar things in other tool sets but in net we have Roslin and we can have Roslin and analyzers so with Roslin analyzers what we're able to do is uh basically write write rules that kind of prevent you from doing certain things they kind of like force you into the the pit of success and I think that that's super cool because uh depending on the scenario it's kind of like shifting it as far left as possible like as it's coming out of your fingertips the code that you're typing an analyzer can be like nope don't do that because we've seen patterns like lead to bad things right the next thing would be kind of like um you know unit tests or something else where you can go run those rate in your IDE right in your build and have responsive uh feedback and stuff like that so um I just uh yeah it's that's some of the framing around that where shifting left just gives you that feedback earlier and earlier now there are types of tests that I would say don't fit well to ship them as far left as possible and that's when we think about tests I mean depending on your experience level and how much you like writing tests there are all sorts of different kinds of tests things like unit tests I would say shift left as far as possible they should be very quick people should be able to run them in their IDE you should have that feedback basically immediate but you might have other types of tests um I'll give you an example when I was working at uh magnet forensics before Microsoft we would have tests that were essentially um like what we would call a scan so we would scan hard drives for uh digital forensic information and do these comparisons like a scan depending on the size of the drive and the type of search we were doing that could take that could take hours so sure yeah you could go run it from your your IDE but if we shifted it left as far as possible like it still might not be a suitable thing to go to go run it's not really realistic to say Hey you changed a line of code go run this thing for for 6 hours and wait until you get the feedback before typing another line but in general you know you want to move things ship them left so that we have this feedback because you've probably seen uh charts and stuff online uh they float around periodically that show like the cost of fixing a bug right if you can fix it really early there's very little cost right even if you're investing into analyzers and things like that little cost proportionally but when you have to start fixing bugs the the further away they are from the developer fingertips um the higher that cost gets and the crowd strike situation is like the most extreme right one bug and I'm I'm kind of breaking my rule here I I don't know what else happened in the middle right so we see a null null pointer exception so like there's the one bug um but the cost of that bug all the way like being like found all the way at the end is like is absolutely extreme and if there was something that could have prevented it and caught it then like maybe that would be very cheap upfront so just a kind of a a spectrum to Think Through um in the chat what are the chances this update was on purpose um I guess anything's possible um I don't like personally I don't like kind of speculating on this kind of stuff because you could we could have this conversation about anything right like I don't know um someone motivated to do it sure someone could be but we like I said you could go down this rabbit hole for uh for any topic you want like and I don't know I would much rather spend my time just kind of thinking about like uh like I can't go back and fix this not that it's necessarily my responsibility to do so but like I can't do anything about something that happened in the past so um the best that I can do the the most that I can do for influence at least is have a stream like this or make content that just shares uh you know information about how you could prevent something like this uh in your own work because I think we have control over that at least and I'd rather talk about uh things that we have some control over so sorry I don't have a I don't really know on uh speculation and stuff like that I only ask because the last eight days there have been videos on execution against attacking computer colonels then it happens um yeah I don't know to be honest it could be could be anything right so uh I don't know don't know the answer to that unfortunately um so one of the takeaways from my article yeah sorry uh yeah and you're welcome I wish I wish I had a better answer for you one of the takeaways under the testing section for my article was to call out that like you do have a variety of testing uh styles that you can use right I mentioned unit tests we have things like functional or integ ation tests you could have performance tests there's all sorts of tests that you could have and I like talking about all these different kinds of tests because if you just take the lens very narrowly on different tests it's like well we're just going to write unit tests like you're missing out on a ton and if you say well unit tests suck they're they're brittle and like functional tests are the best I go okay great they might have better coverage they might they might be the best bang for the butt cuz there's somewhere in the middle but like if you're not writing unit tests I can guarantee you you're missing out on some opportunities so instead of just you know anchoring to one type of test and saying this is the best and the only thing we're doing I would just say like ask yourself honestly like where do you need confidence and what's the best tool to be able to give you that confidence and you'll find that there is a spectrum of different styles of tests to pull from so just something to think about I highly encourage you go through exercise and if you're unable to go add tests of that particular type because your infrastructure doesn't support it or the codes written in a way that doesn't allow it have conversations with you know if you're an IC have conversations with your team your product owner your manager about like could we do better than this like if we're constantly feeling like we're stuck in this pattern of fixing these these silly bugs that keep coming up and we don't have confidence over this area what could we do better so something to think about um next takeaway that's probably a pretty obvious one but I I think it's worth mentioning um you do and this will depend a lot I guess on the size of your company and stuff or the the life cycle of your of the product you're working on or the service you're working on but you you're going to want to hook up automation for your tests right I I'm kind of laughing because I think it's more obvious for well it's obvious for probably most people but having some tests is better than zero tests even if you have to go manually run them but if you're relying on manually running them the point that I want to get across here is like we're all human and there will be times where you forget or you do something weird with the run because it's maybe it's not perfectly repeatable because you're a human there's always going to be these situations so I would say kind of take the human element out of it hook it up to automation run your Suite of tests like just I would recommend getting to that early and then kind of removing the human error part um and then the final part on the testing thing because you know the main point was just the different types of tests is like revisit this stuff okay so you might have something in place today and um you'll say hey like we unit test a lot of stuff I'm just making this up right we unit test a lot of stuff and we feel pretty good about it like we've been going pretty strong we've had a couple of hiccups here and there but like we go back to the unit test check him out oh there's a gap we go add one we're feeling good we feel confident and I go great you know if you if you and the team are feeling confident about your deployments making changes to the software excellent that's what you that's the state you want to be in but if you kind of just like put your blinders on and you say well this works right now that's all we're going to do and not talk about it anymore the the risk becomes that if you're not evolving your strategies as your product and service is evolving you will stagnate and you'll be in a position where you're going oh crap like do I have to play catchup on this now like can we keep writing those styles of tests are they going to be okay like is there more that we could be doing here so I just encourage you uh one of my philosophies in software development in general is to be continuously improving across you know every every angle that you can and that includes looking at your deployment and your tests and things like that and saying like are we still doing a good job here could we do better where are the gaps so constantly revisit stuff okay we're going to start talking about deployment now because so far it's been dancing around deployment a little bit um um to start this part off I did mention already I and I'll kind of repeat it depending on whether or not you just joined the stream but when I talk about um deployment I want to decouple how the technology is able to do things versus how you leverage the technology and to repeat this from a little bit earlier because I think it's worth repeating uh I think that you want to strive for having deployment technology that can go as fast as possible okay so if you were to hypothetically press the deploy button hypothetically the technology should be able to go at light speed and have all of the changes that you want deployed everywhere instantly obviously it's an exaggeration that's not really realistic but the technology should be capable of it and going the other way if you were like oh crap that's not good you would want light speed roll backs to last known good State like that would be ideal from a technology perspective so all of the things that you can be doing to move the technology in that direction I think are great there's going to be a point of diminishing returns where you're like it's it's fast enough like we don't have to truly go the speed of light um now that coupled with this next part which is just because you can go that fast does not mean you should that's the whole point of this section that I have in my article which is slow and steady now there isn't a like a a philosophy here where you know I'm exaggerating a lot when I'm saying like hey you press the button it's uh all at once everywhere right like if we think about about crowd strike in this in this example like again I don't know the exact details that's exactly what it looks like though on the surface right it's kind of like there's a bug in the software someone didn't notice they press the button and it's like hey it's literally everywhere all at once so technology could go very fast but they also let it go very fast um no uh CJ uh Styles when you say no validation on completion can you uh complete that thought so I can uh respond to it I'm not exactly sure um how to answer that so you want to be able to go fast doesn't mean that you you should and in fact I think that there's a lot of benefit in going slower you do have to think about your use case right so um I want to give you a couple of different examples to kind of think through here so you're rolling out a feature on your team okay and um depending on how many machines and stuff they're going to and you might have things like AB testing and stuff like that but regardless of all these things if you were to press the deploy button and it just went everywhere like you're you're now in this position where you're like okay how do I monitor all this stuff to make sure it's good like do I need to be like ready to like press the the go backwards Button as fast as possible like it's kind of it's not a great spot to be in but if you can start going like incrementally more and more and taking your like take your time to do it it lets you go do the analysis to see if things are working as you expect so when I think about deploying the the difference between that and testing is like I mean there's a little bit of relation there but I would I like to think about it as validation right so the difference I want to draw here is that we can run tests on repeat and they are repeatable by like hopefully by definition they're repeatable when we're rolling things out like rolling stuff out does not give you like regression coverage over the new change that you're introducing so it's not a replacement you don't say hey we roll things out slowly therefore we don't need tests right they solve different problems there is some overlap but they solve different problems so going slow allows you to be careful about how far you're going how fast you're going and then looking at the results of doing it now the other thing when we think about I was saying hey you want your technology to be able to go as fast as possible forward and backwards the reality is like it it probably doesn't go light speed so you do want to try and optimize for going back to last known good more than you want to optimize going forward faster and that's because generally forward fixes are pretty brittle okay so when you are deploying things and you are going slow if there's an issue what you don't do most of the time is go oh no worries I just looked at the code I can see the bug fix I'll just let me just go deploy one more change on top of that and like you know what that we know the technology goes fast right we just let's like ramp up the speed a little bit we'll kind of catch up and then keep going um this is a it's a pretty I'm not saying you can never do this because I don't I don't like saying uh never and always but it's a pretty risky pattern and the reason it's a risky pattern is because usually you're under pressure trying to solve bugs like this and then you're going to go even faster to try and get it out there and what you usually see is this kind of like landsliding effect where you're like oh crap like that one didn't work either um what else what else and you go Rush another fix um instead and ideally you're able to do this most of the time is you just revert to the last known good State and if you're able to do that very fast what's nice about that is you take this state of panic like oh crap everything's broken my features busted or I took down some other service or um you know bad things are happening you take that Panic away and you go back to the last known good State as fast as you can now you can go deep breath right okay so what happened like what went wrong there let's go take the time to go like fix this properly and go roll out the change because you're back in this last known good State you feel safe to be able to go forward so um I highly recommend doing that um so that's one of the things I called out my newsletter is just like you know reverting is uh I don't I don't say always but almost always like the best thing that you can do right gives you that time to go make the change um more safely so we're going to talk in the next section that I'll talk about is like signaling monitoring and alerting but uh it's a little bit out of order but I want to bring this up um about like why roll things out slower because again if you don't have a lot of experience doing this um or or maybe you're in an environment where you can go faster the thing that I want you to think about is like is the situations and the and and we're going to talk talk about the signals that you have if you have really good signals for what's called green lighting you might feel very good about moving faster because if you can get those signals fast your confidence level is high it comes back to confidence right if you need to do red lighting and other things like that you might need to wait longer or wait for these signals to to know like oh crap like the bad thing happened like um so your situations and your your mileage will vary right so these are just things that you want to think about in different environments um so okay back in the chat when sending out updates to computer do they not send a ping confirmation or handshake to say yes update completed yeah this is it's an implementation detail right it's hard to know I don't know for the crowd strike situation this is why I'm trying not to comment on it like too specifically because it's like it's almost like I'm speculating and if I have to speculate I need to be very transparent that I'm speculating and not like from you know as an engineer like I know the and I'm not an engineer uh I'm not a professional engineer I've been programming for 21 years and I went to school for it but by law I am not a professional so it is speculating for me to say this but uh I would imagine that when they're doing deployments and things like that they do have some type of feedback mechanism so again speculating what I what I would gather from something like this being so widespread and so uh like so impactful is that even if they have this Con let's let's kind of do like um benefit of the doubt right let's assume because again if we're going to speculate I want to kind of walk through a happy path or like a best intentions kind of thing and prove that even with best intentions things can kind of go wrong so let's assume that crowd strike does have some type of roll out where they're like hey we're going to try it on some machines it looks good we'll go a little further okay in that type of scenario they might have red and green lighting where they can get signals back from machines so as CJ Styles is saying in the chat like do they send a ping or some other confirmation about these things going like probably right if if you were to think like if someone were building software that does this type of roll out and we need to be able to have this feedback you would speculate that they have some type of feedback mechanism so I think that would make sense so let's assume that part's working right when you have to do these root cause analyses you have to go dig into like what part of the system didn't go as expected where you can go make improvements but let's assume um let's assume that they have feedback like that the fact that all of this happened like in a very fast way makes me either think their feedback part was broken but I almost think that if they did have some slow release thing I feel like the gating system to roll out to larger and larger Scopes basically like hey we tested it here on these few machines and then you know a bigger set of machines over here and this bigger set now let's go do the world I feel like again speculating but I feel like um this is so funny Instagram I'm literally streaming on Instagram and they put up a banner that's like your accounts being automated we think like dude it's a live stream I'm literally talking to a camera um anyway I think I feel like the gating system must have been broken because I want to assume they have something to control the roll out so how would it go everywhere all at once so fast well that's how I think that that happened so um I'm gonna try to get back to my my live stream here okay cool it's still running it's still running so yeah um yeah see I I hope that kind of answers like my my thoughts at least on that but you know it's total speculation uh so Canary releases I think this technique can help yeah so uh Canary and a coal mine right you want to be able to do like a bit of a scream test so you put it somewhere and see like is there any noise um there's AB testing there's all sorts of things um from my and you can do this in many different ways you can mix and match these things you could do it for certain types of rollouts um in substrate we combine doing like uh slow controlled rollouts with the ability to go back fast um and we also do like feature flagging on top of all of this so a really cool way that you can approach things is like um and it's a it's a generalization so you might be listening to this and being like well Nick what about this scenario this obviously won't work and like I get it um just try to make some generalizations but you could think about deploying a feature okay so if we're saying I'm just going to make up some timelines it's going to take two weeks for your change if you press uh it gets built all your test pass and so from the build you press the button to deploy to the world let's say uh we're going to take two weeks to do that and we'll gradually do more and more Scopes we'll start very small kind of spread it across the world let's say 2 weeks what you could do during that time depending on how much control you want is could also layer a feature flag onto this right so depending on what you're building you might Sayre I have two weeks for this thing to touch every machine but what I'm also going to do is by default my feature my change it's going to be turned off it's off right and that means that what should happen is theoretically you could wait the full two weeks and have it deployed to every machine in the world and you should notice literally no difference potentially then with your flighting on top of that what you're able to do is start controlling your own scopes of the machines and stuff you want to see get turned on so these are decoupled things you can't turn the flight on if it's never even been deployed there so you do have this base of like you need to start deploying across all the machines right and your flight could Chase it you could absolutely do that um you could turn your flight on by default but I think a safer way to do it if we're kind of thinking about safe deployment practices is keep your feature off by default right you do have this period of time where the deployment is going to be spreading the change out like literally the binaries being landed on machines and then from there um you can kind of Chase the deployment the Cadence that you want but the deployment is is kind of setting the um the rate at which you're able to go you can't go faster than the deployment so in the example that I'm giving you here with this twoe thing if you're one week into it and I said two weeks to get the whole way you can't be like okay I've seen enough like just turn the flight onto all the machines like you can't because your change hasn't even made it that far so um there's some some things to think about there um so I hope that makes sense you could turn the flight on like once it's at say I'm just making this up say you're one week in you're like okay I'm I'm very happy with it um and we're at 50% of the machine means cuz it's going linearly you might say okay turn the flight on like fully and it's not going to take effect instantly but it will mean that as the other machines are deploying after that they will be coming uh back online like your service will be coming back online um with the flight turned on automatically but these are two systems you can layer on top right and you can like I said you can go mix this with AB testing and things like that so lots of different things that you can mix in but the main point that I wanted to talk about here was going slow right doesn't have to be dreadfully slow but slow enough where you're like we have confidence in being able to see what's going on so something to think about for your situation okay next part because I think this is important to layer on with the going slow part because you might say well Nick like how do I know how fast or slow to go and I want to talk about signaling monitoring and alerting and I'm going to use my I'm explain my working definitions here and that way um if you want to disagree about what these words mean that's totally cool I'm not here to tell you this is the exact definition I want to Define these three words in a way that I can talk about them and if you would like to use your own words for these you know for these things totally cool I'm not here to like be um a dictionary or thesaurus so it's just to get us on the same page so signaling is going going to refer to the sort of um the signals that we want to look for so different things that you might measure okay that could be uh a signal could be like your CPU usage it could be your hard disk space it could be uh memory usage it could be uh whether or not something is at a given point in time like it's running right uh the number you could check for you could do all sorts of things right did we make sure that we had those 10 lines available in the configuration file just making stuff up but the signal is kind of like this base thing that you're using as a measurement so it's going to be the measurements you make on on on a signal some piece of data that's interesting to you in your situation I think things like CPU memory um that kind of stuff like that's a pretty common type of signal um availability could be something so to give you an example um if you wanted to Ping or do a health check on a service like the the the part where you're pinging and measuring is the service online um the the act of doing that is kind of like the signal that you're trying to measure but that's going to bring us a nice little segue into the next one which is monitoring so when do you do this how often do you do it so what are you trending for the signal like if we take the Ping example right so if you're thinking about monitoring over a signal which is the health check do you care if and I'm and I'm not telling I'm not insinuating this so I want to be careful about how I say it but do you care if you go to do a ping the one time and the service is down what if you did this over 10 minutes and uh you pinged 99 or so you pinged a 100 times in one time it didn't work right in the beginning would you say oh you know we have this monitoring like monitoring seeing something not good would that be the case or you know would it be like some threshold like we pinged 100 times and 50 of them weren't good you might be like oh like this is the trend that we want to look for you could do the same thing for CPU memory so I just want you to be thinking about like the signal is the thing that you're measuring and monitoring is kind of like um the period of time or like the the view of this sta that you want to set and the final thing that we're going to layer on to that is the alerting part right so again we use the pinging example for like a health check you might say Hey you know we didn't get 50% Health on our availability checks right that was the bar we set and at that bar we should alert so what does it mean to alert well this is going to mean all sorts of different things depending on where you are um you know it might mean for your for a personal project that I'm working on so I'm working on something called brand ghost if brand ghost doesn't do the right thing I get an email goes right to my inbox hopefully I wake up right like so it's it's a system that's not as robust as say like substrat or Office 365 something goes wrong on some of those monitors you have entire teams of human beings that will get an alert it gets fired to their inbox it goes to an automated system it will depending on the severity of it which is a whole other thing it might call them right like there's all sorts of things that can happen with alerts but you you need to think about how you combine these things and I I mentioned the word severity and I think that's a good one to think about too because you might have some signals with some monitors where you're like this looks suspicious we should probably put this on someone's radar and they they should check it out we don't think that the signal with our monitor is strong enough where we're like you know what like this is bad like we got to stop but it might be like this seems suspicious at least like we should go look into it because it might be a sign that something bad is going to happen later so you might have these different severity levels or different types of alerting that you mix in um based on that severity so um the takeaway that I wrote in my newsletter article was just like understand how these three different things however you would like to call them fit into your current system right if you don't have any signals you you should probably think like okay well um what does what does a good State look like you know what would make me feel good after a deployment is done what would make me feel good what would make me feel bad and try to think about those things like if I I've had a couple of embarrassingly I've had a couple of situations maybe it's worth talking about these where I I'll use brand ghost as the example I made a YouTube video recently where I posted how this happened but I did a deployment and the the thing that I was trying to check was about running uh these jobs that would happen overnight so I have a schedule that runs and I know that the data that I have to operate on is not perfect as in it's not perfectly sanitized I I imagine or speculate there could be an issue when the schedule is running and in the video I showed how I put this Tri catch inside of this Loop and I was trying to be safe right so I I had it Encompass the entire inner body of the loop but the problem was that it did throw an exception so thank you TR catch you saved me but what didn't work so hot for me was that the delay before the next iteration was also inside the TR catch so I had this busy Loop that BAS basically sat there spamming Azure as fast as it could and it ate through my entire budget overnight now budget was not super high this is just some Microsoft credits I was going through so was okay but the point that I'm trying to make here is that what would be a good uh thing for me at this point a good signal would be CPU usage I burned I burned myself a couple of times now where this has happened um when I was doing uh meal coach um so this is another I tried having a nutrition platform as a as a service bit over about two years ago now it was a failed sort of startup idea but um I had issues with um like database connections being held open uh so that might not have been like a good deployment signal maybe not um I'm just trying to think through if there's some good examples from that maybe not a good deployment signal um this logging one uh in in the example I just gave you from Brand ghost that's not a good example in that context but I after I put that YouTube video out I wish I was making this up I'm just being transparent I pushed another uh so that that YouTube video was made a couple of weeks after the issue happened after I released that YouTube video I literally pushed up a breaking log change or a logger change I configured the logger wrong on application start up it was trying to it was throwing an exception cuz it was failing trying to set up the log so then it used a logger to try and log the error which also threw an exception and it sat there doing an infinite like Loop almost like recursion and that was at app startup totally not good so I think for me a good signal at this point for brand ghost would be you should probably put something that's going to monitor C CPU there and if you see it Spike like probably not good revert back to last known good so that's my my little story there so that would be something that I want as a signal um if I had to ask on the monitoring front like how long after a deployment um once the app is up and running if it was you know CPU was elevated and I don't know the threshold I'd have to go look at a chart but I might say hey look if you're over you know x% CPU even for I don't know 30 seconds or something might not even need that long like probably not good go back to last know good and send Nick an email like that kind of covers the signaling monitoring and alerting for for brand ghost for for one signal and one monitor so something to think about but you can go through that exercise for your own systems and see how that fits um I have to sneeze it's creeping up on me don't do it Nick okay and I think that was mostly it then kind of coming back I finished the article off with deploying on Friday so um depending on when you join the stream one of the first questions from Sergio in the chat was I'll just kind of read it back out he said should I do a deployment on a Friday I've heard that if you have good practices and are confident you should go ahead and do it and my response was absolutely so I finished the article off by basically saying the exact same thing right it's like if you if you don't have the confidence it's not a matter of hey you should feel bad Nick is trying to call me out and make me feel bad for not deploying on a Friday um not the case at all because I did mention earlier I literally have some projects I would say you're you're definitely not touching that deploy button on a Friday but that's because the goal state is to be able to deploy on a Friday the goal state is to be able to have the confidence that you can because it shouldn't matter what day of the week unless you literally have statistical evidence that indicates that a Friday is somehow you know more likely to break things I would say it shouldn't matter and then I would say if you do have stats on it maybe you should figure out why that's the case I would be very curious to know so the argument and I mentioned this earlier I'm kind of repeating myself cuz I think it's a it's a critical point the argument that you hear a lot of for not deploying on a Friday is I don't want to come in and fix it on a Saturday right it's the weekend but then I just say like I don't want to if you're going to deploy on a Monday instead because somehow that's safer like I don't want to fix issues on a Tuesday wake up early 5 in the 4 in the morning 5 in the morning you're getting paged or it's your business and you happen to wake up because your phone's going off or something and it's like Tuesday morning at 4 in the morning and then you're fixing this bug until you know Tuesday night like 5: in the morning like or whatever I guess that's the next morning like why is Tuesday now okay it like it shouldn't be there should be no difference in the days of the week so if you find that you are not confident to deploy on a Friday it's not Nick says I should feel bad about myself the conversation is really well why why do you feel that you can't Deploy on a Friday what is it about your system how you have your technology set up that does not give you confidence and then start chipping away at that and to bring it kind of Full Circle to one of the earlier points too like if if you don't ever have a need to deploy on a Friday because your customers don't requ Quire it or whatever else like it maybe it's not conducive for your customers to get updates on a Friday maybe you never need to optimize for it but I would always be curious like you know pick a different day of the week well we don't like deploying on Wednesdays I think probably I think probably before Microsoft I'm trying to think did we have I feel like we had other days where people didn't like uh like committing code or something um or we've even seen I'm sure other people have seen things like this where we're going to do like a a freeze code freeze we have a release coming up we got a code freeze and it's like why like I get it on the surface you're you're just acknowledging if we need a code freeze we don't have confidence in what's happening so I get it like I'm not I want to I want to acknowledge it like I understand why people do it it's you don't have confidence in it but I I sort of question like why do you stop there why do you go this is normal like that should not be an okay thing I understand if you need to do it because it's a stop Gap but if you're like oh like thankfully we made it through this code freeze and it saved us I wouldn't my personal reflection would be like and that's exactly why we love code freezes we should do more code code freezes like I think you want to say now that that code freeze is done and it saved us what can we possibly start improving to never have to do that again freezing being like able your ability to deliver code to me is I don't want to say it's ridiculous um because I don't think that's fair like I already acknowledge I can see why you would do it but I think being like complacent with that being okay that feels very bad to me um pick a pick a Cadence you're going to do uh quarterly you're going to do quarterly releases I'm just making something up right so that's four times a year you're going to do a release which means you have three month release Cycles cuz 3 * 4 is 12 and so you have three months and you're going to do a code freeze for two weeks right what what proportionally how much time is that like it's pretty interesting so it's 12 weeks you're going to do a feature uh code freeze for two of those weeks so whatever two out of 12 is like that percent of time is just you're not committing code like what what are you doing I just don't know why people are complacent with that so um I I like I said I could see you being like we need to do this because we've seen that stuff just breaks that goes terrible like we need to freeze the the code I'm okay do it like do do the right thing for your context but you know where is the reflection that comes after this to say you know what feels really bad that we lose like you know like 20ish per of our time just not just not doing stuff because we have to freeze the code because we keep breaking it like maybe change something what you're doing to get that back so um anyway that's some thoughts on code freezes but that's primarily what I wanted to share about safe deployment practices it's like I said I'm not prescribing how you go do it but to briefly recap we just talked about testing being essential the fact that you want to have a variety of tests not just one test to rule them all like there's a silver bullet for it you want to be able to try and do tests early automated right this is going to help you with regression coverage so you do want something like this this does vary from uh slower deployments so having a deployment technology that allows you to control your roll out very important you can layer on feature flagging on top of this there are two things that exist separately that you can kind of Leverage together to have more control over your rollouts and then the other thing that we layer on to those slower controlled roll outs is your signaling monitoring and alerting so that you do get sufficient time as things are rolling out not to get the information about whether or not you can proceed so you could combine things like green lighting so to be able to say Hey you know we check these these things off and you get the green light to move ahead you can combine that with something like red lighting to say hey we measured something that is no bueno do not go do not you know don't pass go don't collect $200 we need to revert this thing so you can mix and match all of those and that's primarily it so that you can go through this EX exercise of figuring out why you're not comfortable deploying on a Friday and that is safe deployment in a nutshell so hope that's helpful um doing a time check about an hour in I mentioned at the beginning of the stream I may cut this one short because I have a flight in not too long from now I have to be up in a couple hours to go drive to the Seattle Airport which I'm I'm now regretting cuz it's so early in the morning but um I might might wrap it up here but I'm going to do a pause to see if there's any questions in the chat um while I'm waiting to do that I guess I'm not pausing but I will be checking the chat so feel free to type whatever you want um I'm going to do a little switch over here uh so again if you haven't seen this is the newsletter so this is Dev leader weekly I put out a newsletter article once a week uh we're on the 53rd issue so there's about uh there's about 6,000 people that subscribe to this which is pretty cool uh it's taking just over a year to kind of kind of build that audience but it's really cool to be able to have something that you know uh you know thousands of people read so you can check it out the format if you're curious if you've never seen um I do put an exclusive article like right at the top so I'm scrolling through it just so you can see what it looks like um got this and it keeps going I don't like to just write like a a two second article like there don't get me wrong I'm not trying to knock anyone's newsletter so I've had people literally tell me just write small like one minute articles and I'm like no man like that's not it's just not who I am so I try to write more um and then I'll usually plug a course or something because I trying to share other things that you might find valuable and then the second entire half of the newsletter is generally Recaps from the week so you can catch up on streams you might have missed or YouTube videos um as I'm building brand ghost I had to stop writing blog articles just because that was the one thing I don't have time for as I'm coding more but um when I get more time back after brand ghost is a little bit more um uh feature Rich I guess and I'm spending less time coding on it I was up to six blog articles a week on top of all this stuff so you would see them listed off in this section so there's that um if you guys are interested in learning C I do have courses on dome train for C so um these are the three I have I am currently working on uh as of as of this morning I am working on two more Dome train courses so um pretty excited about that I I I still haven't asked Nick chaps if I'm allowed to reveal what the topics are I have a feeling the answer is no um but uh one of them for sure is something that I talk about all the time so there's a there's a hint so um that one's pretty exciting I've been working on that one for a few weeks now and this second one is basically just signed uh signed the work for it uh this morning so more Dome tra courses coming very excited about that I don't see any other questions that have come into the chat though so folks I may wrap it up right there um I do appreciate it thanks for joining um it's my Monday night I don't know what time it is where you're at uh tomorrow's stream in the morning will be canceled because of my flight so for reference if you thought this was fun I do these Monday night 9:00 at night in PST for about 2 hours and then I try to sleep for a little bit and then I uh code live in the morning at 700 a.m. so I do that every Tuesday except this Tuesday so apologies for that got to ship things around for be able to to fly out to Dallas um oh here's a you got a you got really good timing on Instagram how do I stay motivated with this growing world that's a really deep question um honestly I think at the end of the day you have to find things that you're passionate about and it probably sounds like a really generic fluffy answer but I think that's how you do it because if you're finding external motivators that only last so long so don't get me wrong things like uh money like money money is a is a good thing you need to be able to support yourself support your family right but at some point you just toss more money into things and it's like kind kind of why right so you want to find things that you're passionate about because we only have so much time here it's kind of a grim way to look at it but we only have so much time in the world so find the things that you love to do right spend spend your time doing things you love with people you love so um I I mentioned this on stream before and it's always hard to show my tattoo how do I do this um especially the inside how do I turn my arm there I think I finally figured it out but you can see there's an hourglass on my forearm and it's upside down so the hourglass is Flowing the wrong way but The Hourglass shows that I have it's basically filled with money in the top and the money in the top of the hourglass goes into a fiery abyss and it's a reminder literally tattooed onto my arm that is don't just do things to make money for no reason and it's because you're going to die as Grim as it sounds you're going to die and you're not going to do anything with it and there are countless examples of people like on their deathbed that were rich and they were like I regret not doing the other things in life I was just focused on money so I do think money is important it is uh important up to a certain point for everyone you want to make sure you can get to a point where you're not concerned about it right beyond that it starts to not have like a good return on your time so if you're able to do the things that you love I I think that will help you feel motivated um it does mean like I should remind people especially if if your comment is specifically about software development and software engineering it's not easy it's not easy and I'm not I'm saying there aren't things you know other industries that aren't easy there's lots of things that aren't easy but like you you need to find some enjoyment in it and Tik Tok not disconnect me at the end of my my stream here what the heck Tik Tok just getting into it I was just signing off actually but um any anyway whatever Tik tok's Tik tok's done so when you're able to spend time on things that you're passionate about I think that helps you do need to acknowledge that things like software engineering software development or really you know anything that you want to be able to get good at and make a career out of it these things take a lot of time and effort and that's okay right it's it's okay for things to be challenging the that we have to remember is that we have to work through this stuff and when it is hard you have to keep going okay like what's what is my goal here and I will admit the motivation isn't going to be Skyhigh every day it's just the reality of it there's going to be days where it sucks and you're questioning why am I doing this but that's truly when you need to remind yourself you have to dig in have some other ways to motivate yourself and kind of get that little spark back and there might be days where you can't right and I would still recommend trying to go through the motions of keeping that momentum if it's learning how to code or practicing other things find some way to keep going because you're going to have days that are kind of crappy and you know the longer you stay at it the more crappy days will show up but hopefully they're heavily outpaced by the good days and the more good days you have I'm hoping the more opportunity you have to align those good good days with things you're truly passionate about so um I'll end the stream there so I hope that helps a little bit and I'm sorry if that's uh you know not a very uh actionable answer for you in this moment but I wish you success and I hope that you can you know find that motivation where you can and do awesome things so thanks everyone I will

Frequently Asked Questions

What are some best practices for safe deployment in software engineering?

In my experience, best practices for safe deployment include implementing comprehensive testing strategies, utilizing automated testing, and ensuring that your deployment technology allows for quick rollbacks. It's also important to monitor your deployments closely and have clear signaling and alerting mechanisms in place to catch any issues early.

Is it safe to deploy software on a Friday?

I believe that deploying on a Friday can be just as safe as any other day, provided you have confidence in your deployment process. If your team is well-prepared and has good practices in place, there shouldn't be a significant difference between deploying on a Friday versus any other day. The key is to assess your own confidence and readiness.

How can I improve my team's confidence in deploying software?

To improve your team's confidence in deployments, I recommend conducting a thorough analysis of your current processes. Identify any areas where your team feels uncertain and work on addressing those gaps. This could involve enhancing your testing coverage, automating more of your processes, or improving your monitoring and alerting systems. The goal is to create an environment where your team feels secure in their ability to deploy at any time.

These FAQs were generated by AI from the video transcript.