This tutorial provides some general guidance for having Claude Code (or your favorite LLM) write tests that fail for you. And why would we want that? So it can fix the code and then prove the tests pass afterward!
View Transcript
It's finished. And I guess doing a time check. I'm 17 minutes into filming this video. One of the challenges that we have building with AI is that sometimes the LLMs come back to us very confidently about something they've done that is still very wrong. So in this video, I wanted to walk you through having LLMs build tests for us in a way that should hopefully give us a little bit more confidence in their output. So what we're going to be doing is asking Claude to basically build tests for us that fail, fix the issue, and then write the tests such that it pass. So let's jump over to Visual Studio and quickly look at a scenario that I had to tackle with this exact use case. While I was building my needler framework for dependency injection, I noticed that there was a small little issue
when I was building a sample application. And if we have a quick look at the code here, we can see that on line 86, I am passing in this I configuration. And what should happen is that this configuration gets added on to our dependency container. If you don't understand what that means, that's totally fine. But this was a bug that I encountered while building a sample application. I needed to guarantee that this configuration that gets passed in gets put onto our container here. And as a result, later on, we can resolve that configuration from the dependency container. I knew what needed to be done. I actually know how to write the test for this, but I figured it would be great to have AI do my work for me. So, let's jump over to Cloud Code and see the prompt. I generally find myself writing
very verbose prompts now. And that's because I have Whisper and I use a foot pedal. So, I can just talk into my microphone and write these prompts. It's way nicer for me than having to sit here and type everything out. So, I am using Claude Flow on top of Claude Code. If you're not familiar with what Claude Flow is, it's an Agentic Swarm framework built on top. I have a video, if you haven't seen it yet, you can check it out up here that tells you how to set that up on a Windows machine using WSL. But you can do this same approach just with Claude Code. This is just my flavor of doing it. It's my preferred way. So, running Claude Code here, but through Claude Flow, I'm using what's called the Spark framework. And that's going to give it a little bit more
guidance in terms of structuring what it's going to do. But let's pay attention to what's happening in the prompt. Of course, you can tailor this to your liking. But I had given it some context. I said I have this build method on this class called service provider builder that takes in this I configuration and it's supposed to be used for building plugins, but it's not getting registered onto the dependency injection container. I would like you to create the test that proves this behavior exists. And that means that we have to have this happening and that means that it has to fail first. We need to have this build method not resolvable from the container once we build it. So we need that behavior, the current behavior proven first with a test. So I said then I would like you to fix this behavior, right? So
fix this behavior so that we do register that configuration and prove that that test now fails afterwards. So I need that test to be red, right? So go from it being green, basically testing bad behavior, fix the behavior that should fail the test, and then after correct the test so that it passes. And I said, this is like a test-driven approach to prove you're fixing the problem. And I said, I need proof of that in the console for you to be able to finish this successfully. And the reason I have these pieces in here, again, I want to have a test that asserts the existing bad behavior. So we'll get a green light on that even though it's bad behavior. and then I want the behavior to get fixed. So that test should fail. Now the reason I'm doing this is that if you've worked
with AI writing tests for you, it's not that uncommon that it will write a test very confidently saying, you know, the test is named that it's checking this behavior and it just doesn't do it. I've literally had AI write tests over top of mocks and not actually test any real code and it's like, hey, I've done what you've asked for. So, I need it to go from testing the bad behavior to breaking once the behavior is fixed. And then now we can change the test to assert that it's now working. That should do most of it. But I also told it that I want to see proof in the console. And the reason I've asked for that is that I've often found myself working with these tools and again it will say something at the end like I've done what you've asked for and then
I go to build the code and it doesn't even build. So, how could you have written these tests and run them if the code doesn't even compile? So, sometimes I try to make sure that I'm asking it for some type of proof when I've done benchmarking. For example, I've said I want to see that you provide the benchmark results cuz I've seen it say things like it's now 30% faster and I'm going that sounds suspicious and then I ask it and it's like I made that up. So, I do want to ask for proof. Let's go ahead and run this prompt and we'll try to see what it's thinking through as it starts spitting out some results. So in this case, this is going to be running Claude flow like I mentioned. So we have the Claude console launching. This is probably going to scroll
a lot in just a moment, but it's setting up with this more complex prompt to give it some of that spark behavior I was talking about. So it's going to start by giving us this to-do list. So analyze the codebase, create the failing test, right? Run the test to confirm it demonstrates the issue, fix it, and then that that test should fail. Run the test again to prove that it now fails. Update the test for the expected behavior. Run the final test to prove it passes and show proof. This is all exactly what I asked for. And something to call out is that if you're not familiar with this, having some sort of to-do list or checklist that the LLM can go through does help significantly, especially because with some more complex things, it loses a bit of sort of the context or what it's
working on. And if you give it something that's a little bit too longunning or a little bit too complex, by the end, it thinks it's done and it hasn't really done what you've asked for in the beginning. So, a to-do list definitely helps keep it on track. Now, it sees the issue. That's great. So, it's done the analysis. This is probably going to take a moment. Probably going to chew up a bunch of tokens, but I will try to call out some things as it's going through when it's interesting. This so far has taken a very long time just to be able to get the first running test. So, now it's going to fix the actual underlying issue, which again, this took way longer than I thought. I have run this before and it did it much faster. So, I'm very interested to see what
it comes up with, but we'll see what it's doing. I can see the fix on the screen right now. That's actually what I expect the fix to be. So, so far maybe not so bad. It's just that it's taking longer. I saw it struggle a lot with dependencies, but it was trying to see what other parts of the code were doing. And I think that's the right approach for what it's doing right now. I know the code is flying by. I can tell that the way it's trying to test this is actually um going to be very misleading. It's trying to do a little bit too much. So, it's trying to over complicate the proof that it's registering this configuration. So, I'm interested to see There we go. It's deleting that code now. So, I'm curious to see if it will come up with a
lighter weight approach. The working version of this that I have from running this before filming this video, it came up with something quite elegant. It's very simple. So, I'm curious to see if we'll get there. Now, it says it has to go run the test to prove that it fails, which is what I asked for. So, it seems to have thinks it that it fixed the issue. So, now the test should fail. And then hopefully it can go correct the test. Let's see. There is a failing test. So far, that's a good sign cuz even though it's failing, we expected to fail at this point. So, so far so good. Okay. So, it's finished. And I guess doing a time check. I'm 17 minutes into filming this video. So, was this more effective to have AI fix this oneliner and add a test? Absolutely not.
Especially given that I already have the prompt um from running this before. But what's nice is that we can see that it updated the to-do list. It gives me its test-driven approach. It talks about what changes it actually made and then gives me some verification. What I thought was very interesting is that when I was watching it output the final part, it actually seemed to have like wrote a script that writes to the console. So, it over complicated what it's doing. I just wanted it to actually give me proof that it fixed it, but it wrote some script to go write stuff to the console, which is kind of odd, but that's okay. So, if you haven't seen my previous video, you can check it out up here, and that talks about how to go validate and walk through tests after AI has finished what
it's doing. We're going to spend just a little bit of time to check out what it actually delivered. But that one is a bit more of a thorough walk through on like exploring some of the changes. If we jump into here, we should expect that we have this going on, which I think is a little bit odd. This actually went and did something more complicated than I would have expected. If we undo this, I think that I had very simple changes to go fix this. So, it seems to have fixed it apparently, but this went and made something a lot more complicated than it needed to, and it left in this weird comment. So, let's go check out the tests. The first thing that makes me a little bit nervous is that I see red squigglies. That could just be a Visual Studio thing. So,
let me try building this again. If this doesn't build, I will be displeased, but we'll see. So, three succeeded. Great. It's just a little bit of a stale cache thing going on. These tests look a little bit more convoluted than I would expect. A lot more convoluted. Just to briefly explain what it's trying to do is just a little bit too much. So, let me close this. What I like is that the method's called build. It's not following the same naming convention that I have on my other tests. I could have been more explicit about that. What again, it's leaving comments and stuff. In my other video, I give some examples about trying to clean up these comments cuz it seems to always do arrange act assert. And like we don't need those comments in the code and especially because when it leaves comments like
this, they're sort of transient in nature. It says after fixing the options should be available in the container. After fixing what, right? like we only have that context because we're literally filming this video right now. But it makes it very weird to leave like these transient point in time comments in here. And then again after fixing the option should be available on the container. But this is what is really bizarre. This one here again you might not understand if you're not familiar with this package that I'm creating. That's totally fine. But test with a custom plugin to verify the options are not accessible within plugins. This is nothing to do with what I asked about. So again, I would probably go delete this given that it wrote out something that's this convoluted for this test. I'm actually not sure if I fully trust some of
the other tests that it wrote. This, in my opinion, this would be a a failed sort of outcome of this experiment. And I was fortunate that the first time I ran this, literally with the exact same prompt, that it did almost exactly what I wanted. friendly reminder that if you are using AI to go write tests for you even though we prompted it with a you know uh test the bad behavior get a green light fix the issue get a red light on the exact same test and now fix the test to get a green light even following that we can still have sort of crappy outcomes from this. So overall like I would probably go delete this and this and this entire test. I would maybe keep these two tests, but honestly at this point in time, I would uh I wouldn't even go
back to Claude and say, "Hey, like no, go correct it. Do it again." I would probably give up on this example because I feel like it's already done too much and over complicated it. So, unfortunately, not how I was hoping this one would turn out, but I think it's still valuable to walk through these types of things so that you can see this type of experience cuz probably a lot of videos that you're going to come across on the internet are everyone, you know, 100xing their development workflows. And I think it's good to still throw out some real examples, too. So, not the outcome we wanted, but hopefully you still thought this approach was helpful and you can try leveraging it in your own workflows to hopefully get some better tests than this. So, thanks so much for watching. I'll see you in the next
one.