How To Use Claude Code To Write Tests For Your Code - Is It Worth It?!

Name: How To Use Claude Code To Write Tests For Your Code - Is It Worth It?!
Uploaded: 2025-08-25T12:01:26.0000000+00:00
Duration: 24 min 2 s
Description: This tutorial provides instructions on guiding Claude Code to write tests for your code base. I use Claude Flow on top of Claude Code to orchestrate an agentic swarm to tackle it. But how did it do?

August 25, 2025

• 2,912 views

This tutorial provides instructions on guiding Claude Code to write tests for your code base. I use Claude Flow on top of Claude Code to orchestrate an agentic swarm to tackle it. But how did it do?

View Transcript

In a previous video, someone asked me what the token count was. It uses a lot of tokens. I'm constantly trying to look for ways to incorporate AI into my software development workflows. And one of those spots that I like looking into is testing, but one of the challenges is that I don't have a ton of trust with AI writing tests because sometimes I feel like it's overly confident in what it's doing. So today, I wanted to show you a setup that I used with Clawed Flow using the Spark method. and we're going to see how I was able to add some baseline tests for my Nougat package that's called Needler. Okay, to kick things off, I have my PowerShell window pulled up here. This is the prompt that I used. I have since gone back in my git history to undo it and we'll see what it comes up with this time, but I'm going to explain my thought process for putting this prompt together. And a quick note that if you're not familiar with clawed flow, this is something that allows you to have like an agentic swarm on top of clawed code. So, I do have a video on that if you want to check that out right up here and you can see how to set up Claude Flow for your own development process. Now, you don't necessarily need Claude Flow for this. You can take the concepts here and apply them, but Claude Flow with the Spark method gives a lot more structure to what we're going to be looking at. So, you can check out the Claude Flow GitHub repository. They kind of walk you through what Spark will be able to do for you, but essentially it's going to be setting up specifications for things that we're going to be building. And without that structure, I find a lot of the time that AI goes down a path and it keeps going and it goes down a bit of a rabbit hole until it's doing stuff, but it's almost like it loses track of what it's doing and then finally it's like, "Here you go. I'm done." And then I'm generally pretty disappointed. But with something like this, there's a lot more guidance. So in my Nougat package, I started putting together my projects and I had some sample applications where I'm like, "Hey, this seems like it's doing the right thing." And a lot of the code that I was putting into the package I was borrowing from another product that I've been putting together for myself. So I have a lot of confidence in the code, but I don't have the unit test and stuff set up in the new Nougat package. So I said, "Okay, time to use Cloudflow to help out with this." For a little bit of context, this prompt is kind of big. I don't like typing out prompts anymore. So I've been showing in recent videos that I'm actually using uh something called Whisper, and I have a foot pedal below my desk, and I can do this kind of stuff hands-free. So, just to quickly show you if I start typing like this by speaking, um, you can see that it just puts into the console exactly what I was saying. So, let's get rid of that. And I'm just going to explain this prompt. One of the tricky things when you're using voice to text is that it's not going to be perfect, right? Like as you're typing things, you're correcting them. But generally, I'll blab a bunch and then I have to go back and patch it up. But this is essentially what I came up with. So I said using XUnit version 3 and I have noticed that I really need to remind the AI tools that it's not XUnit v2 like the latest version. It's actually a new Nougat package. So I explicitly called out it's not the previous v2 package. And then I said that I like using uhQ so mock uh as the the mocking framework. Um that's another package that I like to use. Some people are against it. That's totally fine. So I said those are the things that I want you to use. And then I said test project should be onetoone mapping for the library. In other cases when working with LLMs and I've said like hey go add test to this. They seem to organize tests in ways that I'm not a fan of and that's totally fine. I just want to give it a little bit more direction. So instead of one big test project that's trying to test all of my different libraries. I like having dedicated test projects for those libraries directly. So gave it that instruction and then I said there should be you know I gave it an example right there should be one test project for this project and then it should be named this way right so for Nexus labs needler plugins there should be Nexus lab needler plugins tests like just trying to be you know uh very explicit I said do not add n substitute and do not use fluent assertions you can see that it wrote out end substitute because I was dictating it but I've also noted is that probably based on a lot of the training data for these LLMs, they see end substitute a lot. They see fluent assertions. I don't want to use these packages. They're just not my preference. But I found even when I tell it to use mock that it will still sometimes go add in and substitute as well. So it's just kind of silly. So again, just being very specific about what I want and do not want. Then I said do not add arrange act assert comments. I don't know if you've tried using AI to go write tests, but for me with basically a 100% guarantee, my tests are littered with arrange act assert. And sometimes some of the tests are so trivial and it's just like there's more comment than code in the test. And I just can't stand it. So just telling it don't do that. And then I said basically, in fact, use as few comments as possible and tried to add in if you find that you're needing comments, it's likely because the code isn't obvious. So there's probably better ways to give more structure to the LLM, but this is just me talking to it like I would a person. And then I tried to give it guidance to not rely on mocks. I am someone who writes software where the code should be mockable. It should be such that you you know you can use mocks and have unit tests set up and that's fine. But just because the code can do that for tests doesn't mean that those are always the tests I write. I write code that is testable that way does not mean those are the only tests I write. And in fact it's quite the opposite where I do rely more on functional tests. I would rather you know given a more complex application have the dependency injection container allow myself to resolve types from that. So I said don't rely on mocks. they should be used in extreme circumstances and tried to call that out like for web requests things where you're going to the internet don't actually go do that we don't have any of that in this library right now so not really necessary but I also tried to call out that if there are API boundaries for a third party projects so for example I have Carter in here I have screwdor we're not going to use them for these tests but just trying to set the precedent that if you're going to go interact with third party things that's probably where I wanted to at least consider a mock and in some cases I would just use the real thing anyway. But if I don't put this in, I have found that it will over mock things and then it becomes kind of pointless. I've literally seen it write tests on the mocks themselves and it's not actually testing anything real. So it does it with a high degree of confidence and adds zero value. So otherwise prefer to use real instances and resolve these instances from the dependency containers as necessary. In these particular tests, it's not really going to be necessary. But for example, if I had a service in my application, I would ask it to go make the dependency container the same way that the application does and resolve that service. So it has the real things inside of it, not mocks of everything. Or you'll sometimes see that it has test setup code where it's duplicating constructing an object and there might be a lot of dependencies and it's just a mess because if that code changes in the the actual application we're now testing things set up in a different way and I just try to avoid that and then I kind of gave it that guidance like if you're just testing the dedicated library and you it doesn't rely on a container then just new up the things. So giving it a a bit of an out there. And then I said I just wanted to focus on this particular project. So this one is Nexus Labs needler injection. And then I said to focus on a couple of folders. So loaders, sorters, and type filterers. Um this felt like enough context because there are several classes in there that are related and I didn't say go write this for everything because I find that when you go too broad, it just falls apart much faster. You might have a couple of tests where you're like, "This feels okay." And then sometimes it's just really crappy beyond that. So, this is the prompt I use. Let's see if it goes and does what we need it to do. While that's happening, I will poke around a little bit in the solution to show you where it's going to hopefully add some tests. So, let's go run it. Cool. So you can see that it's getting everything up and running and it's just going to use the spark method and it passes my prompt inside. So while that's running, I'll come back to it. But just to show you, we are going into this project. You can see that I have the loaders, sorters, and filterers. So basically, we should be seeing tests come out on these three, these four, and these two classes. We'll see what it does. These are all pretty lightweight things. There's not a ton of really complex logic. This should be something that's quite simple to test. It's just nice in this case that I can have AI hopefully go do most of that work for me. But we'll see how good it does because I suspect we'll still need to go through it a little bit and give it some TLC so that it's acceptable. Right, we are done. And I just wanted to call out that in a previous video, someone asked me what the token count was. I glanced up a couple of times while this was running and I saw that the tokens were resetting. Um, it uses a lot of tokens. Uh, I was seeing some counts get up to like 15,000 and then it would go off and do something and come back and seeing the count go back up again to 15,000 or so. So, I don't know how many tokens it uses, but it's an awful lot. Let's see what it says it did and then we'll go through some of the tests. It did add XUnit v3 and mock. I saw this happen pretty early. We are using central package management. So I noted that it was putting stuff into here. We'll confirm that though. It went and added test projects for all of the different projects I have in the solution, which I don't know if that's exactly what I meant by my prompt, but this is the second time I run that prompt and both times it did that. So probably something I need to be more explicit about. totally fine because I think they're empty except the one that I asked it to add test for which is part three. Implemented comprehensive tests for Nexus Labs needler injection. So loaders, sorters, and type filters. It looks like if you recall what I said before that it tackled the right classes there. So that's good news. And then it says test characteristics following your requirements. XUnit V3, not V2. Mock only where necessary. No end substitute. No fluent assertions. We'll check to see if it added comments or not. Uh, minimal comments only were absolutely necessary. Real instances use actual implementations. It says 64 tests, 59 in injection. And then it added five in needler tests. So, interestingly, I didn't ask it to do that. I did glance up partway through the run and I saw that it was doing that. The reason, just for context, it said that it was adding five there is to prove that that project was building. Kind of odd, but that's not what I asked it to do and it still went ahead and did it. So, 64 tests total and then 59 for what I told it to do. So, let's jump over to the code. The solution has to be reloaded because it's changed it. So, um, okay. So, something I see that I don't like already is, yes, it did put, you know, test at the end of the project name for all of the different projects. That's great. That's what I wanted. I personally don't like just blasting the tests out into their own subfolder, but I wasn't transparent or explicit about that ask, right? It just did sort of what it's thinking is the right thing to do. So, that's not the end of the world, but not what I would like. So, generally, I would go move these things out of the tests folder there, but that's fine. This is where I asked it to go add tests. And we can see that it did add those test files. It might not have been obvious because the prompt is probably flying by when it was doing all of the work. So, the console, I mean, but it did say that it was running the test and they should pass. We'll verify that. But what I wanted to do is step through these tests and see if they actually make sense. Because one of the biggest sort of risks or concerns I have with AI writing tests is that you might have noticed this, but when AI gives you answers, it sounds like it's very confident, right? So even when I was going through that checklist, the output was very much like, yep, got all this stuff delivered exactly as you wanted, checks the box, but like does it it wrote tests, but are they actually useful tests? I have this class. It's called the all assemblies loader. And if we were to jump into it, we're not going to do this for everyone specifically, but it's going to try and look for everything with an executable in DLL in the base directory. So where we're running from, not really that complicated. It's trying to see loads all assemblies from base directory. Does this test actually do what it says? Um, the answer is no, which is kind of funny. So it says the expectation is that it loads all assemblies from the base directories. But if we look at the assertions, it's just checking that it's not a null return type and it's not empty. How many assemblies are in the base directory? Is it 100? Is it one? I don't know. Right? But like this test does not actually confirm that. So this one, it's not that the test itself is bad. It's just that it's misleading. this expectation that it's putting into the title is not actually what lines up with the assertion. So that's not a great example unfortunately. Okay, next one. Contains current test assembly. Okay, does this one actually do that? It looks like the answer is yes. It's checking what the current assembly is. If you're not familiar with this stuff, assembly.getexecuting assembly is going to give you the currently executing assembly as the name suggests. So then we're looking through that assembly's collection and making sure that we have a name that matches. So I think that's pretty good. You could maybe use the location of that assembly instead of the name. But that test seems like it makes sense, right? If we're asking for all of the assemblies when we call this thing, it should contain the current one. I think that's an okay test. This one is saying that it loads DLS and executable files and then it's doing an assert across all of the results that they must end in DLL or executable like DL or .exe. So I think that's a totally fine test to go create. This one says with continue on errors true handles invalid files. This is interesting because when I look at this test and what this is trying to do is for this loader I basically have a condition in there that says if you encounter an error just continue but that's optional. It's testing with that option on that we should be able to continue. And that means that this should pass, right? And so that's fair. So far so good. But do we know that this is going to throw an error? Like when it goes over the assemblies in my output folder, is that going to throw an error? Like maybe not. So we don't actually know from this test if it's handling the error. All that this test tells us is that whatever it's running against doesn't throw an error, but we don't know if it's actually going to simulate that condition. That's a bit of a risk. It's misleading in terms of confidence because it's telling us we have a test that guarantees this. It doesn't guarantee it. The other thing is that we don't have the inverse. And that would be something that's really helpful here is that if we had the inverse where continue on assembly error is false and then we could explicitly have a scenario that did the opposite. that would be helpful to show us that given these two scenarios, we can have the error either bubble up, get thrown, or have the error get continued over. So, so far, this isn't terrible, but it it's just reassuring to me that I will not stop reviewing AI generated tests. We're not going to go through all of these in extreme detail. I figured I'd jump over to another one. Let's go over to this default type filter one. So, again, these tests are all quite simple, right? If I scroll through pretty quickly, you can see that it's like, you know, one line of action and then a check, right? Like an assertion. And so what this one is doing, this class is default type filter. It's looking for whether or not because this is an a dependency injection framework, it's checking to see if we want to have types registered by default as transient or singleton. And so it's just testing that expected behavior. So what it did was it generated a bunch of different types for us, right? So all the things where you see in the type of keyword here, these are all examples that it was coming up with. So if I just jump to one of these, you can see that there's a whole pile of them listed at the bottom. And what's interesting is that there's a bunch of them inside this class. And technically, none of those should get registered by default because they're inside of a type. I know the implementation cuz I wrote the framework. What it should do is check that none of those actually come up. That's what it should do. It's hard to know. We'd have to go through these one by one to see like did it miss an edge case. So again, this is a type of thing where I would really urge you that if it spits out a whole bunch of different test scenarios, you definitely want to check them because just having a number like a large number of tests like what was it like 60some, that's not necessarily meaning that they're good tests. They could be very misleading. Something else to point out just to quickly scan through here. There's probably a lot of these tests if you notice they're almost the exact same structure, right? Is injectable type is injectable type assert false assert true whatever we could probably use an xunit theory here. So some of the things that I'm thinking of are like I would go back to claude and say great like for the default type filterer tests refactor this to be using like xunit theories instead and we'd see a lot of this code get collapsed down not terrible this one probably gives me some confidence but I would need to go through and make sure that it's testing the things that I care about maybe let's look at one more this is an assembly sorter let's do the alphabetical one cuz alphabetical is pretty obvious I think this is going to check that The assemblies that it's loading are in alphabetical order, which makes sense. If we're sorting with an empty list, so these are some like edge cases. If we pass it a null, it should throw because there's guard clauses in there. This feels pretty good to me. Sorts with single assembly, we get one back. If we have duplicate names, what should it do? So, I don't have any error handling, like if there's two with the same name, should it throw? It doesn't do that. This is a totally valid test. And then it's giving us a list of uh assemblies to work with. But this isn't a terrible one. This is probably pretty decent. Like I said, a lot of these classes are very simple. So, I'm just trying to give you some thought for when I'm reviewing AI generated code, especially with tests, like what I'm looking for. But if we go over to the test explorer, that's popping out way too far. That's okay. I'm going to go run them all and we'll see. It told me in the console output that it did run them and they all passed. So, if anything fails, I would be very confused, but it's possible. And there we go. Right. So, it does actually fail when it's running here. Alphabetical. And this one here says it fails. And what does it fail on? It says it's not in alphabetical order because it's looking for these things out of order. Um, it's just cut off because the resolution of my screen. So, interesting fact that again, you don't have this context because you didn't see it. But I have run this prompt before and it generated tests and I've already gone through this manually just that it made different tests. But this is actually a test that it recreated. This is the second time I've seen it. I think what's happening is that when it's running the assembly loading, I think there's a difference between running it on the command line versus here. And I think that this likely has to do with the fact that on the command line, it's running in a WSL environment. So, it's like a Linux Windows subsystem for Linux compared to me running it in Visual Studio. That's something that when I was dealing with this the first time, I pushed it up to my build system and that broke because I think it's also running in Linux there. Something to consider. But there's one more thing and I was hoping that I would easily come across an example here. There's one more thing that I think is kind of dangerous that it does in these tests and I'm just trying to scan through. Sorry. I saw somewhere it said it was going to do it, but I don't see it coming up. Maybe it's in here. Okay, maybe not. I can't really see an example of it. But just to explain something that I've seen be kind of dangerous is that there's a like a rule that you're not supposed to have if statements in your tests. And I don't like saying, you know, like always or never. If you've heard me talk about this kind of stuff before, I think there's always exceptions. But in this case, I have seen it basically in a single fact which is not a parameterized test. Right? None of these tests are parameterized. It will have an if statement in here around an assertion because it's like, oh, sometimes this doesn't pass. So, it puts an if statement around it. To me, this is extremely dangerous because it means that you have some test code that is sometimes not running. If there's an if statement around it, that means that sometimes that condition will look different. I would really encourage you when you're reviewing AI generated tests to look for if statements. If you're writing parameterized tests and you have if statements and stuff like that, it's a little bit of a smell. I'm not going to, you know, I wouldn't freak out necessarily, but just something you want to be careful for when you have if statements in your tests. But in a fact, for a single test, not parameterized, I really think that having an if statement in there is going to be dangerous. So I've noticed that when it is running tests looking at the output and trying to refine it that sometimes instead of just fixing the crappy test or if there's something wrong in the code it's like hm if I put an if statement in the test it works around it. So be very very careful about that. In general though I'm not very upset with this. I think that it did a decent job but I would personally be going through all of these and you know double-checking them because it wrote 60 some odd tests. That's a bunch of code to review. I don't recommend blindly just taking that, pushing it, saying, "Hey, look, green check marks. Everyone's happy." I would go through them one by one and make sure that the tests make sense. One more quick note is that if the format of your test that it creates is not something that you like, either ask the LLM to go change that, so go back to Claude and say, "Hey, I want you to go do it this way. Structure them differently." like the example I gave you with using facts versus theories. If you like having more parameterized tests, encourage it to do that. The reason I say that is that if you start landing code into your codebase that's following patterns that you don't want to have propagated and you keep using AI, it will keep propagating these crappy patterns that you don't want to have. So, just a friendly reminder. But in general, I hope that you found that interesting. It gives you some ideas to think about when you're using AI to write your tests for you. And yes, this is a Nougat package that I will be releasing. It is in very early alpha stages. It's called Needler, and the goal of it is that it's an opinionated dependency injection setup for you to be able to scan assemblies and automatically register types. I use this in my own applications like brand ghost that I build on the side. So, thank you so much for watching and I will see you in the next one. Take care.

Frequently Asked Questions

What is Claude Flow and how does it help with writing tests?

Claude Flow is a tool that allows you to have an agentic swarm on top of Claude Code, providing more structure to your testing process. It helps me create better prompts for generating tests, ensuring that the AI stays focused and doesn't go off on tangents.

Why do you have concerns about AI-generated tests?

I have concerns because AI can sometimes be overly confident in its outputs, which may lead to tests that sound good but don't actually verify the intended functionality. It's crucial to review and validate the tests to ensure they are meaningful and accurate.

How do you ensure the AI-generated tests meet your specific requirements?

I make sure to provide very explicit instructions in my prompts, detailing the frameworks I want to use, the structure of the test projects, and any specific preferences I have, such as avoiding certain packages or comment styles.

These FAQs were generated by AI from the video transcript.