How to Read HUUUGE Files in C# - Designing a Segmented Stream

Name: How to Read HUUUGE Files in C# - Designing a Segmented Stream
Uploaded: 2023-05-05T19:24:09.0000000+00:00
Duration: 24 min 39 s

May 5, 2023

• 383 views

This video walks through an initial design for a segmented stream in C#. We jump between the digital whiteboard and then into Visual Studio to discuss some dotnet concepts for how to make it work!

Want to see the follow-up with optimizations? Check this out:
https://youtu.be/IeW4Q1Uc-Ak

For more videos on programming with detailed examples, check this out:
https://www.youtube.com/playlist?list=PLzATctVhnsghN6XlmOvRzwh4JSpkRkj2T

Check out more Dev Leader content (including full in-depth articl...

View Transcript

today I want to walk through this streaming API that I'm trying to build and I want to talk about this because I think in a lot of the content that I put out and a lot of the questions that I get from people coming in are about you know like how do I get more confident with my programming and I find that I'm always defaulting to like hey like I think you should go be building things not necessarily just reading about stuff like go build um and this is I think a perfect example that I wanted to highlight because it's a situation where if I search for long enough on the internet I'm sure I could go find a package or something that does something close to what I want but I wanted to actually take this time to learn about it myself try going through the optimization process um and you know just exploring so that I have a chance to go learn on my own so for some context what I'm going to be talking about building and this will take multiple videos to go through so this is just going to be the intro I want to be putting together a a stream so if you're familiar with like the system.io namespace and the stream class in particular like uh file streams I want to be able to create a extreme class or something that has a streaming API so I doesn't necessarily need to be a stream class but something that lets me iterate over a bunch of bytes what I want is an API that feels like I have contiguous access to an entire file stream and I want to be able to have multiple consumers be able to effectively read from this file stream or just Source stream in general to you know to keep it generalized and then not be able to mess each other up and have some performance characteristics that are much better than necessarily just seeking to random parts of a stream and kind of thrashing the disk if you're on a HDD versus an SSD and that point alone is like the thing that started getting my brain kind of uh you know focused on what what does this design look like so I'm going to spend a couple seconds in paint trying to just draw out a bit of a visual for what I'm thinking I want to show you where I've gotten to initially um some basically some early performance characteristics they look terrible and that's okay I have something that's functioning it actually does exactly what I want I have a more complex application that's consuming this doing what it needs to do and and uh now I'm gonna be going through the um the optimization process the testing and that kind of stuff so I just want to give you a bit of an intro to what this looks like for building a streaming API all right so I apologize for the brightness but let's get drawing on paint and of course I always apologize for my paint drawings but I'm going to start drawing what would represent the bytes on a disk that I am interested in reading so this would be the source data and in my particular situation what I'm focused on are extremely large files so this would be many bytes let's just say this could be anywhere from like one gigabyte all the way up to maybe like you know like multiple terabytes right the idea here is that it's going to be a lot of bytes and it's going to be more data than you would ever want to be jamming in to Ram because at first glance if you consider some of my API requirements I just want continue us access to data that would represent something like a file and I want to have many consumers be able to quickly access and navigate that data if you think about something like pulling it into an array and having it all in memory then I mean that would be perfect but I can't do that because I can't Jam all of this data into an array even that example that I just gave you about like what seems to me like the most performant way to have this work well I mean I need it in memory in my opinion that would be the quickest way to be able to navigate this stuff but the problem is that like I said it can't all go in so I'm also going to draw in the fact that I want to have some consumers down here so these would be multiple consumers right there could be n number the idea here is that there's no fixed number of consumers that I'm designing this for just literally it's made for end number okay so we have I've drawn three but assume it's n number of consumers they want to be able to read from this data and I want it to be as performant as possible what I was thinking originally I want to be able to pull this data into memory but I only want to be able to store because I can only store so much I can only store the most relevant data so what is that going to mean in my particular use case in my particular use case the most relevant data is most likely the latest data that I've read so if you think about I'm just going to use a different color here if you think about that I want to start reading at the beginning of this and then I want to keep reading this way while the most recent data like I would start you know I'm reading here then here then here then here and I'm going through the most recent data is going to be the highest offset into this that would be the case for the most part so that's sort of the naive way to look at this but the reality is that these consumers are mostly kind of operating at the most recent data being read in but each of the consumers that I have is technically able to if it needs to it can jump ahead or can jump back and read other parts of data it's not necessarily just like a sliding window of data and taking the most recent and this was my in very very initial approach in designing this I think that would work for call it like the 90 use case of of what I'm looking at truly I need to be able to support access to what feels like a contiguous view of literally the entire thing this got me thinking even more because okay if it's not just a sliding window let me get rid of all these other red lines here you know if I wanted to do a sliding window here and then move the sliding window over and now it's over here right this was my initial idea and then these consumers could always read from the current sliding window but that's not going to fly here they need to be able to jump outside of the bounds of these sliding Windows potentially the key word is potentially it's not super common but it's possible so I needed to revamp my design a little bit when I'm talking about my process of going through these designs how much am I coding at this point only very little just so I can get familiar with some of the things that I have access to so I had already decided the sliding window thing wasn't going to work by coding like nothing that would even compile I was just exploring things like streams and spans and stuff like that just to see how the you know the the different methods that I have available but I hadn't actually coded anything that would even attempt to compile I had to move on to something else but I still thought that my idea was on the right track and I liked the idea of the sliding window but it was not going to accomplish the goal because things had to look outside of it instead I want to still offer this contiguous view right so I'm going to draw another layer here I want to have this API for my consumers and each of them like I said should be able to jump around to any where they want in here and this contiguous block that they can see directly maps to what's on the disk up here the piece that I felt was missing was that to my point about caching I want to pull in the most recent or most relevant I think was a better word most recent was essentially the idea that I have a sliding window but most relevant is actually going to support that if the consumers want to jump outside of the sliding window then they can my thought process at this point was instead of something like an array for a big sliding window I said well what other data structure lets me have um you know pieces of information stored but not necessarily keyed by a direct in order position like an array and for me a dictionary was the thing that stood out a sort of implementation of dictionary that really stood out to me as actually supporting this concept of you know what's the most relevant and it was really just a cash implementation that's last recently used what this looked like for me then was I said Okay I want to have something that's like a cache so I'm going to draw that here and this is really the core of what I was building so this is some cache and it will have some upper bound of uh size right so this would allow me as the developer to decide how much memory am I willing to give to this thing to work properly can I afford to have a gig of RAM allocated for this and then have you know up to a gig of data at a time in memory so these consumers can use it can I only afford 100 megabytes is it only 64 megabytes whatever and then the idea from here was that I would have all of these individual chunks that I could pull in essentially I could lazy load the data that people were trying to read and every time that I had a cache Miss so you know one of the consumers is trying to read a little bit further ahead and we don't have one of these pulled in no problem it just drops another one of these in and then it would go load that data for that chunk and then I would have access to it and basically because I agreed that I had some upper bound of size here I don't have to worry about evicting anything only until I fill my cash so once I'm at this point and there's no more space for blocks what would happen is if some other consumer said hey look I need I'm going to use a different color I need data from a different point here come on paint okay I need data at a different point so some consumers like I need this data and you can see I don't have any green boxes down here well that would mean that what has to happen is that the last recently used it doesn't necessarily have to be in the order I drew it either but this cache actually maintains which segment was last recently used so if that happened to be this one right so no one had touched this one in the longest time this one gets evicted and then what happens is that we can pull this one in now we maintain this upper size limit of the cache and what's cool about this is that because no one was recently using this one it's based on my use case at least it's very unlikely that someone is going to have to go look at it ever again and the reality is even if they do have to look at it again we would just go evict the other last recently used chunk and pull in the next most relevant this actually gives the ability for me to piece together the chunks in a way that feels contiguous to these consumers the more consumers I have would likely require that my cash size becomes bigger and bigger just to avoid performance implications so if I had a hundred consumers that needed to go through four terabytes of data and my cache was say only like a megabyte or something it's very likely that if I had um that many consumers on such a small cash that they're going to be constantly asking for different pieces of the disk to be pulled into memory so in my opinion this was a flexible design and I figured I think I can go implement this let's give it a shot as the first pass all right so I'm over in visual studio and I'm just going to show you what I have currently and I'm going to basically start by walking through this setup application show you the output and then dive into a couple of details and then I'll cut the video so that I can follow up with another one that shows how I'm optimizing this so in this example all that I'm doing is I'm going to be opening up a file that I have on my desktop and then I have this function that I have to basically just open up a new file stream and this is going to be an important point that you can like why didn't I just open a new file stream and store it in a variable and that's because I need to be able to open new streams because if I have multiple consumers at some point they need to be able to not thrash another stream now it doesn't have to work this way you could absolutely Implement a stream that has locking on it so that it could be shared across threads for example I just don't want to dabble with that and I felt that at least in my case I'm totally able to open new streams we scroll down a little bit this is just some variables that I can play around with so these three lines that I have here the segment size this is going to be essentially that I have four megabytes that I want to be able to allocate per segment of this file it's a very large well it's a large file for if you were to compare it to memory so it's a gig and a half so I want to have four Meg chunks and I'm saying that I'm willing to hold up to one gigabyte of RAM for these chunks and then this is just a a calculation that says how many um segments for a total capacity would I be willing to have so this is the count of the segments and this is the the size of memory that they should have next I create my segment map so this is actually the the new class that I was implementing I'll briefly touch on what we have here and they won't make as much sense until we dive into the details but this is a RAID pool is just going to be used so that I can have buffers that I can not have to constantly allocate because we have to do some byte copying the second argument here is just the function that's going to actually open up my stream so again I'm not passing in the stream directly but a function that will go open a new stream instance for me the third parameter here is just the length or the size of the stream that I'm dealing with um my current strains or that I always will know the size and I thought that it would be an interesting follow-up to see if I can change my design later to not actually require the whole length because my current use case is looking at files on disk I will always know the length but there might be situations where you're streaming stuff over the network and you don't know the full size and this might be a really cool way that you can have some amount of buffered data not have the whole length and then be able to basically jump around a little bit within your buffer but of course because you can't hold all of it in Ram you have limitations the final arguments are just the variables that we talked about up here and then that way we can go create this map so I'll jump into that after I look at just the rest of this test code here but what I wanted to prove was that does my stream actually basically function end to end and look like a normal stream so all that this code does that I have on screen right now is it's just going to time how long it takes to compute a hash of the source stream and then compute a hash on my segmented Stream So this segmented stream all that it does is it wraps that segment map and gives us that API that makes it look like the data is contiguous and just a quick note that I had talked to a contact at Microsoft that I thought would have some really valuable information about how I might approach this and I'm going to follow up with him again to talk about my current implementation but we both agreed that something like a stream for this actually might not be the best approach even though the API looks nice streams have a lot of concept of copying data around and we actually just want to avoid copying as much as possible so this might be something that I revisit but most of the the work that I did was actually in the segment map and less about the segmented stream the rest of this test code is almost the exact same thing we're just timing it for the segmented stream instead I'm going to go ahead and run this and then when I edit the video I'll just fast forward it to the end all right so there's good news and there's bad news if you look on my console output I have the original and segmented data output and the good news is that this hash that we have here is the exact same for both streams so at least if I am doing continuous reading both of my streams the original of course and then the one that I've actually manually created they're going to have the exact same data so that means that my implementation for continuous reading does in fact work as expected if even one byte was off this hash would be completely different so that's good news the bad news is it takes literally twice as long to go through the data and that's pretty crappy so in follow-up videos I'll be walking through how I'm trying to optimize this because at this current point I actually don't know but the only thing that I can think of is that I'm doing a better job with buffer copying and trying to minimize that as much as possible I'm just going to spend a little bit of time going through this segment map just to show you a couple of the pieces that I thought were kind of interesting but everything you see in this class should reflect mostly what we were looking at in paint as I was chatting through it so you can see all the parameters passed in I still have some code that I want to go clean up to check some things out and then in this Constructor I'm just setting up my variables nothing too fancy just some bounce checking and then like I said I'm going to have to go validate some stuff I wanted to be able to expose some of this data so that if I were putting a stream wrapper around this like I did that I could see it and if I want to go make other implementations that make this data look contiguous then they have some of this available as well this needs to be disposable I am going to be pulling in a ton of data into RAM potentially and I need a way to be able to clean it up and this segments variable that I have here I should have mentioned at the top I am using this caching nuget package that I'm a really big fan of it's called a bit faster caching and it has this fast concurrent lru which is last recently used cash and then I'm able to put in the different chunks that I want to pull in I have them called segments and that's all of the different pieces that will have stored in memory they are keyed by the index of the chunk and that means if we know where we're trying to read within the full size of the data and we have an index for that chunk because each chunk is the same size I can go look up any chunk and then access the data for it and what's cool about that when I phrase it that way I'm just going to scroll a little bit lower is that I can write a method that says go get me that chunk all right go get that segment for the index that I have basically I can go ask the cache get the one that exists for this index or add a new one right so if I don't have that segment currently pulled into RAM no worries just go load it up and that's what this code down here does so all the a segment is is a really small wrapper around opening up a stream copying that stream's data into RAM which means we don't need that stream anymore and it's basically just holding the data in an array that I grabbed from the array pool so it's actually a really simple class and the easiest way to look at it is that I'm lazily creating an array of data based on the stream now the other important part about this piece is that well what happens if we don't have any space well no worries if we'll reached the capacity of the segments cache it takes care of evicting the last recently used one so all of this code right here basically gives us like 90 of what I needed to be able to basically have the most relevant chunks of data pulled in to memory the last little bit here is just it's and I guess maybe I should highlight this line too this might not look totally obvious for what's happening and that's okay but because I need to be able to dispose of things at the right time and give back some memory I need to be able to properly remove things from the cache I need to be able to reference count properly which is handled by the cash and then I need to return the arrays that I was holding from the array pool so that other segments can go load these up finally I just have this least segment thing here and all that that's doing is kind of wrapping the disposability of the eye segment and the scoped class that the cache has to use for reference counting so I have this as private just because no one really needs to see this you can see it's mostly just passed through kind of stuff and really it's just being able to call dispose on this stuff properly I'm just going to quickly jump over to the stream just to prove to you that there's not really anything fancy going on in here but because I have this segment map passed in if I scroll down these are just the methods that you have to implement on a stream because it's an abstract class this read method is really where all the magic happens what you'll see is that all that I'm doing is trying to get which segment I need to go read at which offset within that segment getting the actual segment itself and then all that this does is read byte at a time from that segment into the buffer personally I think that there's a huge optimization that could be had here because what this reader try read is doing is a single byte at a time and then allocating one byte at a time here and I'm pretty confident that if I did something else like reading in larger chunks across these different segments potentially that I could do a faster byte array copy in this Loop so personally when I go to start investigating where my optimizations are my eye is on this section right here otherwise the rest of the stream class is quite simple I have some seeking and then it doesn't support any writing so this is a read-only type of stream and then my cleanup code just to make sure that once I'm done with this stream if it has ownership of the segment map that all of those chunks that I pulled into memory get disposed of alright so just a quick summary the design that I was looking at was trying to have a contiguous view of a file stream or in General any stream that I could have multiple consumers go look at and the idea was that I wanted to be able to initially pull these into a sliding window but then I realized that my consumers need to jump outside of that sliding window so I ended up going with a cash approach a last recently used cache and where I'm currently at with my design is that it works it's functional but it's slower than I want it to be so I have some next steps where I want to go optimize and I'm pretty confident that it has to do with my my bite copying between buffers and I think the bathroom I'm going to focus my efforts so I will be following up this video I have not recorded it yet because I haven't made the optimizations but I'll follow up this video and I will eventually link it at the end of this one I'll probably put it right over here and then I'll update the comment as well so if you're watching this in the future and this video is now in the past you can go follow up with the new stuff so thanks so much for watching take care foreign

Frequently Asked Questions

What is the main purpose of the streaming API being discussed in the video?

The main purpose of the streaming API I'm building is to create a way to read extremely large files efficiently in C#. I want to design an API that allows multiple consumers to access a file stream without interfering with each other, while also optimizing performance to avoid thrashing the disk.

Why did you choose to implement a cache for the streaming API?

I chose to implement a cache because it allows me to store the most relevant chunks of data in memory while keeping the overall memory usage manageable. This way, consumers can quickly access the data they need without having to load the entire file into RAM, which is crucial for handling large files.

What are the next steps you plan to take after this initial implementation?

After this initial implementation, my next steps are to optimize the performance of the streaming API. Specifically, I want to focus on improving the efficiency of buffer copying, as the current implementation is slower than I would like. I'll be recording a follow-up video to share my optimization process.

These FAQs were generated by AI from the video transcript.