Summarify.net

Feed Your OWN Documents to a Local Large Language Model!

fFgyOucIFuk — Published on YouTube channel Dave's Garage on October 8, 2024, 8:55 PM

Watch Video

Summary

This summary is generated by AI and may contain inaccuracies.

- Dave is hosting a talk about how to add knowledge files and documents to a large language model. Dave explains the difference between retraining a model and providing documents. - Speaker A wants to give a little demo of a modestly sized model running locally on the dual Nvidia RTX 6000 Ada setup to see how it would perform against a 1 billion parameter model. - There are three primary ways to retraining a model: retrieval augmented generation, uploading documents directly into the context window, and uploading documents to the contacts window. Each has its own strengths depending on how permanent or flexible the model's knowledge needs to be. - There are several reasons why we won't be doing a formal retraining of a model. The first is the lack of access to the model weights. The second is the hardware and software issues. - We are going to run the open web UI locally this time rather than in the docker container.

Video Description

Dave explains how retraining, RAG (retrieval augmented generation) and context documents serve to expand the functionality of existing models, both local and online. For my book on the autism spectrum, check out: https://amzn.to/3zBinWM

Dave's Attic - Friday 4PM Podcast - https://www.youtube.com/@UCtb6a_CnmGbSns9G8W2Ny0w

Follow me for updates!
Twitter: @davepl1968 davepl1968
Facebook: fb.com/davepl

Transcription

This video transcription is generated by AI and may contain inaccuracies.

Hey, I'm Dave, welcome to my shop. Today's episode is one of the most requests that I've ever done. How to add your own knowledge files and documents to a large language model, both local and online. I'll explain the difference between retraining a model using retrieval augmented generation or rag, and providing documents for the contacts window. I'll show you how to upload your own documents, both to chatgpt and to a local Olama model running under open web Uihe. Once you insert your documents into the model, they become part of its knowledge base for answering your questions. Now, before we dive into retraining rag and context documents, however, I want to give you a little demo of a modestly sized model running locally on the dual Nvidia RTX 6000 Ada setup. Thats because in my last episode I saddled the big workstation with a massive model that the other machines couldnt even hope to run, 405 billion parameters loaded into its 512gb of Ramirez. But the protests in the comments were many folks wanted to see this beast of a machine chewing on a regularly sized model to see how it would perform. So let's take a quick look at just how fast it can run a 1 billion parameter model. So to run the smaller model, I'll take the listing and we'll see that we do in fact have Wallace 3.2 the 1 billion parameter model. And so I'll run that with the verbose flag, and as soon as the model comes up and is ready, we'll ask it to tell us a story and we'll see how many tokens per second it can actually generate. Well, it's cranking through pretty quick here, 345 tokens per second. Let's get a longer story to see if it can sustain that. It scrolls by very quickly. And when it is done, how many tokens per second do we get? 324. So it's able to easily sustain over 300 tokens per second generating on the dual RTX 6000 Ada machine. I would mention that it's a threadripper, but it's using almost no cpu during that generation. For comparison. Well launch the 70 billion parameter model, which is predictably about 70 times as big as the 1 billion parameter. We'll ask it to tell me a story and we'll see how many tokens per second it can generate once it's done generating. Now I'll let it run for a little bit here so you can see the natural output speed, which is still very usable if you weren't generating tons of content. I think it's actually quite fast, but it's certainly slower than the 1 billion. Lets fast forward to the end and see just how many tokens per second this one is generating and at about 20 tokens per second its quite a ways off of the 300 pace. But its still pretty useful and its more than usable so its still fast enough for most purposes. Now lets get to our main topic of adding information to existing large language models. There are three primary ways which you can do retraining a model using retrieval augmented generation or ragnar, and uploading documents directly into the context window. First, let's consider retraining a model. Think of the model as a bit like being a student who has already learned a lot, but now you want to teach them something new or correct their understanding. You go back to the basics with them, bringing in new books, updated lessons, and putting them through another round of study sessions. They don't forget that much of what they already know, but they use your training to fine tune and add to their knowledge. The process is thorough, but it's a lot of computational work, takes a lot of time, and once they've learned it's permanent, every time you use the model, from then on it will have that updated knowledge embedded deep within it, ready to apply in all relevant situations. But retraining a model requires a lot of resources, more data, lots of computing power and time. It's like sending your student back to school for a while to learn and improve on what they already know. Now let's compare that to retrieval augmented generation or rag. Here, instead of retraining the model, it's as if your student doesn't have all the information they need at their fingertips, but knows exactly where to look. And when asked a question, they quickly consult a library of books, pulling out the most relevant sources, and then give you a response that combines what they knew before and what theyve just looked up. This process is faster than retraining because it doesnt involve the deep, permanent learning of new material. Instead, it allows the model to retrieve information dynamically from a database or document pool, crafting its answer based on up to date sources. Its a much more agile process, and its great when you need the model to adapt to ever changing large sets of data without retraining it every time. And finally, theres uploading documents to the contacts window. Imagine youre in a one on one conversation with the model and you handed some notes. The model can reference these notes while talking to you, but it wont internalize them in the way it would with actual retraining. Its like a student who gets to peek at a cheat sheet during an exam, they can look at your document and use it to answer your questions. But once the exam is over or your session ends, they'll forget the information. This method is the quickest way to provide specific, immediate knowledge, but the information only lasts as long as that specific session or conversation. When you're done, the model won't retain the uploaded document unless you upload it again next time. So in summary, retraining builds long term permanent knowledge within the model rag fetches relevant knowledge dynamically without needing to retrain. And uploading documents into the context window is like giving the model a temporary cheat sheet for quick reference. Now, each has its own strengths, depending on how permanent or flexible you need the model's knowledge to be. Before we get into these approaches, let's look at why we won't be doing the first method, a formal retraining of a model. And the first reason is openness, meaning that you have no access to the chat GPT model itself, so there's no way you can directly or modify or retrain it with something that's more open. Like Llama 3.2, the models are generally designed to be open in the sense that you can access the node weights and modify them, but certain conditions still apply youd need to access the model weights, the core data that defines the models knowledge. Depending on licensing and availability, you may or may not be allowed to use those weights freely for commercial purposes or large scale projects. But lets say none of those are serious roadblocks. But the next problem is hardware and software. Retraining a model takes almost as much in terms of skill and resources as training that model did in the first place. Fine tuning an LLM like Llama 3.1, even on a smaller dataset, still requires some serious hardware. That means youre looking at leasing or buying data center GPU's like the Nvidia A 100, and depending on the model size, it could require an enormous amount of rAm to work on it. The other hurdle at this point is that it requires significant programming, even if it is mostly just Python. But you wind up needing to write code using Pytorch or Tensorflow or something similar to run your fine tuning tasks. And when I combined with all the hardware required, this likely rules out full retraining for most people, so well focus on rag and context, both of which are fully doable without custom coding. Lets start with the easiest of these mechanisms, uploading additional documents into the models context window. Our documents become the cheat sheets that we referred to earlier, allowing the model to reference them and incorporate them into its answers. With chat GPT, youve likely noticed and even used the upload button. At some point. When you upload a file to chat GPT, it becomes part of its current context window. Okay, let's ask about something that it may have knowledge on from its general knowledge base. But I want the specific order in which to push the buttons on the 1134 to boot it up. Let's see if it knows. So that's sort of a general approach for starting a PDP eleven. It does not take into account the owner's manual of the 1134, which tells you how to actually use the boot switch to boot to a rom location, which will bootstrap the machine. So it doesn't seem to have that context. So in our next step, let's give it that context. This is the simplest of cases beyond just dragging a document into the actual browser window. But we'll click on the paperclip to attach a document. We'll say upload from the computer. And I will pick, let's see, we'll pick the PDP 1134 manual here. Now that it has that document and it's uploaded, we can ask it questions about the document, read the attached document, and tell me the order to push the buttons to start the system. And now, thanks in part to the technical documents that we uploaded at the beginning of the session, it has the information that it actually needs to give us the correct answer to use the boot init switch, which will go through the M 9301 bootstrap, which will boot the system from ROM. You could do this each and every time that you wanted chat GPT to have access to this additional knowledge. But it gets a little cumbersome, which is especially true if you have multiple documents to do each time. Let's look at a better approach. Creating our own custom GPT with our context documents fully baked in. Okay, here we are at chat GPT, but we don't want just regular chat GPT. Not even zero one preview, not even GPT 40 with canvas. We don't want any of those things. We're going to create our own chat GPT. Let's do that by going explore GPTs. Then in the top right, you'll see create a. Then we give a name and a purpose to our custom GPT PDP eleven expert. And it is an expert at PDP eleven stuff. And what it's going to do, it's going to answer questions for us about that stuff. Now, this next point is where we can upload our files to form part of its knowledge. So we click upload files and I've got a folder called PDP where I have a bunch of documents that are already OCR'd and their PDF's and it can handle that format. So we'll just upload them directly. It'll take a few seconds to upload the files, but it goes surprisingly quick. And now that they're all uploaded, as soon as our button becomes available here we'll click create and I'll say anybody with the link can access it. PDP eleven expert is a name and we'll save it. Next we can click on view GBT to actually use it. Next I'll ask it a very specific question that will require it to do some research. We'll try that. It's searching its knowledge base. Let's hope it finds something. You kind of have to actually read the document, but it actually does have that information. So let's see what it comes up with. 8067, that's the correct part number and that is how you wire wrap the boards. That would have been really handy when I did this about two weeks ago. I mean, why read the manual when you can just ask chatgpt? In a previous episode I showed you how to run Olama and open web UI locally on your machine. So let's take a look at the steps needed to provide context documents to Olama. So here we are in the local machine running Olama and Openweb UI. I've clicked upload file by clicking on plus, and I've selected the PDP 1134 user manual, just as I did last time with chatgpt. Now let's get a different but equally specific question. What are the power requirements for the PDP 1134? And it cranks out an answer really quickly. And it is in fact the correct answer. And at the bottom of our answer you will see that it is referenced a context document. Even better, we can click on that link and bring it up and actually see the part of the document that it's making reference to. I find that incredibly handy when you're looking up a citation of where it got the information that you're asking it about. The size of the context window itself can also be quite constraining. Think of numbers like 4000 tokens for a model like llama three. If the documents you are trying to provide exceed the size of that context window, clearly it's not going to work as well. For example, I have a set of a half dozen PDP eleven reference documents that I'd like to incorporate into my searches, and the context window is generally not large enough for that. Our next step then is with retrieval augmented generation or rag, which is a system designed to dynamically retrieve information from an external knowledge base or database. So when the user asks a question, the system searches through a large repository of documents or data and pulls out the most relevant pieces. This information is then combined with the models internal knowledge to produce a more accurate and contextually informed response. The key strength of Rag is its ability to efficiently handle large amounts of data, pulling only whats necessary at the moment of the query. This allows for a more scalable approach, especially when dealing with complex evolving datasets. The retrieval process ensures that only the most pertinent information is used, which makes Rag particularly useful for situations where accuracy and specificity are crucial. In contrast, uploading documents into the context window is a more static approach. When documents are uploaded, their content is directly inserted to the model's input window, giving the model access to that information for generating responses. Its essentially like you had just typed all the documents into the query window. While its a straightforward way of providing additional information to the model, it can be inefficient for large or numerous documents. The model has to work with everything thats provided in the input, regardless of whether all of it is relevant to the specific question being asked or not. So rags ability to dynamically retrieve information makes it a more efficient and scalable system. It can handle large datasets without overloading the models memory or processing capacity. The retrieval process allows the system to be more selective, bringing in only the most relevant pieces of information in response to your query. This also makes it more adaptable in situations where the information being accessed frequently changes, as Reg can always pull the latest updates from the source without needing any manual adjustments. No, the PDP eleven isn't changing much these days, but to maintain accuracy on more current things, the user would need to regularly upload newer revised documents, which adds a layer of manual upkeep. Ultimately, RaG offers a more dynamic and scalable approach, especially well suited for handling large and evolving knowledge bases. While uploading documents into the context window is more suited for smaller, more static datasets. Rag ensures that the model's responses are not only accurate, but also timely, reflecting the most up to date information available, especially if you're pulling the data from a database directly, then you're getting the very latest snapshots of the database. If you're not doing that even with documents, if they're updated frequently, at least you're getting the most frequent documents. Conversely, uploading the documents provides a simpler though more limited method of augmenting the models knowledge. So lets take a look at setting up rag on our own system. Olama one note about the configuration im going to run the open web UI locally this time rather than in the docker container, simply by enlisting in it from GitHub and then launching the start script from the folder called backend. Im going to do it that way because it makes the documents folder easily available in my local file system. If you're running it in a docker container, it'll still work, but you'll need to use the docker copy command to copy your files into the containers version of the data documents folder. To make our documents available to models within the Olama Open web UI system, we need to go to the admin settings. So we'll go to the admin panel, we'll go to settings, and we'll go to documents. Now I've copied the documents that I'm interested in into the actual folder underneath the open web UI folder that I enlisted in in order to put the documents there, but it won't automatically find them until you come in here and scan. So let's click on scan and see what happens. We'll monitor the other window to see its progress. There you can see it's obviously pulling the deck manuals and so on. I can see stuff about cylinders and hard drives and so on. So it's now parsing this data, processing it, and it is now ready so I can click save, which is slightly obscured by a little dialog box that I can't get rid of for some reason. But I'll click save and those documents are now available to be used in creating custom models. Now that we have our documents registered with the OpenWeB UIs interface, we can create a new model which incorporates the knowledge encoded in our documents. When we subsequently query the model for answers, it will use retrieval augmented generation to incorporate that knowledge into those answers. So in the open web UI interface, we want to go to workspace. Once we're in workspace, we want to create a model, we'll give it a name, we'll select a base model to work with, and I'm actually going to go with the smaller but efficient Lana 3.2 because I want it to be more dependent on my data and less on its own data that it brings to the table. Whoops, I got those backwards. Let me swap those. We'll put that there and we'll type PDP eleven expert in the description and now it's a simple matter of telling the model which documents to reference. The easiest way is to pick all documents and it will be able to reference everything in your data documents folder. However, that will be slower. So in this case, I'm just going to give it one or two documents that are applied directly to my actual machine, and we'll see if it can then incorporate that document knowledge into its answers using rag. Now back at the main chat window to interact with our model, we'll pick it from the list. It's now available as a PDP expert. Let's try a really specific question and see how it does. And you can see in the left hand window it has found and hid our data and referred to it to produce its answer. Ah, and because I probably didn't give it the information that it needed on the memory board, the closest it could find was the RQ DX three controller. So it's going to answer its questions with pretty much that as its main knowledge about the PDP eleven. So let's ask it something that it might know. Okay, that's a better answer. And let's see. So it's got the right jumper and it's found the right point in the manual. I don't know if the manual has actually more detail than that or not, but it did find the right reference. So as we can see, it is in fact hitting all the rag data, which we can also see scrolling by in the left hand window as we do a query. And again, it's able to synthesize and recite the facts that it learned from looking at the rag data and give me a look at the approximately correct answers. And so by creating a custom GPT with your documents embedded in it, it will use rag to update and do retrieval augmented generation as it answers your questions. If you found today's little sampler of AI augmentation to be any combination of informative or entertaining, please remember that I'm mostly in this for the subs and likes, so I'd be honored if you'd consider subscribing to my channel and leaving a like on the video before you go today. And if you're already subscribed, thank you. And with that, you're now an expert on contacts, windows and retrieval augmented generation. Collect your official certificate. On the way out the door, be sure to check out the second channel, Dave's attic, where you'll find our weekly podcast where we answer viewer questions on episodes like this live on the air. Not really live, it's recorded a day in events, so almost live on the air. The podcasts go live every Friday afternoon, and they're called shop talk. So take a look, maybe watch your back episode or two, see how it goes. Thanks, and I'll see you next time.