All posts in series | Next post
My new project is an entirely offline AI to answer any question you have. You can think of it like The Hitchhiker’s Guide to the Galaxy, or a Rabbit R1 that doesn’t have connection issues. Or perhaps it is more like Foundation’s Encyclopedia Galactica: an archive of human knowledge in case of the collapse of civilization. It should be a chatbot to answer any question but fit in your hand and not use an internet connection. It could teach you to build a ceramic kiln or help you remember that movie you saw once.
But I’m not making it to make money. Perhaps there is some value in this idea, but I doubt there is a lot of money. The primary goal is like Wikipedia’s:
to benefit readers by acting as a widely accessible and free encyclopedia. Wikipedia:Purpose – Wikipedia
And my secondary goal is to expand my own knowledge. While I know Python well enough to get around a notebook file, I have never set up and run a Python project myself. And while I’m an expert on productizing AI, I need a playground to try out new ideas and get a deeper understanding of the technology beyond a user and product manager. And finally, this is a good excuse to use AI as a programmer.
Progress so far
The rest of this post and future entries in the series will be my development blog. What I’m learning, what I’m discovering, and what I’m thinking about as the next steps. You can follow along at abrakjamson/The-Archive (github.com). I’m currently at commit a598b1edcf3556a9fa1c8a48034b51e5a1e602f9 as I write this. Forgive the hardcoding – it is not set up to be runnable on anyone else’s computer yet.
The current state of the project is using Phi-3 to turn the question into search terms, search a local copy of Wikipedia, and use the results to generate an answer.
Pythonic Projects
My first issue was having little idea how to organize the code. I know I want people to be able to clone the repo and get it running themselves. I searched around the internet, got some advice from Copilot, and looked through some popular Python code bases. Here’s where I ended up for now:
The-Archive
| setup.py # TODO: guide the user through downloading a model and database
| requirements.txt # generated by pip freeze, so users can install dependencies
| models # host the LLM and any other models
| data # host the database
| src
-> | The_Archive
-> | app.py # For execution and display
| language_model.py # for the LLM management, prompts, and chains
| local_wikipedia.py # a Langchain retriever that calls the local wikipedia database
-> | test # TODO: add tests here (and first learn how this is generally done in Python)
Having only interacted with Python via scripts, I was surprised to learn how it does object orientation. Unlike in Java or C#, you need to pass “self” as a parameter to an object’s method, but you don’t declare it as a parameter of that method. This will trip me up dozens of more times, I am sure. I also keep forgetting Python’s capitalization conventions.
Choosing a Framework
Any good project needs a framework! I chose Langchain, mostly because it’s very popular. And I’ve nearly quit using Langchain several times already. It’s plenty frustrating, and so far it is barely helpful. I don’t like:
- Langchain Expression Language (LCEL) – this uses the “|” character to pass flow between iRunnable objects. Python mostly doesn’t use the “|” character though. It is supported to do the same thing as in POSIX, pipe output from the left to input on the right. And this is what’s happening with the LLM logic, but it doesn’t feel Pythonic. It’s not particularly easier to read, and I find that it does little more than filling in variables inside strings.
- Chains – You use LCEL to define your Chain of logic. That’s OK, but if you want to do something in parallel, it is enormously difficult to understand passing variables around. And Langchain is too new for Copilot to teach me. At the current state in the project, I have it working without making everything a single chain. I’ll have to rectify this later.
- Local models – The core of this project is running everything locally. I would have thought that the most popular LLM framework would support running a LLM. Readers, it does not. You can import a Langchain Ollama library, but it isn’t very robust. It’s clear Langchain is meant to be used by calling an LLM API. I could split hosting the LLM out of the project, but I don’t want to make it any more complicated to run than I have to right now. Later, if this becomes a handheld device or app, separating it out as a microservice would be a good idea.
I’m sticking with Langchain for now. I’m optimistic that the logging and tracing it builds in with Langsmith will be a big help later.
RAG
You’re probably aware that adding information at runtime to an LLM is called “RAG” for Retrieval Augmented Generation. It’s pronounced out, rhyming with lollygag. It’s a bad name, largely because nearly everything you can do with an LLM falls within it. It should almost be called just “developing.” This project won’t use RAG in the sense implied by the term, which would simply add some data to the user’s prompt with a single call to the LLM. There’s a lot more cooler things that can be done beyond that.
I knew from the start that I wanted Wikipedia to be the knowledge source for The Archive. Fortunately, Huggingface has already done the work of formatting all English-language Wikipedia into a clean dataset: wikimedia/wikipedia · Datasets at Hugging Face. It’s designed to be good text for pre-training, i.e., teaching a language model what language is, not being a database for RAG. It’s close though; I just wish it kept in the cross-page links.
Wikipedia’s online search leverages Elasticsearch (ES), which is what I’m using too. ES is designed for production-scale use, but you can run it locally without too much issue. Choosing Elasticsearch meant I had another technology to learn, but Copilot helped me generate the index definition commands and queries I would need. For some reason, fuzzy search was awful. “Obama” kept matching not Barack Obama but some guy named Ohama. Multi-match on the title and body has been better.
The AI Part
Online Wikipedia searches actually do pretty decent if you type in a full question, but I couldn’t find their source code with their magic sauce. And anyway, I imagine future users having a conversation with The Archive, not only typing in well-formatted questions. And so I’ve inserted an LLM call to come up with the best Wikipedia search terms that would answer the question. I’ll use the LLM’s response to make the actual query to Elasticsearch. Here’s that prompt:
<|system|>Give a comma-separated list of search keywords most likely to find an answer to the user's question. Do NOT answer the question.<|end|>\n
<|user|>When was Obama born?<|end|>\n
<|assistant|>Barack Obama,United States Presidents,Family of Barack Obama<|end|>\n
<|user|>How can I make charcoal?<|end|>\n<|assistant|>Charcoal,Charcoal Kiln,Retort (Chemistry)<|end|>\n
<|user|>{user_question}<|end>\n
<|assistant|>
(I’ve added newlines and bolded the special tokens for clarity) I’ll point out a few things about this prompt:
- The model I’m using for now (Phi-3-mini) has this style of special tokens, e.g., “<|system|>” to mark the system message as well as user and assistant messages in history. Other models use a different syntax; there’s no standard yet.
- I see a couple errors I made now that I put it here. That last user message’s end token is malformed, and so it will be handled entirely differently. LLMs do not give you nice feedback on this kind of thing. Second, I don’t think I should have the “\n” characters in the string itself. I doubt it was post-trained with them in. I’m also not certain whether I should have the final assistant tag; this kind of thing needs to be tested and evaluated, aka guessed-and-checked.
- I give two examples, aka “two-shot.” I’m using a small language model (SLM) so it can run locally with room to spare, and SLMs in particular improve dramatically in performance when you give examples. Without this, I would get introduction and conclusion paragraphs included in the response.
The flow from here is to pass the search terms into one query to ES, concatenate the first three articles text into a “context” and call the LLM again, this time to answer the query. I ask the language model to “Use only the following context information for the user’s question” to reduce hallucinations, with good results so far. You can see the whole prompt in the Github repo.
I’m using Phi-3-mini because it doesn’t use as much memory and supports a long context window. It’s pretty likely that I’ll change it out later as we get better models, which will mean changing out the special tokens.
What’s next?
Everything is functioning pretty well, at least on my machine. The next thing I want to look into is adding Monte Carlo Tree Search (MCTS). This approach is the current buzz on improving reasoning capabilities. I’m also interested in trying a similar approach, but with a Graph Search, thereby inventing Monte Carlo Graph Search (MCGS). Ultimately, I’d like to combine multiple search strategies:
- Elasticsearch
- Monte Carlo Graph Search
- Semantic search
And having a re-rank stage to take the best results. Later, I intend to explore multiple ways of handling conversation memory, and the best OSS for voice audio. Follow along, and let me know on the socials if you’re finding this interesting!