This repository was archived by the owner on Jun 24, 2024. It is now read-only.

Description
I think the next step in the project is lightweight ANN (approx. k-nearest neighbours search) vector database. Applications:
- Document store over local documents: code bases, journals, articles. Input: directory with text files.
- Chrome-assistant: A memory over your currently and recently opened tabs with llama-wasm
- Mobile: similar to local
Details:
- The k-ANN database should always be an optional dep compiled under a feature flag.
- We will reuse the loaded model for encoding. See: here, section 4.3. It suggests using the analog to [CLS] token. See e.g. LlamaIndex. Not too sure for decoding. It can just return the text string from the metadata. Alternately, one can load an embedding/decoding model that is serialized to ggml.
Problem definition:
- We are ok trading off a bit of performance to have something that has minimal surface area
- We want the database to be persistent - it should persist the index after generation. The model is very similar to ggml, we should generate the artifact once, or periodically, and then load the artifact (index) into memory.
Options:
- ❌ connect to existing vector database (qdrant, milvus, pinecone). But these are heavy deps, and also have many features designed that are more about scaling out (cloud native). We like transparency and owning the artifacts involved. We are willing to tradeoff a bit of perf and or implementation complexity for aim.
- ❌ Compile faiss as an optional dependency. Still a pretty huge dependency.
- 🦀 Something rust-native: e.g. hora. Not actively maintained, but still works and I've run it locally. We can slice out the core functionality (e.g. just HNSW). It already has a persisted format for index. We can add mmap support (optimization). Hopefully we can slice out to about 2K loc.
Plan:
- Use prompt engineering to allow model to indicate via a special unicode sequence. llama-rs will detect the unicode sequence and trigger a database lookup.
- not clear how well this would work. Ideally, we should have prompt-tuned this, LoRA-based fine-tuning might work.
- Implement either partial encoding with existing LLM, or allow loading an embedding model.