RAG Model Application
A complete Retrieval-Augmented Generation (RAG) system implementation with vector database, embeddings, and intelligent document retrieval.
Overview
This project implements a full RAG pipeline that:
- Ingests and processes documents
- Creates embeddings using transformer models
- Stores embeddings in a vector database (FAISS)
- Retrieves relevant documents for queries
- Generates answers using retrieved context
Features
- Document Processing: Supports TXT, MD, and JSON files
- Vector Database: Uses FAISS for efficient similarity search
- Embeddings: Sentence transformers for high-quality embeddings
- Retrieval: Semantic search with configurable top-k retrieval
- Generation: Integration with Ollama for local LLM inference
- Interactive CLI: Command-line interface for querying
Installation
Prerequisites
- Python 3.8+
- Ollama (for LLM generation) - Install Ollama
Setup
- Install dependencies:
pip install -r requirements.txt- (Optional) Install and start Ollama for LLM generation:
# Install Ollama from https://ollama.ai
# Then pull a model:
ollama pull llama3.1:8bUsage
Basic Usage
Run the interactive query interface:
python main.pyThe system will:
- Create sample documents if none exist
- Build or load the vector store
- Start an interactive query session
Programmatic Usage
from rag_system import RAGSystem
# Initialize RAG system
rag = RAGSystem(
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
vector_store_path="vector_store",
llm_model="llama3.1:8b"
)
# Load or create vector store
documents = rag.load_documents(["doc1.txt", "doc2.txt"])
rag.create_vector_store(documents, save=True)
# Query the system
result = rag.query("What is RAG?", k=5)
print(result['answer'])Adding Your Own Documents
- Place documents in the
documents/directory (or any directory) - Update the code to load your documents:
file_paths = [
"documents/my_doc1.txt",
"documents/my_doc2.md",
"documents/data.json"
]
documents = rag.load_documents(file_paths)
rag.create_vector_store(documents, save=True)Configuration
Set environment variables to customize behavior:
export RAG_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"
export RAG_LLM_MODEL="llama3.1:8b"
export RAG_CHUNK_SIZE=1000
export RAG_CHUNK_OVERLAP=200
export RAG_DEFAULT_K=5
export RAG_VECTOR_STORE_PATH="vector_store"π Production Deployment
Deployment Strategy
For production, we recommend wrapping the RAG system in a REST API (using FastAPI) and deploying it as a containerized service.
Docker Deployment
- Dockerfile:
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]- Run Container:
docker run -d -p 8000:8000 -v ./vector_store:/app/vector_store rag-service:latestVector Database Scaling
- FAISS on GPU: For large-scale datasets (>1M vectors), use
faiss-gpufor significantly faster indexing and search. - IVF Indexing: Use Inverted File (IVF) indexing to speed up search by clustering vectors.
- External Vector DB: For distributed scaling, consider migrating from local FAISS files to managed services like Qdrant, Pinecone, or Weaviate.
Caching Strategies
- Embedding Cache: Cache embeddings for frequently ingested documents to avoid re-computation.
- Query Cache: Cache results for identical queries using Redis to reduce LLM latency.
Performance Optimization
- Quantization: Use quantized embedding models (int8) to reduce memory usage and increase speed with minimal accuracy loss.
- Batch Processing: Process document ingestion in batches to utilize vectorization efficiencies.
- Asynchronous Ingestion: Offload document processing to a background worker (Celery/RQ) to keep the API responsive.
Architecture
βββββββββββββββ
β Documents β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Text Split β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Embeddings β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
βVector Store β
β (FAISS) β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Retrieval β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Generation β
β (LLM) β
βββββββββββββββ
Components
rag_system.py
Core RAG implementation with:
- Document loading and processing
- Embedding generation
- Vector store management
- Retrieval and generation
main.py
Command-line interface and demo application
Vector Database
The system uses FAISS (Facebook AI Similarity Search) for efficient vector storage and retrieval. FAISS supports:
- Fast similarity search
- Scalable to millions of vectors
- CPU and GPU support
- Various indexing methods
Embedding Models
Default: sentence-transformers/all-MiniLM-L6-v2
You can use any sentence transformer model:
all-MiniLM-L6-v2(default, fast, 384 dims)all-mpnet-base-v2(better quality, 768 dims)all-MiniLM-L12-v2(larger, 384 dims)
LLM Integration
The system integrates with Ollama for local LLM inference. Supported models:
llama3.1:8b(default)mistral:7bcodellama:13b- Any Ollama-compatible model
Performance Tips
- Chunk Size: Adjust based on your documents (500-2000 tokens)
- Overlap: Use 10-20% overlap for better context
- Top-K: Start with k=5, adjust based on results
- Embedding Model: Larger models = better quality but slower
- Vector Store: Use GPU FAISS for large datasets
Troubleshooting
Import Errors
pip install langchain faiss-cpu sentence-transformersOllama Connection Issues
# Check if Ollama is running
curl http://localhost:11434/api/tagsMemory Issues
- Reduce chunk size
- Use smaller embedding model
- Process documents in batches
License
See main repository LICENSE file.