RAG Model Application

A complete Retrieval-Augmented Generation (RAG) system implementation with vector database, embeddings, and intelligent document retrieval.

Overview

This project implements a full RAG pipeline that:

Ingests and processes documents
Creates embeddings using transformer models
Stores embeddings in a vector database (FAISS)
Retrieves relevant documents for queries
Generates answers using retrieved context

Features

Document Processing: Supports TXT, MD, and JSON files
Vector Database: Uses FAISS for efficient similarity search
Embeddings: Sentence transformers for high-quality embeddings
Retrieval: Semantic search with configurable top-k retrieval
Generation: Integration with Ollama for local LLM inference
Interactive CLI: Command-line interface for querying

Installation

Prerequisites

Python 3.8+
Ollama (for LLM generation) - Install Ollama

Setup

Install dependencies:

pip install -r requirements.txt

(Optional) Install and start Ollama for LLM generation:

# Install Ollama from https://ollama.ai
# Then pull a model:
ollama pull llama3.1:8b

Usage

Basic Usage

Run the interactive query interface:

python main.py

The system will:

Create sample documents if none exist
Build or load the vector store
Start an interactive query session

Programmatic Usage

from rag_system import RAGSystem

# Initialize RAG system
rag = RAGSystem(
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    vector_store_path="vector_store",
    llm_model="llama3.1:8b"
)

# Load or create vector store
documents = rag.load_documents(["doc1.txt", "doc2.txt"])
rag.create_vector_store(documents, save=True)

# Query the system
result = rag.query("What is RAG?", k=5)
print(result['answer'])

Adding Your Own Documents

Place documents in the documents/ directory (or any directory)
Update the code to load your documents:

file_paths = [
    "documents/my_doc1.txt",
    "documents/my_doc2.md",
    "documents/data.json"
]
documents = rag.load_documents(file_paths)
rag.create_vector_store(documents, save=True)

Configuration

Set environment variables to customize behavior:

export RAG_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"
export RAG_LLM_MODEL="llama3.1:8b"
export RAG_CHUNK_SIZE=1000
export RAG_CHUNK_OVERLAP=200
export RAG_DEFAULT_K=5
export RAG_VECTOR_STORE_PATH="vector_store"

🏭 Production Deployment

Deployment Strategy

For production, we recommend wrapping the RAG system in a REST API (using FastAPI) and deploying it as a containerized service.

Docker Deployment

Dockerfile:

FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

Run Container:

docker run -d -p 8000:8000 -v ./vector_store:/app/vector_store rag-service:latest

Vector Database Scaling

FAISS on GPU: For large-scale datasets (>1M vectors), use faiss-gpu for significantly faster indexing and search.
IVF Indexing: Use Inverted File (IVF) indexing to speed up search by clustering vectors.
External Vector DB: For distributed scaling, consider migrating from local FAISS files to managed services like Qdrant, Pinecone, or Weaviate.

Caching Strategies

Embedding Cache: Cache embeddings for frequently ingested documents to avoid re-computation.
Query Cache: Cache results for identical queries using Redis to reduce LLM latency.

Performance Optimization

Quantization: Use quantized embedding models (int8) to reduce memory usage and increase speed with minimal accuracy loss.
Batch Processing: Process document ingestion in batches to utilize vectorization efficiencies.
Asynchronous Ingestion: Offload document processing to a background worker (Celery/RQ) to keep the API responsive.

Architecture

┌─────────────┐
│  Documents  │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ Text Split  │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ Embeddings  │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│Vector Store │
│   (FAISS)   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Retrieval  │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ Generation  │
│   (LLM)     │
└─────────────┘

Components

`rag_system.py`

Core RAG implementation with:

Document loading and processing
Embedding generation
Vector store management
Retrieval and generation

`main.py`

Command-line interface and demo application

Vector Database

The system uses FAISS (Facebook AI Similarity Search) for efficient vector storage and retrieval. FAISS supports:

Fast similarity search
Scalable to millions of vectors
CPU and GPU support
Various indexing methods

Embedding Models

Default: sentence-transformers/all-MiniLM-L6-v2

You can use any sentence transformer model:

all-MiniLM-L6-v2 (default, fast, 384 dims)
all-mpnet-base-v2 (better quality, 768 dims)
all-MiniLM-L12-v2 (larger, 384 dims)

LLM Integration

The system integrates with Ollama for local LLM inference. Supported models:

llama3.1:8b (default)
mistral:7b
codellama:13b
Any Ollama-compatible model

Performance Tips

Chunk Size: Adjust based on your documents (500-2000 tokens)
Overlap: Use 10-20% overlap for better context
Top-K: Start with k=5, adjust based on results
Embedding Model: Larger models = better quality but slower
Vector Store: Use GPU FAISS for large datasets

Troubleshooting

Import Errors

pip install langchain faiss-cpu sentence-transformers

Ollama Connection Issues

# Check if Ollama is running
curl http://localhost:11434/api/tags

RAG Model Application

Overview

Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Programmatic Usage

Adding Your Own Documents

Configuration

🏭 Production Deployment

Deployment Strategy

Docker Deployment

Vector Database Scaling

Caching Strategies

Performance Optimization

Architecture

Components

`rag_system.py`

`main.py`

Vector Database

Embedding Models

LLM Integration

Performance Tips

Troubleshooting

Import Errors

Ollama Connection Issues

Memory Issues

License

References

RAG Model Application

Overview

Features

Installation

Prerequisites

Setup

Usage

Basic Usage

Programmatic Usage

Adding Your Own Documents

Configuration

🏭 Production Deployment

Deployment Strategy

Docker Deployment

Vector Database Scaling

Caching Strategies

Performance Optimization

Architecture

Components

rag_system.py

main.py

Vector Database

Embedding Models

LLM Integration

Performance Tips

Troubleshooting

Import Errors

Ollama Connection Issues

Memory Issues

License

References

Related Projects

`rag_system.py`

`main.py`