FastAPI AI Kit ships a complete RAG (Retrieval-Augmented Generation) pipeline: document ingestion with chunking, embedding storage in pgvector or Qdrant, and context-aware query endpoints.
How it works
Document → Parse → Chunk → Embed → Store → Index
↓
Query → Embed query → Similarity search → Context injection → LLM → Answer
Ingestion API
Ingest a document
# POST /v1/rag/ingest
# Content-Type: multipart/form-data
# Python example
import httpx
with open("docs/handbook.pdf", "rb") as f:
response = httpx.post(
"https://api.example.com/v1/rag/ingest",
files={"file": f},
data={"collection": "company-kb"},
headers={"X-API-Key": "kit_live_..."},
)
# Response
{
"job_id": "uuid",
"status": "queued",
"doc_id": "uuid"
}
Large documents process asynchronously via Celery. Poll the job status endpoint.
Ingest from URL
# POST /v1/rag/ingest-url
{
"url": "https://docs.example.com/api.md",
"collection": "api-docs"
}
Supported file types
- PDF (
.pdf) — text extraction via pdfplumber - Markdown (
.md,.mdx) - Plain text (
.txt) - HTML (
.html) — basic tag stripping
Query API
Basic query
# POST /v1/rag/query
{
"question": "What is the cancellation policy?",
"collection": "company-kb"
}
# Response
{
"answer": "According to the documentation, cancellations must be...",
"sources": [
{"file": "handbook.pdf", "page": 12, "relevance": 0.92}
],
"tokens": {"input": 820, "output": 143, "total": 963}
}
With options
{
"question": "How do I configure rate limiting?",
"collection": "api-docs",
"top_k": 5, # Number of chunks to retrieve (default: 5)
"min_score": 0.75, # Minimum cosine similarity (default: none)
"model": "gpt-4o" # Override LLM model
}
Collections
Collections are namespaces for document groups. Use them to separate:
- Customer vs. internal docs
- Different products or versions
- Multi-tenant isolation
# List collections
GET /v1/rag/collections
# Delete a collection
DELETE /v1/rag/collections/{name}
# Get collection stats
GET /v1/rag/collections/{name}/stats
# → {"doc_count": 142, "chunk_count": 8430, "size_mb": 12.4}
Configuration
Vector store selection
# .env
VECTOR_STORE=pgvector # or "qdrant"
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIMENSIONS=1536
CHUNK_SIZE=512 # tokens per chunk
CHUNK_OVERLAP=64 # token overlap between chunks
pgvector (default)
pgvector uses your existing Postgres database. No additional service required.
DATABASE_URL=postgresql+asyncpg://user:pass@localhost/dbname
The initial Alembic migration creates the vector extension and HNSW index automatically.
Qdrant
VECTOR_STORE=qdrant
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=optional-api-key # For Qdrant Cloud
No code changes required — the same RAG API works with both backends.
Re-ingestion
The pipeline is idempotent. Re-ingesting the same document updates changed chunks and ignores unchanged ones, based on content hashing.
# Force re-ingestion
POST /v1/rag/ingest
{
"url": "...",
"collection": "...",
"force": true # Clears existing chunks for this doc
}
Using RAG in your own endpoints
from app.rag import rag_service
@router.post("/v1/support/chat")
async def support_chat(body: SupportRequest, key: APIKey = Depends(get_api_key)):
# RAG query returns answer + source chunks
result = await rag_service.query(
question=body.message,
collection="support-docs",
top_k=5,
)
return {"reply": result.answer, "sources": result.sources}
