FastAPI AI Kit

FastAPI AI Kit ships a complete RAG (Retrieval-Augmented Generation) pipeline: document ingestion with chunking, embedding storage in pgvector or Qdrant, and context-aware query endpoints.

How it works

Document → Parse → Chunk → Embed → Store → Index
                                              ↓
Query → Embed query → Similarity search → Context injection → LLM → Answer

Ingestion API

Ingest a document

# POST /v1/rag/ingest
# Content-Type: multipart/form-data

# Python example
import httpx

with open("docs/handbook.pdf", "rb") as f:
    response = httpx.post(
        "https://api.example.com/v1/rag/ingest",
        files={"file": f},
        data={"collection": "company-kb"},
        headers={"X-API-Key": "kit_live_..."},
    )

# Response
{
    "job_id": "uuid",
    "status": "queued",
    "doc_id": "uuid"
}

Large documents process asynchronously via Celery. Poll the job status endpoint.

Ingest from URL

# POST /v1/rag/ingest-url
{
    "url": "https://docs.example.com/api.md",
    "collection": "api-docs"
}

Supported file types

PDF (.pdf) — text extraction via pdfplumber
Markdown (.md, .mdx)
Plain text (.txt)
HTML (.html) — basic tag stripping

Query API

Basic query

# POST /v1/rag/query
{
    "question": "What is the cancellation policy?",
    "collection": "company-kb"
}

# Response
{
    "answer": "According to the documentation, cancellations must be...",
    "sources": [
        {"file": "handbook.pdf", "page": 12, "relevance": 0.92}
    ],
    "tokens": {"input": 820, "output": 143, "total": 963}
}

With options

{
    "question": "How do I configure rate limiting?",
    "collection": "api-docs",
    "top_k": 5,           # Number of chunks to retrieve (default: 5)
    "min_score": 0.75,    # Minimum cosine similarity (default: none)
    "model": "gpt-4o"     # Override LLM model
}

Collections

Collections are namespaces for document groups. Use them to separate:

Customer vs. internal docs
Different products or versions
Multi-tenant isolation

# List collections
GET /v1/rag/collections

# Delete a collection
DELETE /v1/rag/collections/{name}

# Get collection stats
GET /v1/rag/collections/{name}/stats
# → {"doc_count": 142, "chunk_count": 8430, "size_mb": 12.4}

Configuration

Vector store selection

# .env
VECTOR_STORE=pgvector     # or "qdrant"
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIMENSIONS=1536
CHUNK_SIZE=512            # tokens per chunk
CHUNK_OVERLAP=64          # token overlap between chunks

pgvector (default)

pgvector uses your existing Postgres database. No additional service required.

DATABASE_URL=postgresql+asyncpg://user:pass@localhost/dbname

The initial Alembic migration creates the vector extension and HNSW index automatically.

Qdrant

VECTOR_STORE=qdrant
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=optional-api-key    # For Qdrant Cloud

No code changes required — the same RAG API works with both backends.

Re-ingestion

The pipeline is idempotent. Re-ingesting the same document updates changed chunks and ignores unchanged ones, based on content hashing.

# Force re-ingestion
POST /v1/rag/ingest
{
    "url": "...",
    "collection": "...",
    "force": true    # Clears existing chunks for this doc
}

Using RAG in your own endpoints

from app.rag import rag_service

@router.post("/v1/support/chat")
async def support_chat(body: SupportRequest, key: APIKey = Depends(get_api_key)):
    # RAG query returns answer + source chunks
    result = await rag_service.query(
        question=body.message,
        collection="support-docs",
        top_k=5,
    )
    return {"reply": result.answer, "sources": result.sources}

PreviousStreaming Responses (SSE)NextBackground Jobs