Chat with PDF

Overview

It started at 2 AM. I was searching a 200-page technical PDF for one specific authentication error code. Ctrl+F returned 47 matches for "authentication". None of them the one I needed. After the third pass, I gave up and asked the obvious question: what if I could just ask the document?

That question turned into Chat with PDF, a Flask app that lets you upload a PDF, ask anything in natural language, and get answers grounded in the document itself. Built on Retrieval-Augmented Generation (RAG) with OpenAI embeddings and FAISS for vector search.

The naive first attempt (concatenate the whole PDF, send it to GPT) failed immediately. Token limits, context loss, and an OpenAI bill that would have ended the project. That failure forced the actual architecture.

Architecture

The system is a 5-stage pipeline. Each stage is small and replaceable, which became important when I built the fallback tiers.

1000

Chunk size (chars)

420

Overlap (chars)

1536

Embedding dim

k=5

Top retrieved

The two parameters that matter most: chunk size and overlap. Splitting at fixed character boundaries severs sentences mid-thought and destroys semantic coherence. The fix was RecursiveCharacterTextSplitter with a priority list of separators (\n\n, \n, . , ), and a 420-character overlap so retrieval can stitch context across boundaries.

Embeddings via OpenAI's text-embedding-ada-002 (1536-dimensional vectors). Storage and retrieval via FAISS using Maximum Marginal Relevance (fetch_k=20, lambda_mult=0.7). This trades pure relevance against diversity, which empirically gave better answers than top-k cosine alone.

The 3-Tier Fallback

The same query interface routes through three different backends depending on what's available:

Tier 1, Full RAG. LangChain + OpenAI embeddings + FAISS. Best quality.
Tier 2, Keyword RAG. TF-IDF-inspired scoring on regex-extracted keywords (4+ char words, common terms filtered). No embeddings, no API cost. Surprisingly good for technical docs where the right keywords carry most of the signal.
Tier 3, Simple search. Raw substring match. The "it works on a fresh laptop with nothing installed" tier.

The user never sees which tier ran. They just get an answer. Quality drops measurably going down, but the fact that something always works changed who could actually use the tool.

Hallucination prevention

RAG without grounding checks happily invents plausible-sounding answers from nothing. The defense:

Custom prompt template that explicitly instructs the model to use only the provided context and to admit when the answer isn't there.
Post-hoc validation: extract key concepts from both the generated answer and the source chunks. If unsupported concepts exceed 30% of the response, flag it.
Response cap at 4 sentences (~1200 chars). Long responses correlate with hallucination: the model padding when it doesn't actually know.

Numbers

Measured on a held-out set of 100 questions across 12 different PDFs (technical docs, papers, manuals):

89%

Response accuracy

2.3s

Avg response time

94%

Context relevance

80%

Cost reduction vs naive

The 80% cost reduction is the one I'm proudest of. It came from caching embeddings (@lru_cache(maxsize=100)) and only retrieving 3-6 chunks instead of stuffing the entire context window.

What I learned

A few takeaways that generalize beyond this project:

Chunking strategy is foundational. It feels like an implementation detail until you ship and realize bad chunks bottleneck everything downstream.
Fallbacks aren't compromises. They're features. Building Tier 2 and 3 forced me to understand which parts of the RAG pipeline were actually load-bearing.
Hide the complexity. Users don't care about MMR or lambda_mult. They care that the answer is right.
Cache aggressively. Embeddings are deterministic, so recomputing them is wasted spend.
Always validate against source. If RAG can't ground its answer, it shouldn't answer.

The full write-up with code samples and the dependency-hell story is on Medium. Source is on GitHub.

Chat with PDF: A conversational interface for any document

Overview

Architecture

The 3-Tier Fallback

Hallucination prevention

Numbers

What I learned

Interested in working together?