Mingqi Hou

RAG Ingestion: Why Loaders and Splitters Come Before Vector Search

Knowledge does not arrive as LangChain Documents. How loaders normalize sources and splitters set retrieval granularity—with a web-scraping demo.

When people first build RAG, attention goes to vector databases, embedding models, and similarity search. Fair—those decide whether you find the right material at query time.

But in a real knowledge-base project you quickly hit another fact:

Knowledge does not show up as Document objects.

It lives in PDFs, web pages, Notion, Word manuals, video transcripts, email archives. Until you normalize those sources, embedding and retrieval have nothing stable to work on. That is the loader.

Once content is in, documents are often too long. Users care about a paragraph, not a 10,000-token file. Whole-document vectors dilute semantics. That is the splitter.

From an engineering view, RAG is not only “how to search at question time.” It is how knowledge is cleaned, loaded, chunked, and organized on the way in.

You cannot throw raw files into a vector DB

The textbook flow:

  1. Vectorize the knowledge base
  2. Retrieve on user questions
  3. Stuff results into the prompt
  4. Generate answers

That assumes you already have structured, splittable text. Reality disagrees:

Step zero is not embedding—it is converting sources into Document with pageContent + metadata.

What loaders do

Loaders are the ingestion layer. They do not answer questions or run vector search; they read external data into LangChain Document instances:

That uniform shape powers chunking, filtering, citations, and ACL later.

Loader as input normalization

Why loading alone is not enough

You can embed a whole page. You usually should not.

A long article encoded as one vector averages many topics. Retrieval returns huge noisy context, costs rise, and the model loses focus. What belongs in the index is usually chunks, not whole files.

From document to chunks

Splitters: balance granularity and coherence

Naïve splitting is not “cut every N characters.”

Goal: balance retrieval granularity and semantic completeness. Hence chunkSize and chunkOverlap—overlap preserves context across boundaries.

Typical ingestion pipeline

  1. Read raw content (loader)
  2. Normalize to Document[]
  3. Split into chunks (splitter)
  4. Embed chunks
  5. Store vectors + metadata
  6. At query time: embed question → retrieve → prompt LLM

Loaders and splitters own the first half. They never talk to the user, but they cap retrieval quality.

Ingestion pipeline

Demo: web article → chunks → Q&A

Load with a selector

pnpm add cheerio @langchain/community
import "dotenv/config";
import "cheerio";
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const loader = new CheerioWebBaseLoader("https://example.com/article", {
  selector: ".main-area p",
});

const documents = await loader.load();

The selector matters: sidebars, comments, and “related posts” pollute the index if you scrape the whole DOM. Loaders are noise filters, not dumb fetchers.

Split for retrieval

pnpm add @langchain/textsplitters
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,
  chunkOverlap: 50,
  separators: [".", "!", "?"],
});

const splitDocuments = await textSplitter.splitDocuments(documents);

RecursiveCharacterTextSplitter prefers natural breaks (punctuation for English; use locale-appropriate separators for Chinese). chunkSize / chunkOverlap trade focus vs continuity.

Chunk size and overlap

Full mini pipeline

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "@langchain/classic/vectorstores/memory";

// ... same loader + splitter as above ...

const vectorStore = await MemoryVectorStore.fromDocuments(
  splitDocuments,
  embeddings,
);
const retriever = vectorStore.asRetriever({ k: 2 });

const question = "How did the father's death change the author's outlook?";

const retrievedDocs = await retriever.invoke(question);
const context = retrievedDocs
  .map((doc, i) => `[Passage ${i + 1}]\n${doc.pageContent}`)
  .join("\n\n");

const prompt = `
You are a reading assistant. Answer only from the passages.
If evidence is missing, say so. Do not invent facts.

Passages:
${context}

Question:
${question}
`;

const response = await model.invoke(prompt);

Think in roles, not API names

Why chunk-then-embed beats whole-document embed

One article may cover childhood, family loss, attitude shifts, and education. A question about the father’s death needs the family chunks, not a single diluted vector over the entire piece. Chunking improves recall precision.

Pitfalls

  1. Loader noise — bad selectors → junk in the KB
  2. Chunks too small — incomplete answers
  3. Chunks too large — relevant but noisy hits
  4. Missing metadata — no provenance or filters
  5. One-size-fits-all splitting — legal text ≠ chat logs ≠ API docs

Beyond the demo

Summary

RAG is not only “vectors + LLM.” Before search, knowledge must be accepted, cleaned, and chunked. Loaders standardize input; splitters define what “one retrieval unit” means.

If embeddings and retrievers decide how well you search, loaders and splitters decide whether you can search well at all.