RAG Ingestion: Why Loaders and Splitters Come Before Vector Search

Knowledge does not arrive as LangChain Documents. How loaders normalize sources and splitters set retrieval granularity—with a web-scraping demo.

Published May 30, 2026

When people first build RAG, attention goes to vector databases, embedding models, and similarity search. Fair—those decide whether you find the right material at query time.

But in a real knowledge-base project you quickly hit another fact:

Knowledge does not show up as Document objects.

It lives in PDFs, web pages, Notion, Word manuals, video transcripts, email archives. Until you normalize those sources, embedding and retrieval have nothing stable to work on. That is the loader.

Once content is in, documents are often too long. Users care about a paragraph, not a 10,000-token file. Whole-document vectors dilute semantics. That is the splitter.

From an engineering view, RAG is not only “how to search at question time.” It is how knowledge is cleaned, loaded, chunked, and organized on the way in.

You cannot throw raw files into a vector DB

The textbook flow:

Vectorize the knowledge base
Retrieve on user questions
Stuff results into the prompt
Generate answers

That assumes you already have structured, splittable text. Reality disagrees:

Web pages mix nav, ads, footers, and body
PDFs have headers, broken lines, tables
Word has headings, styles, comments
Video needs transcription
DB rows may need field stitching for one semantic unit

Step zero is not embedding—it is converting sources into Document with pageContent + metadata.

What loaders do

Loaders are the ingestion layer. They do not answer questions or run vector search; they read external data into LangChain Document instances:

pageContent — text that gets embedded and retrieved
metadata — URL, title, author, time, chapter, path

That uniform shape powers chunking, filtering, citations, and ACL later.

Why loading alone is not enough

You can embed a whole page. You usually should not.

A long article encoded as one vector averages many topics. Retrieval returns huge noisy context, costs rise, and the model loses focus. What belongs in the index is usually chunks, not whole files.

From document to chunks

Splitters: balance granularity and coherence

Naïve splitting is not “cut every N characters.”

Too large — mixed themes, blunt vectors
Too small — fragments without causality

Goal: balance retrieval granularity and semantic completeness. Hence chunkSize and chunkOverlap—overlap preserves context across boundaries.

Typical ingestion pipeline

Read raw content (loader)
Normalize to Document[]
Split into chunks (splitter)
Embed chunks
Store vectors + metadata
At query time: embed question → retrieve → prompt LLM

Loaders and splitters own the first half. They never talk to the user, but they cap retrieval quality.

Ingestion pipeline

Demo: web article → chunks → Q&A

Load with a selector

pnpm add cheerio @langchain/community

import "dotenv/config";
import "cheerio";
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const loader = new CheerioWebBaseLoader("https://example.com/article", {
  selector: ".main-area p",
});

const documents = await loader.load();

The selector matters: sidebars, comments, and “related posts” pollute the index if you scrape the whole DOM. Loaders are noise filters, not dumb fetchers.

Split for retrieval

pnpm add @langchain/textsplitters

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 400,
  chunkOverlap: 50,
  separators: [".", "!", "?"],
});

const splitDocuments = await textSplitter.splitDocuments(documents);

RecursiveCharacterTextSplitter prefers natural breaks (punctuation for English; use locale-appropriate separators for Chinese). chunkSize / chunkOverlap trade focus vs continuity.

Chunk size and overlap

Full mini pipeline

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "@langchain/classic/vectorstores/memory";

// ... same loader + splitter as above ...

const vectorStore = await MemoryVectorStore.fromDocuments(
  splitDocuments,
  embeddings,
);
const retriever = vectorStore.asRetriever({ k: 2 });

const question = "How did the father's death change the author's outlook?";

const retrievedDocs = await retriever.invoke(question);
const context = retrievedDocs
  .map((doc, i) => `[Passage ${i + 1}]\n${doc.pageContent}`)
  .join("\n\n");

const prompt = `
You are a reading assistant. Answer only from the passages.
If evidence is missing, say so. Do not invent facts.

Passages:
${context}

Question:
${question}
`;

const response = await model.invoke(prompt);

Think in roles, not API names

load() — pull from the outside world
splitDocuments() — retrieval-friendly granularity
fromDocuments() — build the semantic index
asRetriever() — search interface

Why chunk-then-embed beats whole-document embed

One article may cover childhood, family loss, attitude shifts, and education. A question about the father’s death needs the family chunks, not a single diluted vector over the entire piece. Chunking improves recall precision.

Pitfalls

Loader noise — bad selectors → junk in the KB
Chunks too small — incomplete answers
Chunks too large — relevant but noisy hits
Missing metadata — no provenance or filters
One-size-fits-all splitting — legal text ≠ chat logs ≠ API docs

Beyond the demo

Per-source loaders
Structure-aware splitting (headings, code blocks)
Rich metadata (title hierarchy, ACL, updated_at)
Persistent vector stores
Hybrid search + rerank
Eval sets to prove chunking changes moved metrics

Summary

RAG is not only “vectors + LLM.” Before search, knowledge must be accepted, cleaned, and chunked. Loaders standardize input; splitters define what “one retrieval unit” means.

If embeddings and retrievers decide how well you search, loaders and splitters decide whether you can search well at all.