Getting Started with RAG Ingestion

This page bridges Chapter 6's design rationale with the actual implementation code. If you want to understand why the pipeline works the way it does -- generation tagging strategy, chunking philosophy, hybrid search design -- start with Chapter 6. This page covers what was built and how to use it.

Why This Approach

The BBj documentation corpus spans five distinct source formats, each with its own structure: MadCap Flare XHTML topics, PDF manuals, WordPress magazine articles, WordPress knowledge base lessons, Docusaurus MDX tutorials, and BBj source code files. A single generic parser would lose the structural cues that make each format valuable for retrieval. Instead, the pipeline uses source-by-source ingestion -- each source has a dedicated parser that understands its native format.

Every chunk produced by the pipeline carries generation labels (all, character, vpro5, bbj_gui, dwc) so that retrieval can filter by the BBj generation a developer is actually working with. A query from a DWC project returns DWC-relevant documentation first, without excluding universal content. Generation tagging is automatic -- derived from file paths, condition tags, and content patterns.

The chunker is heading-aware: it splits at section boundaries (## and ###) rather than at arbitrary token counts, and prepends a contextual header like BBjSysGui > addWindow > Parameters to each chunk. This preserves document structure so chunks make sense in isolation and produce richer embeddings. See Chunking Strategy in Chapter 6 for the full design rationale.

Pipeline Architecture

Each source feeds into a shared pipeline: Parse extracts text and metadata, Tag assigns generation labels and document type, Chunk splits at heading boundaries with overlap, Embed generates vectors via Ollama (or OpenAI), and Store bulk-inserts into PostgreSQL with pgvector.

Source Coverage

Source	Parser Module	Content Type	URL Scheme
Flare XHTML	`parsers/flare.py`	API reference, concepts, migration guides	`flare://path`
PDF	`parsers/pdf.py`	GUI programming guide	`pdf://filename#section`
WordPress (Advantage)	`parsers/wordpress.py`	Magazine articles	`https://basis.cloud/advantage...`
WordPress (KB)	`parsers/wordpress.py`	Knowledge base lessons	`https://basis.cloud/knowledge...`
Docusaurus MDX	`parsers/mdx.py`	DWC course tutorials	`mdx://chapter/file`
BBj Source	`parsers/bbj_source.py`	Code examples	`file://relative-path`

The URL scheme column shows how each parser constructs source_url values stored alongside chunks. These prefixes enable per-source reporting and retrieval filtering.

Key Data Models

Document

The parser output contract -- every parser yields validated Document objects:

class Document(BaseModel):
    source_url: str
    title: str
    doc_type: str
    content: str
    generations: list[str]
    context_header: str = ""
    deprecated: bool = False
    metadata: dict[str, str] = Field(default_factory=dict)

Full source: models.py

Chunk

The storage-ready contract -- extends Document fields with a content hash for deduplication and an embedding vector:

class Chunk(BaseModel):
    source_url: str
    title: str
    doc_type: str
    content: str
    content_hash: str          # SHA-256, auto-computed by from_content()
    generations: list[str]
    context_header: str = ""
    deprecated: bool = False
    metadata: dict[str, str] = Field(default_factory=dict)
    embedding: list[float] | None = None

Chunks are always created via the Chunk.from_content() factory method, which auto-computes the SHA-256 hash. This hash powers idempotent re-ingestion (ON CONFLICT (content_hash) DO NOTHING).

Full source: models.py

DocumentParser Protocol

Every parser implements this protocol:

@runtime_checkable
class DocumentParser(Protocol):
    def parse(self) -> Iterator[Document]:
        """Yield Document objects from the configured source."""
        ...

Full source: parsers/__init__.py

BBj Intelligence

Between parsing and chunking, the pipeline applies two classification stages and builds contextual headers. Flare-sourced documents get full intelligence enrichment; other parsers pre-populate these fields during parsing.

Generation tagger -- assigns generation labels (all, character, vpro5, bbj_gui, dwc) using three weighted signal sources: file path prefixes (weight 0.6), MadCap condition tags (weight 0.3--0.5), and content regex patterns (weight 0.4). Signals are aggregated and thresholded at 0.3 to produce the final generation list. Documents below the threshold receive an untagged sentinel.

Document type classifier -- categorizes each document as api-reference, concept, example, migration, language-reference, best-practice, or version-note. Uses a data-driven rule registry that scores heading structure, path patterns, and content patterns. The highest-scoring rule above its threshold wins; concept is the fallback default.

Context headers -- builds hierarchical headers like BBj Objects > BBjWindow > addButton > Parameters from TOC section paths, document titles, and section headings. These are prepended to chunk content before embedding so the vector captures structural context.

Full source: intelligence/

Running the Pipeline

Full setup instructions

See the rag-ingestion/README.md for complete prerequisites (PostgreSQL + pgvector, Ollama, Python 3.12+, uv), installation, and configuration.

Ingest a source:

bbj-rag ingest --source <name>

Available sources: flare, pdf, advantage, kb, mdx, bbj-source. Each source requires its own configuration -- paths for local sources (Flare, PDF, MDX, BBj source) or URLs for web sources (Advantage, KB).

Parse only (debug mode):

bbj-rag parse --source <name>

Runs the parser without embedding or storing -- useful for verifying parse output before a full ingestion run.

Quality report:

bbj-rag report

Shows chunk distribution by source, generation, and document type with anomaly warnings.

Search validation:

bbj-rag validate

Runs retrieval assertions against embedded data to verify search quality.

Why This Approach​

Pipeline Architecture​

Source Coverage​

Key Data Models​

Document​

Chunk​

DocumentParser Protocol​

BBj Intelligence​

Running the Pipeline​