RAG Database Design

TL;DR

Every piece of BBj documentation enters a retrieval pipeline that tags it by generation (all, character, vpro5, bbj-gui, dwc), chunks it with contextual headers, embeds it into vectors, and stores it in PostgreSQL with pgvector. Retrieval uses hybrid search -- dense semantic vectors combined with BM25 keyword matching -- followed by cross-encoder reranking and generation-aware scoring. This shared pipeline powers both the IDE extension and documentation chat.

The retrieval layer is the bridge between raw documentation and AI-powered answers. Without it, every query to the fine-tuned model relies solely on what the model memorized during training. With it, the model can ground its responses in actual, current documentation -- citing specific API methods, referencing exact syntax rules, and providing generation-appropriate examples.

As Chapter 2 establishes, the BBj AI strategy follows a two-layer architecture: a shared foundation consumed by multiple applications. The RAG database is the second pillar of that foundation (alongside the fine-tuned model). Both the IDE extension's context enrichment and the documentation chat's response generation depend on the same retrieval API, the same chunked corpus, and the same generation metadata.

This chapter is the technical blueprint for that pipeline: what goes in, how it gets processed, and how it comes back out.

Source Corpus

The RAG database ingests documentation from 7 source types, processed by 7 dedicated parsers into 51K+ chunks. Each source contributes a different kind of knowledge:

Source	Parser	Chunks	Description
MadCap Flare	Flare XML parser	44,587	Primary BBj/webforJ documentation -- API references, concepts, tutorials, migration guides
WordPress (Advantage + KB)	WordPress REST API parser	2,950	Blog posts and knowledge base articles
Web Crawl	Docusaurus HTML parser	1,798	Crawled web documentation
Docusaurus MDX	MDX parser	951	This strategy documentation site
JavaDoc JSON	JavaDoc JSON parser	695	BBj API class/method documentation
BBj Source Code	BBj source parser	106	BBj code examples and patterns
PDF	PDF parser	47	PDF documentation

Total: 51,134 chunks across 7 source groups.

The MadCap Flare documentation is the primary source (87% of the corpus) -- it is the authoritative, maintained documentation that BBj developers rely on. The other sources supplement it with practical examples, precise API details, community knowledge, and API reference data.

MadCap Flare Ingestion

MadCap Flare is the documentation authoring tool used for BBj's official documentation. Understanding its content format is essential for designing the ingestion pipeline.

Content Format

Flare stores content as individual XHTML topics -- each topic is a separate .htm file containing W3C-compliant XML with standard HTML elements. Topics are organized hierarchically through Table of Contents (TOC) files, and build targets define output formats.

The key characteristics for RAG ingestion:

W3C XHTML compliance -- standard XML parsing works; no proprietary binary format
One topic per file -- natural document boundaries for chunking
Section hierarchy via headings -- <h1>, <h2>, <h3> provide contextual structure
MadCap namespace extensions -- proprietary tags (mc:*, data-mc-*) for conditions, snippets, and cross-references that are irrelevant to RAG content

Clean XHTML Export

Flare provides a Clean XHTML build target that strips all MadCap-specific tags and outputs basic HTML files. This is the recommended ingestion format.

Decision: Clean XHTML as Ingestion Format

Choice: Use MadCap Flare's Clean XHTML export as the primary input for the RAG ingestion pipeline.

Rationale: Clean XHTML strips proprietary MadCap tags (mc:*, data-mc-*), removes conditional content markers and snippet references, and outputs standard HTML that any parser can process. This avoids building a MadCap-specific parser and produces stable, predictable input for the pipeline.

Alternatives considered: Parsing raw Flare project files directly (requires handling MadCap-specific XML namespaces, conditions, and snippets -- significantly more complex). Using Flare's HTML5 output (includes styling and navigation elements that add noise to extracted text).

Status: Operational. The Flare XML parser processes Clean XHTML exports, producing 44,587 chunks -- the largest source in the corpus.

Ingestion Pipeline

The pipeline from Flare export to stored chunks follows a predictable sequence:

interface FlareDocument {
    filePath: string;          // e.g., Content/Topics/BBjAPI/addWindow.htm
    title: string;             // Extracted from <head><title>
    body: string;              // Extracted from <body>, HTML stripped
    headings: string[];        // Section hierarchy for contextual headers
    generation: string[];      // Inferred from content + file path
    docType: 'api-reference' | 'concept' | 'example' | 'migration'
           | 'language-reference' | 'best-practice' | 'version-note';
}

// Pipeline steps:
// 1. Export Clean XHTML from Flare (removes MadCap-specific tags)
// 2. Parse XHTML files, extract text + metadata
// 3. Classify document type from structure and content signals
// 4. Apply generation tagging based on API names, syntax patterns, file paths
// 5. Chunk with contextual headers (preserve section context)
// 6. Embed chunks using selected embedding model
// 7. Store in PostgreSQL with pgvector + generation metadata

Flare does not expose a programmatic API for content extraction. The export step is a manual build target execution or a file system parse of the Flare project directory. Once exported, the rest of the pipeline is fully automated.

Multi-Generation Document Structure

The defining design choice for this RAG database is generation metadata on every chunk. This is not optional -- it is what makes BBj retrieval fundamentally different from generic documentation search.

As Chapter 3 establishes, BBj spans four generations of UI technology. A developer asking "how do I create a window?" needs different answers depending on whether they are working with character UI, Visual PRO/5, modern BBj GUI, or DWC browser-based code. The generation metadata on each chunk enables the retrieval system to return the right answer for the right context.

Generation Labels

The RAG database uses the same generation labeling schema as the training data:

Label	Scope	Example Content
`"all"`	Universal patterns	FOR/NEXT loops, file I/O, string functions
`"character"`	Character UI (1980s)	`PRINT @(x,y)`, `INPUT` statements
`"vpro5"`	Visual PRO/5 (1990s)	`PRINT (sysgui)'WINDOW'(...)`, `PRINT (sysgui)'BUTTON'(...)`, `CTRL(sysgui,id,index)`
`"bbj-gui"`	BBj GUI/Swing (2000s)	`BBjAPI().getSysGui()`, `addWindow()`
`"dwc"`	DWC/Browser (2010s+)	`getWebManager()`, `executeAsyncScript`

A document can carry multiple generation labels. An API method like BBjSysGui.addWindow() that works in both desktop Swing and DWC browser contexts would carry ["bbj-gui", "dwc"].

Document Structure Examples

Universal documentation (applies to all generations):

{
    "id": "bbj-for-next-001",
    "type": "language-reference",
    "topic": "FOR/NEXT loop",
    "generation": "all",
    "content": "The FOR/NEXT loop executes a block of statements a specified number of times. Syntax: FOR var = start TO end [STEP increment]...NEXT var",
    "keywords": ["loop", "iteration", "for", "next", "control flow"],
    "contextual_header": "Language Reference > Control Flow > FOR/NEXT"
}

Modern API documentation (multi-generation):

{
    "id": "bbj-addwindow-001",
    "type": "api-reference",
    "class": "BBjSysGui",
    "method": "addWindow",
    "generation": ["bbj-gui", "dwc"],
    "since_version": "12.00",
    "content": "Creates a new top-level window. Syntax: addWindow(int x, int y, int w, int h, String title)...",
    "signatures": [
        "BBjTopLevelWindow addWindow(int x, int y, int w, int h, String title)",
        "BBjTopLevelWindow addWindow(int id, int x, int y, int w, int h, String title)"
    ],
    "related": ["BBjTopLevelWindow", "BBjChildWindow", "addChildWindow"],
    "supersedes": "vpro5-window-create",
    "keywords": ["window", "gui", "create", "toplevel"],
    "contextual_header": "BBjSysGui > addWindow"
}

Legacy documentation (still valid, but superseded):

{
    "id": "vpro5-window-create-001",
    "type": "api-reference",
    "verb": "PRINT (sysgui)'WINDOW'(...)",
    "generation": ["vpro5"],
    "deprecated_in": "12.00",
    "still_valid": true,
    "content": "Creates a GUI window using Visual PRO/5 mnemonic syntax. print (sysgui)'window'(x,y,w,h,title$,flags$,eventmask$)...",
    "superseded_by": "bbj-addwindow-001",
    "migration_note": "For new development, use BBjSysGui.addWindow() for better DWC compatibility.",
    "keywords": ["window", "gui", "create", "vpro5", "legacy"],
    "contextual_header": "Visual PRO/5 > GUI > PRINT (sysgui)'WINDOW'(...)"
}

The supersedes and superseded_by links create a graph of modernization paths. When a developer queries legacy documentation, the retrieval system can surface the modern equivalent alongside the legacy answer.

Document Types

Type	Generation	Description	Example
`language-reference`	Usually `"all"`	Core language syntax	FOR/NEXT loops, file I/O
`api-reference`	Varies	Method/class documentation	BBjSysGui.addWindow()
`concept`	Varies	Conceptual explanation	"Understanding BBj Events"
`example`	Varies	Working code sample	"Creating a Grid Application"
`migration`	N/A (has from/to)	How to modernize legacy code	"Migrating from PRINT (sysgui)'WINDOW'(...)"
`best-practice`	Often `"all"`	Recommended patterns	"Error Handling in BBj"
`version-note`	Varies	Version-specific behavior	"New in BBj 23.04: await parameter"

The document type informs both chunking strategy (different types get different chunk sizes) and retrieval ranking (API references are boosted for "how do I" queries; migration docs are boosted when the query involves legacy syntax).

Chunking Strategy

Not all documentation should be chunked the same way. An API reference entry for a single method is compact and self-contained. A conceptual guide explaining the BBj event model spans multiple paragraphs and requires context to be useful. Treating both identically -- either too small or too large -- degrades retrieval quality.

Document-Type-Aware Chunk Sizes

Decision: Variable Chunk Sizes by Document Type

Choice: Use different target chunk sizes based on document type rather than a uniform chunk size across the entire corpus.

Rationale: API references are dense and self-contained -- a 200-400 token chunk captures a complete method signature with its description. Conceptual documentation needs more context to be meaningful -- a 400-600 token chunk preserves the explanation around a concept. Code examples should never be split mid-function. One-size-fits-all chunking either loses context for concepts or wastes vector space on padded API entries.

Alternatives considered: Uniform 512-token chunks (simpler but lower retrieval quality), sentence-level splitting (too granular for technical documentation), full-document embedding (works only for very short documents).

Status: Operational. The ingestion pipeline applies document-type-aware chunking across all 7 source parsers, producing 51K+ chunks with contextual headers.

Document Type	Target Chunk Size	Rationale
API references	200-400 tokens	Compact, self-contained; one method per chunk
Conceptual docs	400-600 tokens	Need surrounding explanation for context
Code examples	Variable (keep intact)	Splitting a code example mid-function destroys its value
Migration guides	400-600 tokens	Need both the legacy and modern patterns together
Language reference	300-500 tokens	Core syntax rules with examples

Contextual Headers

Every chunk is prepended with its section hierarchy -- the path from the top-level topic through subheadings to the chunk's location. This is one of the most impactful improvements in modern RAG systems because it lets the embedding capture the context that would otherwise be lost when a paragraph is extracted from its document.

Without contextual headers:

"Creates a new top-level window. The first parameter specifies the X coordinate..."

The embedding has no idea this is about BBjSysGui.addWindow() in BBj.

With contextual headers:

"BBjSysGui > addWindow > Parameters: Creates a new top-level window. The first parameter specifies the X coordinate..."

Now the embedding captures both the semantic meaning and the precise API context. This significantly improves retrieval precision for queries like "addWindow parameters" or "BBjSysGui window creation."

Overlap Between Chunks

Adjacent chunks share 10-15% overlap at their boundaries. This ensures that information spanning a chunk boundary is captured in at least one chunk. For a 400-token chunk, this means approximately 40-60 tokens of overlap with the preceding and following chunks.

The overlap is especially important for BBj documentation, where a method description often ends with a code example that starts in the next logical section. Without overlap, a query about a method's usage pattern might retrieve the description chunk but miss the example chunk, or vice versa.

Embedding Strategy

Embeddings are the numeric representations that make semantic search possible. The choice of embedding model determines how well the system understands the meaning behind a query, not just its keywords.

Starting Point: General-Purpose Embeddings

BBj documentation is primarily written in English prose with embedded code snippets. General-purpose embedding models handle English text well, and the contextual headers strategy (described above) compensates for most domain-specific terminology gaps by providing explicit context.

The recommended starting point is a strong open-source embedding model such as BGE-M3 or a comparable model from the MTEB leaderboard. As of early 2026, open-source embedding models have largely closed the gap with proprietary options for English-language retrieval tasks.

Key selection criteria for the embedding model:

Dimension: 768-1024 dimensions provides a good balance of quality and storage efficiency
Sequence length: Must support at least 512 tokens to handle the larger conceptual chunks
Code awareness: Models trained on mixed code-and-text corpora handle BBj code snippets within documentation better than pure text models
Self-hostable: The embedding model must run locally -- sending BBj documentation to external APIs may conflict with enterprise data policies

Fine-Tuning Embeddings (Deferred)

Domain-specific embedding fine-tuning is feasible with as few as 1,000-5,000 query-document pairs and can yield 7% or greater improvement in retrieval precision. For BBj, this would mean creating pairs like:

Query: "how to create a window in DWC" paired with the BBjSysGui.addWindow() documentation chunk
Query: "BBj file I/O" paired with the OPEN/READ/WRITE language reference chunks

However, this optimization should be deferred until the baseline pipeline is operational and retrieval quality can be measured. Premature embedding fine-tuning risks optimizing for the wrong thing before the chunking strategy and corpus coverage are validated.

Vector Store Selection

The vector store is where embedded chunks live and where similarity search happens. The choice matters less than most RAG guides suggest -- at BBj's corpus scale, the differences between options are negligible.

Decision: PostgreSQL with pgvector as Default Vector Store

Choice: pgvector (PostgreSQL extension) as the default vector store. As of early 2026.

Rationale: BBj's total corpus -- MadCap Flare documentation, source code samples, API references, and knowledge base articles -- will likely produce fewer than 50,000 chunks. At this scale, pgvector and dedicated vector databases (Qdrant, Weaviate, Milvus) perform identically. Benchmarks as of January 2026 show sub-millisecond p50 latency differences between pgvector and Qdrant at datasets under 100K vectors. pgvector avoids running a separate database service, integrates natively with SQL for metadata filtering (generation, document type, version), and uses infrastructure that most organizations already operate.

Alternatives considered:

Qdrant -- Purpose-built vector database with excellent filtering and clustering. Better scaling characteristics above 1M vectors. Worth evaluating if the corpus grows significantly or if multi-tenant isolation is needed.
Weaviate -- GraphQL-native vector database with built-in vectorization modules. More complex to operate but offers hybrid search out of the box.
Chroma -- Lightweight, embedded vector store. Good for prototyping but lacks advanced operational features.

Status: Operational for internal exploration. PostgreSQL 17 + pgvector 0.8.0 running via Docker Compose, storing 51,134 embedded chunks with Qwen3-Embedding-0.6B (1024 dimensions) via Ollama.

pgvector Capabilities

pgvector supports the features required for BBj's retrieval strategy:

HNSW indexing -- Approximate nearest neighbor search with configurable recall/speed tradeoff. At 50K vectors, exact search (IVFFlat or sequential) is also viable.
Distance metrics -- Cosine similarity (<=> operator), L2 distance, and inner product. Cosine similarity is the standard for text embeddings.
SQL integration -- Generation filtering, document type filtering, and metadata joins happen in the same query as the vector search. No separate metadata store needed.
Incremental updates -- New or updated documentation chunks can be inserted or upserted without rebuilding the entire index.

When to Reconsider

If the corpus grows beyond 500,000 chunks (unlikely for BBj documentation alone, but possible if community-contributed content or extensive source code analysis is added), or if retrieval latency requirements drop below 10ms at p99, dedicated vector databases offer better scaling characteristics. The migration path is straightforward: export chunks with metadata from PostgreSQL, bulk-import into Qdrant or Weaviate, and update the retrieval API endpoint.

Hybrid Retrieval Strategy

Pure vector search has a well-documented weakness: it struggles with exact terms, identifiers, and API names. A query for BBjSysGui.addWindow() benefits more from keyword matching than from semantic similarity. Conversely, a query like "how do I create a window in a BBj web application" is better served by semantic search that understands the intent behind the words.

BBj documentation retrieval requires both. The recommended approach is hybrid search with four stages:

Stage 1: Dense Vector Search

Embed the query using the same model that embedded the corpus. Search pgvector for the top 20 most semantically similar chunks, optionally filtered by generation metadata.

-- pgvector semantic search with generation filter
SELECT id, content, generation, doc_type,
       1 - (embedding <=> query_embedding) AS similarity
FROM doc_chunks
WHERE generation @> ARRAY['dwc']  -- generation filter
   OR generation = ARRAY['all']    -- always include universal docs
ORDER BY embedding <=> query_embedding
LIMIT 20;

Stage 2: Sparse Keyword Search (BM25)

Use PostgreSQL's built-in full-text search to find chunks containing the query's exact terms. This is critical for BBj because API names, method signatures, and BBj-specific keywords (like BBjSysGui, addWindow, CTRL(), PRINT (sysgui)'WINDOW'(...)) are exact identifiers that semantic search may not rank highly.

-- PostgreSQL full-text search for keyword matching
SELECT id, content, generation, doc_type,
       ts_rank(search_vector, plainto_tsquery('english', query_text)) AS rank
FROM doc_chunks
WHERE search_vector @@ plainto_tsquery('english', query_text)
ORDER BY rank DESC
LIMIT 20;

Stage 3: Reciprocal Rank Fusion

Merge the semantic and keyword result sets using Reciprocal Rank Fusion (RRF). RRF combines rankings from multiple search methods without requiring score normalization -- each result's fused score is the sum of 1 / (k + rank) across all methods, where k is a constant (typically 60).

The weighting reflects BBj's documentation characteristics: semantic search handles conceptual queries well, while keyword search catches the exact API names and BBj syntax that semantic search misses.

function reciprocalRankFusion(
    semanticResults: SearchResult[],
    keywordResults: SearchResult[],
    weights = { semantic: 0.7, keyword: 0.3 },
    k = 60
): SearchResult[] {
    const scores = new Map<string, number>();

    semanticResults.forEach((result, index) => {
        const rrf = weights.semantic / (k + index + 1);
        scores.set(result.id, (scores.get(result.id) ?? 0) + rrf);
    });

    keywordResults.forEach((result, index) => {
        const rrf = weights.keyword / (k + index + 1);
        scores.set(result.id, (scores.get(result.id) ?? 0) + rrf);
    });

    return [...scores.entries()]
        .sort(([, a], [, b]) => b - a)
        .map(([id]) => allResults.get(id)!);
}

Stage 4: Cross-Encoder Reranking

The top 20 fused results are reranked using a cross-encoder model. Unlike bi-encoders (used for initial embedding), cross-encoders process the query and each candidate document together, producing a more accurate relevance score at the cost of higher latency.

Reranking the top 20 down to the top 5 is a standard pattern that balances precision and latency. The cross-encoder is too slow to run against the full corpus but highly effective on a small candidate set.

The Complete Retrieval Function

Combining all four stages:

async function retrieveDocumentation(
    query: string,
    generationHint?: string
): Promise<DocChunk[]> {
    // Stage 1: Dense vector search (semantic similarity)
    const semanticResults = await vectorStore.search(
        embed(query),
        { topK: 20, filter: buildGenerationFilter(generationHint) }
    );

    // Stage 2: Sparse keyword search (BM25)
    const keywordResults = await fullTextSearch(query, { topK: 20 });

    // Stage 3: Reciprocal Rank Fusion
    const fused = reciprocalRankFusion(semanticResults, keywordResults, {
        semanticWeight: 0.7,
        keywordWeight: 0.3
    });

    // Stage 4: Cross-encoder reranking (top 20 -> top 5)
    const reranked = await rerank(fused.slice(0, 20), query, { topK: 5 });

    // Stage 5: Generation-aware scoring adjustment
    return applyGenerationScoring(reranked, generationHint);
}

This function is the shared retrieval API that both the IDE extension and documentation chat consume. The IDE calls it with a generation hint derived from the Langium parser's AST analysis. The chat interface calls it with a generation hint inferred from conversation context.

Generation-Aware Retrieval

Generation metadata is not just a filter -- it is a scoring factor that adjusts relevance based on the developer's working context. A developer writing DWC code should see DWC documentation first, but universal documentation is always relevant, and legacy documentation may provide useful background.

Generation Scoring Logic

function computeGenerationScore(
    docGeneration: "all" | string | string[],
    targetGeneration?: string
): number {
    // Universal docs are always highly relevant
    if (docGeneration === "all") return 95;

    if (!targetGeneration) {
        // No hint: prefer modern, then universal
        if (Array.isArray(docGeneration)) {
            if (docGeneration.includes('dwc')) return 90;
            if (docGeneration.includes('bbj-gui')) return 85;
        }
        if (docGeneration === 'dwc') return 90;
        if (docGeneration === 'bbj-gui') return 85;
        if (docGeneration === 'vpro5') return 50;
        if (docGeneration === 'character') return 30;
        return 70;
    }

    // With target: exact match scores highest
    if (Array.isArray(docGeneration)) {
        if (docGeneration.includes(targetGeneration)) return 100;
    } else if (docGeneration === targetGeneration) {
        return 100;
    }

    // Close matches for GUI generations
    const guiGenerations = ['vpro5', 'bbj-gui', 'dwc'];
    if (guiGenerations.includes(targetGeneration)) {
        if (Array.isArray(docGeneration) &&
            docGeneration.some(g => guiGenerations.includes(g))) {
            return 70;
        }
    }

    return 20; // Different generation
}

The scoring follows these principles:

Universal documentation (generation "all") is always relevant -- core language constructs apply everywhere.
Exact generation matches score highest when a target generation is known.
GUI generation proximity -- vpro5, bbj-gui, and dwc share enough conceptual overlap that cross-references are still useful, even if the exact API differs.
Without a generation hint, the system defaults to preferring modern documentation (DWC > BBj GUI) since most new development targets current platforms.
Legacy documentation is never excluded -- it is deprioritized but still retrievable. Developers maintaining character UI or Visual PRO/5 codebases can explicitly target those generations.

Retrieval Example

Consider a developer working in a DWC context who asks: "How do I create a window?"

The retrieval system:

Semantic search finds chunks about window creation across all generations
Keyword search matches "window" and "create" in API references
Fusion merges results, surfacing BBjSysGui.addWindow() and PRINT (sysgui)'WINDOW'(...)
Reranking evaluates relevance to the specific query
Generation scoring boosts the BBjSysGui.addWindow() chunk (generation ["bbj-gui", "dwc"], score 100) and deprioritizes the PRINT (sysgui)'WINDOW'(...) chunk (generation ["vpro5"], score 70 due to GUI proximity)

The response includes the modern addWindow() documentation first, with a reference to the legacy PRINT (sysgui)'WINDOW'(...) syntax for context -- exactly what a developer migrating or maintaining cross-generation code needs.

MCP Integration

The retrieval pipeline described in this chapter is exposed through the BBj MCP server's search_bbj_knowledge tool, defined in Chapter 2. The tool accepts a natural language query and an optional generation filter, performs the hybrid search pipeline described above (dense vector search, BM25 keyword matching, reciprocal rank fusion, cross-encoder reranking), and returns ranked results with source citations.

This is the primary interface for the documentation chat and for any MCP-compatible client that needs to query BBj documentation. Whether a developer asks a question through the chat interface, through Claude, or through Cursor, the same retrieval pipeline returns the same generation-aware results. The generation metadata, chunking strategy, and hybrid search logic described in this chapter are what make those results accurate.

For the complete MCP server architecture and the other two tools (generate_bbj_code and validate_bbj_syntax), see Chapter 2: Strategic Architecture.

Current Status

Where Things Stand

operational for internal exploration: RAG ingestion pipeline with 7 parsers processing the full documentation corpus into 51K+ chunks across 7 source groups.
operational for internal exploration: PostgreSQL 17 + pgvector 0.8.0 database via Docker Compose, storing 51,134 embedded chunks with Qwen3-Embedding-0.6B (1024 dimensions) via Ollama.
operational for internal exploration: REST retrieval API -- POST /search (hybrid retrieval with source-balanced ranking), GET /stats, GET /health.
operational for internal exploration: MCP search_bbj_knowledge tool providing semantic search across the documentation corpus, available via stdio and Streamable HTTP transports.
operational for internal exploration: Web chat integration using RAG retrieval to ground Claude API responses in documentation.

The full RAG system is operational for internal exploration. The ingestion pipeline has processed the complete documentation corpus into 51,134 chunks across 7 source groups, stored in PostgreSQL 17 with pgvector 0.8.0. Retrieval is available through the REST API (hybrid search with source-balanced ranking) and the MCP search_bbj_knowledge tool. The generation metadata schema is shared with the fine-tuning training data, ensuring consistency between what the model learned and what the retrieval system provides. The documentation chat uses this retrieval pipeline to ground Claude API responses in actual documentation with source citations.

What Comes Next

This chapter described the retrieval foundation. The chapters that follow describe its two primary consumers:

IDE Integration -- The VS Code extension uses the retrieval API to enrich code completion prompts with relevant documentation, filtered by the generation the developer is working in.
Documentation Chat -- The chat interface uses the same retrieval API to ground conversational responses in actual documentation, with citations linking back to source topics.
Fine-Tuning the Model -- The training data uses the same generation labeling schema described in this chapter, ensuring the model and the retrieval system share a consistent understanding of BBj's generational structure.
Strategic Architecture -- The two-layer architecture that positions this RAG pipeline as shared infrastructure consumed by multiple applications.
Implementation Roadmap -- Progress to date and forward plan for all components, including the RAG pipeline.

Source Corpus​

MadCap Flare Ingestion​

Content Format​

Clean XHTML Export​

Ingestion Pipeline​

Multi-Generation Document Structure​

Generation Labels​

Document Structure Examples​

Document Types​

Chunking Strategy​

Document-Type-Aware Chunk Sizes​

Contextual Headers​

Overlap Between Chunks​

Embedding Strategy​

Starting Point: General-Purpose Embeddings​

Fine-Tuning Embeddings (Deferred)​

Vector Store Selection​

pgvector Capabilities​

When to Reconsider​

Hybrid Retrieval Strategy​

Stage 1: Dense Vector Search​

Stage 2: Sparse Keyword Search (BM25)​

Stage 3: Reciprocal Rank Fusion​

Stage 4: Cross-Encoder Reranking​

The Complete Retrieval Function​

Generation-Aware Retrieval​

Generation Scoring Logic​

Retrieval Example​

MCP Integration​

Current Status​

What Comes Next​