Skip to main content

Fine-Tuning the Model

TL;DR

We fine-tune Qwen2.5-Coder-14B-Base using a two-stage QLoRA approach -- continued pretraining on raw BBj source code, then instruction fine-tuning on curated ChatML examples -- informed by findings from the bbjllm experiment. The resulting model is exported to GGUF format, evaluated using the bbjcpl compiler's compile@1 metric, and served through Ollama for local, self-hosted inference with zero per-query costs and complete data privacy.

The foundation of the entire BBj AI strategy is a fine-tuned language model. Every other component -- IDE integration, documentation chat, migration tooling -- depends on a model that actually understands BBj syntax, idioms, and multi-generational patterns. Generic LLMs cannot provide this. As Chapter 1 establishes, BBj is effectively invisible in public training corpora. Fine-tuning bridges that gap.

This chapter is the technical blueprint: what the existing bbjllm experiment established and where the recommended approach diverges, why Qwen2.5-Coder-14B-Base is the recommended base model, how training data is structured and pipelined, how two-stage QLoRA training addresses the unique challenges of a zero-representation language, how to evaluate the results using BBj's own compiler, and how the finished model reaches end users through Ollama.

The bbjllm Foundation

The bbjllm repository represents a valuable first attempt at fine-tuning a BBj-aware code model. It demonstrated that fine-tuning is feasible for BBj, produced working training infrastructure, and created a substantial dataset of curated examples. The recommended approach described in this chapter builds on that foundation while addressing specific gaps identified through analysis of the training methodology.

What bbjllm Built

The bbjllm experiment accomplished several important things:

  • Working QLoRA pipeline: A functional training script using Qwen2.5-Coder-32B-Instruct with 4-bit NF4 quantization, LoRA rank 32, and alpha 64 -- a sound, standard QLoRA configuration.
  • 9,922 curated ChatML examples: A substantial dataset in ChatML format aligned with Qwen's native chat template, covering BBj comprehension, completion, and explanation tasks.
  • Proved fine-tuning viability for BBj: Demonstrated that QLoRA can adapt a general-purpose code model toward BBj-specific tasks, establishing the foundation for continued iteration.
  • Established Qwen2.5-Coder as the right model family: Validated that the Qwen2.5-Coder architecture is suitable for BBj fine-tuning, narrowing the model selection question to which variant and size.
  • Correct target module selection: Applied LoRA to all linear layers including MLP -- the recommended configuration for code tasks.
Aspectbbjllm CurrentRecommended Approach
ModelQwen2.5-Coder-32B-InstructQwen2.5-Coder-14B-Base
Model variantInstruct (pre-aligned)Base (clean slate for domain adaptation)
Training stagesSingle stage (instruction fine-tuning only)Two stages (continued pretraining + instruction fine-tuning)
Loss computationFull sequence (system + user + assistant)Completion only (assistant response tokens)
ValidationNone (100% data used for training)90/10 train/validation split with periodic evaluation
Learning rate2e-52e-4 to 5e-5 (QLoRA-recommended range)
Library stackPEFT 0.12.0 + transformers 4.44.0Unsloth 2026.1.4 (2-3x faster, 70% less VRAM)
Evaluation3 hardcoded test questionsbbjcpl-based compile@1 metric + held-out test set
Artifact managementAdapters on training server onlyVersion-controlled with model cards

Analysis of the bbjllm training methodology identified three issues that should be resolved before the next training iteration.

1. No validation set. The training script uses 100% of the 9,922 examples for training with zero held-out data. Without a validation set, there is no way to detect overfitting -- the model may memorize training examples rather than learning generalizable BBj patterns. The training loss will decrease regardless of whether the model is actually improving. Fix: Reserve 10% of examples as a validation set and configure evaluation_strategy="steps" to monitor validation loss during training.

2. Full-sequence loss computation. The training script computes loss on the entire sequence: system prompt, user question, and assistant response. The system prompt ("You are an expert BBj programmer...") appears identically in all 9,922 examples, and the user questions are fixed input text. Computing gradients on these constant tokens wastes an estimated 30-40% of the gradient signal on tokens the model does not need to learn to generate. Fix: Mask non-assistant tokens in the loss computation (set labels to -100 for system and user tokens), or use TRL's SFTTrainer which handles completion-only training automatically.

3. Instruct model choice. Fine-tuning Qwen2.5-Coder-32B-Instruct -- a model already trained to follow instructions -- on domain-specific BBj data risks degrading the instruction-following capability that makes the model useful in the first place. This phenomenon, known as the alignment tax, is discussed in detail in the Base Model Selection section below. Fix: Switch to a Base variant (14B-Base recommended) where domain adaptation starts from a clean foundation rather than overwriting existing instruction alignment.

The recommended approach described in the remainder of this chapter builds on bbjllm's foundation -- the same model family, the same QLoRA method, the same curated dataset -- while addressing these three gaps to improve training outcomes.

Base Model Selection

Choosing the right base model determines the ceiling for the fine-tuned result. The model must be capable enough to learn a new language from relatively few examples, licensed for commercial deployment, and available in a Base variant suitable for domain adaptation.

Decision: Qwen2.5-Coder-14B-Base as Primary Recommendation

Choice: Qwen2.5-Coder-14B-Base for fine-tuning, selected based on research into training outcomes and alignment characteristics.

Rationale: 14B shows greater improvement from fine-tuning than 7B (per Qwen technical report), remains trainable on a single 24GB GPU, and the Base variant avoids the alignment tax of fine-tuning an already instruction-tuned model.

Alternatives considered: 7B-Base (smaller but less fine-tuning headroom), 32B-Instruct (used by bbjllm; higher base quality but alignment tax makes domain adaptation counterproductive).

Status: Active research -- bbjllm experiment used 32B-Instruct; research recommends switching to 14B-Base for next training iteration.

Why Qwen2.5-Coder

The Qwen2.5-Coder family (released September 2024 by Alibaba's Qwen team) represents the current state of the art for open-source code models at fine-tunable sizes. The bbjllm experiment validated this choice, and the recommended approach stays within the same model family. Key facts:

  • Sizes available: 0.5B, 1.5B, 3B, 7B, 14B, 32B (both Base and Instruct variants)
  • Training data: 5.5 trillion tokens -- 70% code, 20% text, 10% math -- covering 92+ programming languages
  • Benchmarks: The 14B-Instruct variant exceeds the 7B on all code benchmarks, while the 32B-Instruct matches GPT-4o on code generation tasks. Critically, the Qwen technical report shows the 14B model demonstrates greater improvement from fine-tuning compared to smaller variants.
  • License: Apache 2.0 -- fully permissive for commercial use, modification, and redistribution
  • FIM support: Native fill-in-the-middle capability, critical for IDE code completion where the model must generate code between existing lines

Training Suitability Comparison

The choice between model variants is primarily about training suitability -- how well the model responds to fine-tuning on a niche language like BBj:

ModelParametersFine-Tuning ImprovementBase vs InstructAlignment Tax RiskRecommendation
Qwen2.5-Coder-7B-Base7BModerateBase (clean slate)NoneStarting point for experimentation
Qwen2.5-Coder-14B-Base14BHigh (per Qwen technical report)Base (clean slate)NonePrimary recommendation
Qwen2.5-Coder-32B-Instruct32BLow (alignment tax)Instruct (pre-aligned)HighNot recommended for domain fine-tuning

The 14B-Base model occupies the sweet spot: large enough to show substantial improvement from fine-tuning on new domain data, small enough to train on a single GPU, and available as a Base variant that avoids the alignment tax entirely.

The Alignment Tax

Fine-tuning an Instruct model on BBj data risks degrading its ability to follow instructions -- the very capability that makes it useful. This is the alignment tax: the hidden cost of fine-tuning a model that has already been trained to follow instructions.

The mechanism is straightforward. An Instruct model's weights encode both domain knowledge (how to write code) and instruction-following behavior (how to structure responses, handle multi-step prompts, refuse harmful requests). When you fine-tune on domain-specific BBj data, the training process cannot selectively update only the "code knowledge" portion of the weights. It overwrites both simultaneously. The model may learn BBj syntax but degrade its response quality, producing less structured answers, ignoring parts of complex prompts, or losing its ability to explain its reasoning.

Research confirms this risk. The Shadow-FT study (ICLR 2025) demonstrates that directly fine-tuning Instruct models on domain data can lead to "marginal improvements and even performance degeneration." The authors propose fine-tuning a Base model instead and grafting the weight deltas onto the Instruct version -- evidence that the research community recognizes Instruct fine-tuning as problematic for domain adaptation.

For BBj specifically, this is why the bbjllm experiment's choice of 32B-Instruct is identified as an area for improvement. The model may learn to generate syntactically valid BBj code, but it may simultaneously lose the instruction-following quality that makes a coding assistant useful -- producing code without adequate explanation, misunderstanding multi-part prompts, or generating responses that do not address what was asked.

This is why the recommended approach uses 14B-Base, not 32B-Instruct. Starting from a Base model means the two-stage training process (continued pretraining for syntax, instruction fine-tuning for response quality) builds both capabilities from a clean foundation rather than risking degradation of existing alignment.

Landscape Comparison (as of February 2026)

Beyond the three Qwen variants under active consideration, the broader code model landscape provides additional context:

ModelSizeFIMLicenseNotes
Qwen2.5-Coder (7B/14B/32B)7-32B denseYesApache 2.0Recommended family. Proven fine-tuning ecosystem, Base + Instruct variants
Qwen3 dense (0.6B-32B)0.6-32B denseYesApache 2.0Newer architecture; not yet evaluated for BBj fine-tuning
Qwen3-Coder480B/30B MoEYesApache 2.0MoE-only (no dense variants); impractical for single-GPU fine-tuning
CodeLlama-7B7B denseYesLlama 2Superseded by Qwen family on all code benchmarks
StarCoder2-7B7B denseYesBigCode OpenSuperseded by Qwen family on all code benchmarks

The original strategy paper (January 2025) listed CodeLlama, DeepSeek Coder, and StarCoder2 as candidates. All three have been superseded by Qwen2.5-Coder on code generation benchmarks. Qwen3-Coder (released July 2025) offers impressive capabilities but ships only in large MoE sizes that are impractical for single-GPU fine-tuning and customer self-hosting. The Qwen3 dense models (non-Coder variants) exist but have not been evaluated for BBj fine-tuning. As the Qwen3 ecosystem matures, it may become the preferred base.

Model selection is not a permanent decision. The fine-tuning pipeline described below is model-agnostic -- when a better base model emerges, we retrain on the same curated data.

Training Data Structure

The quality of training data determines the quality of the fine-tuned model. For a low-resource language like BBj, this is the single most important factor -- more important than model size, hyperparameter tuning, or training duration.

Format and Schema

Each training example is a Markdown file with YAML front matter for metadata and structured content sections. The critical design choice is generation labeling: every example is tagged with the BBj generation(s) it applies to, so the model learns to distinguish between character UI patterns, Visual PRO/5 idioms, modern BBj GUI code, and DWC browser-based patterns.

training-data/gui/hello-window.md
---
title: "Hello World Window"
type: completion
generation: ["bbj-gui", "dwc"]
difficulty: basic
tags: [gui, window, sysgui, hello-world]
description: "Create and display a basic BBj window"
---

## Code

```bbj
REM Hello World Window - Modern BBj
sysgui! = BBjAPI().getSysGui()
window! = sysgui!.addWindow(100, 100, 400, 300, "Hello World")
window!.setCallback(window!.ON_CLOSE, "handleClose")

process_events

handleClose:
release
```

## Expected Behavior

A 400x300 pixel window appears at screen position (100,100)
with the title "Hello World". The window remains open until
the user closes it, at which point the program terminates cleanly.

## Explanation

1. **Get GUI manager**: `BBjAPI().getSysGui()` returns the
system GUI interface
2. **Create window**: `addWindow(x, y, width, height, title)`
creates a top-level window
3. **Handle close event**: `setCallback()` connects the
window's close event to a label
4. **Event loop**: `process_events` starts the BBj event
processing loop
5. **Cleanup**: `release` frees all resources and exits

The generation label uses a simple schema:

LabelScopeExamples
"all"Universal patternsFOR/NEXT loops, file I/O, string functions
"character"Character UI (1980s)PRINT @(x,y), INPUT
"vpro5"Visual PRO/5 (1990s)PRINT (sysgui)'WINDOW'(...), PRINT (sysgui)'BUTTON'(...), CTRL(sysgui,id,index)
"bbj-gui"BBj GUI/Swing (2000s)BBjAPI().getSysGui(), addWindow()
"dwc"DWC/Browser (2010s+)getWebManager(), executeAsyncScript
["bbj-gui", "dwc"]Subset arrayPatterns shared by modern generations

Example Types

Training examples fall into four categories, each teaching the model a different capability:

  • Comprehension -- "Explain this BBj code." The model learns to read and describe legacy patterns, a prerequisite for migration assistance.
  • Completion -- "Complete this code." The model learns to generate syntactically valid BBj that matches the surrounding generation context.
  • Migration -- "Convert this Visual PRO/5 code to modern BBj." The model learns to bridge between generations.
  • Explanation -- "What does BBjAPI().getSysGui() do?" The model learns API semantics.

BBj Code in Training Data

To illustrate what the model learns from, here is a modern BBj event handler -- the kind of pattern that generic LLMs consistently fabricate:

Modern BBj Event Handler
class public OrderForm

field private BBjTopLevelWindow window!
field private BBjEditBox customerField!
field private BBjButton saveButton!

method public void create()
sysgui! = BBjAPI().getSysGui()
#window! = sysgui!.addWindow(100, 100, 600, 400, "Order Entry")
#customerField! = #window!.addEditBox(201, 80, 30, 200, 25)
#saveButton! = #window!.addButton(202, 80, 350, 100, 25, "Save")
#saveButton!.setCallback(#saveButton!.ON_BUTTON_PUSH, #this!, "onSave")
methodend

method public void onSave(BBjButtonPushEvent event!)
customer$ = #customerField!.getText()
rem Process the order...
methodend

classend

A fine-tuned model that has seen hundreds of examples like this will understand that ! suffixes denote object references, that # prefixes reference instance fields, and that methodend closes a method block -- none of which a generic LLM knows.

Volume and Quality Targets

Current research consistently shows that data quality outweighs data quantity for instruction fine-tuning. One thousand carefully curated, expert-reviewed examples can outperform ten thousand hastily generated ones.

The data collection strategy combines two independent efforts:

  1. bbjllm dataset: 9,922 ChatML examples created independently, covering BBj comprehension, completion, and explanation tasks. This dataset has known quality issues (duplicate entries, formatting inconsistencies) that should be addressed before the next training iteration.
  2. training-data/ repository (this repo): 2 seed examples in Markdown format with JSON Schema validation, organized across 7 topic directories ready for expansion. This is the canonical format for new contributions.
  3. Iterative expansion (Ongoing): Analyze model failures in evaluation, create targeted examples for weak areas, retrain.
Decision: Quality-First Data Strategy

Choice: Start with a smaller, expert-curated dataset rather than attempting to scrape or auto-generate tens of thousands of examples upfront.

Rationale: For low-resource languages, training data quality is the dominant factor in model performance. A fine-tuned 7B model on high-quality data can match a 70B general model on domain-specific tasks. Investing engineering time in data curation yields better returns than investing in larger models or longer training.

Alternatives considered: Bulk scraping of BBj source repositories (risk of including broken/outdated code), fully automated synthetic generation (risk of compounding errors without human review).

Status: The bbjllm repository contains 9,922 ChatML examples created independently. The training-data/ repository provides 2 seed examples in Markdown format with JSON Schema validation and contributor guides. Conversion pipeline from Markdown to ChatML is planned.

Training Data Pipeline

Two repositories contribute to the training data, each serving a different role:

training-data/ (this repository): The canonical source for new training examples. Contributors create Markdown files with YAML front matter specifying the example type, target generation(s), and difficulty level. Files are validated against a JSON Schema and organized by topic (gui/, file-io/, etc.). This format is human-readable, GitHub-renderable, and designed for expert review.

bbjllm (separate repository): Contains 9,922 ChatML JSONL examples -- the actual input to the training script. These examples were created independently using a different workflow, not converted from the training-data/ Markdown format.

Conversion pipeline (planned): A convert_to_chatml.py script will transform training-data/ Markdown examples into ChatML JSONL suitable for training. This pipeline does not yet exist. The bbjllm examples and training-data/ examples are currently disconnected -- unifying them through a conversion pipeline is a planned improvement.

The intended flow from contribution to deployment:

The bbjllm dataset also has known quality issues -- approximately 375 duplicate entries and 60 examples with formatting inconsistencies -- that should be cleaned before the next training run. These are manageable data preprocessing tasks, not fundamental problems with the dataset.

The QLoRA Fine-Tuning Approach

Full fine-tuning of a 14B parameter model requires updating all ~14 billion weights, demanding multiple high-end GPUs and substantial memory. QLoRA (Quantized Low-Rank Adaptation) achieves comparable results at a fraction of the cost by freezing the base model weights and training only small adapter matrices.

How LoRA Works

LoRA (Low-Rank Adaptation) decomposes weight updates into two small matrices rather than modifying the full weight matrix. For a weight matrix W of dimension d x d, instead of computing the full update deltaW (d x d parameters), LoRA factorizes it as deltaW = A x B where A is d x r and B is r x d, with rank r typically between 16 and 64. This reduces trainable parameters from d-squared to 2dr -- a reduction of over 99% for typical model dimensions.

QLoRA adds quantization: the frozen base model weights are stored in 4-bit precision (NF4 format), reducing memory by ~75% compared to full 16-bit weights. Only the LoRA adapter matrices are trained in higher precision.

The result:

ApproachVRAM Required (14B)HardwareApproximate Cost
Full fine-tuning (FP16)120-160 GB8x A100 80GB$100,000+
LoRA (FP16 base)40-56 GB2-4x A100$30,000+
QLoRA (NF4 base)16-20 GB1x RTX 4090~$1,500

QLoRA on a single RTX 4090 makes fine-tuning a 14B model accessible on hardware that would be orders of magnitude more expensive for full fine-tuning. This is not a compromise -- research shows QLoRA matches full fine-tuning quality with no measurable accuracy loss.

Based on current best practices for code model fine-tuning via Unsloth:

ParameterValueRationale
LoRA rank (r)32-64Higher rank for learning a new language; 16 is typical for style tuning
LoRA alpha2x rankStandard scaling factor
LoRA targetAll linear layersApply to attention AND MLP layers -- not just attention
QuantizationNF4 (4-bit)QLoRA default; best memory/quality tradeoff
Learning rate2e-4 to 5e-5Lower end for continued pretraining, higher for instruction tuning
Epochs1-3More risks overfitting on small datasets
Batch size4-8 (with gradient accumulation)Effective batch size of 32-64
Max sequence length2048-4096BBj functions are typically compact
Completion maskingAssistant tokens onlyMask system/user tokens with -100 in labels

A critical detail: apply LoRA to all linear layers, not just the attention matrices. Recent research confirms that including MLP layers in the LoRA adaptation significantly improves performance on code tasks compared to attention-only LoRA.

A note on learning rate: The bbjllm training script uses a learning rate of 2e-5, which is 5-10x lower than the recommended QLoRA range of 2e-4 to 5e-5. At this rate, adapter weights may not move far enough from initialization to meaningfully encode BBj knowledge -- the model produces plausible output because the base model is already capable, but the fine-tuning adds minimal value. The learning rate range in the table above is the correct starting point for QLoRA.

Completion masking is equally important: training should compute loss only on the assistant's response tokens, not on the system prompt or user question. The system prompt appears identically in all examples, and computing gradients on these constant tokens wastes 30-40% of the gradient signal. TRL's SFTTrainer handles this automatically when configured for completion-only training.

Two-Stage Training Approach

For a language with near-zero representation in the base model's training data, a two-stage approach produces better results than jumping directly to instruction fine-tuning:

Stage 1 -- Continued Pretraining: Feed the model raw BBj source code (without instruction/response formatting) so it learns the language's syntax, token patterns, and idioms. This builds foundational understanding. The base model needs to learn that ! suffixes denote object references, that # prefixes reference instance fields, that methodend closes a method block, and that process_events is a control flow statement -- none of which appear in its pre-training data.

Stage 2 -- Instruction Fine-Tuning: Train on the ChatML examples (comprehension, completion, migration) so the model learns to follow instructions and produce useful outputs in the BBj domain. This is where the 9,922 examples from bbjllm -- plus any additional examples from the training-data/ repository -- provide the instruction-following capability.

The bbjllm experiment skipped Stage 1 entirely, going directly to instruction fine-tuning on the 32B-Instruct model. The recommended approach includes Stage 1 because BBj has near-zero representation in the base model's pre-training data. Without continued pretraining, the model must simultaneously learn BBj syntax and learn to follow instructions about BBj -- two distinct learning objectives that compete for the same gradient updates. Separating them into stages allows each to be optimized independently.

Stage 1 matters specifically for zero-representation languages because the base model's tokenizer was not designed for BBj. Common BBj tokens like methodend, classend, BBjAPI, and sysgui! will be split into subword fragments. Continued pretraining on raw BBj source code teaches the model's attention layers to recognize these fragment patterns as coherent constructs, even though the tokenizer splits them.

Avoiding Catastrophic Forgetting

A persistent risk in fine-tuning is catastrophic forgetting -- the model loses general capabilities while acquiring domain-specific ones. A BBj-fine-tuned model that can no longer write Python or explain algorithms is less useful than one that retains broad knowledge while adding BBj expertise.

Mitigations:

  • LoRA inherently helps -- by only modifying small adapter weights, the base model's general knowledge is largely preserved.
  • Mixed training data -- include some general code examples (Python, Java, JavaScript) in the training mix to reinforce broad capabilities.
  • Evaluation on both domains -- always measure performance on general code benchmarks (HumanEval) alongside BBj-specific benchmarks. If general performance drops more than 5%, adjust the training mix.

Evaluation Methodology

Without a BBj-specific evaluation framework, improvements from fine-tuning cannot be measured. No public BBj benchmark exists -- the language is too niche for standard code evaluation suites like HumanEval or MBPP. A custom evaluation approach is required, built around BBj's unique advantage: the bbjcpl compiler provides ground-truth syntax validation.

compile@1: Automated Syntax Validation

The compile@1 metric measures the percentage of generated BBj code samples that compile successfully on first attempt using the bbjcpl compiler.

The evaluation process:

  1. Prompt the model with N test cases (natural language descriptions of BBj programs)
  2. Collect the generated code for each test case
  3. Validate each sample with bbjcpl -N (the -N flag compiles without linking, checking syntax only)
  4. Compute the pass rate: compile@1 = (samples that compile) / (total samples)

This metric leverages BBj's "secret weapon": the bbjcpl compiler IS ground truth for syntactic correctness. Unlike natural language evaluation (which requires expensive human review or unreliable LLM-as-judge approaches), compilation is binary and deterministic. A BBj program either compiles or it does not. This creates a unique, non-gameable metric that most niche-language fine-tuning efforts lack.

compile@1 does not measure whether the code is correct (it may compile but produce wrong output), but it establishes a necessary floor: code that does not compile is definitively wrong. For a language where generic LLMs consistently fabricate syntax (.addEventListener() instead of .setCallback(), missing methodend, inventing non-existent API methods), compile@1 is the most important first metric.

Qualitative Evaluation

Compilation alone does not guarantee useful code. Human review evaluates dimensions that automated metrics cannot:

  • Code quality: Does the generated code follow BBj conventions? Proper use of ! suffixes for object references, # prefixes for instance fields, REM comments, and appropriate error handling.
  • Idiomatic patterns: Does the model use generation-appropriate idioms? DWC patterns should use getWebManager() and modern event handling, not legacy PRINT (sysgui)'WINDOW'(...) syntax. Character-mode examples should use PRINT @(x,y), not GUI calls.
  • Documentation quality: Are explanations accurate and helpful? When asked to explain code, does the model correctly identify BBj-specific constructs and their purposes?

Qualitative evaluation is subjective and time-consuming, but it catches failure modes that compile@1 misses: code that compiles but uses deprecated patterns, code that works but is not idiomatic for the target generation, and explanations that are technically incorrect despite sounding plausible.

Baseline Comparison

Evaluation results are meaningful only in comparison. Three baselines establish the performance spectrum:

  1. Qwen2.5-Coder-14B-Base (unmodified): What can the base model do before any fine-tuning? This is the floor. If the fine-tuned model does not beat this baseline on compile@1, fine-tuning has not helped.
  2. Claude API (current system): What does the current RAG + Claude approach achieve? This is the bar to clear for practical deployment. The fine-tuned model does not need to match Claude's general reasoning, but it should approach Claude's BBj-specific output quality.
  3. bbjllm 32B output: What did the existing fine-tuning experiment produce? This is the direct comparison point -- the recommended approach should outperform bbjllm's output, demonstrating that the methodology changes (Base model, two-stage training, completion masking) translate to measurable improvement.

Run the same test set against all three baselines. Report results side-by-side to make improvement (or lack thereof) unambiguous.

Test Set Structure

A well-constructed test set is the foundation of reliable evaluation:

  • Held-out split: Reserve 10% of training data for evaluation -- these examples are never used during training. The training script must enforce this split consistently across runs.
  • Category coverage: Test cases should span all generation labels (all, character, vpro5, bbj-gui, dwc) and all example types (comprehension, completion, migration, explanation). Evaluation results broken down by category reveal which areas the model has learned well and which need more training data.
  • Size: A minimum of 50-100 test cases is needed for statistically meaningful results. Smaller test sets produce noisy metrics where a single test case swings the score by 1-2 percentage points.
  • Difficulty distribution: Include basic, intermediate, and advanced examples. A model that scores 90% on basic examples but 10% on advanced ones has a different profile than one scoring 50% uniformly.

Sample Evaluation Test Case

To make the methodology concrete, here is what a single evaluation test case looks like end-to-end:

Prompt: "Write a BBj program that creates a window with a button. When the button is clicked, display a message box saying 'Hello'."

PASS example -- syntactically valid BBj that compiles:

REM Create window with button and message box
sysgui! = BBjAPI().getSysGui()
window! = sysgui!.addWindow(100, 100, 400, 300, "Button Demo")
button! = window!.addButton(1, 50, 50, 120, 30, "Click Me")
button!.setCallback(button!.ON_BUTTON_PUSH, "onButtonClick")
window!.setCallback(window!.ON_CLOSE, "onClose")

process_events

onButtonClick:
i = msgbox("Hello")
return

onClose:
release
$ bbjcpl -N button_demo.bbj
$ echo $?
0

The compiler returns exit code 0 -- the code compiles successfully. This test case passes compile@1.

FAIL example -- code with a common LLM fabrication error:

REM Create window with button - INCORRECT
sysgui! = BBjAPI().getSysGui()
window! = sysgui!.addWindow(100, 100, 400, 300, "Button Demo")
button! = window!.addButton(1, 50, 50, 120, 30, "Click Me")
button!.addEventListener("click", "onButtonClick")
window!.setCallback(window!.ON_CLOSE, "onClose")

process_events

onButtonClick:
alert("Hello")
return

onClose:
release
$ bbjcpl -N button_demo_bad.bbj
**Error on line 5: Method addEventListener not found in BBjButton
**Error on line 11: alert is not a recognized function
$ echo $?
1

The compiler returns exit code 1 with specific error messages. The model fabricated .addEventListener() (a JavaScript pattern) instead of using BBj's .setCallback(), and used alert() (JavaScript) instead of msgbox() (BBj). These are exactly the kinds of errors that generic LLMs make when they lack BBj training data -- and exactly what fine-tuning should fix.

Reporting Format

Evaluation results should be reported in a standardized format for comparison across training runs:

Modelcompile@1QualitativeDateNotes
Qwen2.5-Coder-14B-Base (unmodified)--%----Baseline (pre-fine-tuning)
bbjllm 32B-Instruct--%----Previous experiment
Claude API + RAG--%----Current system
14B-Base fine-tuned (v1)--%----First recommended-approach run

Dashes indicate that these evaluations have not yet been run -- the evaluation framework itself is planned. Building it before the next training iteration is a prerequisite for measuring whether methodology changes actually improve outcomes.

Toolchain: Unsloth + llama.cpp + Ollama

The fine-tuning-to-deployment pipeline uses three tools, each handling a distinct stage:

Unsloth -- Fine-Tuning

Unsloth (2026.1.4) is the recommended training framework for QLoRA fine-tuning. Compared to vanilla Hugging Face Transformers + PEFT:

  • 2-3x training speed through custom CUDA kernels
  • 70% less VRAM usage via aggressive memory optimization
  • 0% accuracy loss -- same mathematical operations, just more efficient execution
  • Dynamic 4-bit Quantization -- selectively preserves higher precision for critical parameters, improving accuracy over uniform NF4 quantization
  • 500K context support -- enables training on very long sequences when needed
  • Built-in GGUF export -- can eliminate the llama.cpp conversion step in some workflows

Unsloth natively supports Qwen2.5-Coder and integrates with Hugging Face datasets for data loading. It is fully compatible with the HuggingFace ecosystem (PEFT, TRL, transformers) while providing batteries-included training workflows.

Alternative frameworks worth noting: LLaMA-Factory provides a web UI for fine-tuning configuration, and Axolotl supports advanced training recipes. Both are viable, but Unsloth's speed and memory advantages make it the preferred choice for single-GPU setups.

Version Comparison

The bbjllm training script pins library versions via starttrain.sh. Several of these are significantly behind current releases, including one critical bug fix:

Librarybbjllm VersionCurrent (Feb 2026)Key Changes
transformers4.44.05.1.0Major v5 release (first in 5 years), significant API changes
peft0.12.00.18.16 minor versions, Python 3.9 dropped
bitsandbytes0.43.00.49.1Critical: 0.43.0 has QLoRA memory bug wasting 5-10GB VRAM (fixed in 0.43.2)
trl(not used)0.27.xSFTTrainer with built-in completion masking and packing
Unsloth(not used)2026.1.4Recommended replacement for raw PEFT; 2-3x faster, 70% less VRAM

Before the next training run, update at minimum bitsandbytes (critical bug fix that recovers 5-10GB of wasted VRAM) and add trl for completion masking support. The recommended approach uses Unsloth, which bundles compatible versions of all libraries and handles version management automatically.

llama.cpp -- GGUF Conversion

After training, the merged model weights (in Hugging Face Safetensors format) need to be converted to GGUF format for efficient CPU/GPU inference. The convert_hf_to_gguf.py script from llama.cpp handles this conversion, including optional quantization to reduce model size.

Common quantization levels for the 14B model:

FormatSize (14B model)QualityUse Case
F16~28 GBFullDevelopment/evaluation
Q8_0~14 GBNear-fullHigh-end workstations
Q4_0~8 GBGoodDefault for deployment
Q4_K_M~8.5 GBBetter than Q4_0Recommended balance
Q2_K~5 GBReducedLow-resource environments

For BBj code generation, where output quality matters more than inference speed, Q4_K_M provides the best quality-to-size ratio and is the recommended deployment format. At ~8.5 GB, it fits comfortably on workstations with 16 GB or more of RAM.

Ollama Modelfile

The final step is packaging the GGUF model for Ollama. A Modelfile is a simple text file that tells Ollama how to load and configure the model:

Modelfile
FROM ./bbj-coder-14b-q4_k_m.gguf

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

SYSTEM "You are a BBj programming assistant. You understand all four generations of BBj: character UI, Visual PRO/5, BBj GUI/Swing, and DWC/browser. When generating code, match the generation context of the surrounding code. Default to modern DWC patterns for new projects."

PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

Creating the model in Ollama is then a single command:

ollama create bbj-coder -f Modelfile

Training Workflow

Fine-tuning is not a one-time event. Each training run produces artifacts that must be preserved, and each evaluation cycle informs the next iteration. This section describes the practical workflow for managing training runs.

Artifact Management

Each training run produces several artifact types with different storage requirements:

  • LoRA adapter weights (100-300 MB): Commit to the repository with Git LFS. These are the primary training output -- small enough for version control, and critical to preserve. Currently, adapter weights exist only on the training server. If that server is rebuilt, all training results are lost.
  • Merged model weights (full precision, ~28 GB for 14B): Store on the training server. These are intermediate artifacts used for GGUF conversion and do not need to be committed to version control.
  • GGUF quantized models (~8.5 GB for Q4_K_M): Store in a model registry for distribution to Ollama deployments. These are the deployment artifacts that reach end users.
  • Training logs and metrics: Track with Weights & Biases (free tier) or TensorBoard for run-to-run comparison. The bbjllm training script currently uses report_to="none", which means loss curves are only visible in stdout and cannot be compared across runs.

What to Commit Back

After each training run, commit the following to the repository:

  1. Updated adapter weights via Git LFS -- these are the versioned training output
  2. Model card documenting: hyperparameters used, training metrics (final training loss, validation loss), evaluation results (compile@1 score and baseline comparison), and training date and duration
  3. Dataset changes -- any deduplication, corrections, or new examples added since the previous run

The model card is particularly important. Without it, there is no record of what changed between training runs or why one model performs differently from another. Each card should be a Markdown file committed alongside the adapter weights.

Iterative Improvement Process

The training cycle follows a straightforward loop:

  1. Evaluate the current model against baselines (compile@1 for automated scoring, qualitative review for code quality)
  2. Identify weak areas -- which generation labels score lowest? Which example types produce the most compilation errors?
  3. Create targeted training examples for the weak areas identified in step 2
  4. Retrain with the updated dataset (including any data quality fixes)
  5. Evaluate again -- compare results to the previous run using the same test set
  6. If improved, export to GGUF and update the Ollama deployment

Each iteration should change only one variable at a time (dataset changes OR hyperparameter changes, not both) so that improvements or regressions can be attributed to a specific change. Track every run in the experiment tracker so the team can identify which changes had the most impact.

Hosting via Ollama

Self-hosted inference via Ollama is a deliberate architectural choice, not just a deployment convenience. It addresses the two most common objections to AI tooling in enterprise environments: data privacy and ongoing costs.

Decision: Ollama for Local Model Serving

Choice: Ollama (v0.15.x) as the inference runtime for the fine-tuned BBj model. Customers self-host on their own hardware.

Rationale: Zero per-query API costs after initial setup. Customer source code never leaves their network. OpenAI-compatible API means existing tooling (IDE extensions, chat interfaces) can integrate without custom adapters. Cross-platform support (macOS, Windows, Linux) with native desktop applications as of mid-2025.

Alternatives considered: vLLM (higher throughput but more complex deployment, better suited for centralized team serving with LoRA hot-swapping), llama.cpp server directly (lower-level, less user-friendly), cloud API (privacy concerns, ongoing costs).

Status: Ollama infrastructure validated for internal exploration. Model packaging workflow defined but not yet automated.

Why Self-Hosting Matters

For BBj customers, the source code being processed by AI tools often represents decades of business logic -- proprietary algorithms, customer data handling, financial calculations. Sending this to a cloud API is a non-starter for many organizations.

With Ollama:

  • Source code stays on-premises. The model runs on the customer's hardware, behind their firewall.
  • No per-query costs. Once the model is deployed, usage is unlimited. This matters when IDE completions can generate hundreds of inference requests per developer per day.
  • Air-gapped operation. The model works without an internet connection, critical for customers in regulated industries or secure environments.
  • Simple deployment. Customers install Ollama and run a single command: ollama run bbj-coder.

Hardware Requirements for Inference

TierHardwareModel SizePerformanceAudience
Minimum16GB RAM, any modern CPU14B Q4_K_M (~8.5 GB)~5-10 tokens/secIndividual developer
Recommended24GB+ RAM, GPU with 12GB+ VRAM14B Q4_K_M (~8.5 GB)~20-40 tokens/secIndividual developer
Team32GB+ RAM, RTX 3090/409014B Q4_K_M (~8.5 GB)~30-50 tokens/secShared inference server
Enterprise64GB+ RAM, A100/H10014B Q8_0 (~14 GB)~40-60 tokens/secOrganization-wide

The 14B Q4_K_M model at approximately 8.5 GB fits on workstations with 16 GB of RAM, though a GPU is strongly recommended for interactive use cases like IDE code completion where latency matters.

API Compatibility

Ollama exposes an OpenAI-compatible API out of the box. This means any tool built against the OpenAI API -- including the VSCode extension and documentation chat system described in later chapters -- can point to a local Ollama instance by changing only the base URL:

# Instead of:
OPENAI_API_BASE=https://api.openai.com/v1

# Point to local Ollama:
OPENAI_API_BASE=http://localhost:11434/v1

Ollama supports OpenAI-compatible API endpoints. As of Ollama v0.15.x, the API also supports tool calling and structured JSON output -- features that enable more sophisticated integrations as the BBj AI tooling matures.

Deployment Architecture

Model updates are distributed as GGUF files (~8.5 GB for the 14B Q4_K_M model). Customers download new versions from a model registry (which could be a simple file server, Hugging Face Hub, or the Ollama model library) and update with ollama create bbj-coder -f Modelfile. No retraining on the customer side.

MCP Integration

The fine-tuned model described in this chapter is designed to be accessed through the BBj MCP server. Two MCP tools are currently operational: search_bbj_knowledge (RAG-powered documentation retrieval) and validate_bbj_syntax (bbjcpl-based compilation checking). These are defined in Chapter 2.

A third tool, generate_bbj_code, is planned but not yet implemented. When built, it will accept a natural language prompt, a target BBj generation, and optional surrounding code context. It will assemble RAG-retrieved documentation into an enriched prompt and forward it to the Ollama-hosted fine-tuned model -- the same model built through the QLoRA pipeline described above. This follows the RAFT pattern (Retrieval-Augmented Fine-Tuning): the fine-tuned model provides BBj syntax knowledge, while RAG provides current API signatures and documentation context.

Once generate_bbj_code is implemented, any MCP-compatible client -- Claude Desktop, Cursor, VS Code, or a custom application -- will be able to generate BBj code using the fine-tuned model without building custom Ollama integration code. The model's generation awareness, trained through the labeled examples in this chapter, will be available to every client through a single standard tool interface.

For the complete MCP server architecture, tool schemas, and integration patterns, see Chapter 2: Strategic Architecture.

Current Status

Where Things Stand
  • Active research: bbjllm experiment (9,922 ChatML examples fine-tuned on Qwen2.5-Coder-32B-Instruct via QLoRA/PEFT). Research recommends switching to Qwen2.5-Coder-14B-Base with two-stage training approach to address identified gaps (validation, loss masking, model variant).
  • Operational: Training data repository with 2 seed examples, 7 topic directories, JSON Schema validation, and contributor guides. Toolchain components (Unsloth, llama.cpp, Ollama) are publicly available and actively maintained.
  • Planned: Evaluation suite using bbjcpl-based compile@1 metric. Training data conversion pipeline (training-data/ Markdown to ChatML JSONL). Training workflow with artifact management and iterative improvement.

The fine-tuned 14B model is the foundation that both the IDE extension and the planned documentation chat depend on. The generation labeling schema defined in this chapter is shared with the RAG database, ensuring consistency between what the model learned and what the retrieval system provides. As the model improves through iterative training runs -- each evaluated against the compile@1 baseline and tracked through versioned adapter weights -- every consumer application benefits immediately. This is the core value of the unified architecture.