I Trained a Domain Expert LLM That Ships Inside My Product

I needed a model that knows CMMC compliance cold. Not "has read the NIST docs" — actually understands when a security control is met, when it's not, and when the evidence is inconclusive. And it had to run on a customer's laptop. An i7 with 16 gigs of RAM. No GPU. No cloud calls.

This is the story of how I got there, and why the answer wasn't what I expected.

The Problem

I'm building a compliance assessment tool. It pulls security configuration data from Microsoft 365 — conditional access policies, device compliance, audit logs, MFA settings — and evaluates it against 320 CMMC Level 2 objectives.

The Rust evaluators handle the structured checks. But some objectives need judgment. "Is this configuration sufficient?" isn't always a yes/no. I needed a model that could look at raw API data and make a defensible call: Verified, Deficient, or Insufficient Data.

Cloud models are out. This tool runs in air-gapped environments. The model has to ship with the product.

I needed a judge. One that could sit on a laptop, review evidence, and render a verdict. Offline, every time.

Round 1: Train the Judge

I started with Gemma 3 12B. Good model, open weights, fits on my M4 Pro for training. I built an eval harness — 50 test cases covering all three verdict types — and ran the baseline.

58% accuracy. And the distribution was catastrophic: 100% on Verified, 8% on Deficient, 0% on Insufficient Data. The model said "looks good" to everything. Every security control passed. Every tenant was compliant.

This wasn't a judge. It was a rubber stamp. A judge who acquits everyone isn't dispensing justice — they're just not paying attention.

Teaching the Judge Through Experience

I figured the problem was training data. The model hadn't seen enough examples of failure. So I wrote 36 corrective examples — cases where the model should have said Deficient but didn't — and ran LoRA fine-tuning on Apple Silicon.

Round 1: 70%. Insufficient Data jumped from 0% to 60%. The judge was starting to notice problems.

I wrote 31 more examples and bumped the learning rate.

Round 2: 84%. Deficient went from 8% to 67%. Insufficient Data hit 100%. The model was actually reading the evidence — catching disabled policies, empty arrays, noncompliant device states. Real scrutiny. Real judgment.

I thought I was almost there.

The Ceiling

Then I kept going. More case law. More examples. Carefully balanced distributions.

Round 3: 62%. Regression. Round 4: 60%. Worse.

I tried different balancing strategies. Different learning rates. Different checkpoint selections. Every additional round of training degraded the model. The base model's prior — "say Verified" — was an attractor basin. I could nudge the judge partway toward rigor, but push harder and they snapped back to rubber-stamping.

Four rounds of LoRA training taught me something I didn't want to learn: I was trying to bake the law into the judge's brain. Memorize every statute. Internalize every precedent. Know, from experience alone, what matters in every case.

That's not how courts work. That's not even how compliance works.

Hand the Judge the Statute Book

I stopped training. Instead, I wrote a decision guide for each of the 320 objectives — a short document explaining what to look for in the API evidence and what each verdict means for that specific control.

Think of it as the statute book. "This objective is about MFA enforcement. Check if grantControls contains 'mfa'. If the policy state is 'disabled', that's Deficient regardless of what the grant controls say."

I injected the relevant guide into the prompt at eval time. No weight changes. Same model. Same judge — just one who now has the law in front of them instead of trying to remember it.

100%. Every test case. All three verdict types.

The 8 cases that the fine-tuned model still got wrong? All fixed. Admin counting, empty configurations, unresolved incidents, Secure Score interpretation — every failure was a case where the judge needed the statute, not more experience.

A Smaller Judge Works Fine

If the statute book was doing the heavy lifting, maybe I didn't need a senior judge at all.

I pulled Gemma 3 4B. Stock weights. No fine-tuning. No training whatsoever. Handed it the same decision guides.

100%. One second per inference. 3.3 gigabytes on disk.

A model a third the size, with zero training, matched the fine-tuned model's best — and beat it on every case the fine-tuned version still got wrong. The difference wasn't the judge's experience. It was whether they had the law in front of them.

What Ships

The product bundles Gemma 3 4B and a context map — 321 decision guides, one per CMMC objective. At assessment time, each objective gets its guide injected alongside the raw API data. The model reads the evidence, consults the statute, and renders a verdict.

It runs on a Dell Latitude with an i7 and 16 gigs of RAM. CPU only. No internet required. About a second per objective, which puts a full 320-objective assessment around five minutes.

Total payload: the judge (3.3 GB) and the law (640 KB).

The Metaphor Is the Lesson

Fine-tuning is training through experience — trying to make the judge internalize the law by showing them enough cases. It works up to a point. An experienced judge is better than a fresh one. But there's a ceiling, because experience is lossy. You can't encode every statute into intuition.

RAG is handing the judge the statute book at trial time. The judge still does real work — reading evidence, interpreting context, weighing ambiguity, rendering a verdict. That's not nothing. A regex can't do that. A rule engine can't do that for 320 different objectives without becoming unmaintainable.

But the judge doesn't need to memorize the law. They need to read it. And a competent junior judge with the right reference material will outperform a senior judge working from memory every time.

The Point

Most people building domain-expert models start where I started: collect training data, fine-tune, evaluate, iterate. It's the obvious path. You're training the judge.

The ceiling I hit wasn't compute or data. It was approach. I was trying to bake domain knowledge into weights when I should have been passing it in at inference time.

A 3.3 GB model with the right context beat a 12 GB fine-tuned model without it. Every time. The judge was never the bottleneck. The law was.

If your model needs to know more, you might not need to train it. You might just need to hand it the statute book.