TECHNICAL BENCHMARK REPORT

ConclAive Validation Results

Last updated: 2026-06-01 · All inference local, single workstation · No cloud API calls
EXECUTIVE SUMMARY

Key Numbers

1,000+Validated Pipeline Runs
94Module Tests (100% Pass)
13LoRA Specialist Adapters
73Scenario Configurations
4Domains (Land/Naval/Air/c-UxS)
12Adversary Doctrine Files
<200msTA2 Combined Latency
~79sRapid Pipeline End-to-End

All benchmarks run on a single workstation with local inference. No cloud, no API calls. Every number below is reproducible on the same hardware.

BENCHMARK 1

COA Generation — Model Comparison

7 models evaluated on the same NTC scenario (25K-token context). Automated scoring across 9 dimensions: phase completion, doctrine compliance, maneuver axes, deception/ruse, combat enablers, contingency scenarios, decision points, terrain accuracy, and format adherence.

MODELSIZETYPESCOREPHASESTERRAINRUSEDPsTIME
LoRA Specialist A24BFine-tuned854/4Clean4 refs2101.7s
Base Model B72BVanilla754/4Clean4 refs0295.3s
Base Model C32BVanilla754/4Clean4 refs2211.9s
Base Model D22BVanilla744/4Clean4 refs1106.1s
Base Model E27BVanilla724/4Clean4 refs2106.4s
KEY FINDING
Model size does not predict COA quality. A 24B fine-tuned specialist (score 85) outperforms a 72B vanilla model (score 75) while running 3x faster. Task-specific fine-tuning is the primary quality differentiator, not raw parameter count.
Scoring methodology: Each dimension scored independently. Phases (0-4): structural completeness. Compliance (0-10): doctrine references. Terrain: binary pass/fail with hallucination count. Ruse: reference count. Decision Points: count of explicit DP triggers. Penalties applied for: terrain hallucinations (-3 each), missing EW (-10), missing CAS, CoT leakage, format violations.
BENCHMARK 2

Fine-Tuned Specialists vs Vanilla Models

Head-to-head comparison of ConclAive's LoRA fine-tuned specialists against unmodified base models on the same scenario. 18 COA generations across multiple model sizes.

MODELSIZETYPEPHASESTERRAINRUSE REFSENABLERSAVG TIME
LoRA Specialist (24B)24BFine-tuned12/120 violations213585.8s
LoRA Specialist (32B)32BFine-tuned12/120 violations2633126.1s
LoRA Specialist (8B)8BFine-tuned12/120 violations253527.7s
Vanilla Base (22B)22BVanilla12/120 violations2540101.2s
Vanilla Base (32B)32BVanilla12/124 violations4052220.4s
KEY FINDING
Fine-tuned 8B model generates compliant COAs in 27.7 seconds — 8x faster than a vanilla 32B with zero terrain violations vs 4. An order of magnitude smaller, running on edge hardware, producing superior tactical grounding. The 8B specialist achieves what a 32B vanilla cannot: consistent terrain accuracy.
BENCHMARK 3

Red Team — Adversarial Evaluation

Red team models independently identify flaws in generated COAs, assign severity levels, and produce survivability scores. Fine-tuned specialists compared against vanilla base models.

MODELSIZETYPEFLAWS/COACRITICALHIGHFLAW DIVERSITYAVG TIME
RT Specialist A14BFine-tuned12.01.75.36/6 categories34.4s
RT Specialist B24BFine-tuned9.31.02.06/6 categories49.9s
Vanilla RT (32B)32BVanilla25.03.36.05.7/6 categories257.6s
Vanilla RT (22B)22BVanilla11.01.32.35.3/6 categories53.4s
KEY FINDING
Fine-tuned red team models achieve full 6/6 category diversity (logistics, timing, terrain, force ratio, C2, deception) while maintaining structured output. Vanilla 32B generates 25 flaws/COA at 5x the latency — quantity over quality. The fine-tuned 14B RT completes a full adversarial review in 34 seconds.
BENCHMARK 4

Automated Judge — Quality Assessment

Independent judge model evaluates 60 COAs across 5 generators, 4 scenarios, and 3 commander profiles. Two metrics: Tactical Coherence (0-10) and Overall Quality (0-10).

GENERATORSIZETYPEAVG COHERENCEAVG QUALITYSAMPLES
LoRA Specialist A22BFine-tuned8.428.5812
LoRA Specialist B8BFine-tuned8.338.4212
LoRA Specialist C8BFine-tuned8.087.1712
Vanilla Base (70B)70BVanilla7.927.1712
KEY FINDING
8B fine-tuned specialist scores 8.42/10 quality — outperforming a 70B vanilla model at 7.17/10. An order of magnitude smaller, running on edge hardware, producing superior tactical plans. Task-specific distillation beats generic scale for structured military planning.
BENCHMARK 5

Monte Carlo Simulation — Naval Domain

1,000-iteration stochastic simulation for naval operations (near-peer adversary, multiple weather conditions). Each iteration applies random perturbations to 10 parameters and evaluates COA survivability.

COARESULT95% CI LOWER95% CI UPPERBASE PROBTOP FAILURE MODE
CHARLIEScore 78Highest tactical score — aggressive posture with escalation risk surfaced automatically
ALPHA38.2%35.2%41.3%69.7%Weather Degradation (16.6%)
BRAVOScore 65Conservative posture — logistically exposed under perturbation

Top failure modes (1000 iterations): Weather Degradation (16.6%), OPFOR Repositioning (16.0%), Submarine Contact (6.9%). Monte Carlo reveals that the highest-scoring COA under deterministic analysis carries significant escalation risk — the system surfaces this tradeoff automatically for commander decision.

Perturbation model: 10 stochastic factors (weather shift, OPFOR reposition, submarine contact, logistics delay, C2 disruption, air superiority loss, political constraint change, EMCON violation, waterspace denial, timing slip). Each sampled from calibrated probability distributions per adversary doctrine file.
BENCHMARK 6

TA2 Human-Machine Teaming Modules

7 modules implementing human-machine teaming requirements. All modules benchmarked with automated test suites. Combined latency under 200ms.

MODULEFUNCTIONLATENCYTESTSKEY METRIC
Intent ParserNL to JP 5-0 weighted criteria<100ms3000-item CV81% accuracy (5-fold)
Abstention (L1+L2)Out-of-domain rejection<50ms300-item81% recall, 100% specificity
Commander ProfileEMA trait tracking, 4 archetypes0.4ms7/7Converges in <50 interactions
Transactive MemoryKnowledge distribution (Wegner 1987)0.06ms9/9Real-time team state
InterdependenceShared goals (Salas Big Five)0.04ms6/6Complementary role mapping
Trust Calibration6-factor per-COA confidence0.05ms7/7Wilson 95% CI scoring
Cognitive ForcingAnomaly detection, debiasing0.87ms7/7Trust break triggers
36/36TA2 Tests Passing
58/58TA1 Tests Passing
<200msCombined TA2 Latency
BENCHMARK 7

LoRA Specialist Performance

13 fine-tuned LoRA adapters across 4 base model architectures (8B to 32B). Trained via frontier distillation, validated on held-out scenarios.

SPECIALIST ROLEFORMAT COMPLIANCEKEY METRICLATENCYvs VANILLA BASE
COA Generator100% (48/48 COAs)90% structure consistency29s+87pp compliance
Red Team Analyst94% (15/16)6.5 flaws/COA avg39s+94pp compliance
JP5 Judge94% (15/16)83% rank preservation5s+100pp compliance
KEY FINDING
Fine-tuning improves format compliance by +87 to +100 percentage points over vanilla base models. The JP5 Judge specialist goes from 0% structured output (vanilla) to 94% compliance — the difference between usable and unusable in a pipeline where each stage must produce parseable output for the next.
BENCHMARK 8

Training Data Validation

COA training data generated via frontier distillation from multiple proprietary and open-source generators, automatically filtered for quality.

DATASETRUNSSAMPLESPASSEDPASS RATE
COA Training (all)29706491.4%
Adapted COA Training294343100%

Weather coverage: 9 distinct conditions (optimal, storm, snow, rough, overcast, improving, clear, rain, degrading).

Generator coverage: 7 frontier LLMs used for distillation. All failures excluded from training data via automated filtering.

VALIDATION SCOPE

Cross-Domain Coverage

1,000+ validated pipeline runs spanning 4 operational domains, 12 adversary doctrine files, and 73 scenario configurations.

DOMAINSCENARIO TYPESADVERSARY DOCTRINES
Land WarfareConventional force-on-force, unconventionalPeer, near-peer
Naval OperationsCarrier strike group, littoral operationsNear-peer naval
Air DomainMulti-domain coordination, littoral airPeer, regional
Counter-UxSDrone swarm defense, counter-surveillanceRegional, non-state
METHODOLOGY

Reproducibility

Hardware: Single workstation, 128GB unified memory. All inference local.

Models: 3B to 72B parameter range. LoRA adapters (rank 16-64) trained on frontier-distilled data. All models run locally — no cloud API calls during benchmarks.

Evaluation: Automated scoring scripts with deterministic rubrics. Judge models evaluate blind (no knowledge of generator identity). Cross-validation where applicable (5-fold for Intent Parser on 3,000-item corpus).

Statistical rigor: Monte Carlo uses 1,000 iterations with Wilson 95% confidence intervals. Red team evaluations use category diversity as a quality metric (6 flaw categories). All training data quality-filtered with automated pass/fail criteria.

Availability: Full benchmark data and scoring methodology available on request for evaluation purposes.