TECHNICAL BENCHMARK REPORT

ConclAive Validation Results

Last updated: 2026-06-01 · All inference local, single workstation · No cloud API calls

EXECUTIVE SUMMARY

Key Numbers

1,000+Validated Pipeline Runs

94Module Tests (100% Pass)

13LoRA Specialist Adapters

73Scenario Configurations

4Domains (Land/Naval/Air/c-UxS)

12Adversary Doctrine Files

<200msTA2 Combined Latency

~79sRapid Pipeline End-to-End

All benchmarks run on a single workstation with local inference. No cloud, no API calls. Every number below is reproducible on the same hardware.

BENCHMARK 1

COA Generation — Model Comparison

7 models evaluated on the same NTC scenario (25K-token context). Automated scoring across 9 dimensions: phase completion, doctrine compliance, maneuver axes, deception/ruse, combat enablers, contingency scenarios, decision points, terrain accuracy, and format adherence.

MODEL	SIZE	TYPE	SCORE	PHASES	TERRAIN	RUSE	DPs	TIME
LoRA Specialist A	24B	Fine-tuned	85	4/4	Clean	4 refs	2	101.7s
Base Model B	72B	Vanilla	75	4/4	Clean	4 refs	0	295.3s
Base Model C	32B	Vanilla	75	4/4	Clean	4 refs	2	211.9s
Base Model D	22B	Vanilla	74	4/4	Clean	4 refs	1	106.1s
Base Model E	27B	Vanilla	72	4/4	Clean	4 refs	2	106.4s

KEY FINDING

Model size does not predict COA quality. A 24B fine-tuned specialist (score 85) outperforms a 72B vanilla model (score 75) while running 3x faster. Task-specific fine-tuning is the primary quality differentiator, not raw parameter count.

Scoring methodology: Each dimension scored independently. Phases (0-4): structural completeness. Compliance (0-10): doctrine references. Terrain: binary pass/fail with hallucination count. Ruse: reference count. Decision Points: count of explicit DP triggers. Penalties applied for: terrain hallucinations (-3 each), missing EW (-10), missing CAS, CoT leakage, format violations.

BENCHMARK 2

Fine-Tuned Specialists vs Vanilla Models

Head-to-head comparison of ConclAive's LoRA fine-tuned specialists against unmodified base models on the same scenario. 18 COA generations across multiple model sizes.

MODEL	SIZE	TYPE	PHASES	TERRAIN	RUSE REFS	ENABLERS	AVG TIME
LoRA Specialist (24B)	24B	Fine-tuned	12/12	0 violations	21	35	85.8s
LoRA Specialist (32B)	32B	Fine-tuned	12/12	0 violations	26	33	126.1s
LoRA Specialist (8B)	8B	Fine-tuned	12/12	0 violations	25	35	27.7s
Vanilla Base (22B)	22B	Vanilla	12/12	0 violations	25	40	101.2s
Vanilla Base (32B)	32B	Vanilla	12/12	4 violations	40	52	220.4s

KEY FINDING

Fine-tuned 8B model generates compliant COAs in 27.7 seconds — 8x faster than a vanilla 32B with zero terrain violations vs 4. An order of magnitude smaller, running on edge hardware, producing superior tactical grounding. The 8B specialist achieves what a 32B vanilla cannot: consistent terrain accuracy.

BENCHMARK 3

Red Team — Adversarial Evaluation

Red team models independently identify flaws in generated COAs, assign severity levels, and produce survivability scores. Fine-tuned specialists compared against vanilla base models.

MODEL	SIZE	TYPE	FLAWS/COA	CRITICAL	HIGH	FLAW DIVERSITY	AVG TIME
RT Specialist A	14B	Fine-tuned	12.0	1.7	5.3	6/6 categories	34.4s
RT Specialist B	24B	Fine-tuned	9.3	1.0	2.0	6/6 categories	49.9s
Vanilla RT (32B)	32B	Vanilla	25.0	3.3	6.0	5.7/6 categories	257.6s
Vanilla RT (22B)	22B	Vanilla	11.0	1.3	2.3	5.3/6 categories	53.4s

KEY FINDING

Fine-tuned red team models achieve full 6/6 category diversity (logistics, timing, terrain, force ratio, C2, deception) while maintaining structured output. Vanilla 32B generates 25 flaws/COA at 5x the latency — quantity over quality. The fine-tuned 14B RT completes a full adversarial review in 34 seconds.

BENCHMARK 4

Automated Judge — Quality Assessment

Independent judge model evaluates 60 COAs across 5 generators, 4 scenarios, and 3 commander profiles. Two metrics: Tactical Coherence (0-10) and Overall Quality (0-10).

GENERATOR	SIZE	TYPE	AVG COHERENCE	AVG QUALITY	SAMPLES
LoRA Specialist A	22B	Fine-tuned	8.42	8.58	12
LoRA Specialist B	8B	Fine-tuned	8.33	8.42	12
LoRA Specialist C	8B	Fine-tuned	8.08	7.17	12
Vanilla Base (70B)	70B	Vanilla	7.92	7.17	12

KEY FINDING

8B fine-tuned specialist scores 8.42/10 quality — outperforming a 70B vanilla model at 7.17/10. An order of magnitude smaller, running on edge hardware, producing superior tactical plans. Task-specific distillation beats generic scale for structured military planning.

BENCHMARK 5

Monte Carlo Simulation — Naval Domain

1,000-iteration stochastic simulation for naval operations (near-peer adversary, multiple weather conditions). Each iteration applies random perturbations to 10 parameters and evaluates COA survivability.

COA	RESULT	95% CI LOWER	95% CI UPPER	BASE PROB	TOP FAILURE MODE
CHARLIE	Score 78	Highest tactical score — aggressive posture with escalation risk surfaced automatically
ALPHA	38.2%	35.2%	41.3%	69.7%	Weather Degradation (16.6%)
BRAVO	Score 65	Conservative posture — logistically exposed under perturbation

Top failure modes (1000 iterations): Weather Degradation (16.6%), OPFOR Repositioning (16.0%), Submarine Contact (6.9%). Monte Carlo reveals that the highest-scoring COA under deterministic analysis carries significant escalation risk — the system surfaces this tradeoff automatically for commander decision.

Perturbation model: 10 stochastic factors (weather shift, OPFOR reposition, submarine contact, logistics delay, C2 disruption, air superiority loss, political constraint change, EMCON violation, waterspace denial, timing slip). Each sampled from calibrated probability distributions per adversary doctrine file.

BENCHMARK 6

TA2 Human-Machine Teaming Modules

7 modules implementing human-machine teaming requirements. All modules benchmarked with automated test suites. Combined latency under 200ms.

MODULE	FUNCTION	LATENCY	TESTS	KEY METRIC
Intent Parser	NL to JP 5-0 weighted criteria	<100ms	3000-item CV	81% accuracy (5-fold)
Abstention (L1+L2)	Out-of-domain rejection	<50ms	300-item	81% recall, 100% specificity
Commander Profile	EMA trait tracking, 4 archetypes	0.4ms	7/7	Converges in <50 interactions
Transactive Memory	Knowledge distribution (Wegner 1987)	0.06ms	9/9	Real-time team state
Interdependence	Shared goals (Salas Big Five)	0.04ms	6/6	Complementary role mapping
Trust Calibration	6-factor per-COA confidence	0.05ms	7/7	Wilson 95% CI scoring
Cognitive Forcing	Anomaly detection, debiasing	0.87ms	7/7	Trust break triggers

36/36TA2 Tests Passing

58/58TA1 Tests Passing

<200msCombined TA2 Latency

BENCHMARK 7

LoRA Specialist Performance

13 fine-tuned LoRA adapters across 4 base model architectures (8B to 32B). Trained via frontier distillation, validated on held-out scenarios.

SPECIALIST ROLE	FORMAT COMPLIANCE	KEY METRIC	LATENCY	vs VANILLA BASE
COA Generator	100% (48/48 COAs)	90% structure consistency	29s	+87pp compliance
Red Team Analyst	94% (15/16)	6.5 flaws/COA avg	39s	+94pp compliance
JP5 Judge	94% (15/16)	83% rank preservation	5s	+100pp compliance

KEY FINDING

Fine-tuning improves format compliance by +87 to +100 percentage points over vanilla base models. The JP5 Judge specialist goes from 0% structured output (vanilla) to 94% compliance — the difference between usable and unusable in a pipeline where each stage must produce parseable output for the next.

BENCHMARK 8

Training Data Validation

COA training data generated via frontier distillation from multiple proprietary and open-source generators, automatically filtered for quality.

DATASET	RUNS	SAMPLES	PASSED	PASS RATE
COA Training (all)	29	70	64	91.4%
Adapted COA Training	29	43	43	100%

Weather coverage: 9 distinct conditions (optimal, storm, snow, rough, overcast, improving, clear, rain, degrading).

Generator coverage: 7 frontier LLMs used for distillation. All failures excluded from training data via automated filtering.

VALIDATION SCOPE

Cross-Domain Coverage

1,000+ validated pipeline runs spanning 4 operational domains, 12 adversary doctrine files, and 73 scenario configurations.

DOMAIN	SCENARIO TYPES	ADVERSARY DOCTRINES
Land Warfare	Conventional force-on-force, unconventional	Peer, near-peer
Naval Operations	Carrier strike group, littoral operations	Near-peer naval
Air Domain	Multi-domain coordination, littoral air	Peer, regional
Counter-UxS	Drone swarm defense, counter-surveillance	Regional, non-state

METHODOLOGY

Reproducibility

Hardware: Single workstation, 128GB unified memory. All inference local.

Models: 3B to 72B parameter range. LoRA adapters (rank 16-64) trained on frontier-distilled data. All models run locally — no cloud API calls during benchmarks.

Evaluation: Automated scoring scripts with deterministic rubrics. Judge models evaluate blind (no knowledge of generator identity). Cross-validation where applicable (5-fold for Intent Parser on 3,000-item corpus).

Statistical rigor: Monte Carlo uses 1,000 iterations with Wilson 95% confidence intervals. Red team evaluations use category diversity as a quality metric (6 flaw categories). All training data quality-filtered with automated pass/fail criteria.

Availability: Full benchmark data and scoring methodology available on request for evaluation purposes.