All benchmarks run on a single workstation with local inference. No cloud, no API calls. Every number below is reproducible on the same hardware.
7 models evaluated on the same NTC scenario (25K-token context). Automated scoring across 9 dimensions: phase completion, doctrine compliance, maneuver axes, deception/ruse, combat enablers, contingency scenarios, decision points, terrain accuracy, and format adherence.
| MODEL | SIZE | TYPE | SCORE | PHASES | TERRAIN | RUSE | DPs | TIME |
|---|---|---|---|---|---|---|---|---|
| LoRA Specialist A | 24B | Fine-tuned | 85 | 4/4 | Clean | 4 refs | 2 | 101.7s |
| Base Model B | 72B | Vanilla | 75 | 4/4 | Clean | 4 refs | 0 | 295.3s |
| Base Model C | 32B | Vanilla | 75 | 4/4 | Clean | 4 refs | 2 | 211.9s |
| Base Model D | 22B | Vanilla | 74 | 4/4 | Clean | 4 refs | 1 | 106.1s |
| Base Model E | 27B | Vanilla | 72 | 4/4 | Clean | 4 refs | 2 | 106.4s |
Head-to-head comparison of ConclAive's LoRA fine-tuned specialists against unmodified base models on the same scenario. 18 COA generations across multiple model sizes.
| MODEL | SIZE | TYPE | PHASES | TERRAIN | RUSE REFS | ENABLERS | AVG TIME |
|---|---|---|---|---|---|---|---|
| LoRA Specialist (24B) | 24B | Fine-tuned | 12/12 | 0 violations | 21 | 35 | 85.8s |
| LoRA Specialist (32B) | 32B | Fine-tuned | 12/12 | 0 violations | 26 | 33 | 126.1s |
| LoRA Specialist (8B) | 8B | Fine-tuned | 12/12 | 0 violations | 25 | 35 | 27.7s |
| Vanilla Base (22B) | 22B | Vanilla | 12/12 | 0 violations | 25 | 40 | 101.2s |
| Vanilla Base (32B) | 32B | Vanilla | 12/12 | 4 violations | 40 | 52 | 220.4s |
Red team models independently identify flaws in generated COAs, assign severity levels, and produce survivability scores. Fine-tuned specialists compared against vanilla base models.
| MODEL | SIZE | TYPE | FLAWS/COA | CRITICAL | HIGH | FLAW DIVERSITY | AVG TIME |
|---|---|---|---|---|---|---|---|
| RT Specialist A | 14B | Fine-tuned | 12.0 | 1.7 | 5.3 | 6/6 categories | 34.4s |
| RT Specialist B | 24B | Fine-tuned | 9.3 | 1.0 | 2.0 | 6/6 categories | 49.9s |
| Vanilla RT (32B) | 32B | Vanilla | 25.0 | 3.3 | 6.0 | 5.7/6 categories | 257.6s |
| Vanilla RT (22B) | 22B | Vanilla | 11.0 | 1.3 | 2.3 | 5.3/6 categories | 53.4s |
Independent judge model evaluates 60 COAs across 5 generators, 4 scenarios, and 3 commander profiles. Two metrics: Tactical Coherence (0-10) and Overall Quality (0-10).
| GENERATOR | SIZE | TYPE | AVG COHERENCE | AVG QUALITY | SAMPLES |
|---|---|---|---|---|---|
| LoRA Specialist A | 22B | Fine-tuned | 8.42 | 8.58 | 12 |
| LoRA Specialist B | 8B | Fine-tuned | 8.33 | 8.42 | 12 |
| LoRA Specialist C | 8B | Fine-tuned | 8.08 | 7.17 | 12 |
| Vanilla Base (70B) | 70B | Vanilla | 7.92 | 7.17 | 12 |
1,000-iteration stochastic simulation for naval operations (near-peer adversary, multiple weather conditions). Each iteration applies random perturbations to 10 parameters and evaluates COA survivability.
| COA | RESULT | 95% CI LOWER | 95% CI UPPER | BASE PROB | TOP FAILURE MODE |
|---|---|---|---|---|---|
| CHARLIE | Score 78 | Highest tactical score — aggressive posture with escalation risk surfaced automatically | |||
| ALPHA | 38.2% | 35.2% | 41.3% | 69.7% | Weather Degradation (16.6%) |
| BRAVO | Score 65 | Conservative posture — logistically exposed under perturbation | |||
Top failure modes (1000 iterations): Weather Degradation (16.6%), OPFOR Repositioning (16.0%), Submarine Contact (6.9%). Monte Carlo reveals that the highest-scoring COA under deterministic analysis carries significant escalation risk — the system surfaces this tradeoff automatically for commander decision.
7 modules implementing human-machine teaming requirements. All modules benchmarked with automated test suites. Combined latency under 200ms.
| MODULE | FUNCTION | LATENCY | TESTS | KEY METRIC |
|---|---|---|---|---|
| Intent Parser | NL to JP 5-0 weighted criteria | <100ms | 3000-item CV | 81% accuracy (5-fold) |
| Abstention (L1+L2) | Out-of-domain rejection | <50ms | 300-item | 81% recall, 100% specificity |
| Commander Profile | EMA trait tracking, 4 archetypes | 0.4ms | 7/7 | Converges in <50 interactions |
| Transactive Memory | Knowledge distribution (Wegner 1987) | 0.06ms | 9/9 | Real-time team state |
| Interdependence | Shared goals (Salas Big Five) | 0.04ms | 6/6 | Complementary role mapping |
| Trust Calibration | 6-factor per-COA confidence | 0.05ms | 7/7 | Wilson 95% CI scoring |
| Cognitive Forcing | Anomaly detection, debiasing | 0.87ms | 7/7 | Trust break triggers |
13 fine-tuned LoRA adapters across 4 base model architectures (8B to 32B). Trained via frontier distillation, validated on held-out scenarios.
| SPECIALIST ROLE | FORMAT COMPLIANCE | KEY METRIC | LATENCY | vs VANILLA BASE |
|---|---|---|---|---|
| COA Generator | 100% (48/48 COAs) | 90% structure consistency | 29s | +87pp compliance |
| Red Team Analyst | 94% (15/16) | 6.5 flaws/COA avg | 39s | +94pp compliance |
| JP5 Judge | 94% (15/16) | 83% rank preservation | 5s | +100pp compliance |
COA training data generated via frontier distillation from multiple proprietary and open-source generators, automatically filtered for quality.
| DATASET | RUNS | SAMPLES | PASSED | PASS RATE |
|---|---|---|---|---|
| COA Training (all) | 29 | 70 | 64 | 91.4% |
| Adapted COA Training | 29 | 43 | 43 | 100% |
Weather coverage: 9 distinct conditions (optimal, storm, snow, rough, overcast, improving, clear, rain, degrading).
Generator coverage: 7 frontier LLMs used for distillation. All failures excluded from training data via automated filtering.
1,000+ validated pipeline runs spanning 4 operational domains, 12 adversary doctrine files, and 73 scenario configurations.
| DOMAIN | SCENARIO TYPES | ADVERSARY DOCTRINES |
|---|---|---|
| Land Warfare | Conventional force-on-force, unconventional | Peer, near-peer |
| Naval Operations | Carrier strike group, littoral operations | Near-peer naval |
| Air Domain | Multi-domain coordination, littoral air | Peer, regional |
| Counter-UxS | Drone swarm defense, counter-surveillance | Regional, non-state |