# Energy Decision Stack — v33 Upgrade Roadmap

**Generated:** March 31, 2026
**Purpose:** Prioritized action plan synthesizing the independent research audit (16 analysis improvements + 14 reproducibility gaps) against known gaps already documented in `methodology_faq.md`.

---

## Framing

The audit correctly identifies that the package is **auditable today but not independently reproducible**. The methodology FAQ is honest about this (see §6, "What this analysis IS NOT" section). The question is what to fix first to close the credibility gap for the two audiences that matter: investors evaluating the thesis, and operators deciding whether to run a pilot.

The 16 improvement items and 14 data gaps collapse into **four tiers** by impact and feasibility.

---

## Tier 1: Do Now (v32.1 patch — days, not weeks)

These are HTML/content fixes that change no underlying data but dramatically improve how the analysis presents its own limitations.

| # | Audit Item | Action | Effort |
|---|-----------|--------|--------|
| 4 | Coverage badge on augmented exhibits | Add "304 of 404 rows" badge to every exhibit using v32 axes (compressibility, criticality, reasoning demand, control). One CSS class + 4 HTML attributes. | 1 hr |
| 10 | Downgrade unvalidated external claims | Move 408 GW IA, 175 GW transmission, $110B grid optimization out of headline language. Demote to footnoted scenarios with source caveats. Cross-ref `external_sources.md` flags. | 2 hr |
| 11 | Label 60-70% document-work figure | Add explicit "internal modeled estimate" cite-type badge everywhere this appears (hero section, reflexive loop, beachheads). Currently drifts into sourced-statistic tone. | 1 hr |
| 9 | Surface counterforces in main flow | Add 3-4 "What could break this" callout boxes beside the biggest claims (queue compression, decision quality multiplier, $1T reallocation). Pull content from `counterforces_and_failure_modes.md`. | 3 hr |
| 2 | Inline evidence labels on headline claims | Add claim-type badges (Measured/Modeled/Scenario/Hypothesis) to section headers, not just in the research room. The taxonomy already exists. Extend the `.cite-type` CSS to work at heading level. | 2 hr |

**Total Tier 1: ~9 hours.** No data changes. No new research. Just honesty surfaced where readers will see it.

---

## Tier 2: Do Next (v33 data phase — 2-4 weeks)

These require new data collection or computation but are bounded and achievable.

| # | Audit Item | Action | Effort |
|---|-----------|--------|--------|
| 5 | Fill or scope out 100 missing augmented rows | Either score the remaining 100 positions on v32 axes (compressibility, criticality, reasoning demand, control) or formally scope every augmented exhibit to "304-row subset." Decision: fill is better for credibility. Use same rubric + analyst assignment. | 1-2 wk |
| 3 | Fix workflow/artifact double-counting | Zero out employment estimates for workflow and artifact rows in headcount-facing exhibits. Move them to a separate non-additive panel. The overlap is ~197,250 headcount (~4.9%) per methodology FAQ L47, but the visual impression is worse than the number. | 1 day |
| 13 | Add measured KPIs to beachheads | Pull eval metrics from `eval_spec_treasury_lender_readiness.xlsx` (cycle time, analyst hours, citation fidelity, numeric reconciliation accuracy, unsupported-claim rate, error escape rate) and display as "success criteria" in each beachhead section. These already exist — just surface them. | 4 hr |
| 7 | Publish row-level provenance | Add source_basis, confidence_level, and uncertainty_band columns to the main CSV. The methodology FAQ already defines confidence intervals (±1.0 points on compressibility, ±1.0 points on criticality). Expose in the Role Explorer tooltip. | 3 days |
| 8 | Resolve compressibility method | Standardize: document that production compressibility = task_exposure_v1 for 290 rows + analyst assignment for 114 rows. Either remove the phantom formula from the FAQ entirely or implement the four-component version and re-score. The FAQ already disclosed this honestly — now clean up the method. | 1 wk |

**Total Tier 2: ~3-4 weeks.** This is the "v33 data fixes" workstream already identified in the superseded `thread_context_energy_decision_stack.md` (now in `superseded/`).

---

## Tier 3: Structural Upgrade (v33-v34 — 1-2 months)

These change the architecture of the deliverable.

| # | Audit Item | Action | Effort |
|---|-----------|--------|--------|
| 1 | Split into three explicit layers | Restructure into: (a) Core benchmark (404 positions, scores, wage bill), (b) Deployment evidence (beachheads, workflows, evals), (c) Strategic scenarios (TAM, reflexive loop, native-AI, $1T reallocation). Each layer carries its own burden of proof. The single-page format can stay — use clear visual dividers. | 1-2 wk |
| 6 | Replace formulaic employment estimates | This is the single biggest credibility upgrade. 344 of 404 rows use a deterministic formula. Replace with: BLS OES extraction (expand from 60 to ~120 matched codes), company 10-K headcount disclosures, industry association surveys. Target: 200+ rows with external sourcing, formula as fallback for remainder. | 3-4 wk |
| 12 | Replace composite proof-room with real evidence | Requires actual pilot data. The proof-room doc (L220-227) defines what's needed: redacted artifact families, before/after traces, model-eval results, error examples, signoff boundaries. This is gated on running the treasury/lender beachhead pilot. | 2-3 mo |
| 14 | Multi-rater scoring pass | 3-5 analysts independently score a sample (30-50 roles), measure inter-rater reliability, adjudicate disagreements. The FAQ's validation protocol (§6, inter-rater reliability section) already defines this. Budget: $15-25K for external analysts. | 3-4 wk |
| 15 | Harden production artifact | JS unit tests for calculation functions. Automated chart/data consistency checks. Mobile audit. Accessibility audit against WCAG 2.1 AA. | 1-2 wk |

---

## Tier 4: Scope Expansion (v34+ — quarters)

These expand the analysis into adjacent territory. Only pursue after Tiers 1-3 are solid.

| # | Audit Item | Action | Notes |
|---|-----------|--------|-------|
| 16 | Competitive and regulatory granularity | Map AI-for-energy competitive landscape. Add regulator-by-regulator variation (FERC vs. state PUCs vs. BOEM). International coverage (NOCs, European grid codes). | Large scope. Separate deliverable. |

---

## Reproducibility Gap Summary

The 14-point data request collapses into **five must-haves** for independent reproduction:

1. **Raw BLS extraction + occupation-to-role mapping table** — The 60-role measured foundation. Without this, nobody can verify the starting point.

2. **Row-level scoring audit trail** — Pre-consensus sheets, scorer rationale, override log. Currently the package gives final scores but not the decision process.

3. **Compensation source workbook** — The six layer-level proxies ($58K-$180K) are stated but the calibration source isn't included.

4. **Missing v32 subcomponent inputs** — workflow_standardization, template_density, refresh_frequency, evidence_volume, branching_factor for the 100 rows that need them.

5. **Transformation code/notebook** — The computation path from raw data to final CSV and headline metrics. Currently only the inline JS exists.

**Priority order:** Items 1 and 2 first (they're the credibility foundation), then 3 and 5 (enable independent computation), then 4 (enables v32 layer completion).

---

## What the HTML Fixes Already Address

The following items from the audit are **already partially addressed** by fixes applied today:

- **Methodology disclaimer** — Added visible "All scores are modeled estimates" banner before Exhibit 1
- **Scenario labeling** — Added "HYPOTHETICAL" badge on Exhibit 4 (scenarios)
- **Counterforce integration** — Identified as Tier 1 priority (not yet implemented)
- **Evidence labels** — Claim-type taxonomy already unified across HTML, CSS, two CSVs (from the 25-fix round). Extending to section headers is Tier 1.

---

## Decision Framework

If the goal is **investor credibility**: prioritize Tier 1 (honest labeling) + Tier 2 items 3, 7, 13 (surface what you already have). Investors reward transparency about limitations more than they punish having them.

If the goal is **operator adoption**: prioritize Tier 2 item 13 (beachhead KPIs) + Tier 3 item 12 (real proof-room evidence). Operators need to see someone ran the pilot and it worked.

If the goal is **academic/peer credibility**: prioritize Tier 3 items 6 and 14 (better employment data + multi-rater scoring). This is the "would it survive peer review" standard.

The package is honest about where it stands. The job now is to surface that honesty where readers will see it, then systematically close the gaps in priority order.

---

## Note: Citation Audit

The v31 package included a structured citation audit file (`SUPERSEDED_citation_audit_v31.csv`, 60 rows). In v32, this function was replaced by `external_sources.md` (a validation table with URL, date, exact/approximated status, and confirmation flags). The external sources table is less structured but more comprehensive. A v33 upgrade could re-introduce a structured citation audit format if needed for institutional review.