# Methodology FAQ: How to Recreate the Energy Decision Stack Analysis

This document is designed to let someone with domain knowledge audit and challenge every methodological choice in the analysis. The package is auditable at the logic level — assumptions are documented, key formulas are disclosed, and claims are tagged by type and confidence. It is not yet independently reproducible from raw data: the BLS extraction pipeline, row-level scoring audit trail, compensation source workbook, and transformation code are not included in this bundle (these are flagged as v33 deliverables in the upgrade roadmap).

---

## Section 1: Data Sources

### Q: Where did the 404 position list come from?

**A:** Four sources, in order of priority:

1. **BLS Occupational Employment Statistics (OES)** — Downloaded NAICS-based role lists for energy sectors:
   - 211 (Oil and gas extraction)
   - 213 (Support activities for mining)
   - 2211 (Electric power generation)
   - 2212 (Electric power transmission and distribution)
   - 237 (Heavy and civil engineering construction)

   This gave us ~120 BLS occupation codes, of which 60 matched our energy role taxonomy and had direct employment and wage data.

2. **Company organization charts** — Worked with energy operators, utilities, midstream companies, and trading desks to map real org structures. This expanded coverage to roles BLS doesn't track:
   - Asset management and portfolio optimization
   - Minerals and non-op workflows (JOA analysts, division order processors, royalty accountants)
   - Commodity trading and middle/back office
   - Regulatory affairs and government relations
   - Digital transformation and data governance

   This added ~180 roles.

3. **Industry interviews** — Conducted 40+ conversations with energy executives, technical leads, and operational staff to validate role taxonomy and identify workflows. This revealed:
   - Nuanced distinctions within roles (e.g., "reservoir engineer" has 5-6 sub-workflows depending on career stage)
   - Company-specific variations (utilities do more regulatory work than operators)
   - Workflow sequences (how roles interact to deliver decisions)

   This refined definitions and added ~80 specialized roles.

4. **Expansion by workflow and decision type** — Built a role-matrix by:
   - Workflow type (e.g., regulatory, commercial, operational, financial)
   - Decision criticality (e.g., board-level, management, routine)
   - Knowledge domain (geoscience, commercial, technical, financial)

   This ensured full coverage across the decision stack, not just org-chart roles.

**Total:** 404 positions — 373 roles, 24 workflows, and 7 artifacts. Roles are people/seats (e.g., reservoir engineer, trading analyst). Workflows are recurring decision loops (e.g., production accounting close support, rate case evidence assembly). Artifacts are document deliverables (e.g., board pack production, nuclear licensing packet assembly). The dataset tags each row with a `row_type` column to distinguish them. This comprehensive coverage includes some roles that are 50-person functions in a large integrated company (e.g., reserve engineers) and some roles that are 2-person functions (e.g., chief risk officer). The dataset uses modeled employment estimates and layer-level compensation proxies to compute exposure-weighted wage bill. Headline figures like the ~90% above-field share are exposure-weighted wage bill shares (employment × layer comp proxy × compressibility score / 10), not simple headcount shares. The compressibility score alone does not determine ranking.

**Known overlap:** The 31 workflow and artifact rows carry their own employment estimates (totaling ~197,250 headcount, or ~4.9% of the total). These likely overlap with the role-based employment for the people who perform those workflows. Excluding workflow/artifact rows changes the above-field share by ~1 percentage point. This overlap is acknowledged but not yet resolved; a future version should either zero out workflow/artifact employment or tag them as non-additive.

---

### Q: Where did employment estimates come from?

**A:** Two tiers. The distinction matters, and earlier versions of this FAQ blurred it. Here's what the code actually does.

**Tier 1: BLS/industry-sourced rows (60 of 404)**

For 60 rows (58 roles + 1 workflow + 1 artifact), we use Bureau of Labor Statistics Occupational Employment Statistics (OES) or industry-sourced employment data directly. These positions have nationally reported employment counts and median wages tied to specific NAICS codes (211, 213, 2211, 2212, 237). Example: BLS shows ~12,000 petroleum engineers (OES 17-2171). We use this number as-is.

These 60 rows are stored in a lookup table (`_EMP_KNOWN`) in the codebase. When the code encounters one of these positions, it returns the BLS/industry figure directly with no adjustment.

**Tier 2: Formulaic estimates (344 of 404)**

For the remaining 344 rows — the energy-specific positions that BLS doesn't track (JIB analysts, division order processors, interconnection specialists, trading ops, etc.) — the code uses a deterministic estimation formula:

```
employment = layerBase × tierMultiplier
```

Where:
- **layerBase** is a fixed starting point per organizational layer: Physical operations (22,000), Technical (8,000), Corporate (10,000), Advisory (7,000), Capital markets (5,000), Governance (3,000). These were set by reviewing aggregate BLS employment data for each layer and dividing by estimated role count per layer.
- **tierMultiplier** assigns each role to one of three explicit bands: Low (0.75×), Mid (1.0×), High (1.25×). Tier assignment uses a stable hash of the role name string (hash mod 3), producing a near-equal distribution (~⅓ per tier). This gives three discrete employment levels per layer rather than pseudo-random integers that would imply false precision.

**v32.1 change:** An earlier version included a **scoreFactor** `max(0.15, 1.3 - score/8)` that gave low-scoring positions more employment by construction. This was methodologically suspect because it used the same compressibility score in both the employment estimator and the exposure numerator, partly manufacturing the flagship conclusion. The score factor was removed in v32.1. The continuous **nameHashJitter** (0.7–1.3 range) was also replaced with the three-band tier system to eliminate false precision — exact integers like "8,547 employees" implied measurement accuracy that didn't exist.

**What this means in practice:** The 344 estimated rows produce three employment levels per layer (e.g., Physical: ~16,500 / ~22,000 / ~27,500). They are structurally consistent (field positions are bigger than governance positions) but are not individually calibrated. The formula was designed as scaffolding for directional wage-pool analysis, not as a census.

**What the estimates are good for:** Layer-level aggregation. When you sum all Physical layer employment vs. all Advisory layer employment, the pattern (field has ~29% of headcount but ~10% of exposed wage bill) is driven primarily by the layer base rates and the compensation proxies, not by individual position estimates. That pattern is the claim.

**What the estimates are NOT good for:** Comparing individual role employment counts. The formula produces only three possible values per layer. Don't cite individual role employment as precise.

**Confidence:**
- Tier 1 (BLS/industry-sourced, 60 rows): ±10%. These are government survey data.
- Tier 2 (formulaic, 344 rows): ±50% or more for individual positions. The formula is a structured estimate, not a measurement. Layer-level aggregates are more reliable than individual rows because estimation errors partially cancel.

**Why not use BLS family-matching for all 344?** Because most of these positions don't have a defensible BLS proxy. "Non-op JIB analyst" doesn't map cleanly to any BLS occupation code. Forcing a match would create false precision — the result would look empirical but still be a guess. The formula is more honest: it says "here's our structural assumption about how headcount distributes across layers, with explicit tier bands." You can disagree with the layer bases and recalculate.

**Important:** Employment is a secondary input. The analysis scores roles on compressibility (and related axes), then multiplies by compensation proxies to get exposure-weighted wage bill. If a role's employment estimate is off by 50%, the wage-pool impact is proportionally smaller because aggregation across hundreds of roles dampens individual errors. The directional finding — that the exposed wage bill concentrates above the field — holds across a wide range of employment assumptions.

---

### Q: Where did pay data come from?

**A:** The v32 implementation uses **layer-level median compensation proxies**, not role-specific wage data. This is simpler and more honest than the prior version's description suggested.

Every role is assigned to one of six organizational layers, and every role in that layer uses the same compensation figure:

- Physical operations: $58,000
- Technical: $95,000
- Corporate operations: $72,000
- Advisory: $105,000
- Capital markets: $130,000
- Governance: $180,000

These six figures were calibrated by reviewing BLS OES median wages for energy-sector occupations at each layer and selecting a representative midpoint. They are deliberately round numbers — the intent is to capture the structural gap between layers (field vs. advisory vs. governance), not to estimate any individual role's actual compensation.

**Why not use role-specific BLS wages?** Because the 344 estimated positions don't have BLS wage data. Mixing BLS wages (for 60 rows) with modeled wages (for 344) creates a false sense of precision. The layer proxy keeps the methodology uniform: every role gets the same treatment. If you think $95K is wrong for the technical layer, change it once and recalculate everything.

**Confidence:** ±15-25% for any given role. The proxies are deliberately coarse. A senior reservoir engineer in Houston makes more than $95K; a junior petroleum technologist in Oklahoma makes less. The proxy captures the layer-level center of gravity, not individual compensation.

**Important:** The combination of formulaic employment (Tier 2) and layer-level compensation means that the wage-pool numbers (like the ~$236B exposure-weighted total) are structural estimates. They reflect the relative weight of each layer, not the precise dollar cost of each role. This is disclosed in the dataset and in the "What this does not claim" section of the main page.

---

### Q: How is compensation estimated in the dataset?

**A:** The v32 dataset uses a **layer-level median compensation proxy** rather than role-specific BLS matches. This approach is simpler and more defensible than attempting exact matches.

**The layer_median_comp_proxy system:**

Each role is assigned to one of six layers based on typical responsibility and decision scope:

1. **Physical operations:** $58,000 median annual wage
   - Field technicians, operators, maintenance workers

2. **Technical:** $95,000 median annual wage
   - Engineers, specialists, domain experts

3. **Corporate operations:** $72,000 median annual wage
   - HR, finance operations, compliance, administration

4. **Advisory:** $105,000 median annual wage
   - Analysts, consultants, planners, advisors

5. **Capital markets:** $130,000 median annual wage
   - Traders, portfolio managers, risk officers, pricing specialists

6. **Governance:** $180,000 median annual wage
   - Directors, officers, executives, board-level advisors

**Why this approach:**

Role-by-role BLS matching creates false precision. A "petroleum engineer" title varies wildly in compensation depending on experience, location, and company size. Instead, we assign every role to a layer and use the layer median.

This is more transparent: you can see exactly what compensation assumption is baked into every calculation. If you disagree with the $95K "technical" layer assumption, you can adjust it globally and recalculate the entire wage pool.

---

## Section 2: Scoring Methodology

### Q: What are the 4 original axes?

**A:** The v32 methodology added four new layers on top of the original 2024 benchmark. The original four axes were:

1. **Task exposure (0-10 scale)** — How much of the role's work is performed by current LLMs without domain-specific engineering?
   - 9-10: Pure language work (document review, writing, email, communication)
   - 7-8: Language + simple calculation (spreadsheet interpretation, data summarization)
   - 5-6: Language + domain knowledge (financial analysis, regulatory interpretation)
   - 3-4: Domain knowledge + judgment (technical design, deal strategy)
   - 0-2: Physical work, judgment-critical (field operations, live trading execution)

2. **Adoption feasibility (0-10 scale)** — How plausible is it that a typical energy company could deploy a solution in this workflow in the next 18-36 months?
   - 9-10: Standalone digital workflow (all documents are digital, no manual integrations)
   - 7-8: Mostly digital with integration (some legacy systems, doable)
   - 5-6: Requires data engineering (needs 4-8 weeks of integration work)
   - 3-4: Requires domain engineering (needs custom ML or significant scripting)
   - 0-2: Requires regulatory change (can't deploy without regulator approval)

3. **Economic compression risk (0-10 scale)** — How much labor cost could be displaced if this role was fully compressed?
   - 9-10: High-wage, routine work (cost per unit is high; solution saves money fast)
   - 7-8: Medium-wage, recurring work
   - 5-6: Medium-wage, episodic work
   - 3-4: Low-wage or highly episodic work
   - 0-2: Work that can't be automated (judgment-critical, social, regulatory)

4. **Fee disruption risk (0-10 scale)** — How much of the external advisory market is at risk of being brought in-house?
   - 9-10: Routine external advisory (audits, regulatory filings, templates)
   - 7-8: Specialized but standardizable advisory
   - 5-6: Mixed advisory (some routine, some judgment-critical)
   - 3-4: Mostly judgment-critical external work
   - 0-2: External work that can't be internalized (regulatory authority, third-party certification)

These four axes were scored for each role by analysts with energy operating experience. Scores are entered as integers (0-10). The aggregate benchmark rolls these up by wage pool and decision type.

---

### Q: What are the v32 additions?

**A:** Version 32 added expanded-coverage scoring for four new axes (304 of 404 positions), applied on top of the original four:

1. **Workflow compressibility (0-10 scale)** — How plausibly can current AI compress the prep stack around this workflow?

   **Conceptual inputs:** Task exposure, adoption feasibility, workflow standardization, and template density all inform the compressibility score. In practice, the v32 dataset derives compressibility from `task_exposure_v1` for the majority of rows (290 of 404), with analyst adjustment applied to 14 rows where workflow-specific factors warranted a different score. An additional 100 rows have compressibility scores assigned directly by analyst judgment where `task_exposure_v1` was not available.

   **Earlier versions of this FAQ described a weighted formula:**
   ```
   compressibility = (0.40 × task_exposure) + (0.30 × adoption_feasibility)
                     + (0.15 × workflow_standardization) + (0.15 × template_density)
   ```
   This formula was the intended design but was not consistently applied in the production dataset. The columns `workflow_standardization` and `template_density` are not present in the CSV. For transparency: the scores in the dataset reflect a mix of formulaic derivation from task exposure and direct analyst assignment, not a strict application of the four-component formula.

   Scale: 0-2 (low compressibility) to 8-10 (high compressibility)

2. **Decision criticality (0-10 scale)** — How much value can be created or destroyed by this decision?

   Built from four modeled subcomponents:
   - Upside optionality (0-4 scale): what's the profit potential if the decision is right?
     - 3-4: Decisions that could create 10%+ company value (capital allocation, M&A, strategy)
     - 2-3: Decisions worth 1-10% company value (project selection, cost optimization)
     - 1: Decisions worth <1% company value
     - 0: Decisions with no financial upside

   - Downside severity (0-4 scale): what's the risk if the decision is wrong?
     - 3-4: Decisions that could destroy 5%+ company value (safety failures, regulatory breaches, bad reserves books)
     - 2-3: Decisions worth 1-5% company value (operational delays, cost overruns)
     - 1: Decisions worth <1% company value
     - 0: Decisions with no downside

   - Authority level (0-2 scale): who makes the final call?
     - 2: Board, CEO, regulatory authority (irreversible, public)
     - 1: Department head, CFO, operations manager (reversible, internal)
     - 0: Individual contributor (lowest stakes)

   - Irreversibility (0-2 scale): can the decision be undone?
     - 2: Fully irreversible (well abandonment, license loss, strategic repositioning)
     - 1: Partially reversible (can fix later, but at cost)
     - 0: Fully reversible (can undo with no cost or delay)

   **Calculation:**
   ```
   raw_criticality = upside_optionality + downside_severity + authority_level + irreversibility
   criticality = raw_criticality × (10 / 12)   # normalized to 0–10 scale
   ```

   The raw sum has a theoretical maximum of 12 (4+4+2+2). The dataset normalizes this to a 0–10 scale for consistency with all other axes. The actual data maxes out at 9.44.

   Range: 0-2 (low) to 9-10 (ultra-high)

3. **Reasoning-demand potential (0-10 scale)** — How much recurring model usage could this workflow generate?

   Built from:
   - Task exposure (20% weight): LLM-compatible language work
   - Adoption feasibility (15% weight): deployability
   - Refresh frequency (25% weight): how often is this workflow executed per year?
     - 3-4: Daily or real-time (trading ops, control room operations)
     - 2-3: Weekly (recurring committee meetings, weekly reports)
     - 1-2: Monthly (close processes, quarterly reviews)
     - 0: Annual or episodic (rate cases, acquisitions)

   - Evidence volume (15% weight): how many documents/data points per instance?
     - 3-4: Dozens to hundreds of documents (rate cases, acquisitions, interconnections)
     - 2-3: Tens of documents (budget reviews, field reports)
     - 1: Few documents (simple memos)

   - Branching factor (10% weight): how many scenarios/branches per analysis?
     - 3: 5+ scenarios (price cases, portfolio optimization, reserve cases)
     - 2: 2-4 scenarios (scenario planning)
     - 1: Single scenario (deterministic analysis)

   - Persistent monitoring need (5% weight): does the workflow require standing model coverage?
     - 1: Yes (covenant monitoring, equipment health, market watch)
     - 0: No (episodic analysis)

   **Calculation:**
   ```
   reasoning_demand = (0.20 × task_exposure) + (0.15 × adoption_feasibility)
                      + (0.25 × refresh_frequency) + (0.15 × evidence_volume)
                      + (0.10 × branching_factor) + (0.05 × monitoring_need)
   + (0.10 × company_control_bonus)
   ```

   Range: 0-2 (low recurrence) to 9-10 (high recurrence, high model usage)

4. **Company control (3-zone zoning layer)** — Is the outcome controlled by the company or by an external party?

   Three zones (assigned from continuous 0-10 score):
   - **Company-controlled (3 points):** The key outcome is decided and executed within the company. (e.g., production accounting close, trading position sizing, internal capital allocation)

   - **Semi-controlled (2 points):** The company can compress the prep work, but the outcome depends on an external party's decision. (e.g., rate cases, interconnection studies, lender approval, partner approval)

   - **Externally governed (1 point):** The key outcome is controlled by a regulator, lender, auditor, or other counterparty. (e.g., environmental permits, safety approval, license renewal)

   **Important note on perspective:** This scoring is from the perspective of the focal company (operator, owner, asset sponsor). A regulator may have full control over their own work, but from the operator's perspective, that loop is still "externally governed."

   **Important note on scoring:** The underlying dataset contains a continuous `company_control_score` (0-10 scale) for each role. The three-zone labels (1, 2, 3 points) are derived by mapping score ranges: scores below 4 map to "Externally governed" (1 point), scores 4–6.99 map to "Semi-controlled" (2 points), and scores 7 and above map to "Company-controlled" (3 points). (Note: these thresholds match the v32_scoring_method_note: <4 / 4–6.99 / ≥7.) The FAQ describes the conceptual framework; the actual data is continuous and supports fine-grained analysis.

Version 32 (current) added three minor refinements:
- Standardized the template-density scoring for consistency
- Added a "reasoning-demand bonus" if the workflow involves complex scenario branching
- Clarified the proof-room traces and composite examples

---

### Q: How are subcomponents estimated?

**A:** This is fully transparent but not individually precise. Each subcomponent is estimated from a combination of:

1. **Existing numeric fields** in the benchmark:
   - Task exposure (scored by analysts)
   - Adoption feasibility (scored by analysts)
   - Economic compression risk (scored by analysts)

2. **Workflow-type heuristics** — Rules of thumb based on domain patterns:
   - Workflows with regulatory sign-off gates are "semi-controlled" by default
   - Workflows with high document density (>20 docs per instance) get upside on "evidence volume"
   - Workflows that recur >50 times per year get "daily refresh frequency" scoring
   - Workflows with 5+ parallel paths (scenarios) get high "branching factor" scoring

3. **Sector / role keyword adjustments** — Adjustments based on role or sector tags:
   - Non-op/minerals roles get standardization premium (+0.5 to +1.0) because they use standardized JOAs
   - Trading roles get recurrence premium (+1.0 to +1.5) because they execute daily
   - Regulatory affairs roles get "externally governed" coding by default
   - Field operations roles get "low compressibility" by default (even if task exposure is high)

The result is a decision-support layer, not an individually calibrated estimate. Small score differences (<0.5 points) should not be interpreted as meaningful ranking gaps. Large differences (>2 points) signal structural differences in the workflow type.

**Confidence intervals:**
- Compressibility: ±1.0 points (±10% of scale)
- Criticality: ±1.0 points (±10% of scale)
- Reasoning demand: ±0.8 points (±8% of scale)
- Company control: ±0.3 points (highly reliable; binary/categorical choice)

---

### Q: What's the exposure_weighted_wage_bill? How is it calculated?

**A:** The `exposure_weighted_wage_bill` is the total wage pool in roles where AI has exposure, weighted by the compressibility of the work.

**Calculation:**
```
exposure_weighted_wage_bill = Σ (estimated_employment × layer_median_comp_proxy × compressibility_score / 10)
```

**Example:**

For a role with:
- Estimated employment: 10,000 people
- Layer: Technical ($95,000)
- Compressibility score: 6.2

The exposure-weighted wage bill contribution is:
```
10,000 × $95,000 × 6.2 / 10 = $589,000,000
```

Sum across all 404 positions = total exposure_weighted_wage_bill.

**Why this metric:**

- It answers: "How much wage value sits in roles where AI can plausibly compress prep work?"
- The division by 10 normalizes the compressibility score (0-10 scale) into a compression factor
- Unlike raw wage bill, it's weighted by how much of the work is actually compressible
- This is NOT the wage pool that will be "eliminated" — it's the wage pool where prep work can be compressed
- Typical outcome: 30-40% of the work in high-compressibility roles is compressed, meaning the wage bill is redeployed to higher-judgment work, not eliminated

---

### Q: What happened to the compression_weighted_wage_pool column?

**A:** The v32 dataset replaced `compression_weighted_wage_pool` with `exposure_weighted_wage_bill`. The two are calculated identically:

**Formula (now called exposure_weighted_wage_bill):**
```
exposure_weighted_wage_bill = Σ (estimated_employment × layer_median_comp_proxy × compressibility_score / 10)
```

The name change reflects a more accurate conceptual framing: this is the wage bill *exposed* to compressibility, not a claim about how much will actually be compressed.

**Example:**
- Petroleum engineers: 12,000 employees × $95K (technical layer) × 6.2 compressibility / 10 = $0.71B
- Reserve engineers: 3,500 employees × $95K (technical layer) × 7.8 compressibility / 10 = $0.26B
- Production accountants: 8,000 employees × $72K (corporate ops layer) × 8.1 compressibility / 10 = $0.47B

Sum across all 404 positions = total exposure_weighted_wage_bill.

**Important distinction:**
- This is NOT the wage pool that will be "eliminated"
- This IS the wage pool where AI can compress prep work
- Typical outcome: 30-40% of work is compressed, meaning 30-40% of the wage bill in that pool is redeployed to higher-judgment work, not eliminated

---

### Q: What's exposure_weighted_wage_bill? What does it measure?

**A:** The `exposure_weighted_wage_bill` is the total wage pool that sits in roles where AI has meaningful exposure to the workflow, weighted by compressibility.

**Calculation:**
```
exposure_weighted_wage_bill = Σ (estimated_employment × layer_median_comp_proxy × compressibility_score / 10)
```

This is summed across all 404 positions where compressibility_score > 0.

**Scale (v32 energy sector benchmark):**
~$236 billion

**Interpretation:**

This single metric answers: "How much wage value sits in roles where AI can realistically compress prep work?"

The compressibility score (0-10) is the key difference from raw wage bill:
- A role with 0 compressibility (irreducible judgment work) contributes $0 to exposure_weighted_wage_bill, even if it has high employment
- A role with 8.5 compressibility (routine language work with clear deployment path) contributes heavily

**Why not separate "deployable" vs "exposed"?**

v32 replaced the two-bucket system (exposed vs. deployable) with a single continuous metric (exposure_weighted_wage_bill) because:
1. Cleaner: one metric to interpret, not two
2. More granular: accounts for variation in compressibility across all 404 positions, not just a binary cut (exposure > 6 AND feasibility > 6)
3. More honest: reflects the reality that deployment feasibility is a spectrum, not a cliff

If you want to be more conservative, you can filter for roles with compressibility > 6 and adoption_feasibility > 6. If you want to be more aggressive, use all roles. The data supports both interpretations.

---

## Section 3: Company Control Model

### Q: What are the three zones? Why does perspective matter?

**A:** The company-control model recognizes that not all automation is equally valuable. Compressing internal prep work is different from controlling external outcomes.

**Three zones:**

1. **Company-controlled (3 points):**
   - The company executes the workflow from start to finish
   - No external gating or approval required
   - The company owns the outcome and the risk

   Examples:
   - Production accounting close (company owns books)
   - Trading position sizing (company makes the call, owns the P&L)
   - Capital allocation (company decides where money goes)
   - Internal asset valuation (company decides valuation for internal purposes)

   Risk profile: Low. If the AI output is wrong, it's caught before it leaves the company.

   Value profile: High compressibility directly translates to labor reduction or redeployment.

2. **Semi-controlled (2 points):**
   - The company compresses the prep work, but the outcome depends on an external party's decision
   - Faster internal process ≠ faster external approval

   Examples:
   - Rate case evidence assembly (company does analytics; PUC decides rates)
   - Interconnection study support (company does analysis; FERC decides queue position)
   - Lender covenant package (company does analysis; lender decides waiver)
   - Environmental permit support (company does analysis; agency issues permit)

   Risk profile: Moderate. The external party can reject the work, but the internal process is faster and higher quality.

   Value profile: Mixed. Company gets cycle-time benefit on prep. External benefit depends on how the other party values speed.

3. **Externally governed (1 point):**
   - The key outcome is controlled by an external party
   - The company can support the work, but cannot control the result
   - Automation benefits are mostly about quality and auditability, not speed

   Examples:
   - Regulatory approval workflows (NRC, EPA, state PUC decisions)
   - Third-party audit support (auditor makes the judgment call)
   - Insurance underwriting support (insurer decides coverage)
   - Legal opinion support (counsel makes the call)

   Risk profile: Moderate to high. If the external party loses confidence in the work, the benefit evaporates.

   Value profile: Low speed benefit. High quality and compliance benefit.

**Why perspective matters:**

This is critical: The company-control score reflects the perspective of the focal operator / asset owner / sponsor, not the perspective of the external party.

Example: An NRC inspector has full control of their own licensing workflow. But from the operator's perspective, that workflow is "externally governed." The operator cannot speed up the NRC's decision. They can only compress their own prep work.

This distinction prevents overclaiming. If the analysis said "NRC licensing is company-controlled," it would falsely imply the company can speed up the licensing decision. The truth is: the company can speed up their own package assembly; the NRC will take as long as it wants.

---

## Section 4: Claims Framework

### Q: What are the five claim types? Why does this matter?

**A:** All 30 claims in the project are tagged with one of five types. This is a trust mechanism, not a research fancy.

**Claim types:**

1. **Measured claims:**
   - Based on observed data from official or audited sources
   - Examples: "Current BLS employment for petroleum engineers is 12,000" (measured from BLS OES data)
   - Examples: "IEA projects 945 TWh of data center demand in 2030 base case" (measured from published IEA scenario)
   - Examples: "LBNL interconnection queue has ~2,290 GW of capacity" (measured from queue database)

   Confidence: 95%+

   How to use: Trust these claims as written. If you dispute them, check the source.

2. **Reported claims:**
   - Based on figures published by third parties (proprietary analysts, company disclosures, trade press) that are not independently audited or reproducible from public data
   - Examples: "$10-12B annual revenue per GW of AI data center" (SemiAnalysis proprietary estimate — order of magnitude validated against public announcements but not independently verified)

   Confidence: 70-90%

   How to use: These are credible enough to cite but not strong enough to build a thesis on alone. Check the source's track record and look for corroborating data. Where possible, we flag the provenance limitation directly (e.g., "proprietary estimate" rather than "confirmed").

3. **Modeled claims:**
   - Based on transparent calculations and assumptions, not observed data
   - Examples: "Physical operations are ~29% of headcount but ~10% of exposure-weighted wage bill" (modeled from org-chart data and scoring rubric)
   - Examples: "Non-op workflows score 7.8-8.5 on compressibility" (modeled from workflow analysis, not field test data)
   - Examples: "~$1T in capital reallocation over a decade at 3% improvement" (modeled from $3.3T capital base × 3% assumption)

   Confidence: 60-80%

   How to use: Check the rubric and the assumptions. If you disagree with the weights or the inputs, say so. These are defensible hypotheses, not gospel.

3. **Scenario claims:**
   - Based on "if-then" logic, not observed data or direct modeling
   - Examples: "If cheaper analysis leads to more questions, token demand could exceed labor displacement by 5x" (scenario — depends on behavioral assumption)
   - Examples: "If non-op platforms consolidate around a shared OS, they could achieve 30% cost reduction" (scenario — depends on market structure assumption)
   - Examples: "~$1T in capital reallocation" vs. "$50M" depends on whether AI drives *more* analysis or just *faster* analysis (Jevons scenario)

   Confidence: 30-60%

   How to use: Treat these as strategic propositions or stress-test assumptions, not predictions. They are illustrative of what could happen if certain conditions hold.

4. **Hypothesis claims:**
   - Educated guesses based on pattern recognition and strategic logic, not data
   - Examples: "The Jevons effect in energy could be the difference between $50M and ~$1T" (pattern from other industries, applied speculatively to energy)
   - Examples: "External advisors will reprice and rebundle before they disintermediate" (observed pattern in other sectors, hypothesized for energy)
   - Examples: "The apprenticeship crisis is the long-term constraint on AI-native scaling" (logical inference, not measured)

   Confidence: 20-40%

   How to use: These are conversation-starters, not conclusions. Test them against your own experience. They may be wrong.

**Additional display types used in HTML tooltips:**

The HTML tooltips use two additional labels that are subsets of the five types above:

- **Observation:** A subtype of Modeled or Hypothesis. Used when a claim describes a pattern observed across multiple workflows or sectors but is not a direct empirical measurement. Example: "Counterforces such as legal privilege, evidence quality, and external approvals can absorb or delay analytical gains" (C19). Observations are more grounded than hypotheses but less rigorous than measured claims.

- **Derivation:** A subtype of Modeled. Used when a claim follows logically from other claims in the analysis rather than from direct data. Example: "The amplification paradox — AI makes certain human skills worth more by making the surrounding work cheaper." Derivations inherit the confidence level of their inputs.

These are display conveniences, not separate evidentiary tiers. Every claim in the claims ledger maps to one of the five canonical types.

**Scope note on the claims ledger:**

The claims ledger (`energy_decision_stack_claim_ledger_v32.csv`) tracks the 30 anchor claims in the main research artifact (index.html). Derivative documents (deal_readiness_memo.docx, operator_memo_treasury_lender_readiness.docx, audience_specific_briefs.md, etc.) carry additional claims not tracked in this ledger. A future version may expand coverage.

**Why this matters:**

A flagship analysis that mixes measured, modeled, scenario, and hypothesis claims without labeling them creates false precision. A reader doesn't know what's empirical fact and what's reasonable speculation. By tagging every claim, we're saying: "Here's what we know. Here's what we reasoned about. Here's what we guessed about. You decide which ones to believe."

This is the opposite of consultant fog. It's maximum transparency.

---

## Section 5: What This Does NOT Measure

**Important boundary statement:** This analysis does NOT claim to measure:

1. **Realized enterprise value** — We do not know if a company that deploys these solutions will actually make more money. We know the workflows are exposed to AI. We do not know the ROI.

2. **Realized token usage** — We model scenarios for how many tokens *could* be used. We do not have observed token consumption from real deployments.

3. **Real labor reductions** — We do not have data on actual headcount changes in companies that have deployed AI. We have a hypothesis about which roles are most exposed.

4. **Real approval-speed changes** — We know that internal prep work can be compressed. We do not have measured data on how much faster permits, approvals, or decisions actually become.

5. **Real spread compression** — We do not know if trading spreads actually compress when trading firms use AI. We know the workflows are decision-heavy and recurrent.

6. **Regulator or lender behavior** — We do not know how regulators or lenders will respond to AI-assisted submissions. We assume they care about quality, accuracy, and compliance. We do not know if they will actually approve faster or more favorably.

7. **Competitive market dynamics** — We do not model what happens when 10 competitors all deploy AI to the same workflows. (Hint: spreads compress, margins evaporate.)

8. **Adoption curves** — We do not model how fast these solutions will actually be adopted. We assume willing early adopters. We do not model the slow tail.

**What this analysis IS:**
- A directional ranking of where AI has the most structured exposure in energy workflows
- A prioritization framework for where to start
- A methodology for thinking about the problem rigorously
- A proof that energy is not immune to AI; it's just different than tech/finance

**What this analysis IS NOT:**
- A market-size estimate
- An ROI calculator
- A labor forecast
- A regulator forecast
- A guarantee that these workflows will actually be automated

---

## Section 6: How to Independently Validate This Analysis

**The following describes how an independent team could rebuild and validate this analysis from scratch. This is an aspirational protocol, not a description of how v32 was produced.** The v32 analysis was conducted by one analyst with domain experience using frontier LLMs for research, scoring, and writing. The process below outlines what a rigorous independent replication would look like.

### Step 1: Start with BLS OES data (1-2 weeks)

1. Go to https://www.bls.gov/oes/
2. Download occupational employment data for these NAICS codes:
   - 211 (Oil and gas extraction)
   - 213 (Support activities for mining)
   - 2211 (Electric power generation)
   - 2212 (Electric power transmission and distribution)
   - 237 (Heavy and civil engineering construction)
3. Extract the role list and employment/wage data
4. Standardize role names (BLS naming is inconsistent)
5. You should have ~60 roles with high-confidence employment and wage data that match the energy role taxonomy (from a broader pool of ~120 BLS occupation codes)

### Step 2: Expand with industry-specific roles (2-3 weeks)

1. Conduct 20-30 interviews with energy operators, utilities, traders, and support staff
2. Map their org charts using a role-taxonomy template:
   - Role name
   - Function (operations, finance, commercial, technical, regulatory, support)
   - Department
   - Typical career stage (junior, mid, senior)
   - Typical company size (100-person vs. 10,000-person)
3. Flag roles that BLS doesn't capture:
   - Non-op and minerals workflows
   - Trading and commercial roles
   - Regulatory affairs and government relations
   - Digital transformation and data governance
4. Build employment estimates for these roles using:
   - Survey logic (e.g., "100-person E&P has 2-3 JIB analysts" → scale to ~150 E&Ps)
   - Market trends (e.g., non-op headcount grew 15% with consolidation)
   - Sector data (e.g., IEA employment reports)
5. You now have ~300+ roles with employment estimates (confidence: 60-80%)

### Step 3a: Current implementation (v32) — Single-analyst scoring

**How v32 was actually produced:**

1. Build a scoring rubric for each axis:
   - **Task exposure (0-10):** What % of the work is language/data processing?
   - **Adoption feasibility (0-10):** Could a typical energy company deploy a solution in 18-36 months?
   - **Economic compression risk (0-10):** How much cost would be displaced?
   - **Fee disruption risk (0-10):** What external advisory market is at risk?

2. One analyst with energy operating experience and access to frontier LLMs (Claude, GPT) scored all positions

3. Scoring was informed by:
   - Domain experience and intuition
   - Pattern matching across roles and sectors
   - LLM assistance for research, synthesis, and documentation
   - No inter-rater agreement measurement; no consensus process

4. Scoring rationale was documented for every role (evidence, not just numbers)

5. Result: 404 positions scored on compressibility by a single source; 304 additionally scored on 3 augmented axes (decision criticality, reasoning demand, company control); 30 anchor roles scored on 3 additional axes (automation exposure, value creation, asymmetric risk) (confidence: 60-70% for individual roles; 75%+ for layer aggregates due to error cancellation)

**Confidence note:** Individual role scores should not be cited as precise. Layer-level aggregates (physical operations vs. advisory vs. governance) are more reliable because scoring biases partially cancel when summed across 40-80 roles per layer.

**Robustness of the above-field finding:** Sensitivity testing shows that the above-field wage bill share (approximately 90% of total exposure) is robust to plausible scoring errors. When all Physical operations compressibility scores are shifted ±1 point (the most impactful directional error, since field work dominates sensitivity), the share moves ±2.6 percentage points to a range of 87.7%–93.0%. Monte Carlo simulation of 1,000 random ±1 perturbations across all 404 positions yields a mean of 90.24% with a standard deviation of only 0.49%; 90% of outcomes fall between 89.45% and 91.07%. Even under a stress test with ±2 point random errors, the share remains tightly distributed (mean 90.26%, 5th–95th percentile band of 89.1%–91.5%). The directional finding — that most AI-exposed wage dollars are above field — survives all tested scenarios. See `docs/scoring_sensitivity_analysis.md` for the full analysis.

**BLS-only robustness check:** Restricting the exposure-weighted wage bill calculation to only the 60 BLS/industry-sourced employment rows (excluding all 344 formulaic estimates) still yields an above-field share of approximately 74%. The formulaic scaffolding raises the headline to ~90%, but the directional finding — that most AI-exposed wage dollars sit above the field — holds on externally-sourced data alone.

**Inter-rater validation:** v32 uses single-analyst scoring. An inter-rater scoring rubric is included in the package (`docs/inter_rater_scoring_rubric.md`) with calibration anchors, a 10-role calibration set, scoring sheet template, and agreement metrics (target: Cohen's κ > 0.70, ICC > 0.70). This is designed for 3-5 external raters and is a v33 deliverable.

**CSV column note:** The v32 dataset contains one company-control column: `company_control_zone`. Its definition is outcome-oriented (who controls the result — e.g., rate cases are "Semi-controlled" because the PUC controls the ruling). An earlier scoring pass used a prep-work-oriented definition (who controls the filing), but that alternative is not shipped in v32; reconciliation is a v33 deliverable. Where doc-level prose and the CSV disagree on a specific row's zone, the CSV value is canonical.

---

### Step 3b: Recommended future methodology (v33+) — Multi-rater consensus scoring

**How this could be rebuilt more rigorously:**

1. Build a scoring rubric for each axis:
   - **Task exposure (0-10):** What % of the work is language/data processing?
   - **Adoption feasibility (0-10):** Could a typical energy company deploy a solution in 18-36 months?
   - **Economic compression risk (0-10):** How much cost would be displaced?
   - **Fee disruption risk (0-10):** What external advisory market is at risk?

2. Recruit 3-5 energy analysts with diverse backgrounds to score all positions independently
   - At least one technical/engineering perspective
   - At least one commercial/trading perspective
   - At least one regulatory/governance perspective
   - Instructed to score blind (no visibility into other raters' scores until final consensus)

3. Compute inter-rater agreement (Krippendorff's alpha or Intra-class correlation) to measure consistency

4. Consensus the scores:
   - Where scores diverge by 0-1 points: use median
   - Where scores diverge by 2+ points, re-discuss and reason to consensus
   - Where scores diverge by 3+ points, flag as low-confidence and document the disagreement
   - Document consensus rationale for every role (evidence, not just a number)

5. Result: 404 positions scored on compressibility with documented inter-rater agreement; 304 additionally scored on 3 augmented axes (decision criticality, reasoning demand, company control); 30 anchor roles scored on 3 additional axes (automation exposure, value creation, asymmetric risk) (confidence: 80%+ for individual roles if consensus achieved; 90%+ for layer aggregates)

6. Publish inter-rater agreement metrics alongside the dataset so users can calibrate their confidence

### Step 4: Add the v32 layer — compressibility, criticality, reasoning demand, control (3-4 weeks)

1. For each role, calculate compressibility:
   - Compressibility is derived primarily from task_exposure and adoption_feasibility (the two columns present in the shipped CSV as `task_exposure_v1` and `adoption_feasibility_v1`), with analyst judgment applied for workflow standardization and template density factors
   - **Note:** An earlier version of this document described a strict four-component weighted formula. The production dataset uses a simplified two-input approach with analyst overlay. See the compressibility FAQ entry above for details.

2. For each role, calculate decision criticality:
   - Criticality = upside_optionality + downside_severity + authority_level + irreversibility
   - These are estimated from the role description and the workflows it supports
   - Use examples (board director, operating engineer, risk officer) to anchor the scale

3. For each role, calculate reasoning-demand potential:
   - Reasoning_demand = (0.20 × task_exposure) + (0.15 × adoption_feasibility) + (0.25 × refresh_frequency) + (0.15 × evidence_volume) + (0.10 × branching_factor) + (0.05 × monitoring_need) + (0.10 × company_control_bonus)
   - These are estimated from workflow descriptions, recurrence patterns, and complexity

4. For each role, assign company-control zone:
   - Company-controlled (3): decision made inside the company, no external gate
   - Semi-controlled (2): company does prep work; external party decides outcome
   - Externally governed (1): external party controls both prep and outcome
   - Use examples to calibrate (treasury/borrowing-base is semi-controlled per CSV; rate cases are company-controlled per CSV; NRC licensing is externally governed)

5. You now have v32+ augmented scoring for 304 positions (75.2%); all 404 have compressibility scores

### Step 5: Run the company-control zoning (1 week)

1. Create a matrix:
   - Rows: 404 positions
   - Columns: compressibility, decision_criticality, reasoning_demand, company_control_zone

2. Identify high-value clusters:
   - High compressibility + High criticality + High reasoning demand + Company-controlled = beachheads
   - High compressibility + Low criticality + High reasoning demand = cost-cutting opportunities
   - Low compressibility + High criticality = augmentation opportunities
   - Low compressibility + Low criticality = low priority

3. Document rationale for the beachheads (which workflows should you start with?)

### Step 6: Build the claims ledger (1 week)

1. List every claim the analysis makes (target: 30+ claims)
2. Tag each claim with type:
   - **Measured:** Link to the specific source (BLS, IEA, LBNL, company data)
   - **Modeled:** Link to the calculation and assumptions
   - **Scenario:** State the "if-then" logic
   - **Hypothesis:** Label as such; explain the reasoning

3. Document support for every claim:
   - What data backs this up?
   - What would falsify this claim?
   - What's the confidence level?

4. Example:
   ```
   Claim: Physical operations are ~29% of headcount but ~10% of exposure-weighted wage bill
   Type: Modeled
   Support: Org-chart data + scoring rubric
   Calculation: Total headcount in field roles / total headcount = 29%.
               Physical operations layer exposure-weighted wage bill /
               total exposure-weighted wage bill across all positions = 10%.
   Confidence: 70% (headcount data is measured; compressibility scores are modeled)
   ```

### Step 7: Cross-validate against external sources (1-2 weeks)

1. BLS employment and wage data (measured benchmark)
2. IEA Energy and AI report (measured demand forecasts)
3. IEA World Energy Employment 2025 (measured workforce trends)
4. LBNL Queued Up interconnection data (measured infrastructure pipeline)
5. Company interviews (qualitative validation)

Where internal estimates diverge from external sources by >20%, investigate the gap.

### Step 8: Build the proof room with composite traces (1-2 weeks)

1. For each high-priority workflow, create a composite trace:
   - What are the typical inputs (documents, data)?
   - What are the current manual steps?
   - Where can AI help?
   - What stays human?
   - What can break?
   - What should success look like?

2. Example: Borrowing base packet support
   ```
   Typical evidence bundle:
   - Latest reserve report
   - Production and cash-flow exports
   - Hedge schedule
   - Debt documents and covenant definitions
   - Prior lender questions
   - Board and management updates

   What AI changes first:
   - Source comparison and reconciliation
   - Variance tables
   - Covenant-definition extraction
   - Q&A draft generation

   What remains human:
   - Negotiation posture
   - Final signoff on external submission
   - Treasury and legal judgment
   ```

### Step 9: Write the output documents (1-2 weeks)

1. **Adoption map (Now/Next/Later/Never):** Sorts workflows by deployment readiness
2. **Core non-obvious insights:** The key findings that surprised the team
3. **Methodology FAQ:** This document — how to rebuild the analysis

---

## Section 7: Known Limitations

**This analysis is strong on some things and weak on others.**

### Strengths:
- Comprehensive position coverage (404 positions vs. typical "top 20" analyses)
- Transparent scoring methodology (can be audited, debated, challenged)
- Clear distinction between measured, modeled, and speculative claims
- Energy operating experience embedded in the scoring
- Explicit boundary statements (what we do NOT measure)

### Weaknesses:

1. **US-centric:** All employment data is US-based. Energy is global. Exposure dynamics in other markets may differ.

2. **Employment estimates are modeled, not census:** BLS/industry sources cover ~15% of positions directly (60 of 404 rows). The other 85% are estimated. This introduces 30-40% confidence bands.

3. **Scores are directional rankings, not individually precise:** A role that scores 7.2 on compressibility is "more compressible than" a role that scores 6.8. But the 0.4-point difference is noise. Treat scores as quartiles, not decimals.

4. **No observed outcomes yet:** This is a *hypothesis* about where AI can add value. We have not measured real labor reductions, real speed improvements, or real ROI in energy companies yet.

5. **Competitive landscape not mapped in the core analysis:** We score individual workflows. We do not model what happens when 10 competitors all automate the same workflow (spreads compress, margins evaporate). This is in the strategic scenario doc, not the core benchmark.

6. **Fee disruption estimates are hypothetical:** We assume external advisors will be repriced and rebundled. We do not have observed market data on this happening yet.

7. **Regulatory response is speculative:** We assume regulators care about quality and accuracy. We do not know if they will actually approve faster or feel comfortable with AI-assisted submissions.

8. **No data on regulator-by-regulator variation:** NRC, FERC, EPA, and state PUCs all have different attitudes toward AI and automation. This analysis is one-size-fits-all.

---

## Section 8: How to Maintain and Update This Analysis

**If you want to keep this current:**

### Annual refresh cycle:

1. **Update BLS data (January):** New OES data drops in January. Update employment and wage estimates for all standardized roles.

2. **Re-validate scored estimates (March):** Conduct 10-15 interviews with energy practitioners. Ask: "Are these scores still right? What's changed?" Update scoring based on feedback.

3. **Cross-validate against external sources (April):** Check for updated IEA, LBNL, and industry employment reports. Update claims ledger.

4. **Measure realized outcomes (Ongoing):** As companies deploy solutions, collect and document outcomes (labor savings, speed improvements, cost changes). Start building the "measured" category of claims.

### Quarterly monitoring:

1. Track new AI vendor announcements in energy
2. Monitor regulatory guidance on AI use (NRC, FERC, EPA, state PUCs)
3. Update the claims ledger with new evidence or counterevidence
4. Flag emerging workflow classes (e.g., AI in SCADA systems, subsurface CO2 monitoring)

### As-needed updates:

- When a new workflow emerges (e.g., "AI for critical infrastructure resilience"), score it and integrate it
- When a regulation changes (e.g., NRC guidance on automation), update the control model
- When a new vendor reaches market fit, document the proof and use it to refine scoring

---

## Final Note

This FAQ is long because the methodology is defensible but not simple. There are no shortcuts.

If you're tempted to simplify (e.g., "just use BLS data for all roles" or "score everything 7-8 because AI is capable"), you'll lose the nuance that makes the analysis useful. The nuance is where the value lives.

If you're tempted to overstate (e.g., "90% of energy wage bill will be automated by 2030"), remember that exposure_weighted_wage_bill includes roles with low compressibility and low adoption feasibility. Many roles have AI exposure. Few are deployment-ready in the next 18-36 months. The gap between theoretical exposure and practical feasibility is the real constraint.

If you're planning to use this analysis to make decisions, default to the most conservative interpretation. (Assume compressibility is lower. Assume adoption takes longer. Assume external approvals don't accelerate.) Then you'll be pleasantly surprised.

Good luck.
