# How this benchmark was built with AI

## Process

One analyst (Raj Mistry) with energy operating experience directed the entire research and production process. Two frontier language models served as research and drafting instruments: Claude Opus 4.6 (Anthropic) and ChatGPT Pro 5.4 (OpenAI). The models were used for literature review, data structuring, scoring assistance, sensitivity analysis, code generation, copy drafting, and production of the final HTML artifact. Every judgment call — role selection, compressibility scoring, narrative framing, what to include and exclude — was made by the analyst.

## What the models did well

- **Literature assembly.** Pulling and synthesizing BLS occupational data, FERC filings, IEA reports, SEC attestation requirements, and cross-industry labor research. Tasks that would have taken weeks of reading took hours.
- **Structural scaffolding.** Generating the 344 formulaic employment estimates from layer base values. Building the initial CSV structure, HTML exhibit code, and interactive chart logic.
- **Sensitivity analysis.** Running Monte Carlo simulations, directional stress tests, and robustness checks across scoring perturbations. The models executed thousands of scenarios that would have been impractical to run manually.
- **Drafting speed.** Producing first drafts of methodology documentation, supporting memos, and narrative sections at a pace that let the analyst focus on substance rather than blank-page paralysis.
- **Internal consistency checking.** Flagging terminology mismatches, broken cross-references, and version-string inconsistencies across a 32-document package.

## Where the analyst overrode the models

- **Compressibility scoring for domain-specific roles.** The models consistently underscored regulatory and title roles (PUC rate-case analyst, division order analyst, interconnection coordinator) because their training data underrepresents these workflows. Analyst adjustment corrected 14 rows; direct assignment covered 100 rows where the models lacked sufficient context.
- **Narrative framing.** Early model drafts positioned the thesis as "AI replaces energy jobs." The analyst reframed toward "AI compresses prep work and amplifies judgment" — a distinction the models repeatedly drifted away from.
- **Audience calibration.** The models defaulted to general-audience explanations. The analyst pushed toward operator-specific language (borrowing-base packets, JIB exceptions, rate-case discovery) that signals credibility to the target readership.
- **Confidence labeling.** Models tend toward false precision. The analyst imposed the claim-type system (measured / reported / modeled / derivation / scenario / observation) and wrote the confidence disclosures.
- **What to cut.** Models generate exhaustively. The analyst removed entire sections, demoted speculative claims, and enforced the "directional ranking is the claim, not any individual score" framing.

## Where domain judgment was irreducible

- Knowing that borrowing-base redetermination is semi-annual, deadline-driven, and document-heavy — and therefore a better beachhead than reservoir analysis.
- Knowing that a PUC rate case filing is company-controlled on the prep side but externally governed on the outcome side — and that this distinction matters for deployment scoping.
- Knowing that "cycle time" means different things in a deal-readiness memo (end-to-end including mobilization) versus an operator memo (core packet preparation).
- Knowing which roles actually exist inside an E&P treasury team versus which roles a model hallucinates from job-posting language.

## Failure modes observed

- **Hallucinated specificity.** Models would invent plausible-sounding BLS occupation codes, FERC docket numbers, or company names. Every external reference was manually verified.
- **Score anchoring.** When shown a partially completed scoring sheet, models anchored to existing scores rather than evaluating each row independently. Scoring was done in batches with fresh context to mitigate this.
- **Tone drift.** Over long sessions, models would gradually shift toward either academic hedging or marketing hyperbole. Regular recalibration was required.
- **Structural repetition.** Models tend to produce parallel structures (three-part lists, mirrored sentence patterns) that read well individually but create monotony at page scale. Manual rhythm variation was applied throughout.

## What became faster

- Initial research and data structuring: ~10x faster than a solo analyst working without AI
- First drafts of supporting documents: ~5-8x faster
- Sensitivity analysis and robustness testing: ~20x faster (practical barrier was removed entirely)
- Cross-document consistency checking: ~10x faster

## What became riskier

- **False confidence from speed.** Because drafts appeared quickly and read well, the temptation to skip deep review was real. Several scoring errors were caught only on third-pass review.
- **Compounding errors in scaffolding.** The 344 formulaic employment rows propagate any bias in the layer base values across hundreds of cells. A human building a spreadsheet row by row would catch drift earlier.
- **Audience mismatch from training data.** The models' default audience is a generalist reader. Producing content for energy operators required constant correction, and some generalist phrasings likely survived into the final package.

## Cost

- Direct API billing for final production sessions: ~$75 (Claude Opus 4.6 + ChatGPT Pro 5.4)
- Estimated total model spend including research, iteration, and discarded drafts: $500-$5,000
- Equivalent research team benchmark: ~$450K (3 sector analysts + 1 data scientist + 1 designer + 1 editor × 6 months at market rates; illustrative, not audited)

## Bottom line

The models did the prep work. The analyst made every judgment call. That is the thesis of the report — illustrated in its own production process.