# Inter-Rater Scoring Rubric: AI Compressibility of Energy Roles

## Purpose

This rubric enables independent external validators to score energy sector roles and workflows for **AI compressibility** — the degree to which current large language models can compress the preparation, research, drafting, and analysis work around a decision or workflow. The goal is to test inter-rater reliability (agreement among independent scorers) on this dimension, which is central to understanding AI's real impact on energy operations and decision-making.

---

## Section 1: Instructions for Independent Raters

### What You Are Scoring

You are assessing a single dimension: **How much of the preparation and documentation work surrounding a role or workflow can current AI (as of early 2026) realistically compress?**

Compressibility is **not about**:
- Whether the role will disappear
- Whether humans should make the final decision
- Whether a company has adopted AI yet
- Whether regulatory constraints allow it
- Whether it's economically rational to do it

Compressibility **is about**:
- Current technical capability to automate prep work (document research, analysis, drafting, synthesis)
- Feasibility of integrating AI into the existing workflow without major retooling
- How standardized and templateable the work is
- Volume of evidence/data the AI could synthesize
- Whether recurring usage would be possible

### Operational Definition

**Compressibility** measures the portion of a role's typical weekly preparation work that could be offloaded to current AI without human review of the AI output being impossibly burdensome.

**Examples of "prep work" (compressible):**
- Literature review and synthesis (regulatory precedents, comparable cases, technical standards)
- First-pass document drafting (permit applications, rate case evidence, due diligence checklists)
- Data assembly and table construction (production reports, contract abstracts, regulatory filings)
- Precedent identification and argument structuring (what worked in past cases, how to position a claim)
- Routine analysis (decline curve fitting, royalty accounting, grid reliability modeling given parameters)
- Q&A response drafting (internal board questions, regulator inquiries, lender requests)

**Examples of non-compressible work:**
- Site visits and physical observation
- Live negotiations and relationship management
- Novel technical judgment calls (e.g., "Is this reservoir worth drilling?")
- Strategic decisions that involve company risk appetite
- Personal sign-off and accountability (e.g., "Do you personally stand behind this seismic interpretation?")

### Time Budget

Expect to spend **8-15 minutes per role**, depending on your familiarity with the sector.

- **If you know the role well:** 8-10 minutes (you'll quickly grasp the typical workflow)
- **If unfamiliar:** 12-15 minutes (read the role definition, think through the workflow, then score)

Total calibration set (10 roles): ~2 hours. Full scoring set (if you commit to ~20-30 roles): 4-6 hours.

### Required Domain Knowledge

**Minimum:** You should be able to answer:
- What does this role do day-to-day?
- What documents or analyses does the person produce?
- Who do they report to and what is their output used for?

**Ideal backgrounds:**
- Energy operations (utility, oil/gas, midstream, renewable)
- Energy project finance or M&A
- Energy consulting
- Energy law or regulatory affairs
- Grid operations or power trading
- Energy construction or equipment management

**Can you score if you're outside energy?** No — the task requires knowing what "standard" looks like in energy. A generic knowledge worker can't calibrate what's routine vs. novel in energy workflows.

---

## Section 2: Scoring Rubric with Anchor Points

### The 1-10 Scale: Definitions and Examples

Use the definitions below to assign a score. The anchor points show real energy roles currently in the dataset that exemplify each band. Read the definition, then find your role somewhere between the anchors at the top and bottom of the band.

---

### **1-2: Almost No AI Compression Possible**

**Definition:**
- Dominant work is hands-on, site-based, real-time, or requires irreplaceable judgment
- Prep work (if any) is so context-dependent and novel that templates don't exist
- AI can assist with information retrieval but cannot meaningfully reduce hours
- The role would take 80-95% as long even with best-in-class AI available

**Why compressibility is low:**
- Heavy physical/sensory component
- Judgment is highly situation-specific, not based on precedent or formula
- Safety or immediate real-world response is central to the role

**Energy sector examples:**
- **Lineman (2.4)** — Climbs poles, cuts/restores live power lines, troubleshoots in real time. AI cannot climb poles or make on-the-spot safety decisions. Prep work (safety procedures, weather data) is minimal and not standard.
- **Roughneck/driller crew (2.5)** — Operating drilling equipment, managing drilling fluid, responding to wellbore changes in real time. Most of the role is hands-on; prep work is minimal.

**Other roles at this level:** Crane operators, roustabouts, well control supervisors.

**Scoring guidance:** If you think "this person would be doing roughly the same work even if we gave them the world's best AI assistant," score 1-2.

---

### **3-4: Minimal Compression**

**Definition:**
- Significant prep/support work exists but is either heavily physical, highly localized, or very low-leverage
- AI could handle some document assembly or data pulling but it represents a small % of total effort
- Standardization is low; each instance is sufficiently different that templates don't transfer
- Role would take 75-85% as long with best AI

**Why compressibility is low:**
- Field-heavy roles where coordination and hands-on work still dominates
- Administrative support work that's straightforward but tied to local systems/procedures
- Work that requires real-time responsiveness or physical presence

**Energy sector examples:**
- **Gauger/pumper (3.0)** — Daily well monitoring, fluid sampling, production accounting. AI can't do the physical gauging or sampling. Data entry after gauging is routine but infrequent. Mostly on-site work.
- **Well testing operator (3.2)** — Runs pressure/flow tests on wells, interprets real-time sensor data, makes field adjustments. AI can help with post-test report drafting but the core work (running the test, responding to pressure changes) is hands-on.

**Other roles at this level:** Environmental field samplers, marine operations technicians.

**Scoring guidance:** If the role is 70%+ field work or real-time response, with < 20 hours/week of prep/documentation that could be templated, score 3-4.

---

### **5-6: Moderate Compression**

**Definition:**
- Significant prep/analysis work (40-60% of time) that is partially standardized
- AI can compress 30-50% of that prep work (data pulling, first drafts, precedent assembly)
- Each instance has novel elements but within a recognizable pattern
- Some judgment remains mandatory; AI is a force multiplier but not a replacement for the expert
- Role would take 60-75% as long with best AI

**Why compressibility is moderate:**
- Technical judgment is required but often bounded (e.g., "Is this interpretation reasonable?" not "What should we explore?")
- Workflows repeat but with variable inputs that require adaptation
- Mix of creative analysis and mechanical work

**Energy sector examples:**
- **Nuclear reactor operator (5.2)** — Manages steady-state operations, monitors sensors, executes procedures. AI can help pre-shift briefing assembly and procedure step validation. But hands-on monitoring and emergency response are non-compressible.
- **Energy trader (physical) (6.3)** — Executes trades, monitors positions, prepares daily P&L. AI can help gather market data, structure trade scenarios, and draft trade summaries. But final trade decisions and market judgment are human.

**Other roles at this level:** Production engineers, reservoir engineers in certain workflows, commercial/trading analysts.

**Scoring guidance:** If 40-60% of the role is standardized prep/analysis that AI could accelerate by 40-50%, and the remaining work requires expert judgment but within a pattern, score 5-6.

---

### **7-8: High Compression**

**Definition:**
- 60-75% of the role is prep/documentation/analysis work
- Workflows are standardized; templates and precedents are abundant
- AI can handle 50-75% of that prep work (first drafts, data synthesis, prior art assembly)
- Remaining 25-50% of prep work requires expert judgment/review (quality control, framing decisions, novel combinations)
- Role would take 35-60% as long with best AI

**Why compressibility is high:**
- Work is largely evidence-based (here's what we need to prove; here's the standard way to prove it)
- Prep is repetitive across many instances (same type of application, same structure, different parameters)
- Final decisions still require human judgment but are informed by AI-synthesized evidence

**Energy sector examples:**
- **Completions engineer (7.0)** — Designs well completions (tubing, packers, perforations). Prep work: Gather offset well data, review completion precedents, calculate design parameters, draft completion spec. AI can pull offset data, find precedents, structure design options. Engineer makes the final design call based on reservoir characteristics.
- **Reservoir engineer (7.6)** — Manages reserve estimates, production forecasts, field development plans. Prep: Gather production data, run decline curves, review comparable reserves, assemble documentation. AI can gather data, run standard decline curves, find comparables. Engineer interprets and makes judgment calls.

**Other roles at this level:** Most technical engineering roles, project managers, commercial/bid analysts, regulatory analysts, financial analysts.

**Scoring guidance:** If 60-75% of typical work is standardized prep that AI could handle 50%+ of, and the expert review/judgment is the main irreplaceable part, score 7-8.

---

### **9-10: Near-Complete Prep Compression**

**Definition:**
- 70-85% of the role is prep/documentation/research/drafting with high standardization
- AI can handle 75-90% of the prep work (data assembly, document structure, precedent synthesis, boilerplate drafting)
- Final output still requires human review/sign-off but the prep work is largely offloaded
- Role would take 20-40% as long with best AI (or same time, but much higher output volume/quality)
- Work is highly templatable with routine variations

**Why compressibility is very high:**
- Workflows are standardized; the structure of the work is the same every time
- Evidence gathering and synthesis is systematic (here are the 10 documents we always need; here's how we arrange them)
- Final output is a checkable artifact (a filing, a memo, a compliance checklist) not a novel judgment call

**Energy sector examples:**
- **Payroll / AP / AR clerk (8.6)** — Process invoices, reconcile accounts, run payroll. Almost entirely systematic and rule-based. AI can validate entries, flag exceptions, draft memos. Human review remains but prep is ~90% compressible.
- **Abstractor (9.1)** — Abstracts mineral titles, contracts, regulatory filings. Reads documents, extracts key terms, populates templates. AI can read documents (often better than humans in noisy/scanned PDFs), structure extracts, flag uncertain values. Lawyer reviews for accuracy but prep work is ~85% compressible.

**Other roles at this level:** Document preparation roles, administrative/clerical roles, permitting/filing support, payroll/accounting, data entry/QA roles.

**Scoring guidance:** If 70%+ of the role is templatable document/data work with clear extraction/formatting rules, score 9-10.

---

## Section 3: Calibration Set

### How to Use This Set

Before you score a full set of roles, score these **10 calibration roles** independently. Do not consult the v32 scores yet. Write down your score and confidence (1-5 scale, where 5 is "I'm very sure").

After you've scored all 10, compare your scores to the v32 scores shown below. If you're within ±1 point on most roles, you're well-calibrated. If you're regularly 2+ points off, re-read the rubric and recalibrate.

### Calibration Roles (Score These First)

| Role | Layer | Typical Workflow | v32 Score | Your Score | Confidence |
|------|-------|---|---|---|---|
| **Lineman** | Physical | Daily power line maintenance, outage response, live circuit work | 2.4 | ___ | ___ |
| **Gauger / Pumper** | Physical | Well monitoring, daily fluid sampling, manual production accounting | 3.0 | ___ | ___ |
| **Nuclear Reactor Operator** | Physical | Steady-state reactor operation, sensor monitoring, procedure execution | 5.2 | ___ | ___ |
| **Energy Trader (Physical)** | Technical | Execute commodity trades, manage positions, produce daily P&L, market monitoring | 6.3 | ___ | ___ |
| **Reservoir Engineer** | Technical | Manage reserves, production forecasts, field development plans, data analysis | 7.6 | ___ | ___ |
| **Completions Engineer** | Technical | Design well completions, specify tubing/packers, review offset data, draft designs | 7.0 | ___ | ___ |
| **Permitting Analyst** | Technical | Prepare environmental permit applications, coordinate with agencies, track compliance | 7.8 | ___ | ___ |
| **Rate Case Analyst** | Advisory | Assemble evidence for utility rate cases, prepare testimony support, precedent research | 8.5 | ___ | ___ |
| **Abstractor** | Advisory | Abstract titles, contracts, regulatory filings; populate extraction templates | 9.1 | ___ | ___ |
| **Payroll / AP / AR Clerk** | Corporate | Process invoices, payroll, reconciliations, variance analysis | 8.6 | ___ | ___ |

---

## Section 4: Scoring Sheet Template

Use this table to record your scores. You may score as many or as few roles as you'd like. Minimum useful set: 10-15 roles. If you score 20+, the data becomes much more powerful for testing inter-rater agreement.

**Instructions:**
1. **role_name:** Copy from the role list provided (or propose new roles)
2. **layer:** Physical, Technical, Corporate, Advisory, Capital Markets, Governance
3. **rater_score:** Your 1-10 score for compressibility
4. **confidence:** How confident are you in that score? (1 = guess, 5 = very sure)
5. **notes:** Brief explanation of your score (optional but helpful for reconciliation)

### Blank Scoring Template

```
| Role Name | Layer | Rater Score | Confidence | Notes |
|-----------|-------|-------------|------------|-------|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
```

### Example Completed Row

| Role Name | Layer | Rater Score | Confidence | Notes |
|-----------|-------|-------------|------------|-------|
| Lineman | Physical | 2.5 | 5 | Almost all field work; safety/real-time response non-negotiable. Minimal prep work. |

---

## Section 5: How Inter-Rater Agreement Will Be Computed

Once we have scores from 3-5 raters, we'll compute agreement metrics to validate the rubric and identify roles where judgment diverges.

### Metrics We'll Calculate

#### **1. Cohen's Kappa (κ) for Pairwise Agreement**
Measures agreement between two raters, accounting for chance agreement.

- **Formula:** κ = (P_o - P_e) / (1 - P_e)
  - P_o = observed agreement (% of roles where raters agree within ±0.5 points)
  - P_e = expected agreement by chance
- **Thresholds:**
  - κ > 0.75: Excellent agreement
  - κ = 0.61-0.75: Substantial agreement (acceptable)
  - κ = 0.41-0.60: Moderate agreement (usable but flag outliers)
  - κ < 0.40: Poor agreement (rubric needs refinement)

#### **2. Intraclass Correlation Coefficient (ICC) for All Raters**
Measures agreement among 3+ raters simultaneously.

- **ICC(3,k)** with 95% confidence interval
- **Interpretation:**
  - ICC > 0.75: Excellent reliability
  - ICC = 0.60-0.75: Substantial reliability (acceptable)
  - ICC < 0.60: Poor reliability (flag roles with wide disagreement)

#### **3. Mean Absolute Difference (MAD) per Role**
For each role, compute the average absolute deviation from the median rater score.

- **Good:** MAD < 0.8 (raters are within ~0.8 points, on average)
- **Warning:** MAD > 1.2 (disagreement suggests rubric ambiguity or role definition issues)

#### **4. Per-Rater Bias**
Check whether one rater consistently scores high or low relative to others.

- **Compute:** Mean difference between rater's scores and group median
- **Interpretation:** If any rater's mean bias > ±0.5, we'll review calibration with that rater

### What Thresholds Indicate Acceptable Agreement?

- **Good enough to publish:** κ > 0.70, ICC > 0.70, overall MAD < 0.9
- **Good enough for internal validation:** κ > 0.60, ICC > 0.60, MAD < 1.0
- **Needs refinement:** κ < 0.50 or ICC < 0.50 (suggests rubric is ambiguous; flag specific roles)

### Example Output

After 4 raters score a common set of 15 roles, we'll produce a table like:

| Role | Rater A | Rater B | Rater C | Rater D | Median | MAD | v32 Score |
|------|---------|---------|---------|---------|--------|-----|-----------|
| Lineman | 2.5 | 2.0 | 2.5 | 2.0 | 2.25 | 0.25 | 2.4 |
| Gauger/Pumper | 3.0 | 3.5 | 3.0 | 3.0 | 3.0 | 0.25 | 3.0 |
| Reservoir Engineer | 7.5 | 8.0 | 7.5 | 8.0 | 7.75 | 0.25 | 7.6 |

Then compute κ, ICC, and average MAD for the full set.

---

## Section 6: Practical Next Steps

### How Many Raters Do We Need?

**Minimum:** 3 raters (gives us pairwise agreement + overall ICC)
**Ideal:** 4-5 raters (more robust ICC estimate, better outlier detection)
**Bonus:** If we get 6+, we can hold one rater as a "tiebreaker" and test their calibration separately

### What Backgrounds Are Ideal?

**Tier 1 (highest priority):**
- Energy operations (utility control room, pipeline dispatcher, field supervisor, asset manager)
- Energy consulting (worked on client teams in oil, gas, renewables, utilities)
- Energy project finance (M&A analyst, project developer, due diligence lead)

**Tier 2 (very useful):**
- Energy law or regulatory affairs
- Power grid operations or wholesale market trading
- Energy construction, equipment, or supply chain

**Tier 3 (limited utility):**
- Generic consulting, software, finance with no energy background

We **cannot use:** People with no energy experience.

### How to Recruit

**Option A: Direct Outreach (Fastest)**
- Email 5-10 energy industry contacts you already have
- Subject: "8-minute peer calibration exercise — validate energy role scoring"
- Ask: "Would you score 10-15 energy roles for AI impact? Takes ~2 hours. We need your domain expertise."

**Option B: Newsletter/Podcast Audience**
- Post in Sunya's energy industry newsletter (if audience is appropriate)
- Guest ask on relevant energy/finance podcasts (traders, operators, consulting)
- LinkedIn post: "Looking for energy domain experts to validate AI compressibility scoring — 2 hours, remote"

**Option C: Formal Recruitment**
- Offer small honorarium ($200-500 per rater for 3-5 hours)
- Post on energy industry Slack communities or forums
- Reach out to energy alumni networks (Rice, Colorado School of Mines, Penn State Energy Club)

### How to Run It

**Step 1: Send Rubric + Calibration Set (5 min per rater)**
- Email this document plus the calibration set scoring template
- Ask them to score the 10 calibration roles first, no peeking at v32 scores

**Step 2: Debrief Each Rater (10 min per rater, optional)**
- Collect their calibration scores
- Share v32 anchor scores
- Quick call: "Are you comfortable with how these map to the rubric? Questions?"
- Adjust rubric language if 3+ raters flag the same issue

**Step 3: Full Scoring (2-4 hours per rater)**
- Provide a curated list of 20-30 roles spanning the full 1-10 range
- Ask them to score independently; no collaboration during scoring
- Turnaround: 2 weeks (don't push hard; this is volunteer work)

**Step 4: Aggregate and Compute Metrics (1 hour)**
- Compile all scores into a table
- Compute κ, ICC, MAD for full set and calibration set separately
- Flag any role with MAD > 1.2 for follow-up
- Compare median rater scores to v32 scores to validate dataset

**Step 5: Publish Results (30 min)**
- Write up inter-rater reliability summary
- Show which roles had high/low agreement
- Use results to either:
  - Publish validated v33 compressibility scores (if κ/ICC > 0.70)
  - Refine rubric and re-test (if κ/ICC < 0.60)
  - Publish with caveats (if 0.60 < κ/ICC < 0.70)

### Communication Template

**Subject: 8-Min Peer Review: Energy Role Scoring**

Hi [Name],

I'm validating AI compressibility scoring for energy sector roles. I need 2-3 domain experts to independently score 10-15 roles on a 1-10 scale, measuring how much AI can compress the prep/documentation work around each role.

**What's in it for you?**
- 2 hours of work (10 min calibration, 1-2 hours full scoring)
- Access to the final inter-rater agreement analysis and validated rubric
- Your name credited in the research (if you want it)

**What's required?**
- 5+ years in energy operations, consulting, project finance, or similar
- 2-3 hours over the next 2-3 weeks
- Willingness to think hard about "what does current AI actually help with?"

**How it works:**
1. You score 10 calibration roles (my initial scoring guide is attached)
2. We calibrate together if needed (quick call)
3. You score 20-30 additional roles at your own pace
4. I run inter-rater agreement stats and validate the rubric

Interested? Reply with your background + availability.

Thanks,
[Your name]

---

## Appendix: Technical Notes

### Confidence Intervals for ICC

If ICC = 0.72 with 4 raters and 20 roles:
- 95% CI might be [0.55, 0.84]
- Interpretation: Substantial agreement; could be as low as 0.55 or as high as 0.84

With 5 raters: CI narrows to ~[0.60, 0.81]
With 3 raters: CI widens to ~[0.40, 0.85]

This is why 4-5 raters is the sweet spot.

### Why We Don't Use Simple Correlation

Pearson r would tell us whether raters agree on ordering (e.g., both rank Lineman < Reservoir Engineer) but not whether absolute scores match. We care about the latter (absolute calibration), so κ and ICC are more appropriate.

### Handling Outliers

If one rater gives Lineman a score of 6.0 while others give 2-2.5:
- Flag it immediately in the MAD analysis
- Call the rater and ask: "Walk me through your reasoning. I'm not seeing how Lineman gets >4."
- If they articulate a defensible interpretation, it may indicate rubric ambiguity
- If it's a misunderstanding, exclude that score from that role but keep their other scores

### What If Agreement Is Poor?

If κ < 0.50 or ICC < 0.60:

1. **Identify the problematic roles** (highest MAD)
2. **Re-read the rubric** — is the definition clear? Is the band width realistic?
3. **Convene a brief rater discussion** (30 min video call with 3-4 raters)
4. **Refine language** on 2-3 key concepts (e.g., "What counts as 'standardized prep work'?")
5. **Re-score** just the flagged roles with updated rubric
6. **Recompute metrics**

This is normal; first-draft rubrics rarely hit 0.75 on first attempt.

---

## Summary Checklist for Raters

- [ ] I've read and understand Sections 1-2 (instructions and rubric)
- [ ] I've scored the 10 calibration roles independently
- [ ] I've recorded my confidence level for each calibration role
- [ ] I understand that I'm scoring "current AI capability," not company adoption or regulatory permission
- [ ] I'm ready to score a full set of 15-30 roles
- [ ] I can provide notes on any role where my score is >1 point away from v32

Thank you for your time and expertise. This data will help validate the energy decision stack analysis and shape where AI will have the most impact in the sector.
