Turn raw earnings-call transcripts into decision-grade sentiment metrics that quantify how management, analysts, and the market narrative evolve over time.
This repository builds an end-to-end NLP pipeline that produces:
- ✅ Clean speaker-level text blocks
- ✅ Speaker role labelling (Management / Analyst / Operator / Other)
- ✅ Dual-model sentiment scoring
- VADER (fast, rule-based baseline)
- FinBERT (finance-domain transformer)
- ✅ Power BI-ready metrics tables
- ✅ A Power BI dashboard:
NLP-Dashboard.pbix
Earnings calls are not “just text.” They are strategic communication events that influence:
- Investor confidence
- Risk perception
- Market narratives
- Competitive positioning
Sentiment scoring helps detect changes in:
- Confidence vs caution (tone shifting positive → neutral/negative)
- Uncertainty language (hedging, vague guidance)
- Pressure dynamics (analysts pushing back vs management defending)
- Narrative momentum across quarters and companies
These insights can support:
- Investor Relations (IR): refine messaging; identify where investors are unconvinced
- Equity Research: add consistent sentiment KPIs to qualitative call notes
- Risk / Compliance: flag unusually negative calls for deeper review
- Portfolio strategy: compare narrative trend across companies and time
- Competitive intelligence: benchmark management confidence vs peers
This repo demonstrates the core value loop:
Raw text → structured speaker blocks → model scoring → aggregated KPIs → business dashboard
That’s exactly how NLP engineers turn unstructured language into measurable metrics that guide decisions. As an aspiring engineer myself, I ensured to follow this core value loop!
.
├── data/
│ ├── raw/
│ │ └── transcripts_raw.csv
│ └── processed/
│ ├── speaker_blocks_cleaned.csv
│ ├── speaker_blocks_with_vader.csv
│ ├── speaker_blocks_with_finbert.csv # expected output for merge
│ ├── speaker_blocks_with_sentiment.csv
│ ├── powerbi_call_level_metrics.csv
│ ├── powerbi_role_level_metrics.csv
│ ├── preprocess_checkpoint.txt
│ └── vader_checkpoint.txt
│
├── etl/
│ ├── load_transcripts.py
│ └── preprocess_speaker_blocks.py
│
├── models/
│ ├── sentiment_vader.py
│ └── sentiment_finbert.py
│
├── features/
│ ├── merge_sentimnets.py # filename typo is intentional (matches repo)
│ └── aggregate_for_powerbi.py
│
├── NLP-Dashboard.pbix
└── requirements.txt
This project uses the Hugging Face dataset:
kurry/sp500_earnings_transcripts
etl/load_transcripts.py exports it to:
data/raw/transcripts_raw.csv
Windows: python -m venv .venv .venv\Scripts\activate
macOS/Linux: python -m venv .venv source .venv/bin/activate
pip install -r requirements.txt
IMPORTANT: etl/load_transcripts.py uses Hugging Face datasets:
from datasets import load_dataset
So install it (and ideally add it to requirements.txt): pip install datasets
preprocessing uses: spacy.load("en_core_web_sm", disable=["parser", "ner"])
Install: python -m spacy download en_core_web_sm
python etl/load_transcripts.py
Output:
- data/raw/transcripts_raw.csv
python etl/preprocess_speaker_blocks.py
Outputs:
- data/processed/speaker_blocks_cleaned.csv
- data/processed/preprocess_checkpoint.txt (resume support)
What preprocessing does (technical):
- Reads raw transcripts from
data/raw/transcripts_raw.csv - Parses
structured_contentsafely via:- json.loads() first
- ast.literal_eval() fallback
- Extracts speaker blocks from common keys (segments/blocks/content/dialogue)
- Cleans text:
- lowercasing
- removes “forward-looking statements” and “safe harbour” sections
- strips non-alphabet characters
- lemmatizes with spaCy
- removes stopwords
- keeps alpha tokens with length > 2
- Filters low-signal blocks using:
- MIN_BLOCK_LEN = 30 words
- Labels speaker role using keyword heuristics:
- operator / management / analyst / other
Output schema: data/processed/speaker_blocks_cleaned.csv
- symbol
- company_name
- year
- quarter
- date
- speaker
- speaker_role
- clean_text
- block_length
python models/sentiment_vader.py
Outputs:
- data/processed/speaker_blocks_with_vader.csv
- data/processed/vader_checkpoint.txt
Adds:
- sentiment_vader (compound score in [-1, 1])
Notes:
- Uses chunked processing (CHUNK_ROWS = 5000)
- Resume support via vader_checkpoint.txt
- Duplicate-protection: script stops if output exists and checkpoint is 0
python models/sentiment_finbert.py
Expected output (required for merge step):
- data/processed/speaker_blocks_with_finbert.csv
Required columns (used downstream):
- finbert_sentiment (positive / neutral / negative)
- finbert_confidence (confidence score)
IMPORTANT:
- merge + aggregation scripts assume the file/columns above exist.
- If the current
sentiment_finbert.pyis not producing them yet, implement/update it so it writes: data/processed/speaker_blocks_with_finbert.csv
python features/merge_sentimnets.py
Inputs:
- data/processed/speaker_blocks_with_vader.csv
- data/processed/speaker_blocks_with_finbert.csv
Output:
- data/processed/speaker_blocks_with_sentiment.csv
Merge method:
- strict inner join on: symbol, company_name, year, quarter, date, speaker, speaker_role, clean_text, block_length
The Python pipeline produces sentiment metrics — but the Power BI report is the “decision layer” that makes those metrics usable in real business workflows.
This project isn’t just NLP in Python — the Power BI layer is where raw text becomes decisions.
The dashboard turns thousands of earnings-call speaker blocks into executive-ready KPIs, allowing business analysts to:
- spot sentiment shifts over time (quarter-to-quarter),
- compare Management vs Analyst tone (credibility vs skepticism),
- separate Prepared Remarks vs Q&A (scripted vs spontaneous),
- and drill into outliers + confidence using purpose-built tooltips.
Business impact (real world):
Sentiment signals can act like an “early-warning system” for guidance risk, investor expectations, PR issues, or competitive pressure — especially when Management’s tone diverges from Analysts’ tone.
- What’s the overall tone right now?
- Is sentiment improving or deteriorating over time?
- Is Management more optimistic than Analysts (or vice versa)?
- Does sentiment change when we move from scripted Prepared Remarks → spontaneous Q&A?
- What’s the distribution of positive / neutral / negative sentiment?
- Company slicer (single or multi-select)
- Year slicer
- Quarter slicer
- Section slicer: Prepared Remarks vs Q&A
(this is critical — Q&A is where uncertainty and tension usually shows up)
The banner is the “exec summary” for the current filter context:
- Total Calls
- Avg FinBERT
- Avg VADER
- Avg Confidence
Example shown in the screenshot (AAPL):
- Total Calls: 10
- Avg FinBERT: 0.73
- Avg VADER: 0.77
- Avg Confidence: 0.82
These are the two tables that my Python pipeline exports and my Power BI dashboard consumes:
Click to expand (Tables)
Call-Level Metrics Table (one row per earnings call)
Role-Level Metrics Table (one row per call + role + section)

- volume context (total_blocks, avg_block_len)
- sentiment aggregates (vader_mean, finbert_mean)
- mix metrics (finbert_pos / finbert_neu / finbert_neg)
- reliability signal (finbert_avg_conf)
- divergence signals (vader_gap_mgmt_minus_analyst, finbert_gap_mgmt_minus_analyst)
- traceability ID (CallKey) used in tooltips for “best/worst call” surfacing
- role splits (Analyst / Management / Operator)
- section splits (Prepared Remarks / Q&A)
- role-level sentiment means + medians (e.g., vader_median)
- role-level confidence
- Two primary fact tables:
powerbi_call_level_metrics_v2powerbi_role_level_metrics_v2
DimSectionsupports clean filtering for Prepared Remarks vs Q&A- A dedicated
MeasuresTablecentralizes KPI logic and tooltip logic - Several small tooltip helper tables (TT_*) exist purely to power tooltip layouts:
TT Role,TT Sentiment,TT Axis,TT Extremes
This structure allows my tooltips to act as “mini dashboards” without polluting my main model's more minimal overview for a cursory analysis.
I built 3 tooltip pages that activate on hover from the main dashboard. These tooltips don’t just repeat visible charts — they provide diagnostics:
- role divergence
- percentile ranking
- baseline deltas vs company norm
- QoQ momentum
- call extremes (best/worst) with CallKey traceability
- confidence classification (high / medium / low)
- FinBERT Gap and VADER Gap (Management − Analyst)
- Gap Percentile Label (how extreme the divergence is)
- A Role Insights table that breaks down:
Sentiment (FinBERT)Sentiment (VADER)Delta (VADER)Delta (FinBERT)
- Confidence badge:
- ✅ High confidence
⚠️ Medium confidence- ❌ Low confidence
Example shown:
- FinBERT Gap = -0.02
- VADER Gap = -0.01
- Gap Percentile Label = 38th%
- Analyst vs Management (FinBERT): 0.80 vs 0.78
- Analyst vs Management (VADER): 0.85 vs 0.84
- Confidence: ✅ High confidence
A divergence between Management and Analysts can signal:
- credibility gaps,
- skepticism in Q&A,
- uncertainty not reflected in scripted remarks,
- narrative management vs fundamentals.
- Distribution cards:
- +ve Label
- Neutral Label
- -ve Label
- Benchmarking cards:
- vs Company Avg.
- Δ (change vs baseline)
- A mini chart: Δ vs Company Baseline (how the current context differs from the company’s normal tone)
Example shown:
- +ve Label: 86%
- Neutral Label: 1%
- -ve Label: 13%
- +73% vs Company Avg.
- Baseline deltas show directionality by bucket:
- Positive: +0.013
- Negative: -0.004
- Neutral: -0.009
This prevents the classic analytics mistake:
“This quarter is positive”
instead of
“This quarter is positive relative to this company’s baseline behavior.”
It tells you if the tone is actually unusual or just “business as usual” for that company.
- Momentum cards:
- FinBERT QoQ Δ
- VADER QoQ Δ
- Scale context:
- blocks
- calls
- Sentiment Mix stacked bar (Negative / Neutral / Positive)
- Call Extremes (within Quarter):
Best+Worstsurfaced using Extreme CallKey- includes FinBERT label strength + traceability
Example shown:
- FinBERT QoQ Δ: +0.80
- VADER QoQ Δ: +0.84
- 5,513 blocks
- 175 calls
- Call Extremes:
- Best: ADBE-20211216-2021Q4 (FinBERT Label: 1.00)
- Worst: ABT-20220419-2022Q1 (FinBERT Label: 0.06)
- Confidence: ✅ High confidence
This is the “so what?” tooltip:
- what changed (QoQ),
- how strong the change is (distribution),
- and which calls created the movement (extremes).
These three screenshots show the tooltips activating directly from the Executive Overview page.
- The dashboard is not static — it’s interactive and context-aware.
- Hovering passes filter context into tooltip pages (company/quarter/section/sentiment bucket).
- The tooltips behave like mini dashboards:
- Example (AAPL • Q3 • Positive): ❌ Low confidence
- FinBERT Gap: +0.13
- VADER Gap: -0.21
- Gap Percentile: 50th%
- Role matrix shows cross-model disagreement patterns (FinBERT vs VADER).
- Example (ABBV):
⚠️ Medium confidence- FinBERT QoQ Δ: +0.73
- VADER QoQ Δ: +0.90
- 83 blocks, 2 calls
- Extremes surfaced via CallKey:
- Best: ABBV-20210203-2020Q4 (0.80)
- Worst: ABBV-20240202-2023Q4 (0.67)
- Example (AAPL • Q3 • Positive): ❌ Low confidence
Use these as “executive questions” — each one maps directly to a slicer + a visual + a tooltip.
- Where to look: KPI Banner + Earnings Call Sentiment Trend
- What to check:
- Avg FinBERT / Avg VADER direction over time
- Avg Confidence (high confidence = signal, low confidence = caution)
- Where to look: Management vs Analyst Sentiment bar chart
- Then hover: Tooltip #1 (Role Gap + Percentile)
- What to check:
- FinBERT Gap + VADER Gap (Mgmt − Analyst)
- Gap Percentile Label (is the divergence extreme or normal?)
- Where to look: Prepared Remarks vs Q&A comparison chart
- How: toggle Section slicer (Prepared Remarks vs Q&A)
- What to check:
- If Q&A sentiment drops while Prepared stays high → potential uncertainty / pressure.
- Where to look: Hover any relevant context → Tooltip #2 (Δ vs Company Baseline)
- What to check:
- +ve / Neutral / -ve Label percentages
- vs Company Avg. uplift
- Baseline deltas by bucket (Positive/Neutral/Negative)
- Where to look: Hover the trend or quarter context → Tooltip #3 (QoQ Δ + Extremes)
- What to check:
- FinBERT QoQ Δ and VADER QoQ Δ
- blocks + calls (sample size)
- Call Extremes (Best/Worst CallKey) to identify the exact calls moving the metric
- Where to look: Tooltip #1 Role Insights table
- What to check:
- If FinBERT is strongly positive but VADER is negative (or vice versa), treat as a language-style edge case
- Use the confidence badge (High/Medium/Low) to decide whether to trust the label
- Where to look: KPI Banner (Avg Confidence) + tooltip confidence badges
- What to check:
- If confidence is low, prefer directional trends + larger sample contexts
- Use blocks/calls (Tooltip #3) as a sanity check
- Where to look: Trend line + Tooltip #3 extremes
- How to decide quickly:
- pick the quarter with the biggest QoQ swing
- confirm it has enough blocks/calls
- grab the Worst CallKey and review the transcript context for that call
- Pick a Company (or compare multiple).
- Filter by Year / Quarter.
- Toggle Prepared Remarks vs Q&A for “scripted vs real”.
- Hover any key visual to open tooltips:
- Role Gap diagnostics
- Baseline deltas vs company norm
- QoQ change + best/worst calls
- Use the tooltip insights to answer:
- “Is sentiment moving?”
- “Is leadership credibility aligned with analysts?”
- “Where are the outliers and how confident are we?”
This report is designed with one goal:
Keep the main page clean and executive-friendly — push the deep diagnostics into interactive tooltips.
Built for long runs + safe interruption:
- data/processed/preprocess_checkpoint.txt
- data/processed/vader_checkpoint.txt
If interrupted (CTRL+C), rerun the script to resume from the last checkpoint.
spaCy model missing:
- Error: Can't find model 'en_core_web_sm'
- Fix: python -m spacy download en_core_web_sm
Hugging Face datasets missing:
- Error: No module named 'datasets'
- Fix: pip install datasets
VADER duplicate protection:
- For a clean rebuild, delete:
- data/processed/speaker_blocks_with_vader.csv
- data/processed/vader_checkpoint.txt







