A serverless transcription pipeline optimized for podcasts. It comes with GPU-accelerated transcription, speaker diarization, and LLM-powered refinement.
This is the successor to podwhisperer (v1), rebuilt from the ground up with AWS CDK, WhisperX running on ECS Managed Instances, and a modular pipeline architecture built as a Durable Lambda Function.
- AWS Durable Execution - Long-running workflows survive Lambda timeouts with automatic checkpointing
- LLM-Powered Refinement - Uses Amazon Bedrock to fix transcription errors and identify speakers by name
- Caption Generation - Automatic VTT, SRT, and JSON caption output with word-level timing
- Managed GPU Instances - ECS Managed Instances with automatic provisioning (no more EC2 management): instances only run when needed
- Scale-to-Zero - Zero cost when idle, automatic spin-up when work arrives
- Modular Pipeline - Each step is configurable and can be enabled/disabled independently
- AWS CDK (TypeScript) - Infrastructure as Code
- AWS Lambda with Durable Execution - Pipeline orchestration (TypeScript)
- Amazon ECS with Managed Instances - GPU worker containers
- WhisperX - Speech recognition with word-level timestamps and speaker diarization (Worker in Python)
- Amazon Bedrock - LLM-based transcript refinement
- Amazon S3 - Audio input and output storage
- Amazon SQS - Message queue for GPU worker tasks
- Amazon EventBridge - S3 event triggers and completion notifications
flowchart TD
subgraph Input
S3Upload["S3 Upload<br/>input/audio.mp3"]
end
subgraph "Pipeline Lambda"
EB[EventBridge Trigger]
CheckExists{Output Exists?}
SQS[Send to SQS]
WaitCallback[Wait for Callback]
Replacement[Replacement Rules]
LLM[LLM Refinement]
Normalize[Segments Normalization]
Finalize[Finalize Transcript]
Captions[Generate Captions]
Notify[Send Notification]
end
subgraph "GPU Worker"
Poll[Poll SQS]
Download[Download Audio]
Transcribe[WhisperX Transcription]
Align[Alignment]
Diarize[Diarization]
Upload[Upload Raw Transcript]
Callback[Send Callback]
end
subgraph Output
Raw[raw_transcript.json]
Refined[refined_transcript.json]
VTT[caption.vtt]
SRT[caption.srt]
JSON[caption.json]
end
S3Upload --> EB --> CheckExists
CheckExists -->|No| SQS --> Poll
CheckExists -->|Yes| Replacement
Poll --> Download --> Transcribe --> Align --> Diarize --> Upload --> Callback
Upload --> Raw
Callback --> WaitCallback --> Replacement --> LLM --> Normalize --> Finalize --> Captions --> Notify
Finalize --> Refined
Captions --> VTT & SRT & JSON
The WhisperX container runs on GPU-enabled ECS Managed Instances:
- Download - Fetches audio from S3
input/prefix - Validation - Checks audio format and converts to WAV if needed
- Transcription - Runs WhisperX with the configured model (default:
large-v2) - Alignment - Generates word-level timestamps
- Diarization - Identifies and labels different speakers
- Upload - Saves raw transcript to S3
output/prefix
Apply regex or literal string replacements to fix common transcription errors:
{
"replacementRules": [
{ "type": "literal", "search": "AWS lamda", "replacement": "AWS Lambda" },
{ "type": "regex", "search": "\\bec2\\b", "replacement": "EC2" }
]
}Uses Amazon Bedrock to improve transcript quality:
- Fixes spelling and grammar errors while preserving meaning
- Identifies speakers by name when possible (e.g., "SPEAKER_00" becomes "Luciano")
- Validates suggestions to prevent aggressive rewrites
Splits long transcript segments into caption-friendly chunks:
- Respects character and word limits per segment
- Splits at natural punctuation boundaries
- Handles speaker changes within segments
Generates multiple caption formats from the refined transcript:
- VTT - WebVTT format for web players
- SRT - SubRip format for video editors
- JSON - Simplified format for custom applications
Optional features:
- Word-by-word highlighting (underline, bold, or italic)
- Speaker name prefixes (always, when speaker changes, or never)
Sends an EventBridge event when the pipeline completes, including:
- Input/output S3 keys
- Pipeline timing statistics
- All generated output paths
- AWS Account with appropriate permissions
- AWS CLI configured with credentials
- Node.js 22+ and pnpm 9+
- Docker running locally (for container builds)
- HuggingFace Token with access to pyannote models (details below)
The diarization models require a HuggingFace token. Follow these steps to create one:
If you don't have one, sign up at huggingface.co.
Visit the following model pages and accept their license agreements:
- pyannote/segmentation-3.0 - Click "Agree and access repository"
- pyannote/speaker-diarization-3.1 - Click "Agree and access repository"
- Go to HuggingFace Settings β Access Tokens
- Click "Create new token"
- Select "Read" as the token type (fine-grained)
- Give it a name (e.g.,
podwhisperer) - Under Permissions, enable only:
- β Read access to contents of all repos under your personal namespace
- β Read access to contents of all public gated repos you can access
- Click "Create token" and copy the token value
aws ssm put-parameter \
--name "/podwhisperer/hf_token" \
--type "SecureString" \
--value "hf_your_token_here"Note
Security: The token is handled securely:
- Build time: Uses Docker BuildKit secrets (not stored in image layers)
- Runtime: Injected via SSM Parameter Store by ECS (not baked into image)
The token must be available as an environment variable (HF_TOKEN) when running cdk deploy.
If you're uncertain or have strict security requirements, consider creating a dedicated HuggingFace account with no private content and generate the token from that account.
git clone https://github.com/fourTheorem/podwhisperer.git
cd podwhisperer
pnpm installCreate a cdk/config.ts file to configure your pipeline. An example is provided at cdk/config.example.ts:
// cdk/config.ts
import { defineConfig } from '@podwhisperer/config'
export default defineConfig({
transcription: {
model: 'large-v2',
language: 'en',
minSpeakers: 2,
maxSpeakers: 4,
},
llmRefinement: {
bedrockInferenceProfileId: 'eu.anthropic.claude-sonnet-4-20250514-v1:0',
additionalContext: 'This is a tech podcast about AWS and serverless.',
},
captions: {
highlightWords: true,
includeSpeakerNames: 'when-changes',
},
})The defineConfig() helper provides type safety and IDE autocomplete for all configuration options. This pattern is similar to popular tools like Vite, Vitest, and ESLint.
Before deploying, set the HuggingFace token in your environment:
# Fetch from SSM and export for Docker build
export HF_TOKEN=$(aws ssm get-parameter \
--name "/podwhisperer/hf_token" \
--with-decryption \
--query Parameter.Value --output text)
pnpm cdk deploy# Get the bucket name from the stack output
BUCKET=$(aws cloudformation describe-stacks \
--stack-name Podwhisperer \
--query 'Stacks[0].Outputs[?OutputKey==`BucketName`].OutputValue' \
--output text)
# Upload your audio file
aws s3 cp my-podcast.mp3 s3://$BUCKET/input/my-podcast.mp3The pipeline will automatically trigger and generate outputs in s3://$BUCKET/output/.
| Option | Type | Default | Description |
|---|---|---|---|
model |
string | "large-v2" |
Whisper model name (tiny, base, small, medium, large-v1, large-v2, large-v3) |
language |
string | "en" |
Language code for transcription |
minSpeakers |
number | 1 |
Minimum speakers for diarization |
maxSpeakers |
number | auto | Maximum speakers (undefined = auto-detect) |
jobTimeoutMinutes |
number | 60 |
Timeout for transcription job (max: 720 due to SQS limit) |
skipIfOutputExists |
boolean | false |
Skip transcription if output already exists |
hfTokenSsmPath |
string | "/podwhisperer/hf_token" |
SSM parameter path for HuggingFace token |
Array of replacement rules, each with:
| Option | Type | Description |
|---|---|---|
type |
"regex" | "literal" |
Matching type |
search |
string | Pattern or string to find |
replacement |
string | Text to replace matches with |
| Option | Type | Default | Description |
|---|---|---|---|
bedrockInferenceProfileId |
string | required | Bedrock inference profile (e.g., "eu.anthropic.claude-sonnet-4-20250514-v1:0") |
additionalContext |
string | - | Context to help the LLM understand domain-specific terms |
modelConfig.max_tokens |
number | 64000 |
Maximum tokens in response |
modelConfig.temperature |
number | 0.2 |
Model temperature |
suggestionValidation.enabled |
boolean | true |
Enable validation of LLM suggestions |
suggestionValidation.maxWordChangeRatio |
number | 0.4 |
Max percentage of words that can change (0-1) |
suggestionValidation.maxNormalizedEditDistance |
number | 0.5 |
Max normalized edit distance (0-1) |
suggestionValidation.maxConsecutiveChanges |
number | 3 |
Max consecutive word changes allowed |
| Option | Type | Default | Description |
|---|---|---|---|
normalize |
boolean | true |
Enable segment normalization |
maxCharsPerSegment |
number | 48 |
Max characters per caption segment |
maxWordsPerSegment |
number | 10 |
Max words per caption segment |
splitSegmentAtSpeakerChange |
boolean | true |
Force split when speaker changes |
punctuationSplitThreshold |
number | 0.7 |
Threshold for splitting at punctuation (0-1) |
punctuationChars |
string[] | [".", ",", "?", "!", ";", ":"] |
Characters that can end a segment |
| Option | Type | Default | Description |
|---|---|---|---|
generateVtt |
boolean | true |
Generate WebVTT captions |
generateSrt |
boolean | true |
Generate SRT captions |
generateSimplifiedJson |
boolean | true |
Generate JSON captions |
highlightWords |
boolean | false |
Enable word-by-word highlighting |
highlightWith |
"underline" | "bold" | "italic" |
"underline" |
Highlight style |
includeSpeakerNames |
"never" | "always" | "when-changes" |
"when-changes" |
When to show speaker names |
| Option | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | true |
Send EventBridge notification on completion |
eventBusName |
string | "default" |
EventBridge event bus name |
source |
string | "podwhisperer.pipeline" |
Event source identifier |
detailType |
string | "Pipeline Completed" |
Event detail type |
For an input file input/episode-42.mp3, the pipeline generates:
| File | Description |
|---|---|
output/episode-42_raw_transcript.json |
Raw WhisperX output with word-level timing |
output/episode-42_refined_transcript.json |
Post-processed transcript with all refinements |
output/episode-42_caption.vtt |
WebVTT captions |
output/episode-42_caption.srt |
SRT captions |
output/episode-42_caption.json |
Simplified JSON captions with speaker mapping (custom format, useful for example for generating static sites or dynamic visualizations) |
This pipeline uses several billable AWS services:
| Service | Cost Driver | Notes |
|---|---|---|
| EC2 GPU Instances | ~$0.50-1.00/hour (on-demand) | g4dn.xlarge or similar; scales to zero when idle |
| Amazon Bedrock | ~$3-15 per 1M tokens | Varies by model; Claude Sonnet used for refinement |
| S3 Storage | ~$0.023/GB/month | Audio files + transcripts |
| Lambda | Minimal | Durable execution included in Lambda pricing |
| SQS | Minimal | First 1M requests/month free |
Typical cost for a 1-hour podcast: $0.15-0.50 (GPU time for ~10-15 min) + LLM tokens if enabled
To minimize costs:
- Scale-to-zero is enabled by default (no GPU costs when idle)
- Disable LLM refinement if not needed (
llmRefinement: undefined) - Use a smaller Whisper model (
mediuminstead oflarge-v2)
To tear down all resources and stop incurring costs:
pnpm cdk destroyThis removes:
- S3 bucket (will fail if not empty; empty it first or use
--force) - ECS cluster and GPU capacity provider
- Lambda functions and log groups
- SQS queues and EventBridge rules
To empty the S3 bucket before destroying:
aws s3 rm s3://$BUCKET --recursiveBy default, uploading the same file again triggers a full re-processing. The previous outputs are overwritten.
To avoid re-processing files that already have outputs:
transcription: {
skipIfOutputExists: true,
// ...
}With this enabled, the pipeline checks for existing output files and skips transcription if found. Post-processing steps still run, allowing you to update LLM refinement or caption settings without re-transcribing.
To force re-processing when skipIfOutputExists is enabled:
-
Delete the output files:
aws s3 rm s3://$BUCKET/output/episode-42_raw_transcript.json -
Re-upload the input file:
aws s3 cp my-podcast.mp3 s3://$BUCKET/input/episode-42.mp3
When the pipeline completes, it sends an event to EventBridge. You can create rules to trigger downstream workflows.
{
"source": "podwhisperer.pipeline",
"detail-type": "Pipeline Completed",
"detail": {
"bucket": "podwhisperer-bucket-abc123",
"inputKey": "input/episode-42.mp3",
"outputKeys": {
"rawTranscript": "output/episode-42_raw_transcript.json",
"refinedTranscript": "output/episode-42_refined_transcript.json",
"captions": {
"vtt": "output/episode-42_caption.vtt",
"srt": "output/episode-42_caption.srt",
"json": "output/episode-42_caption.json"
}
},
"timing": {
"pipelineStartedAt": "2025-01-26T12:00:00.000Z",
"pipelineCompletedAt": "2025-01-26T12:15:30.000Z",
"pipelineDurationMs": 930000
}
}
}// In your CDK stack
const rule = new events.Rule(this, 'PodwhispererComplete', {
eventPattern: {
source: ['podwhisperer.pipeline'],
detailType: ['Pipeline Completed'],
},
});
rule.addTarget(new targets.LambdaFunction(myDownstreamLambda));const topic = new sns.Topic(this, 'TranscriptReady');
const rule = new events.Rule(this, 'NotifyOnComplete', {
eventPattern: {
source: ['podwhisperer.pipeline'],
detailType: ['Pipeline Completed'],
},
});
rule.addTarget(new targets.SnsTopic(topic));Since the pipeline uses ECS Managed Instances with scale-to-zero, the first transcription after a period of inactivity can take 5-15 minutes while AWS provisions GPU capacity and starts the container (~15GB image).
To avoid this cold start delay, you can send a warmup message to the SQS queue ahead of time, keeping the GPU worker alive and ready for incoming jobs.
# Get the queue URL from stack outputs
QUEUE_URL=$(aws cloudformation describe-stacks \
--stack-name Podwhisperer \
--query 'Stacks[0].Outputs[?OutputKey==`QueueUrl`].OutputValue' \
--output text)
# Send a warmup message (keep warm for 30 minutes - default)
aws sqs send-message \
--queue-url $QUEUE_URL \
--message-body '{"type": "warmup"}'
# Or specify a custom duration
aws sqs send-message \
--queue-url $QUEUE_URL \
--message-body '{"type": "warmup", "until": "2025-01-15T14:30:00Z"}'| Field | Type | Required | Description |
|---|---|---|---|
type |
"warmup" |
Yes | Identifies this as a warmup message |
until |
ISO 8601 timestamp | No | When warmup expires (default: 30 minutes from receipt) |
- When the worker receives a warmup message, it immediately acknowledges (deletes) it
- The worker sets an internal "keep warm until" timestamp
- During the warmup period, the normal auto-shutdown behavior (exit after 3 empty polls) is disabled
- Normal transcription jobs continue to process as usual during warmup
- Once the warmup period expires, normal auto-shutdown behavior resumes
- Multiple warmup messages: If a new warmup message arrives while already warm, the warmup period is only extended (never shortened)
- Scheduled batch processing: Send a warmup message 10-15 minutes before uploading multiple files
- User-triggered workflows: Warm up capacity when a user starts a workflow that will need transcription
- Predictable usage patterns: If you know transcription jobs arrive at certain times, pre-warm beforehand
Cost Note: Warmup keeps GPU instances running, which incurs costs (~$0.50-1.00/hour). Only use warmup when you expect jobs to arrive soon.
# Build all packages
pnpm run build
# Lint and format
pnpm lint
pnpm format
# Run tests
pnpm --filter @podwhisperer/pipeline test
# Synthesize CloudFormation (without deploying)
pnpm cdk synth
# Compare with deployed stack
pnpm cdk diffcd whisperx-image
# Install dev dependencies
make install-dev
# Run linter
make lint
# Run formatter
make format
# Run tests
make test
# Run tests with coverage report
make test-covEveryone is very welcome to contribute to this project. You can contribute just by submitting bugs or suggesting improvements by opening an issue on GitHub.
Licensed under MIT License. Β© fourTheorem.