Cassandra database schemas, sample data, and multi-platform data loaders for the KillrVideo reference application.
This repository now includes comprehensive data loading tools for multiple platforms:
- OSS Cassandra 5.0+: DSBulk and Python loaders for on-premises deployments
- Astra Tables: CQL-compatible tables with client-side embedding generation
- Astra Collections: Schema-less JSON documents with automatic vectorization
Each approach demonstrates different Cassandra capabilities and deployment patterns. Choose based on your use case!
killrvideo-data/
βββ data/ # Shared data source
β βββ csv/ # CSV files (~1.8MB, 1800+ rows)
β βββ schemas/ # Versioned CQL schemas
βββ loaders/ # Platform-specific data loaders
β βββ oss-cassandra/ # OSS Cassandra 5.0+ loaders
β βββ astra-tables/ # Astra with Tables (structured)
β βββ astra-collections/ # Astra with Collections (JSON)
β βββ requirements.txt # Python dependencies
βββ examples/ # Sample data and queries
βββ graph/ # Graph database schema (DSE only)
βββ migrating/ # Version migration guides
βββ search/ # Search integration (DSE only)
The KillrVideo dataset includes:
- 150+ users with authentication and preferences
- 300-500 videos from real YouTube content (DataStax channel)
- 500-1,500 comments with tech-themed content
- 30-50 tags with relationships
- Counters for views, ratings, and tag usage
- Vector columns ready for embedding population (8-16 dimensions)
Total size: ~1.8MB compressed, perfect for learning and demos.
| Platform | Use Case | Setup Time | Best For |
|---|---|---|---|
| OSS Cassandra | On-premises, full control | 10 min | Production deployments, learning Cassandra internals |
| Astra Tables | Cloud, structured data | 5 min | CQL applications, migrating from OSS |
| Astra Collections | Cloud, flexible schema | 5 min | Rapid prototyping, evolving schemas |
Each platform has a dedicated loader with comprehensive documentation:
- π OSS Cassandra 5.0+ Loader
- π Astra Tables Loader
- π Astra Collections Loader
β Choose this if:
- Running on-premises or private cloud
- Need full control over infrastructure
- Want to learn Cassandra internals
- Have existing Cassandra clusters
- Require features not in Astra (UDFs, custom compaction, etc.)
Features:
- Vector types (
vector<float, N>) - Storage-Attached Indexes (SAI)
- Data masking for PII
- Full CQL feature set
Loading approach:
- DSBulk (recommended for bulk loading)
- Python SDK (more flexibility)
Vector embeddings:
- Pre-compute client-side or load NULL vectors
Get started: β OSS Cassandra Loader
β Choose this if:
- Want cloud-native managed Cassandra
- Need CQL compatibility for existing apps
- Prefer structured, typed schema
- Want control over embedding generation
- Migrating from OSS Cassandra
Features:
- Fully managed, serverless
- CQL binary protocol + Table API
- Storage-Attached Indexes (SAI)
- Manual index management
- Client-side embedding generation
Loading approach:
- Python SDK with cassandra-driver
- Configure embedding provider (OpenAI, Hugging Face, etc.)
- Pre-compute vectors during loading
Vector embeddings:
- Generated client-side
- Supports dimension reduction (e.g., 1536D β 16D)
- Multiple provider options
Get started: β Astra Tables Loader
β Choose this if:
- Building new applications rapidly
- Schema is evolving or unknown
- Want automatic vector generation
- Prefer JSON/document model
- Don't need CQL compatibility
Features:
- Schema-less JSON documents
- Automatic indexing of all fields
- Automatic vectorization (Astra Vectorize)
- Data API (HTTP REST)
- Nested documents (no denormalization needed)
Loading approach:
- Python SDK with astrapy
- Transform relational data to JSON documents
- Let Astra generate embeddings automatically
Vector embeddings:
- Automatic server-side generation
- Configure provider once in Astra Portal
- No client-side embedding code needed
Get started: β Astra Collections Loader
| Feature | OSS Cassandra | Astra Tables | Astra Collections |
|---|---|---|---|
| Deployment | Self-managed | Managed (cloud) | Managed (cloud) |
| Schema | Fixed CQL types | Fixed CQL types | Flexible JSON |
| Query Language | CQL | CQL + Table API | Data API (JSON) |
| Indexing | Manual (SAI) | Manual (SAI) | Automatic |
| Vectorization | Manual | Client-side | Server-side |
| Denormalization | Required | Required | Not needed |
| Setup Complexity | High | Medium | Low |
| Learning Curve | Steep | Moderate | Gentle |
| Cost | Infrastructure | Pay-as-you-go | Pay-as-you-go |
| CQL Compatibility | Full | Full | None |
| Best For | Production, control | CQL apps, migration | Rapid dev, flexibility |
The repository includes multiple schema versions demonstrating Cassandra evolution:
- Basic schema with denormalized tables
- UDTs and collections
- Secondary indexes
- Use for: Legacy systems, learning basics
- Virtual tables
- Arithmetic operators
- Current time functions
- Improved UDF support
- Use for: Cassandra 4.0 clusters
- Vector types for AI/ML features
- Storage-Attached Indexes (SAI) replace many denormalized tables
- Data masking for PII protection
- Enhanced functions (currentTimestamp, etc.)
- Use for: Modern deployments, OSS Cassandra 5.0+
- Adapted from v5 for Astra compatibility
- SAI with Astra-specific syntax
- Data masking removed (not yet supported)
- Optimized for serverless
- Use for: Astra Tables approach
See CLAUDE.md for detailed schema architecture and design decisions.
All platforms:
- Python 3.8+
OSS Cassandra:
- Cassandra 5.0+ cluster
- DSBulk 1.11.0+ (for vector support)
Astra (both Tables and Collections):
- Astra DB database
- Astra token with Database Administrator role
- Embedding provider API key (OpenAI, Hugging Face, etc.)
cd loaders
pip install -r requirements.txtThis installs:
cassandra-driver- for OSS Cassandra and Astra Tablesastrapy- for Astra Collectionspyyaml- for configurationpython-dotenv- for environment variables
cd loaders/oss-cassandra
# Using DSBulk (fastest)
./load_with_dsbulk.sh
# Using Python (more control)
python load_with_python.py --host 127.0.0.1 --keyspace killrvideoTime: ~1-2 minutes
cd loaders/astra-tables
# Configure embedding provider
cp config.example.yaml config.yaml
# Edit config.yaml with your credentials
# Test configuration
python setup_embeddings.py --config config.yaml
# Load data with embedding generation
python load_with_embeddings.py --config config.yamlTime: ~5-10 minutes (embedding generation)
cd loaders/astra-collections
# Configure
cp config.example.yaml config.yaml
# Edit config.yaml
# Load data
python load_to_collections.py --config config.yaml
# Configure Vectorize in Astra Portal after loadingTime: ~3-5 minutes
After loading data, try the example queries:
# For v4
cqlsh -f examples/schema-v4-query-examples.cql
# For v5
cqlsh -f examples/schema-v5-query-examples.cqlExamples include:
- Basic CRUD operations
- SAI index queries
- Counter table operations
- Collection type usage
- Vector similarity search (once vectors are populated)
-- Find similar videos (requires populated vectors)
SELECT videoid, name FROM videos
ORDER BY content_features ANN OF [0.1, 0.2, ..., 0.9]
LIMIT 10;-- Query on any SAI-indexed column
SELECT * FROM videos WHERE tags CONTAINS 'cassandra';
SELECT * FROM users WHERE account_status = 'active';-- PII automatically masked for non-admin users
SELECT email FROM users LIMIT 5;
-- Result: m****@example.com (masked)-- Update video view counts
UPDATE video_playback_stats
SET views = views + 1, unique_viewers = unique_viewers + 1
WHERE videoid = ?;- CLAUDE.md - Comprehensive schema architecture and design guide
- CQL Version Chart - Feature availability by version
- Migration Guides - Upgrading between versions
- OSS Cassandra Loader - Full OSS documentation
- Astra Tables Loader - Tables + embedding generation
- Astra Collections Loader - Collections + auto vectorization
- v4 Data Examples - Sample inserts for v4
- v5 Data Examples - Sample inserts for v5
- v4 Query Examples - Query patterns for v4
- v5 Query Examples - Query patterns for v5
Learning Cassandra? β Start with OSS Cassandra to understand internals Building production app? β Use Astra (Tables or Collections based on use case) Rapid prototyping? β Use Astra Collections for fastest development Existing CQL app? β Use Astra Tables for easy migration
Yes! The CSV files in data/csv/ work with all loaders. Each loader transforms the data appropriately for its platform.
- OSS Cassandra: Pre-compute or load NULL, populate later
- Astra Tables: Generated client-side during loading
- Astra Collections: Generated automatically by Astra (server-side)
Absolutely! Load the same data into all three platforms and compare:
- Query performance
- Development experience
- Feature set
- Cost
This is a great way to understand trade-offs!
Solution: CSVs are in data/csv/. Make sure you're running from the correct directory.
Solution: Load the appropriate schema for your platform from data/schemas/
Solution: Check your embedding provider API key and rate limits
Solution: Verify Astra token has Database Administrator permissions
For detailed troubleshooting, see the README for your specific loader.
Beginner:
- Start with Astra Collections (easiest)
- Load data and run example queries
- Explore automatic indexing and vectorization
Intermediate:
- Try Astra Tables
- Compare with Collections approach
- Understand schema design trade-offs
Advanced:
- Set up OSS Cassandra
- Load data and configure replication
- Tune performance and explore advanced features
Contributions welcome! Areas of interest:
- Additional loaders (e.g., DSE, Scylla DB)
- Pre-computed embedding datasets
- More comprehensive query examples
- Performance benchmarks
- Additional data generators
- KillrVideo Application - Reference application
- DataStax Documentation - Official docs
- Astra Portal - Create Astra databases
Questions? Open an issue or check the loader documentation.
Need help choosing? See the platform comparison table above.