High-performance streaming ingestion from Kafka to Apache Iceberg
K2I (Kafka to Iceberg) is a production-grade streaming ingestion engine written in Rust that bridges the latency-cost trade-off in data pipelines. It consumes events from Apache Kafka, buffers them in-memory using Apache Arrow for sub-second query freshness, and writes them to Apache Iceberg tables in Parquet format with exactly-once semantics.
- Sub-second data freshness - In-memory hot buffer with Arrow columnar storage
- Exactly-once semantics - Write-ahead transaction log for crash recovery
- Multiple catalog backends - REST, Hive Metastore, AWS Glue, Nessie
- Smart backpressure - Automatic consumer pausing when buffer is full
- Automated maintenance - Compaction, snapshot expiration, orphan cleanup
- Production observability - Prometheus metrics, health endpoints, structured logging
- Single-process simplicity - No distributed coordination overhead
Download the latest binary from the GitHub Releases page.
brew install osodevops/tap/k2icurl --proto '=https' --tlsv1.2 -LsSf https://github.com/osodevops/k2i/releases/latest/download/k2i-cli-installer.sh | shDownload the appropriate binary for your architecture from releases:
# Example for x86_64
curl -LO https://github.com/osodevops/k2i/releases/latest/download/k2i-cli-x86_64-unknown-linux-gnu.tar.xz
tar -xJf k2i-cli-x86_64-unknown-linux-gnu.tar.xz
sudo mv k2i /usr/local/bin/docker pull ghcr.io/osodevops/k2i:latest
docker run --rm -v /path/to/config:/etc/k2i ghcr.io/osodevops/k2i:latest ingest --config /etc/k2i/config.tomlSee the image on GitHub Container Registry.
git clone https://github.com/osodevops/k2i.git
cd k2i
cargo build --releaseBinary location: target/release/k2i
Create a configuration file config.toml:
[kafka]
bootstrap_servers = "localhost:9092"
topic = "events"
group_id = "k2i-ingestion"
[buffer]
max_size_mb = 500
flush_interval_seconds = 30
[iceberg]
catalog_type = "rest"
rest_uri = "http://localhost:8181"
warehouse = "s3://my-bucket/warehouse"
database = "analytics"
table = "events"
[storage]
type = "s3"
bucket = "my-bucket"
region = "us-east-1"
[server]
health_port = 8080
metrics_port = 9090k2i validate --config config.tomlk2i ingest --config config.toml# Health check
curl http://localhost:8080/health
# Prometheus metrics
curl http://localhost:9090/metricsModern data architectures face a fundamental tension:
| Approach | Latency | Cost | Complexity |
|---|---|---|---|
| Real-time streaming (Kafka + KSQL) | Milliseconds | High | High |
| Micro-batch (Spark Streaming) | Seconds-Minutes | Medium | Medium |
| Batch ETL (Airflow + Spark) | Minutes-Hours | Low | Low |
Streaming data into Iceberg creates additional challenges:
- Small file problem - Each micro-batch creates new files, degrading query performance
- Exactly-once complexity - Coordinating Kafka, object storage, and catalog commits
- Operational burden - Manual compaction, snapshot expiration, orphan cleanup
K2I resolves these trade-offs through:
- Hot/Cold Architecture - In-memory Arrow buffer for immediate queries, Parquet files for cost-efficient analytics
- Write-Ahead Logging - Transaction log ensures exactly-once semantics and crash recovery
- Single-Process Design - No distributed coordination, deterministic behavior, simple operations
- Automated Maintenance - Background compaction, expiration, and cleanup
| Feature | K2I | Spark Streaming | Flink | Kafka Connect |
|---|---|---|---|---|
| Single process | Yes | No | No | Per-connector |
| Sub-second latency | Yes | Minutes | Seconds | Seconds |
| Exactly-once | Yes | Yes | Yes | Depends |
| Auto compaction | Yes | No | No | No |
| Hot buffer queries | Yes | No | No | No |
| Memory footprint | Low | High | High | Medium |
| Operational complexity | Low | High | High | Medium |
- Complex transformations - Use Apache Flink for stream processing with joins, aggregations
- Multi-source ingestion - K2I is optimized for Kafka; use Flink for diverse sources
- CDC replication - For database change data capture with deletes, consider Moonlink
+-----------------------------------------------------------------------------+
| K2I Ingestion Engine |
+-----------------------------------------------------------------------------+
| |
| +------------------+ +------------------+ +------------------+ |
| | SmartKafka | | Hot Buffer | | Iceberg Writer | |
| | Consumer |--->| (Arrow + Index) |--->| (Parquet) | |
| | | | | | | |
| | - rdkafka | | - RecordBatch | | - Catalog | |
| | - Backpressure | | - DashMap Index | | - Object Store | |
| | - Retry Logic | | - TTL Eviction | | - Atomic Commit | |
| +------------------+ +------------------+ +------------------+ |
| | | | |
| v v v |
| +---------------------------------------------------------------------+ |
| | Transaction Log | |
| | - Append-only entries with CRC32 checksums | |
| | - Periodic checkpoints for fast recovery | |
| | - Idempotency records for exactly-once semantics | |
| +---------------------------------------------------------------------+ |
| |
+-----------------------------------------------------------------------------+
| Document | Description |
|---|---|
| Whitepaper | Comprehensive technical whitepaper |
| Quick Start | Get started in 5 minutes |
| Configuration | Complete configuration reference |
| Architecture | System design and internals |
| Commands | CLI command reference |
| Deployment | Production deployment guide |
| Troubleshooting | Common issues and solutions |
# Validate configuration
k2i validate --config config.toml
# Start ingestion
k2i ingest --config config.toml
# Check service status
k2i status --url http://localhost:8080
# Run manual compaction
k2i maintenance compact --config config.toml
# Expire old snapshots
k2i maintenance expire-snapshots --config config.toml
# Clean orphan files
k2i maintenance cleanup-orphans --config config.toml| Metric | Target | Notes |
|---|---|---|
| Query freshness (hot) | < 1ms | In-memory Arrow buffer |
| Query freshness (cold) | 30s | Configurable flush interval |
| Flush latency (P50) | 200ms | End-to-end flush cycle |
| Flush latency (P99) | 800ms | Including catalog commit |
| Throughput | 10-100K msg/s | Configuration dependent |
| Memory usage | 200MB - 2GB | Based on buffer size |
K2I supports multiple Iceberg catalog backends:
| Catalog | Configuration | Features |
|---|---|---|
| REST | catalog_type = "rest" |
OAuth2, Bearer token, custom headers |
| Hive Metastore | catalog_type = "hive" |
Thrift protocol, schema sync |
| AWS Glue | catalog_type = "glue" |
IAM roles, cross-account access |
| Nessie | catalog_type = "nessie" |
Git-like branching, time travel |
Requirements:
- Rust 1.85+ (for edition 2024)
- CMake
- OpenSSL development libraries
- SASL development libraries
# Clone the repository
git clone https://github.com/osodevops/k2i.git
cd k2i
# Build release binary
cargo build --release
# Run tests
cargo test
# Run with debug logging
RUST_LOG=debug cargo run -p k2i-cli -- --help# Unit tests
cargo test --lib
# All tests
cargo test
# With coverage
cargo tarpaulin --out Htmlk2i/
├── crates/
│ ├── k2i-core/ # Core library
│ │ ├── src/
│ │ │ ├── kafka/ # Smart Kafka consumer
│ │ │ ├── buffer/ # Hot buffer (Arrow + indexes)
│ │ │ ├── iceberg/ # Iceberg writer
│ │ │ ├── txlog/ # Transaction log
│ │ │ ├── catalog/ # Catalog backends
│ │ │ └── maintenance/# Compaction, expiration
│ │ └── tests/
│ └── k2i-cli/ # CLI binary
├── config/ # Example configurations
└── docs/ # Documentation
OSO engineers are solely focused on deploying, operating, and maintaining Apache Kafka platforms. If you need SLA-backed support or advanced features for compliance and security, our Enterprise Edition extends the core tool with capabilities designed for large-scale, regulated environments.
| Feature Category | Enterprise Capability |
|---|---|
| Security & Compliance | AES-256 Encryption (client-side encryption at rest) |
| GDPR Compliance Tools (PII masking, data retention policies) | |
| Audit Logging (comprehensive trail of all operations) | |
| Role-Based Access Control (granular permissions) | |
| Advanced Integrations | Schema Registry Integration (Avro/Protobuf with ID mapping) |
| Secrets Management (Vault / AWS Secrets Manager integration) | |
| SSO / OIDC (Okta, Azure AD, Google Auth) | |
| Scale & Operations | Multi-Table Support (single process, multiple destinations) |
| Log Shipping (Datadog, Splunk, Grafana Loki) | |
| Advanced Metrics & Dashboard (throughput, latency, drill-down UI) | |
| Support | 24/7 SLA-Backed Support & dedicated Kafka consulting |
Need help resolving operational issues or planning a data lakehouse strategy? Our team of experts can help you design, deploy, and operate K2I at scale.
Talk with an expert today or email us at enquiries@oso.sh.
We welcome contributions of all kinds!
- Report Bugs: Found a bug? Open an issue on GitHub.
- Suggest Features: Have an idea? Request a feature.
- Contribute Code: Check out our good first issues for beginner-friendly tasks.
- Improve Docs: Help us improve the documentation by submitting pull requests.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
K2I is licensed under the Apache License 2.0.
K2I draws inspiration from the Moonlink architecture by Mooncake Labs.
Built with these excellent Rust crates:
- rdkafka - Kafka consumer (librdkafka bindings)
- arrow - In-memory columnar storage
- parquet - File format encoding
- iceberg - Table format operations
- tokio - Async runtime
- axum - HTTP servers
Made with care by OSO