K2I

High-performance streaming ingestion from Kafka to Apache Iceberg

K2I (Kafka to Iceberg) is a production-grade streaming ingestion engine written in Rust that bridges the latency-cost trade-off in data pipelines. It consumes events from Apache Kafka, buffers them in-memory using Apache Arrow for sub-second query freshness, and writes them to Apache Iceberg tables in Parquet format with exactly-once semantics.

Features

Sub-second data freshness - In-memory hot buffer with Arrow columnar storage
Exactly-once semantics - Write-ahead transaction log for crash recovery
Multiple catalog backends - REST, Hive Metastore, AWS Glue, Nessie
Smart backpressure - Automatic consumer pausing when buffer is full
Automated maintenance - Compaction, snapshot expiration, orphan cleanup
Production observability - Prometheus metrics, health endpoints, structured logging
Single-process simplicity - No distributed coordination overhead

Installation

Download the latest binary from the GitHub Releases page.

macOS (Homebrew)

brew install osodevops/tap/k2i

Linux / macOS (Shell Installer)

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/osodevops/k2i/releases/latest/download/k2i-cli-installer.sh | sh

Linux (Manual)

Download the appropriate binary for your architecture from releases:

# Example for x86_64
curl -LO https://github.com/osodevops/k2i/releases/latest/download/k2i-cli-x86_64-unknown-linux-gnu.tar.xz
tar -xJf k2i-cli-x86_64-unknown-linux-gnu.tar.xz
sudo mv k2i /usr/local/bin/

Docker

docker pull ghcr.io/osodevops/k2i:latest
docker run --rm -v /path/to/config:/etc/k2i ghcr.io/osodevops/k2i:latest ingest --config /etc/k2i/config.toml

See the image on GitHub Container Registry.

From Source

git clone https://github.com/osodevops/k2i.git
cd k2i
cargo build --release

Binary location: target/release/k2i

Quick Start

1. Create Configuration

Create a configuration file config.toml:

[kafka]
bootstrap_servers = "localhost:9092"
topic = "events"
group_id = "k2i-ingestion"

[buffer]
max_size_mb = 500
flush_interval_seconds = 30

[iceberg]
catalog_type = "rest"
rest_uri = "http://localhost:8181"
warehouse = "s3://my-bucket/warehouse"
database = "analytics"
table = "events"

[storage]
type = "s3"
bucket = "my-bucket"
region = "us-east-1"

[server]
health_port = 8080
metrics_port = 9090

2. Validate Configuration

k2i validate --config config.toml

3. Start Ingestion

k2i ingest --config config.toml

4. Monitor Health

# Health check
curl http://localhost:8080/health

# Prometheus metrics
curl http://localhost:9090/metrics

Why K2I?

The Problem

Modern data architectures face a fundamental tension:

Approach	Latency	Cost	Complexity
Real-time streaming (Kafka + KSQL)	Milliseconds	High	High
Micro-batch (Spark Streaming)	Seconds-Minutes	Medium	Medium
Batch ETL (Airflow + Spark)	Minutes-Hours	Low	Low

Streaming data into Iceberg creates additional challenges:

Small file problem - Each micro-batch creates new files, degrading query performance
Exactly-once complexity - Coordinating Kafka, object storage, and catalog commits
Operational burden - Manual compaction, snapshot expiration, orphan cleanup

The Solution

K2I resolves these trade-offs through:

Hot/Cold Architecture - In-memory Arrow buffer for immediate queries, Parquet files for cost-efficient analytics
Write-Ahead Logging - Transaction log ensures exactly-once semantics and crash recovery
Single-Process Design - No distributed coordination, deterministic behavior, simple operations
Automated Maintenance - Background compaction, expiration, and cleanup

Feature Comparison

Feature	K2I	Spark Streaming	Flink	Kafka Connect
Single process	Yes	No	No	Per-connector
Sub-second latency	Yes	Minutes	Seconds	Seconds
Exactly-once	Yes	Yes	Yes	Depends
Auto compaction	Yes	No	No	No
Hot buffer queries	Yes	No	No	No
Memory footprint	Low	High	High	Medium
Operational complexity	Low	High	High	Medium

When NOT to Use K2I

Complex transformations - Use Apache Flink for stream processing with joins, aggregations
Multi-source ingestion - K2I is optimized for Kafka; use Flink for diverse sources
CDC replication - For database change data capture with deletes, consider Moonlink

Architecture

+-----------------------------------------------------------------------------+
|                           K2I Ingestion Engine                              |
+-----------------------------------------------------------------------------+
|                                                                             |
|   +------------------+    +------------------+    +------------------+      |
|   | SmartKafka       |    |    Hot Buffer    |    | Iceberg Writer   |      |
|   | Consumer         |--->| (Arrow + Index)  |--->| (Parquet)        |      |
|   |                  |    |                  |    |                  |      |
|   | - rdkafka        |    | - RecordBatch    |    | - Catalog        |      |
|   | - Backpressure   |    | - DashMap Index  |    | - Object Store   |      |
|   | - Retry Logic    |    | - TTL Eviction   |    | - Atomic Commit  |      |
|   +------------------+    +------------------+    +------------------+      |
|            |                       |                       |               |
|            v                       v                       v               |
|   +---------------------------------------------------------------------+  |
|   |                      Transaction Log                                |  |
|   | - Append-only entries with CRC32 checksums                          |  |
|   | - Periodic checkpoints for fast recovery                            |  |
|   | - Idempotency records for exactly-once semantics                    |  |
|   +---------------------------------------------------------------------+  |
|                                                                             |
+-----------------------------------------------------------------------------+

Documentation

Document	Description
Whitepaper	Comprehensive technical whitepaper
Quick Start	Get started in 5 minutes
Configuration	Complete configuration reference
Architecture	System design and internals
Commands	CLI command reference
Deployment	Production deployment guide
Troubleshooting	Common issues and solutions

CLI Reference

# Validate configuration
k2i validate --config config.toml

# Start ingestion
k2i ingest --config config.toml

# Check service status
k2i status --url http://localhost:8080

# Run manual compaction
k2i maintenance compact --config config.toml

# Expire old snapshots
k2i maintenance expire-snapshots --config config.toml

# Clean orphan files
k2i maintenance cleanup-orphans --config config.toml

Performance

Metric	Target	Notes
Query freshness (hot)	< 1ms	In-memory Arrow buffer
Query freshness (cold)	30s	Configurable flush interval
Flush latency (P50)	200ms	End-to-end flush cycle
Flush latency (P99)	800ms	Including catalog commit
Throughput	10-100K msg/s	Configuration dependent
Memory usage	200MB - 2GB	Based on buffer size

Catalog Support

K2I supports multiple Iceberg catalog backends:

Catalog	Configuration	Features
REST	`catalog_type = "rest"`	OAuth2, Bearer token, custom headers
Hive Metastore	`catalog_type = "hive"`	Thrift protocol, schema sync
AWS Glue	`catalog_type = "glue"`	IAM roles, cross-account access
Nessie	`catalog_type = "nessie"`	Git-like branching, time travel

Building from Source

Requirements:

Rust 1.85+ (for edition 2024)
CMake
OpenSSL development libraries
SASL development libraries

# Clone the repository
git clone https://github.com/osodevops/k2i.git
cd k2i

# Build release binary
cargo build --release

# Run tests
cargo test

# Run with debug logging
RUST_LOG=debug cargo run -p k2i-cli -- --help

Running Tests

# Unit tests
cargo test --lib

# All tests
cargo test

# With coverage
cargo tarpaulin --out Html

Project Structure

k2i/
├── crates/
│   ├── k2i-core/           # Core library
│   │   ├── src/
│   │   │   ├── kafka/      # Smart Kafka consumer
│   │   │   ├── buffer/     # Hot buffer (Arrow + indexes)
│   │   │   ├── iceberg/    # Iceberg writer
│   │   │   ├── txlog/      # Transaction log
│   │   │   ├── catalog/    # Catalog backends
│   │   │   └── maintenance/# Compaction, expiration
│   │   └── tests/
│   └── k2i-cli/            # CLI binary
├── config/                  # Example configurations
└── docs/                    # Documentation

Looking for Enterprise Apache Kafka Support?

OSO engineers are solely focused on deploying, operating, and maintaining Apache Kafka platforms. If you need SLA-backed support or advanced features for compliance and security, our Enterprise Edition extends the core tool with capabilities designed for large-scale, regulated environments.

K2I: Enterprise Edition

Feature Category	Enterprise Capability
Security & Compliance	AES-256 Encryption (client-side encryption at rest)
	GDPR Compliance Tools (PII masking, data retention policies)
	Audit Logging (comprehensive trail of all operations)
	Role-Based Access Control (granular permissions)
Advanced Integrations	Schema Registry Integration (Avro/Protobuf with ID mapping)
	Secrets Management (Vault / AWS Secrets Manager integration)
	SSO / OIDC (Okta, Azure AD, Google Auth)
Scale & Operations	Multi-Table Support (single process, multiple destinations)
	Log Shipping (Datadog, Splunk, Grafana Loki)
	Advanced Metrics & Dashboard (throughput, latency, drill-down UI)
Support	24/7 SLA-Backed Support & dedicated Kafka consulting

Need help resolving operational issues or planning a data lakehouse strategy? Our team of experts can help you design, deploy, and operate K2I at scale.

Talk with an expert today or email us at enquiries@oso.sh.

Contributing

We welcome contributions of all kinds!

Report Bugs: Found a bug? Open an issue on GitHub.
Suggest Features: Have an idea? Request a feature.
Contribute Code: Check out our good first issues for beginner-friendly tasks.
Improve Docs: Help us improve the documentation by submitting pull requests.

Development Workflow

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

K2I is licensed under the Apache License 2.0.

Acknowledgments

K2I draws inspiration from the Moonlink architecture by Mooncake Labs.

Built with these excellent Rust crates:

rdkafka - Kafka consumer (librdkafka bindings)
arrow - In-memory columnar storage
parquet - File format encoding
iceberg - Table format operations
tokio - Async runtime
axum - HTTP servers

Made with care by OSO

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
config		config
crates		crates
docs		docs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
dist-workspace.toml		dist-workspace.toml

osodevops/k2i

Folders and files

Latest commit

History

Repository files navigation

K2I