A scalable, fault-tolerant distributed job scheduling system built with Rust, Apache Cassandra, and Apache Kafka. Designed to decouple job intake, scheduling, and execution using modular microservices, enabling high availability and observability across services.
The Distributed Job Runner is composed of three key services:
- Job Service: Handles HTTP job submission requests and stores metadata in Apache Cassandra.
- Scheduling Service: Periodically polls the database, delegates jobs to workers via RPC, and sends jobs to Kafka.
- Scheduling Worker Pool: Listens for assignments, sends heartbeat signals, and publishes jobs to Kafka for downstream execution.
Built using Tokio for asynchronous concurrency and gRPC/RPC for inter-service communication. Easily deployable via Kubernetes, with support for scaling, failure detection, and load redistribution.
- Accepts HTTP job submission requests (job metadata + scheduling info).
- Persists job and schedule data to Apache Cassandra.
- Stateless, horizontally scalable via Kubernetes.
- Runs periodic tasks (every minute) to check for due jobs in Cassandra.
- Sends job data to active scheduling workers using RPC.
- Maintains heartbeat monitoring and redistributes jobs if workers fail.
- Receive jobs via RPC, store assigned segment IDs.
- Publish job payloads to Apache Kafka for execution by downstream consumers.
- Send regular heartbeats to the scheduler.
-- Keyspace
CREATE KEYSPACE IF NOT EXISTS dist_task_runner
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
-- Job metadata
CREATE TABLE dist_task_runner.job_table(
user_id int,
job_id uuid,
is_recurring boolean,
run_interval text,
max_retry_count int,
created_time bigint,
PRIMARY KEY((user_id), job_id)
) WITH CLUSTERING ORDER BY (job_id ASC);
-- Job scheduling metadata
CREATE TABLE dist_task_runner.task_schedule(
next_execution_time bigint,
job_id uuid,
segment int,
PRIMARY KEY((next_execution_time, segment), job_id)
) WITH CLUSTERING ORDER BY (job_id ASC);
-- Execution history
CREATE TABLE dist_task_runner.task_history (
job_id uuid,
execution_time bigint,
status text,
retry_count int,
last_update_time bigint,
PRIMARY KEY ((job_id), execution_time)
) WITH CLUSTERING ORDER BY (execution_time ASC);- Rust + Tokio – Async microservices
- Apache Cassandra – Distributed NoSQL database
- Apache Kafka – Distributed log/message queue
- gRPC / RPC – Service communication
- Kubernetes – Service orchestration and scaling
- Add Web UI for job monitoring and manual triggering
- Add distributed tracing and metrics (OpenTelemetry)
- Support CRON expressions or ISO 8601 repeat rules