Distributed Job Runner

A scalable, fault-tolerant distributed job scheduling system built with Rust, Apache Cassandra, and Apache Kafka. Designed to decouple job intake, scheduling, and execution using modular microservices, enabling high availability and observability across services.

Overview

The Distributed Job Runner is composed of three key services:

Job Service: Handles HTTP job submission requests and stores metadata in Apache Cassandra.
Scheduling Service: Periodically polls the database, delegates jobs to workers via RPC, and sends jobs to Kafka.
Scheduling Worker Pool: Listens for assignments, sends heartbeat signals, and publishes jobs to Kafka for downstream execution.

Built using Tokio for asynchronous concurrency and gRPC/RPC for inter-service communication. Easily deployable via Kubernetes, with support for scaling, failure detection, and load redistribution.

Components

Job Service (`job_service`)

Accepts HTTP job submission requests (job metadata + scheduling info).
Persists job and schedule data to Apache Cassandra.
Stateless, horizontally scalable via Kubernetes.

Scheduling Service (`scheduling_service`)

Runs periodic tasks (every minute) to check for due jobs in Cassandra.
Sends job data to active scheduling workers using RPC.
Maintains heartbeat monitoring and redistributes jobs if workers fail.

Scheduling Workers (`scheduling_service_worker`)

Receive jobs via RPC, store assigned segment IDs.
Publish job payloads to Apache Kafka for execution by downstream consumers.
Send regular heartbeats to the scheduler.

Cassandra Schema

-- Keyspace
CREATE KEYSPACE IF NOT EXISTS dist_task_runner
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

-- Job metadata
CREATE TABLE dist_task_runner.job_table(
    user_id int,
    job_id uuid,
    is_recurring boolean,
    run_interval text,
    max_retry_count int,
    created_time bigint,
    PRIMARY KEY((user_id), job_id)
) WITH CLUSTERING ORDER BY (job_id ASC);

-- Job scheduling metadata
CREATE TABLE dist_task_runner.task_schedule(
    next_execution_time bigint,
    job_id uuid,
    segment int,
    PRIMARY KEY((next_execution_time, segment), job_id)
) WITH CLUSTERING ORDER BY (job_id ASC);

-- Execution history
CREATE TABLE dist_task_runner.task_history (
    job_id uuid,
    execution_time bigint,
    status text,
    retry_count int,
    last_update_time bigint,
    PRIMARY KEY ((job_id), execution_time)
) WITH CLUSTERING ORDER BY (execution_time ASC);

Technologies Used

Rust + Tokio – Async microservices
Apache Cassandra – Distributed NoSQL database
Apache Kafka – Distributed log/message queue
gRPC / RPC – Service communication
Kubernetes – Service orchestration and scaling

Future Work

Add Web UI for job monitoring and manual triggering
Add distributed tracing and metrics (OpenTelemetry)
Support CRON expressions or ISO 8601 repeat rules

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
job_service		job_service
scheduling-service-worker		scheduling-service-worker
scheduling_service		scheduling_service
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Job Runner

Overview

Components

Job Service (`job_service`)

Scheduling Service (`scheduling_service`)

Scheduling Workers (`scheduling_service_worker`)

Cassandra Schema

Technologies Used

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Languages

kpp16/distributed-job-runner

Folders and files

Latest commit

History

Repository files navigation

Distributed Job Runner

Overview

Components

Job Service (job_service)

Scheduling Service (scheduling_service)

Scheduling Workers (scheduling_service_worker)

Cassandra Schema

Technologies Used

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Job Service (`job_service`)

Scheduling Service (`scheduling_service`)

Scheduling Workers (`scheduling_service_worker`)

Packages