TaskVault is an open-source, content-aware caching layer for engineering teams running expensive, repeatable tasks. It fingerprints inputs, recognizes identical work, and serves cached results instantly—turning wasted compute into saved time, money, and infrastructure capacity.
Think of it as a secure vault for deterministic work: if the input matches, the result is already waiting.
Your team wastes compute resources:
- ML engineers retrain models with the same datasets → GPU waste (100s €/month)
- Build systems rebuild unchanged code → CI/CD sprawl
- Data engineers re-transform identical datasets → ETL overhead
- DevOps reruns tests with same parameters → Test suite bloat
Solution: TaskVault learns which computations are identical by analyzing input content, not just surface parameters.
- Content-Aware Hashing: Uses Blake3 for cryptographically-secure input fingerprinting. Same input = same hash, regardless of parameter names.
- Format Agnostic: Cache JSON, binary, CSV, model checkpoints, or raw file outputs—TaskVault handles everything.
- Distributed-Ready: Single-node SQLite for dev/edge, PostgreSQL for teams, Kubernetes-native with gRPC sync.
- Eviction Policies: Configurable TTL (time-to-live) and LRU (least-recently-used) cleanup.
- Zero-Downtime Integration: CLI wrapper, environment hooks, or programmatic SDK—no code changes required.
- Full Audit Trail: Every hit/miss/error logged with timestamps and task metadata.
- Production-Ready: Proper error handling, concurrent access, corruption detection.
# Clone repository
git clone https://github.com/taskvault/taskvault.git
cd taskvault
# Build from source (requires Go 1.21+)
go build -o taskvault ./cmd/taskvault./taskvault initCreates .taskvault/config.yaml:
cache_dir: .taskvault/cache
max_size_gb: 10
hash_algorithm: blake3
log_level: info
policies:
default:
ttl_seconds: 604800 # 7 days
max_size_bytes: 104857600 # 100 MB
strategy: lru# First time: do the work
python train_model.py --dataset data.csv > model.pkl
# Cache the result
./taskvault cache save train_model data.csv model.pkl
# Output: ✓ Cached train_model (hash: a3f2b1c8..., size: 5242880 bytes)# Same input, same dataset
./taskvault cache get train_model data.csv model_restored.pkl
# Output: ✓ Cache hit for train_model (size: 5242880 bytes)
# Task skipped! Result restored in milliseconds../taskvault cache statsOutput:
TaskVault Cache Statistics
==========================
Entries: 1247
Total Size: 7.43 GB
Cache Limit: 10.00 GB
Usage: 74.3%
#!/bin/bash
# cicd-test.sh - Cache test results
TASKVAULT=./taskvault
CACHE_KEY="$1"
TEST_INPUT="$2"
TEST_OUTPUT="$3"
# Check if result is cached
if $TASKVAULT cache get "$CACHE_KEY" "$TEST_INPUT" "$TEST_OUTPUT" 2>/dev/null; then
echo "✓ Tests passed (from cache)"
exit 0
fi
# Cache miss: run tests
if npm test > "$TEST_OUTPUT" 2>&1; then
# Save result
$TASKVAULT cache save "$CACHE_KEY" "$TEST_INPUT" "$TEST_OUTPUT"
exit 0
else
exit 1
fiUsage in CI:
# .github/workflows/test.yml
- name: Run tests with cache
run: ./cicd-test.sh "unit_tests" package.json test-results.txtResult: 70% reduction in CI execution time when tests haven't changed.
# data_pipeline.py
from taskvault.sdk import Client
client = Client(".taskvault/config.yaml")
def process_dataset(csv_file: str) -> bytes:
"""Load CSV, normalize, aggregate."""
with open(csv_file, 'rb') as f:
input_data = f.read()
# Check cache
cached, hit = client.get_cached_result("aggregate", input_data)
if hit:
print("✓ Using cached aggregation result")
return cached
# Cache miss: do real work
df = pd.read_csv(csv_file)
result = df.groupby('category').agg({'value': 'sum'}).to_json()
# Save for future runs
client.cache_result("aggregate", input_data, result.encode())
return result.encode()
# Even if same CSV is passed 100x, computation happens once
for i in range(100):
output = process_dataset("sales_data.csv") # Hits cache 99 timesResult: 99 jobs → 1 computation, 150x speedup on retruns.
# train.py
from taskvault.sdk import Client
import torch, pickle
client = Client()
def train_model(dataset_path: str, hyperparams: dict):
# Serialize hyperparams to bytes for hashing
config = json.dumps(hyperparams, sort_keys=True).encode()
with open(dataset_path, 'rb') as f:
dataset = f.read()
task_input = config + dataset # Combine for full determinism
# Try cache
cached, hit = client.get_cached_result("ml_training", task_input)
if hit:
model = pickle.loads(cached)
print(f"✓ Loaded model from cache (saved 2 hours of GPU time)")
return model
# Train for real
model = train_transformer(dataset, hyperparams)
model_bytes = pickle.dumps(model)
# Cache for future identical runs
client.cache_result("ml_training", task_input, model_bytes)
return model
# Run with same data/params: instant result
model = train_model("dataset.pkl", {"lr": 1e-4, "epochs": 100})Result: Saves $500-2000 in GPU costs per model variant.
# .taskvault/config.yaml
# Directory for cache storage (default: .taskvault/cache)
cache_dir: ~/.taskvault/cache
# Maximum cache size in GB (auto-evicts LRU when exceeded)
max_size_gb: 50
# Hashing algorithm: blake3 (fast) or sha256 (compatible)
hash_algorithm: blake3
# Logging detail: debug, info, warn, error
log_level: info
# Service port for future REST API
service_port: 9999
# Per-task caching policies
policies:
default:
ttl_seconds: 604800 # Cache for 7 days
max_size_bytes: 104857600 # Max 100 MB per entry
strategy: lru # Evict least-recently-used
ml_training:
ttl_seconds: 2592000 # Keep ML models 30 days
max_size_bytes: 5368709120 # Allow 5 GB per model
strategy: lru
unit_tests:
ttl_seconds: 86400 # Tests fresh after 24h
max_size_bytes: 52428800 # Limit to 50 MB
strategy: lrupackage main
import "github.com/taskvault/taskvault/pkg/sdk"
func main() {
client, err := sdk.NewClient(".taskvault/config.yaml")
defer client.Close()
// Save result
input := []byte("training_data.csv contents")
output := []byte("trained_model.pkl contents")
cacheKey, _ := client.CacheResult("train_model", input, output)
// Get result
cached, hit, _ := client.GetCachedResult("train_model", input)
if hit {
fmt.Println("Cache hit:", len(cached), "bytes")
}
// Stats
stats, _ := client.GetStats()
fmt.Println(stats)
}from taskvault.sdk import Client
client = Client()
cached, hit = client.get_cached_result("task_name", input_bytes)
if hit:
use_result(cached)┌─────────────────────────────────────────────────────┐
│ CLI / SDK Wrappers (Go, Python, Node) │
├─────────────────────────────────────────────────────┤
│ Cache Manager (Scheduling, TTL, Eviction) │
├─────────────────────────────────────────────────────┤
│ Content Hash Engine (Blake3, SHA256) │
├─────────────────────────────────────────────────────┤
│ Metadata Store (SQLite / PostgreSQL) │
├─────────────────────────────────────────────────────┤
│ Blob Storage (Local FS / S3 / GCS) │
└─────────────────────────────────────────────────────┘
.taskvault/
├── config.yaml # Configuration
├── cache/
│ ├── cache.db # SQLite metadata DB
│ ├── audit.log # Audit trail (hits/misses)
│ └── blobs/ # Content storage
│ ├── a3f2b1c8d5e... # Content hash → blob
│ └── ...
└── metrics/
└── prometheus.txt # Future: Prometheus metrics
If a blob is missing but metadata exists, TaskVault:
- Detects mismatch on
Get() - Removes corrupted metadata
- Returns cache miss (not error)
All operations are thread-safe via sync.RWMutex:
- Multiple readers in parallel
- Exclusive writers with lock
- Audit logging safe under contention
When cache exceeds max_size_gb:
- Scans for expired entries (TTL)
- Removes LRU entries until usage < 50% of limit
- Prevents thrashing with conservative eviction
Audit log format:
[2025-02-02T14:30:45Z] HIT get task=model_training hash=a3f2b1...
[2025-02-02T14:30:46Z] MISS get task=data_pipeline hash=c8d9f0...
[2025-02-02T14:30:47Z] ERROR hash_error task=invalid_input error=EOF
Parse with:
grep "HIT" .taskvault/cache/audit.log | wc -l # Total hits
grep "MISS" .taskvault/cache/audit.log | wc -l # Total misses- SaaS Dashboard (Q2 2025): Web UI for multi-team management, cost analytics
- Distributed Sync (Q3 2025): gRPC-based cache sharing across CI nodes
- ML Model Registry (Q4 2025): Integration with Hugging Face, MLflow
- Cloud Backends (Q1 2026): S3, GCS, Azure Blob native support
- Smart Compression (Q2 2026): Predictive precomputation based on workflow patterns
- Enterprise (Q3 2026): SAML auth, audit compliance, SLA guarantees
TaskVault is MIT-licensed open-source. We welcome:
- Bug reports via GitHub Issues
- Feature PRs with tests
- Documentation improvements
MIT License - see LICENSE file for details.
Start caching now. Stop wasting compute.