Skip to content

Conversation

@arreyder
Copy link
Contributor

@arreyder arreyder commented Jan 9, 2026

Summary

Two defensive fixes to prevent corrupt c1z files from being saved/uploaded:

  1. fsync before compression: Add explicit fsync on sqlite database file after Close() but before saveC1z() reads it
  2. Size validation in SaveC1Z: Detect truncated copies by comparing bytes copied to source file size

Problem

Production incidents showed c1z corruption where:

  • Sync succeeded
  • SaveC1Z logged success with byte count
  • But the uploaded file had a truncated zstd stream (missing end-of-stream marker)
  • Error: "corrupt stream, did not find end of stream"

The zstd end-of-stream marker is written by zstd.Close(). For it to be missing, truncation must occur during or after compression - either in saveC1z() itself or during the copy in SaveC1Z().

Root Cause Analysis

Layer What could go wrong Fix
sqlite → saveC1z Stale data in kernel buffers on ZFS/aggressive caching fsync before read
saveC1z → SaveC1Z Truncated io.Copy Size validation

Why fsync helps (defense in depth)

On filesystems with aggressive caching (like ZFS with ARC), data written by sqlite during Close() may still be in kernel buffers. Without explicit fsync, saveC1z() could theoretically read stale/incomplete data.

Why size validation catches the actual symptom

The observed corruption was a truncated zstd stream. If io.Copy in SaveC1Z() returns fewer bytes than the source file size, we now fail immediately rather than saving a corrupt file.

Changes

pkg/dotc1z/c1file.go

// After rawDb.Close(), before saveC1z():
dbFile, err := os.OpenFile(c.dbFilePath, os.O_RDONLY, 0)
if err := dbFile.Sync(); err != nil { ... }
dbFile.Close()

pkg/dotc1z/manager/local/local.go

// Before io.Copy:
srcStat, err := tmpFile.Stat()
expectedSize := srcStat.Size()

// After io.Copy:
if size != expectedSize {
    return fmt.Errorf("copy truncated: copied %d bytes, expected %d bytes", size, expectedSize)
}

Cost

Fix CPU I/O Latency
fsync ~0 1 disk sync ~1-10ms
Size validation ~0 1 stat() <1ms

Total: <15ms - negligible for sync operations that take minutes.

Test plan

  • go build ./pkg/dotc1z/... passes
  • go test ./pkg/dotc1z/... passes (113s, all tests pass)
  • Deploy to staging and monitor for c1z corruption incidents
  • Verify size validation error is logged if truncation occurs

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Ensured on-disk buffers are flushed before creating compressed outputs to reduce risk of data loss.
    • Added strict file-size verification after copy operations to detect and fail on truncated transfers.
    • Improved error reporting and handling around file copy and sync steps to surface failures earlier.

✏️ Tip: You can customize this high-level summary in your review settings.

Two defensive fixes to prevent corrupt c1z files from being saved:

1. Add explicit fsync on sqlite database file after Close() but before
   saveC1z() reads it for compression. On filesystems with aggressive
   caching (like ZFS with ARC), data may still be in kernel buffers.

2. Add size validation in SaveC1Z to detect truncated copies. Compares
   bytes copied to source file size and fails if they don't match. This
   catches the actual symptom we observed: truncated zstd streams missing
   the end-of-stream marker.

Root cause investigation: Production incidents showed c1z corruption where
sync succeeded and SaveC1Z logged success, but the uploaded file had a
truncated zstd stream (missing end-of-stream marker). The size validation
would catch this scenario and fail fast rather than uploading corrupt data.

Cost: ~5ms for fsync + <1ms for stat - negligible for sync operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Jan 9, 2026

Walkthrough

Adds a pre-compression fsync step that opens the DB file read-write and calls Sync when modifications exist, plus additional file-size/stat validation after file-copy operations to detect truncation before finalizing C1Z save.

Changes

Cohort / File(s) Summary
C1Z pre-sync
pkg/dotc1z/c1file.go
Adds a pre-compression filesystem synchronization in CloseContext: opens the DB file read‑write, calls Sync, closes it, and returns cleanup-wrapped errors on failure. Retains WAL checkpoint behavior and subsequent saveC1z call.
Copy integrity checks
pkg/dotc1z/manager/local/local.go
In SaveC1Z and copyFileToTmp compute expectedSize via Stat, verify io.Copy wrote the full size, and return explicit errors on stat failures or size mismatches; keeps explicit fsync steps.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nudged the disk with a careful paw,
I counted bytes till I saw them all,
A sync, a click, a safe little zip—
Hooray for files on a perfect trip! 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main changes: adding fsync and size validation to prevent c1z file corruption, which matches the core objectives of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e24cb0 and 670333b.

📒 Files selected for processing (1)
  • pkg/dotc1z/c1file.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/dotc1z/c1file.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: go-test (1.25.2, windows-latest)
  • GitHub Check: go-test (1.25.2, ubuntu-latest)

Comment @coderabbitai help to get the list of available commands and usage tips.

Sync() on a read-only file descriptor fails on Windows because
FlushFileBuffers requires write access. Changed from O_RDONLY to
O_RDWR so the fsync works correctly on both Linux and Windows.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements two defensive fixes to prevent c1z file corruption observed in production where compressed files had truncated zstd streams missing end-of-stream markers.

Key changes:

  • Added explicit fsync after SQLite database close but before compression to ensure data is on disk
  • Added size validation after file copy operations to detect truncated transfers

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pkg/dotc1z/c1file.go Adds fsync call on database file after SQLite Close() and before compression to ensure all data is written to disk, particularly important on filesystems with aggressive caching
pkg/dotc1z/manager/local/local.go Adds size validation by comparing bytes copied via io.Copy to source file size to detect and fail on truncated transfers before saving corrupt files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants