Skip to content

Conversation

@arreyder
Copy link
Contributor

@arreyder arreyder commented Dec 23, 2025

Maybe fix c1z corruption by using atomic writes in saveC1z

we are seeing "corrupt stream, did not find end of stream" errors on c1z files. The issue: saveC1z wrote directly to the output file, so any interruption (crash, error, kill) left a partial/corrupt file that could get uploaded to S3.

write to a temp file first, rename to output on success. If anything fails, the original file is untouched and the temp gets cleaned up.

I cannot attribute/explain, how/if this relates to zfs? Only guess is it some how exposed a latent issue.

also added tests for the atomic behavior and a benchmark.

Summary by CodeRabbit

  • Bug Fixes

    • Safer save process using temporary files, explicit sync/close, and atomic rename to prevent partial or corrupted output.
    • Improved cleanup on failure and clearer, more specific error reporting for file operations.
  • Tests

    • Added tests and benchmarks validating atomic save behavior, cleanup on failure, preservation of existing outputs on error, and performance across varied sizes.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 23, 2025

Walkthrough

Save routines were changed to write to a temporary file (os.CreateTemp), sync and close it, then atomically rename to the final path; temp-file cleanup and more robust Close()/error handling were added. Tests and a benchmark were added to validate atomic-write behavior and performance.

Changes

Cohort / File(s) Summary
Atomic write implementation
pkg/dotc1z/file.go, pkg/dotc1z/manager/local/local.go
Replace direct writes with os.CreateTemp → write → Sync/close → os.Rename atomic replacement; add defer cleanup of temp file, explicit close-error handling/logging, niling file vars to avoid double-close, and updated error messages/comments about race mitigation.
Tests & Benchmarks
pkg/dotc1z/file_test.go
Add BenchmarkSaveC1z (1KB/100KB/1MB) and TestSaveC1zAtomicWrite subtests verifying preservation of existing output on failures, temp-file cleanup, and behavior when source DB is missing; add fmt import for benchmark labels.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • ggreer
  • loganintech

Poem

I’m a rabbit with a tiny pen, 🐇
I write to temp and try again.
I close, I sync, I rename true,
No half-done files will bother you.
Hop—atomic now, the world’s like new. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'atomic write on saveC1z, maybe stop shipping bad c1z' clearly describes the main change: implementing atomic writes in the saveC1z function to prevent partial/corrupt files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/zstd-incomplete-stream-on-error

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/dotc1z/file_test.go (1)

120-132: Test may not exercise temp file cleanup path.

When dbFilePath doesn't exist, saveC1z fails at line 67 (os.Open), before the temp file is created at line 83. This test validates that no temp files exist, but the cleanup code in the defer is never actually exercised.

Consider injecting a failure after the temp file is created (e.g., mocking a write error, or using a read-only output directory for Sync to fail) to truly test the cleanup path.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ae8bd8 and 43ff485.

📒 Files selected for processing (2)
  • pkg/dotc1z/file.go
  • pkg/dotc1z/file_test.go
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2024-10-08T21:29:30.695Z
Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:0-0
Timestamp: 2024-10-08T21:29:30.695Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `cleanup` method of the `DBCache` struct in the `pkg/uhttp/dbcache.go` file.

Applied to files:

  • pkg/dotc1z/file.go
📚 Learning: 2024-09-03T15:53:47.572Z
Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:0-0
Timestamp: 2024-09-03T15:53:47.572Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `NewDBCache` function of the `DBCache` struct in the `pkg/uhttp/dbcache.go` file.

Applied to files:

  • pkg/dotc1z/file.go
🪛 GitHub Check: go-lint
pkg/dotc1z/file_test.go

[failure] 80-80:
unnecessary trailing newline (whitespace)


[failure] 85-85:
Comment should end in a period (godot)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: go-test (1.25.2, ubuntu-latest)
  • GitHub Check: go-test (1.25.2, windows-latest)
🔇 Additional comments (4)
pkg/dotc1z/file.go (2)

70-76: LGTM!

Proper nil-guard and error logging for deferred cleanup. This ensures close errors are visible without masking the primary return value.


78-98: Well-designed atomic write pattern.

Creating the temp file in the same directory as the output ensures they're on the same filesystem, which is required for atomic rename. The unconditional os.Remove(tmpPath) in defer is safe—after a successful rename, the temp file no longer exists at that path, so the remove silently fails.

pkg/dotc1z/file_test.go (2)

13-42: LGTM!

Good benchmark coverage with varying input sizes. The use of b.ResetTimer() correctly excludes setup time from measurements.


134-144: LGTM!

This correctly validates that no output file is created when the source file is missing.

Comment on lines +150 to +154
// Atomic rename: outputFilePath now has complete, valid data
// This is the only point where outputFilePath is modified
if err = os.Rename(tmpPath, outputFilePath); err != nil {
return fmt.Errorf("failed to rename temp file to output: %w", err)
}
Copy link

@coderabbitai coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Go os.Rename Windows behavior destination file exists atomicity

💡 Result:

Short answer

  • os.Rename on Windows does NOT atomically replace an existing destination file. On non‑Unix platforms (Windows) Rename is not guaranteed atomic and will fail or behave differently when the destination exists. Use a platform‑specific replace (MoveFileEx with REPLACE_EXISTING) or an atomic helper package. [1][2][3]

Details & practical options

  • The Go runtime/OS docs state Rename may replace dst on Unix but "Even within the same directory, on non‑Unix platforms Rename is not an atomic operation." (Windows has different semantics). [3]
  • The Go issue tracker records this platform difference and the rationale for providing a separate “replace” style operation on Windows. os.Rename was not made to provide Unix‑style atomic overwrite on Windows. [1]
  • Workarounds:
    • Use a library that calls the Windows API (MoveFileEx with MOVEFILE_REPLACE_EXISTING) to perform an atomic replace on Windows (examples: natefinch/atomic and other "replacefile"/"atomic rename" packages). These guarantee either full replace or no change on Windows. [2][4]
    • If you attempt manual Remove(dst) + os.Rename(src,dst) be aware that this is not atomic and has race/permission/use‑in‑use failure modes on Windows. [1][5]

Sources

  • Go issue: "os: make Rename atomic on Windows" (golang/go). [1]
  • github.com/natefinch/atomic (atomic replace/write for Windows using MoveFileEx). [2]
  • os.Rename docs / pkg: "Even within the same directory, on non‑Unix platforms Rename is not an atomic operation." [3]
  • replace/atomic helper implementations (example pkg docs). [4]
  • Discussion/examples showing Windows rename behavior and pitfalls. [5]

🏁 Script executed:

# First, let me examine the code at lines 150-154 and surrounding context
cd /root && find . -type f -name "file.go" -path "*/dotc1z/*" 2>/dev/null | head -5

Repository: ConductorOne/baton-sdk

Length of output: 121


🏁 Script executed:

# Find the file
find . -type f -name "file.go" -path "*dotc1z*" 2>/dev/null

Repository: ConductorOne/baton-sdk

Length of output: 86


🏁 Script executed:

# Also search for any files in dotc1z directory
find . -type d -name "dotc1z" 2>/dev/null

Repository: ConductorOne/baton-sdk

Length of output: 78


🏁 Script executed:

# Read the file, focusing on lines around 150-154
cat -n ./pkg/dotc1z/file.go | sed -n '120,160p'

Repository: ConductorOne/baton-sdk

Length of output: 1213


🏁 Script executed:

# Read the full function to understand the context - let's see from the function start
cat -n ./pkg/dotc1z/file.go | sed -n '80,160p'

Repository: ConductorOne/baton-sdk

Length of output: 2633


🏁 Script executed:

# Check the function signature and see if outputFilePath is expected to exist or not
cat -n ./pkg/dotc1z/file.go | sed -n '60,90p'

Repository: ConductorOne/baton-sdk

Length of output: 1197


🏁 Script executed:

# Also search for calls to this function to understand the usage pattern
rg "os\.Rename" ./pkg/dotc1z/file.go -B 5 -A 5

Repository: ConductorOne/baton-sdk

Length of output: 385


🏁 Script executed:

# Search for calls to saveC1z function
rg "saveC1z" ./pkg/dotc1z/ -B 2 -A 2

Repository: ConductorOne/baton-sdk

Length of output: 4153


🏁 Script executed:

# Also check if there's any code that removes or checks for outputFilePath existence
rg "outputFilePath" ./pkg/dotc1z/file.go -B 2 -A 2

Repository: ConductorOne/baton-sdk

Length of output: 953


🏁 Script executed:

# Let me check the specific test that validates atomic write behavior with existing file
cat -n ./pkg/dotc1z/file_test.go | sed -n '50,130p'

Repository: ConductorOne/baton-sdk

Length of output: 3341


🏁 Script executed:

# Check if there's a test that specifically tests re-saving to the same outputFilePath
cat -n ./pkg/dotc1z/file_test.go | sed -n '130,180p'

Repository: ConductorOne/baton-sdk

Length of output: 721


🏁 Script executed:

# Search for any code that might handle Windows-specific behavior for file operations
rg "windows\|Windows\|GOOS\|runtime.GOOS" ./pkg/dotc1z/ -i

Repository: ConductorOne/baton-sdk

Length of output: 48


Address Windows compatibility for atomic file replacement.

On Unix, os.Rename atomically replaces an existing destination file. On Windows, os.Rename returns an error if the destination already exists—it does not support atomic replacement.

If outputFilePath already exists during a re-sync operation, this will fail on Windows. Consider using a platform-aware helper (e.g., natefinch/atomic or a wrapper calling MoveFileEx with MOVEFILE_REPLACE_EXISTING on Windows) to support atomic replacement across platforms.

🤖 Prompt for AI Agents
In pkg/dotc1z/file.go around lines 150 to 154, the code uses os.Rename to move
tmpPath to outputFilePath which is atomic on Unix but fails on Windows if the
destination exists; replace this call with a platform-aware atomic replace
helper (for example use the natefinch/atomic package or implement a small
wrapper that calls MoveFileEx with MOVEFILE_REPLACE_EXISTING on Windows and
os.Rename on Unix), ensure the helper preserves file permissions/ownership where
appropriate, return the helper's error on failure, and add a brief comment
noting the platform-specific behavior and why the helper is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ This is valid. This code is exercised inservice mode.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kans, thank you for confirming this is a valid concern. Since this code runs in service mode and could potentially be executed on Windows, addressing the atomic rename behavior would improve cross-platform reliability.

Would you like to:

  1. Address the Windows compatibility in this PR by using a library like natefinch/atomic or implementing a platform-aware wrapper?
  2. Create a follow-up issue to track this for a future PR?

If you'd prefer option 2, I can create an issue with the details for you.


🧠 Learnings used
Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:0-0
Timestamp: 2024-10-08T21:29:30.695Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `cleanup` method of the `DBCache` struct in the `pkg/uhttp/dbcache.go` file.

Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:0-0
Timestamp: 2024-09-03T15:49:24.881Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `cleanup` method of the `DBCache` struct in the `pkg/uhttp/dbcache.go` file.

Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:0-0
Timestamp: 2024-09-03T15:53:47.572Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `NewDBCache` function of the `DBCache` struct in the `pkg/uhttp/dbcache.go` file.

Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:0-0
Timestamp: 2024-09-03T15:48:24.208Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `DBCache` methods in the `pkg/uhttp/dbcache.go` file.

Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:0-0
Timestamp: 2024-10-08T21:29:30.695Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `DBCache` methods in the `pkg/uhttp/dbcache.go` file.

Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:267-278
Timestamp: 2024-10-08T21:29:30.695Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `DBCache` methods in the `pkg/uhttp/dbcache.go` file.

Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:267-278
Timestamp: 2024-09-03T15:50:49.204Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `DBCache` methods in the `pkg/uhttp/dbcache.go` file.

Learnt from: mchavez
Repo: ConductorOne/baton-sdk PR: 211
File: pkg/uhttp/dbcache.go:0-0
Timestamp: 2024-09-03T15:51:48.712Z
Learning: The user mchavez has fixed the issue related to improving error messages and adding comments in the `Get` method of the `DBCache` struct in the `pkg/uhttp/dbcache.go` file.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/dotc1z/manager/local/local.go (1)

166-204: LGTM: Atomic write pattern correctly implemented.

The refactoring successfully implements the atomic write pattern to prevent partial/corrupt c1z files:

  1. ✓ Temp file created in destination directory (ensures same filesystem for atomic rename)
  2. ✓ Data synced before rename (critical for durability, especially on ZFS)
  3. ✓ File closed explicitly with error handling before rename
  4. ✓ Atomic rename operation via os.Rename
  5. ✓ Cleanup of temp file on failure via defer
  6. ✓ Prevention of double-close by setting dstFile = nil

This correctly addresses the PR objective of preventing corrupt streams by ensuring the destination file never contains partial data.

Optional: Consider syncing the directory for maximum durability

For maximum durability guarantees, consider syncing the parent directory after the rename to ensure the directory entry is persisted:

// Atomic rename: filePath now has complete, valid data.
if err := os.Rename(dstTmpPath, l.filePath); err != nil {
    return fmt.Errorf("failed to rename temp file to destination: %w", err)
}

// Optional: sync directory to ensure rename is persisted
if dir, err := os.Open(dstDir); err == nil {
    dir.Sync()
    dir.Close()
}

Note: This is often omitted in practice as the durability benefit is marginal in most scenarios, and the code is already robust without it.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a5869d3 and 1dd37f8.

📒 Files selected for processing (1)
  • pkg/dotc1z/manager/local/local.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: go-test (1.25.2, windows-latest)
🔇 Additional comments (1)
pkg/dotc1z/manager/local/local.go (1)

8-8: LGTM: Import addition supports temp file path handling.

The filepath import is necessary for the Dir() and Base() operations used in the atomic write implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants