Skip to content

Fix cloud journal archival to S3#73

Merged
zourzouvillys merged 3 commits intomainfrom
theo/fix-cloud-journal-archival
Mar 9, 2026
Merged

Fix cloud journal archival to S3#73
zourzouvillys merged 3 commits intomainfrom
theo/fix-cloud-journal-archival

Conversation

@zourzouvillys
Copy link
Collaborator

Summary

  • Journal rotation on cloud: Live JournalWriter in InstanceState.ensureBroker() had zero RotateDuration, so files never rotated and on-rotate archival never fired. Added -journal-rotate-duration flag (default PT1H) to lplex-cloud, threaded through InstanceManagerInstanceStateJournalConfig.
  • Startup archive sweep: Added archiveUnarchived() to JournalKeeper.Run() that scans all directories on startup and archives any .lpj files missing .archived markers (skipping the newest file per dir, which may be the active writer). Catches files missed due to crashes, restarts, or late archive configuration.
  • Docs & config: Updated configuration docs, retention docs, cloud self-hosted guide, lplex-cloud.conf.example, and README with the new flag and startup sweep behavior.
  • Entrypoint: Added -journal-rotate-duration PT1H to the Docker entrypoint in the infra repo.

Test plan

  • TestStartupArchivesUnarchivedFiles - non-archived files get archived on startup, newest skipped
  • TestStartupSkipsAlreadyArchivedFiles - already-archived files aren't re-archived
  • TestStartupSkipsActiveFile - most recent file per dir is skipped
  • TestCloudJournalWriterRotation - duration flows from IM to JournalWriter config, OnRotate fires
  • TestSetJournalRotateDurationRetroactive - retroactive update of existing instances
  • go build ./...
  • golangci-lint run (0 issues)
  • Rebuild Docker image and deploy

Three bugs prevented journal archival from working on lplex-cloud:

1. Live JournalWriter had zero RotateDuration, so files never rotated
   and the on-rotate trigger never fired. Add -journal-rotate-duration
   flag (default PT1H) and thread it through InstanceManager to each
   instance's JournalConfig.

2. On startup, the keeper only archived files during hard-expire. If the
   process restarted with rotated but non-expired .lpj files missing
   .archived markers, they sat there until max-age. Add a one-shot
   archiveUnarchived() sweep at startup that archives any unarchived
   files (skipping the newest per dir as it may be the active writer).

3. (Already fixed externally) archive-to-s3.sh read .file instead of
   .path from the JSONL metadata.
The previous commit only threaded RotateDuration through to the cloud's
JournalWriter. Add RotateSize too so files rotate on whichever threshold
hits first (time or size), matching the boat-side behavior.

Rename SetJournalRotateDuration to SetJournalRotation(duration, size)
since both knobs travel together.
Three bugs prevented journal files from reaching S3:

1. Shutdown ordering: cancel() killed the keeper before im.Shutdown()
   stopped brokers. Journal finalize -> OnRotate fired into a dead keeper.
   Fix: shutdown servers, then brokers, then keeper.

2. stopBroker didn't wait for journal writer goroutine. finalize() and
   OnRotate ran asynchronously after stopBroker returned. Added
   journalDone channel so stopBroker blocks until finalize completes.

3. archiveUnarchived skipped the newest file per directory (assumed active).
   With only one file (common after short-lived deployments), nothing got
   archived. Since the sweep runs at startup before any brokers exist,
   all files are completed. Removed the skip entirely.

Also added drain logic to keeper's Run() so it processes remaining
rotation notifications after context cancellation.
@zourzouvillys zourzouvillys merged commit 0cc2cea into main Mar 9, 2026
8 checks passed
@zourzouvillys zourzouvillys deleted the theo/fix-cloud-journal-archival branch March 9, 2026 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant