A patch for the Lustre 2.15.8 llite client that tracks which
2 GB-aligned blocks have been written to large files. The bitmap is persisted as the
extended attribute user.dirty_blockmap and survives unmount/remount.
Large Lustre files (HPC workloads, checkpoints, scientific datasets) are often written in partial passes — only certain regions of a multi-terabyte file change between runs. Without block-level write tracking, consumers downstream (backup, tiering, integrity checking) must either scan the entire file or rely on coarse-grained modification timestamps.
user.dirty_blockmap gives any userspace tool a compact, persistent bitmap of exactly
which 2 GB regions have ever been written, at negligible runtime cost.
| Property | Value |
|---|---|
| Block size | 2 GB |
| Minimum tracked file size | 2 GB (smaller files: no bitmap, no overhead) |
| Maximum tracked file size | 1 PB (larger files: -EFBIG) |
| Maximum blocks | 524,288 (1 PB ÷ 2 GB) |
| Bitmap storage | 8,192 × __u64 = 64 KB (fits in a single xattr value) |
| xattr name | user.dirty_blockmap |
| Encoding | Raw little-endian uint64_t array, no header |
Bit b of word w being set means 2 GB block (w × 64 + b) has been written at
least once. Block numbering starts at 0 (file bytes 0–2 GB).
| File | Change |
|---|---|
lustre/llite/dirty_blockmap.c |
New file — all bitmap logic |
lustre/llite/llite_internal.h |
Constants, struct ll_dirty_blockmap, lli_dirty_blockmap field, declarations |
lustre/llite/file.c |
Hooks in ll_file_open, ll_file_write_iter, ll_file_release, ll_fsync |
lustre/llite/vvp_io.c |
Periodic flush hook in vvp_io_rw_end (CIT_WRITE) |
lustre/llite/llite_lib.c |
Persist + free on inode eviction in ll_clear_inode |
lustre/llite/Makefile.in |
Add dirty_blockmap.o to the kernel module build |
open()
└─ ll_file_open()
└─ ll_dirty_blockmap_alloc() if file >= 2 GB at open time
(fresh zeroed bitmap — no xattr load)
write()
└─ ll_file_write_iter()
├─ lazy alloc if no bitmap yet and (ki_pos >= 2 GB or
│ i_size_read() >= 2 GB) — handles dd O_TRUNC and writes
│ at low offsets into large sparse files
└─ ll_dirty_blockmap_mark() set bits for written byte range
└─ vvp_io_rw_end() [CIT_WRITE, after each IO completion]
└─ ll_dirty_blockmap_store() periodic flush if dbm_dirty
(same cadence as mtime updates — survives long-running opens)
close()
└─ ll_file_release()
└─ ll_dirty_blockmap_store() persist xattr if bitmap is dirty
fsync()
└─ ll_fsync()
└─ ll_dirty_blockmap_store() persist xattr if bitmap is dirty
inode eviction
└─ ll_clear_inode()
├─ ll_dirty_blockmap_store() final persist
└─ ll_dirty_blockmap_free() release memory
lli->lli_lockisrwlock_tin 2.15.8 — usewrite_lock/write_unlockinll_dirty_blockmap_free(), notspin_lock.- xattr I/O buffers are heap-allocated via
OBD_ALLOC/OBD_FREE— the full bitmap is 64 KB which exceeds the kernel stack limit. md_setxattr()requires astruct ptlrpc_request **— always pass&reqand callptlrpc_req_finished(req)afterwards to free the MDS reply buffer.- The bitmap init in
ll_file_open()is placed before the finalGOTO(out_och_free, rc)on the success path. Placing it before theout_och_free:label itself would land in unreachable dead code. ll_file_write_iter()usesrc_normal(notresult). Write start offset isiocb->ki_pos - rc_normalbecauseki_poshas already advanced by the time the hook runs.- Lazy allocation in
ll_file_write_iter()triggers when eitheriocb->ki_pos >= DIRTY_BLOCKMAP_MIN_FILESIZE(file grew past threshold) ori_size_read() >= DIRTY_BLOCKMAP_MIN_FILESIZE(write at any offset into a large sparse file, e.g.pwriteat offset 0 on a 3 GB sparse file).
The patch is generated by a script that applies all changes to a clean Lustre 2.15.8
source tree and produces a git format-patch output.
git clone https://github.com/lustre/lustre-release
cd lustre-release
git checkout 2.15.8
bash /path/to/generate_dirty_blockmap_patch.shOutput: ../0001-llite-add-2GB-block-dirty-blockmap-via-user-xattr.patch
To apply to another tree:
git checkout 2.15.8
git apply 0001-llite-add-2GB-block-dirty-blockmap-via-user-xattr.patchBuild and install as you would any Lustre client RPM:
make rpms
dnf install -y kmod-lustre-client-*.rpm lustre-client-*.rpm- Lustre 2.15.8 source tree
- Lustre mount with
user_xattroption (verify withmount | grep lustre) - Files must be ≥ 2 GB to be tracked
import os, struct
def read_dirty_blockmap(path):
BLOCK_SIZE = 2 * 1024**3 # 2 GB
try:
data = os.getxattr(path, 'user.dirty_blockmap')
except OSError:
print(f"{path}: no dirty_blockmap (file < 2 GB or never written)")
return
file_size = os.stat(path).st_size
nwords = len(data) // 8
words = struct.unpack(f'<{nwords}Q', data)
total_blks = (file_size + BLOCK_SIZE - 1) // BLOCK_SIZE
dirty_blks = sum(bin(w).count('1') for w in words)
print(f"File: {path}")
print(f"Size: {file_size:,} bytes ({file_size / BLOCK_SIZE:.2f} × 2 GB blocks)")
print(f"Dirty blocks: {dirty_blks} / {total_blks}")
print(f"Block map: ", end="")
for w_idx, w in enumerate(words):
for bit in range(64):
block = w_idx * 64 + bit
if block >= total_blks:
break
print((w >> bit) & 1, end="")
print()
read_dirty_blockmap("/lustre/myfile.bin")Example output for a 3 GB file where only the second 2 GB block was written:
File: /lustre/myfile.bin
Size: 3,221,225,472 bytes (1.50 × 2 GB blocks)
Dirty blocks: 1 / 2
Block map: 01
| Test | Expected | Result |
|---|---|---|
dd (no fsync, O_TRUNC) into 3 GB file |
2 dirty blocks | ✅ |
pwrite at offset 0 into 3 GB sparse file |
block 0 dirty (10) |
✅ |
pwrite at offset 2.5 GB only |
block 1 dirty (01) |
✅ |
| OR merge: write block 0, reopen, write block 1 | both blocks dirty (11) |
✅ |
| Read-only open+close | xattr unchanged | ✅ |
| File < 2 GB | no xattr, no overhead | ✅ |
- Truncation: bits above the new file size are not cleared when a file is
truncated. A future version should hook
ll_setattrto zero stale bits. - Block size is fixed at compile time (2 GB). It is not encoded in the xattr, so the constant must match between the kernel module and any userspace reader.
- Concurrency: the xattr update is a read-OR-write sequence with no distributed
lock. Lustre has no client-side
LCK_EXpath forMDS_INODELOCK_XATTR(IT_SETXATTRis obsolete; ACL consistency relies on MDS-side serialization). Two nodes flushing concurrently atclose()or write commit can race — both read the same xattr, both OR in their bits, and the last writer wins, potentially losing the other node's bits. With 2 GB block granularity this window is very narrow in practice. The consequence is a missed dirty block not data corruption or data loss. - Tested on RHEL 9.7, kernel
5.14.0-611.36.1.el9_7.x86_64.
GPL-2.0 — same as the Lustre source tree this patch applies to.