Skip to content

ucphhpc/lustre-dirty-blockmap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

lustre-dirty-blockmap

A patch for the Lustre 2.15.8 llite client that tracks which 2 GB-aligned blocks have been written to large files. The bitmap is persisted as the extended attribute user.dirty_blockmap and survives unmount/remount.


Motivation

Large Lustre files (HPC workloads, checkpoints, scientific datasets) are often written in partial passes — only certain regions of a multi-terabyte file change between runs. Without block-level write tracking, consumers downstream (backup, tiering, integrity checking) must either scan the entire file or rely on coarse-grained modification timestamps.

user.dirty_blockmap gives any userspace tool a compact, persistent bitmap of exactly which 2 GB regions have ever been written, at negligible runtime cost.


Design

Property Value
Block size 2 GB
Minimum tracked file size 2 GB (smaller files: no bitmap, no overhead)
Maximum tracked file size 1 PB (larger files: -EFBIG)
Maximum blocks 524,288 (1 PB ÷ 2 GB)
Bitmap storage 8,192 × __u64 = 64 KB (fits in a single xattr value)
xattr name user.dirty_blockmap
Encoding Raw little-endian uint64_t array, no header

Bit b of word w being set means 2 GB block (w × 64 + b) has been written at least once. Block numbering starts at 0 (file bytes 0–2 GB).


Files Changed

File Change
lustre/llite/dirty_blockmap.c New file — all bitmap logic
lustre/llite/llite_internal.h Constants, struct ll_dirty_blockmap, lli_dirty_blockmap field, declarations
lustre/llite/file.c Hooks in ll_file_open, ll_file_write_iter, ll_file_release, ll_fsync
lustre/llite/vvp_io.c Periodic flush hook in vvp_io_rw_end (CIT_WRITE)
lustre/llite/llite_lib.c Persist + free on inode eviction in ll_clear_inode
lustre/llite/Makefile.in Add dirty_blockmap.o to the kernel module build

Lifecycle

open()
  └─ ll_file_open()
       └─ ll_dirty_blockmap_alloc()   if file >= 2 GB at open time
            (fresh zeroed bitmap — no xattr load)

write()
  └─ ll_file_write_iter()
       ├─ lazy alloc if no bitmap yet and (ki_pos >= 2 GB or
       │  i_size_read() >= 2 GB) — handles dd O_TRUNC and writes
       │  at low offsets into large sparse files
       └─ ll_dirty_blockmap_mark()    set bits for written byte range
  └─ vvp_io_rw_end()  [CIT_WRITE, after each IO completion]
       └─ ll_dirty_blockmap_store()   periodic flush if dbm_dirty
            (same cadence as mtime updates — survives long-running opens)

close()
  └─ ll_file_release()
       └─ ll_dirty_blockmap_store()   persist xattr if bitmap is dirty

fsync()
  └─ ll_fsync()
       └─ ll_dirty_blockmap_store()   persist xattr if bitmap is dirty

inode eviction
  └─ ll_clear_inode()
       ├─ ll_dirty_blockmap_store()   final persist
       └─ ll_dirty_blockmap_free()    release memory

Implementation Notes (Lustre 2.15.8 Specifics)

  • lli->lli_lock is rwlock_t in 2.15.8 — use write_lock/write_unlock in ll_dirty_blockmap_free(), not spin_lock.
  • xattr I/O buffers are heap-allocated via OBD_ALLOC/OBD_FREE — the full bitmap is 64 KB which exceeds the kernel stack limit.
  • md_setxattr() requires a struct ptlrpc_request ** — always pass &req and call ptlrpc_req_finished(req) afterwards to free the MDS reply buffer.
  • The bitmap init in ll_file_open() is placed before the final GOTO(out_och_free, rc) on the success path. Placing it before the out_och_free: label itself would land in unreachable dead code.
  • ll_file_write_iter() uses rc_normal (not result). Write start offset is iocb->ki_pos - rc_normal because ki_pos has already advanced by the time the hook runs.
  • Lazy allocation in ll_file_write_iter() triggers when either iocb->ki_pos >= DIRTY_BLOCKMAP_MIN_FILESIZE (file grew past threshold) or i_size_read() >= DIRTY_BLOCKMAP_MIN_FILESIZE (write at any offset into a large sparse file, e.g. pwrite at offset 0 on a 3 GB sparse file).

Building

The patch is generated by a script that applies all changes to a clean Lustre 2.15.8 source tree and produces a git format-patch output.

git clone https://github.com/lustre/lustre-release
cd lustre-release
git checkout 2.15.8
bash /path/to/generate_dirty_blockmap_patch.sh

Output: ../0001-llite-add-2GB-block-dirty-blockmap-via-user-xattr.patch

To apply to another tree:

git checkout 2.15.8
git apply 0001-llite-add-2GB-block-dirty-blockmap-via-user-xattr.patch

Build and install as you would any Lustre client RPM:

make rpms
dnf install -y kmod-lustre-client-*.rpm lustre-client-*.rpm

Requirements

  • Lustre 2.15.8 source tree
  • Lustre mount with user_xattr option (verify with mount | grep lustre)
  • Files must be ≥ 2 GB to be tracked

Reading the Bitmap

import os, struct

def read_dirty_blockmap(path):
    BLOCK_SIZE = 2 * 1024**3  # 2 GB

    try:
        data = os.getxattr(path, 'user.dirty_blockmap')
    except OSError:
        print(f"{path}: no dirty_blockmap (file < 2 GB or never written)")
        return

    file_size  = os.stat(path).st_size
    nwords     = len(data) // 8
    words      = struct.unpack(f'<{nwords}Q', data)
    total_blks = (file_size + BLOCK_SIZE - 1) // BLOCK_SIZE
    dirty_blks = sum(bin(w).count('1') for w in words)

    print(f"File:         {path}")
    print(f"Size:         {file_size:,} bytes  ({file_size / BLOCK_SIZE:.2f} × 2 GB blocks)")
    print(f"Dirty blocks: {dirty_blks} / {total_blks}")
    print(f"Block map:    ", end="")
    for w_idx, w in enumerate(words):
        for bit in range(64):
            block = w_idx * 64 + bit
            if block >= total_blks:
                break
            print((w >> bit) & 1, end="")
    print()

read_dirty_blockmap("/lustre/myfile.bin")

Example output for a 3 GB file where only the second 2 GB block was written:

File:         /lustre/myfile.bin
Size:         3,221,225,472 bytes  (1.50 × 2 GB blocks)
Dirty blocks: 1 / 2
Block map:    01

Verified Test Cases

Test Expected Result
dd (no fsync, O_TRUNC) into 3 GB file 2 dirty blocks
pwrite at offset 0 into 3 GB sparse file block 0 dirty (10)
pwrite at offset 2.5 GB only block 1 dirty (01)
OR merge: write block 0, reopen, write block 1 both blocks dirty (11)
Read-only open+close xattr unchanged
File < 2 GB no xattr, no overhead

Limitations & Known Issues

  • Truncation: bits above the new file size are not cleared when a file is truncated. A future version should hook ll_setattr to zero stale bits.
  • Block size is fixed at compile time (2 GB). It is not encoded in the xattr, so the constant must match between the kernel module and any userspace reader.
  • Concurrency: the xattr update is a read-OR-write sequence with no distributed lock. Lustre has no client-side LCK_EX path for MDS_INODELOCK_XATTR (IT_SETXATTR is obsolete; ACL consistency relies on MDS-side serialization). Two nodes flushing concurrently at close() or write commit can race — both read the same xattr, both OR in their bits, and the last writer wins, potentially losing the other node's bits. With 2 GB block granularity this window is very narrow in practice. The consequence is a missed dirty block not data corruption or data loss.
  • Tested on RHEL 9.7, kernel 5.14.0-611.36.1.el9_7.x86_64.

License

GPL-2.0 — same as the Lustre source tree this patch applies to.

About

A patch for the Lustre client that tracks which blocks have been written to large files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors