Skip to content

Conversation

@pfliu
Copy link
Owner

@pfliu pfliu commented Sep 24, 2025

No description provided.

Pingfan Liu and others added 13 commits August 8, 2025 20:00
In latter patches, PE format parser will extract the linux kernel inside
and try its real format parser. So making kexec_image_load_default
global.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
To: kexec@lists.infradead.org
The KEXE PE format parser needs the kernel built-in decompressor to
decompress the kernel image. So moving the decompressor out of __init
sections.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
To: linux-kernel@vger.kernel.org
In the security kexec_file_load case, the buffer which holds the kernel
image should not be accessible from the userspace.

Typically, BPF data flow occurs between user space and kernel space in
either direction.  However, kexec_file_load presents a unique case where
user-originated data must be parsed and then forwarded to the kernel for
subsequent parsing stages.  This necessitates a mechanism to channel the
intermedia data from the BPF program directly to the kernel.

bpf_kexec_carrier() is introduced to serve that purpose.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
To: bpf@vger.kernel.org
This commit bridges the gap between bpf-prog and the kernel
decompression routines. At present, only a global memory allocator is
used for the decompression. Later, if needed, the decompress_fn's
prototype can be changed to pass in a task related allocator.

This memory allocator can allocate 2MB each time with a transient
virtual address, up to a 1GB limit.  After decompression finishes, it
presents all of the decompressed data in a new unified virtual
address space.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
To: bpf@vger.kernel.org
As UEFI becomes popular, a few architectures support to boot a PE format
kernel image directly. But the internal of PE format varies, which means
each parser for each format.

This patch (with the rest in this series) introduces a common skeleton
to all parsers, and leave the format parsing in
bpf-prog, so the kernel code can keep relative stable.

A new kexec_file_ops is implementation, named pe_image_ops.

There are some place holder function in this patch. (They will take
effect after the introduction of kexec bpf light skeleton and bpf
helpers). Overall the parsing progress is a pipeline, the current
bpf-prog parser is attached to bpf_handle_pefile(), and detatched at the
end of the current stage 'disarm_bpf_prog()' the current parsed result
by the current bpf-prog will be buffered in kernel 'prepare_nested_pe()'
, and deliver to the next stage.  For each stage, the bpf bytecode is
extracted from the '.bpf' section in the PE file.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philipp Rudo <prudo@redhat.com>
To: kexec@lists.infradead.org
This patch does two things:
First, register as a listener on bpf_copy_to_kernel()
Second, in order that the hooked bpf-prog can call the sleepable kfuncs,
bpf_handle_pefile and bpf_post_handle_pefile are marked as
KF_SLEEPABLE.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: bpf@vger.kernel.org
To: kexec@lists.infradead.org
Analague to kernel/bpf/preload/iterators/Makefile,
this Makefile is not invoked by the Kbuild system. It needs to be
invoked manually when kexec_pe_parser_bpf.c is changed so that
kexec_pe_parser_bpf.lskel.h can be re-generated by the command "bpftool
gen skeleton -L kexec_pe_parser_bpf.o".

kexec_pe_parser_bpf.lskel.h is used directly by the kernel kexec code in
later patch. For this patch, there are bpf bytecode contained in
opts_data[] and opts_insn[] in kexec_pe_parser_bpf.lskel.h, but in the
following patch, they will be removed and only the function API in
kexec_pe_parser_bpf.lskel.h left.

As exposed in kexec_pe_parser_bpf.lskel.h, the interface between
bpf-prog and the kernel are constituted by:

four maps:
                struct bpf_map_desc ringbuf_1;
                struct bpf_map_desc ringbuf_2;
                struct bpf_map_desc ringbuf_3;
                struct bpf_map_desc ringbuf_4;
four sections:
                struct bpf_map_desc rodata;
                struct bpf_map_desc data;
                struct bpf_map_desc bss;
                struct bpf_map_desc rodata_str1_1;

two progs:
        SEC("fentry.s/bpf_handle_pefile")
        SEC("fentry.s/bpf_post_handle_pefile")

They are fixed and provided for all kinds of bpf-prog which interacts
with the kexec kernel component.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: bpf@vger.kernel.org
To: kexec@lists.infradead.org
The routine to search a symbol in ELF can be shared, so split it out.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philipp Rudo <prudo@redhat.com>
To: kexec@lists.infradead.org
All kexec PE bpf prog should align with the interface exposed by the
light skeleton
    four maps:
                    struct bpf_map_desc ringbuf_1;
                    struct bpf_map_desc ringbuf_2;
                    struct bpf_map_desc ringbuf_3;
                    struct bpf_map_desc ringbuf_4;
    four sections:
                    struct bpf_map_desc rodata;
                    struct bpf_map_desc data;
                    struct bpf_map_desc bss;
                    struct bpf_map_desc rodata_str1_1;
    two progs:
            SEC("fentry.s/bpf_handle_pefile")
            SEC("fentry.s/bpf_post_handle_pefile")

With the above presumption, the integration consists of two parts:
  -1. Call API exposed by light skeleton from kexec
  -2. The opts_insn[] and opts_data[] are bpf-prog dependent and
      can be extracted and passed in from the user space. In the
      kexec_file_load design, a PE file has a .bpf section, which data
      content is a ELF, and the ELF contains opts_insn[] opts_data[].
      As a bonus, BPF bytecode can be placed under the protection of the
      entire PE signature.
      (Note, since opts_insn[] contains the information of the ringbuf
       size, the bpf-prog writer can change its proper size according to
       the kernel image size without modifying the kernel code)

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: bpf@vger.kernel.org
To: kexec@lists.infradead.org
Now everything is ready for kexec PE image parser. Select it on arm64
for zboot and UKI image support.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
To: linux-arm-kernel@lists.infradead.org
This BPF program aligns with the convention defined in the kernel file
kexec_pe_parser_bpf.lskel.h, where the interface between the BPF program
and the kernel is established, and is composed of:
    four maps:
                    struct bpf_map_desc ringbuf_1;
                    struct bpf_map_desc ringbuf_2;
                    struct bpf_map_desc ringbuf_3;
                    struct bpf_map_desc ringbuf_4;
    four sections:
                    struct bpf_map_desc rodata;
                    struct bpf_map_desc data;
                    struct bpf_map_desc bss;
                    struct bpf_map_desc rodata_str1_1;

    two progs:
            SEC("fentry.s/bpf_handle_pefile")
            SEC("fentry.s/bpf_post_handle_pefile")

This BPF program only uses ringbuf_1, so it minimizes the size of the
other three ringbufs to one byte.  The size of ringbuf_1 is deduced from
the size of the uncompressed file 'vmlinux.bin', which is usually less
than 64MB. With the help of a group of bpf kfuncs: bpf_decompress(),
bpf_copy_to_kernel(), bpf_mem_range_result_put(), this bpf-prog stores
the uncompressed kernel image inside the kernel space.

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: bpf@vger.kernel.org
To: kexec@lists.infradead.org
The objcopy binary can append an section into PE file, but it disregards
the DOS header. While the zboot format carries important information:
payload offset and size in the DOS header.

In order to keep track and update such information, here introducing a
dedicated binary tool to build zboot image. The payload offset is
determined by the fact that its offset inside the .data section is
unchanged. Hence the offset of .data section in the new PE file plus the
payload offset within section renders the offset within the new PE file.

The objcopy binary can append a section to a PE file, but it disregards
the DOS header. However, the zboot format carries important information
in the DOS header: payload offset and size.

To track this information and append a new PE section, here a dedicated
binary tool is introduced to build zboot images. The payload's relative
offset within the .data section remains unchanged.  Therefore, the .data
section offset in the new PE file, plus the payload offset within that
section, yields the payload offset within the new PE file.

Finally, the new PE file 'zboot.efi' can be got by the command:
  make -C tools/kexec zboot

Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: bpf@vger.kernel.org
To: kexec@lists.infradead.org
@pfliu
Copy link
Owner Author

pfliu commented Sep 24, 2025

/gemini review

@gemini-code-assist
Copy link

Summary of Changes

Hello @pfliu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Linux kernel's kexec functionality by introducing a BPF-driven mechanism for loading and parsing UEFI Portable Executable (PE) images. This new approach allows for flexible and dynamic handling of complex boot image formats, such as zboot, where various components like the kernel, initrd, and command line might be compressed or nested within the PE file. By integrating BPF, the kernel can delegate the intricate parsing logic to user-defined programs, making the kexec process more adaptable to diverse boot environments and future format evolutions without requiring kernel recompilation for each new format.

Highlights

  • BPF-assisted PE Image Loading: Introduces a new kexec file loader (kexec_pe_image_ops) that leverages BPF programs to parse UEFI Portable Executable (PE) files, enabling dynamic and extensible handling of boot images.
  • Dynamic Parsing with Embedded BPF: BPF programs, embedded within a dedicated .bpf section of a PE image, can now dynamically extract and decompress kernel, initrd, and command line components during the kexec loading process.
  • New BPF Kernel Functions and Infrastructure: Adds essential BPF kernel functions (bpf_decompress, bpf_copy_to_kernel) and supporting data structures (mem_range_result, carrier_listener) to allow BPF programs to manage kernel memory, perform decompression, and pass extracted resources back to the kexec loader.
  • Zboot Image Support and Tooling: Provides a userspace tool (zboot_image_builder) to inject BPF parsing logic into zboot EFI images, ensuring proper handling of compressed kernel payloads while maintaining PE header integrity. A corresponding BPF program (zboot_parser_bpf.c) is also introduced for this purpose.
  • ARM64 Architecture Integration: Enables the KEXEC_PE_IMAGE configuration option for ARM64 architectures, allowing this new PE image loading mechanism to be utilized on ARM64 systems.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature for kexec, allowing PE files to be loaded and parsed with the help of BPF programs. This is a complex change that adds a new BPF "carrier" framework for passing data from BPF to the kernel, a PE image loader, and BPF programs for parsing specific PE formats like zboot. While the overall architecture is sound, my review has identified several critical issues, including a mutex deadlock and use-after-free vulnerability in the BPF decompression helper, a memory leak in the PE image loader's error path, and other high-severity bugs related to incorrect resource management and logic. These critical issues must be addressed before this change can be merged.

Comment on lines +3906 to +3911
if (!decompressor) {
kfree(range);
vfree(input_buf);
pr_err("Can not find decompress method\n");
goto error;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are critical issues in this error handling path:

  1. If decompress_method() returns NULL, the code jumps to the error: label while output_buf_mutex is still held. This will cause the function to return without releasing the mutex, leading to a deadlock on subsequent calls.
  2. In the same error path, range is freed via kfree(range), but the function proceeds to return this now-dangling range pointer, which is a use-after-free vulnerability.

The error handling logic must be refactored to ensure the mutex is always unlocked and resources are cleaned up correctly on all paths. Failures after acquiring the mutex should jump to a cleanup path that unlocks the mutex before returning.

Comment on lines +440 to +452
for (int i = 0; i < 3; i++) {
r = parsed_resource[i].r;
if (!r)
continue;
parsed_resource[i].r = NULL;
/*
* The release of buffer defers to
* kimage_file_post_load_cleanup()
*/
r->buf = NULL;
r->buf_sz = 0;
mem_range_result_put(r);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical memory leak in the error handling path. The cleanup logic at the err: label is also used for the success path. It unconditionally sets r->buf = NULL before calling mem_range_result_put(r).

On a failure path, this prevents __mem_range_result_free from freeing the buffer pointed to by r->buf, causing a memory leak. The cleanup logic should differentiate between success and failure. On failure, mem_range_result_put(r) should be called without modifying r to ensure the buffer is freed.

err:
	/* On success, ldata is not an error pointer. */
	bool success = !IS_ERR(ldata);

	for (int i = 0; i < 3; i++) {
		r = parsed_resource[i].r;
		if (!r)
			continue;
		parsed_resource[i].r = NULL;
		/*
		 * On success, buffer ownership is transferred to kimage.
		 * Prevent freeing it here. On failure, the buffer should be
		 * freed by mem_range_result_put().
		 */
		if (success) {
			r->buf = NULL;
			r->buf_sz = 0;
		}
		mem_range_result_put(r);
	}

Comment on lines +143 to +146
if (range->alloc_type == TYPE_KALLOC)
kfree(kbuf);
else
vfree(kbuf);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a bug in this error handling block. You are checking range->alloc_type to determine which deallocation function to use for kbuf. However, range->alloc_type is not initialized until line 157. At this point, it holds an uninitialized value, which will lead to undefined behavior. You should use the local alloc_type variable, which holds the correct value.

if (alloc_type == TYPE_KALLOC)
	kfree(kbuf);
else
	vfree(kbuf);

image->cmdline_buf_len = cmdline_len;
}

if (kernel == NULL || initrd == NULL || cmdline == NULL) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This check is too strict. A valid kexec image might not have an initrd, and the command line can be empty. The kexec_file_load syscall supports these cases. This check will cause pe_image_load to fail for valid scenarios. It should be relaxed to only require that the kernel is present.

if (kernel == NULL) {

strcpy(c, "cmdline ");
c += strlen("cmdline ");
}
c = '\0';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a bug in the error message construction. c = '\0'; assigns the null character's value (0) to the pointer c, which is incorrect and will likely cause a crash. To null-terminate the string, you should write to the memory location pointed to by c.

		*c = '\0';

Comment on lines +55 to +56
This option makes the kexec_file_load() syscall cooperates with bpf-prog
to parse PE format file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The help text has some minor grammatical errors and could be clearer. Consider rephrasing for better readability.

  This option enables the kexec_file_load() syscall to cooperate with a BPF
  program to parse a PE format file.

return __builtin_memcmp(local_name, res_kernel, 4);
}

SEC("fentry.s/bpf_handle_pefile")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The section name fentry.s/bpf_handle_pefile appears to be incorrect. The standard format for fentry/fexit programs is fentry/FUNCTION_NAME. The .s is likely a typo and should be removed.

SEC("fentry/bpf_handle_pefile")

return 0;
}

SEC("fentry.s/bpf_post_handle_pefile")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the other fentry program, the section name fentry.s/bpf_post_handle_pefile appears to be incorrect. The .s is likely a typo and should be removed.

SEC("fentry/bpf_post_handle_pefile")

Comment on lines +275 to +277
close(base_fd);
close(out_fd);
close(bpf_fd);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function has resource leaks. The file pointer bin_fp is opened with fopen but never closed. Additionally, bpf_fd is closed but was never opened, which will attempt to close an arbitrary file descriptor.

	close(base_fd);
	close(out_fd);
	fclose(bin_fp);

/* Overwrite buf */
bpf_probe_read((void *)buf, payload_size, context->image + payload_offset);
bpf_printk("Calling bpf_kexec_decompress()\n");
struct mem_range_result *r = bpf_decompress(buf, payload_size - 4);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expression payload_size - 4 uses a magic number 4. Please add a comment to explain why 4 bytes are being subtracted from the payload size before decompression. This will improve code readability and maintainability.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature enabling kexec to load and parse PE files using BPF programs. This is an innovative approach that involves adding new BPF helpers, a generic "carrier" mechanism for BPF-to-kernel data transfer, and new tooling for embedding BPF programs into PE files. The changes are extensive and touch multiple kernel subsystems. My review has identified a few critical and high-severity bugs in the implementation of the new BPF helpers and the PE image loading logic, along with some medium-severity issues in the supporting tools and the BPF program itself. These should be addressed to ensure the stability and correctness of this new functionality.

Comment on lines +141 to +150
ret = copy_from_kernel_nofault(kbuf, buf, size);
if (unlikely(ret < 0)) {
if (range->alloc_type == TYPE_KALLOC)
kfree(kbuf);
else
vfree(kbuf);
kfree(range);
ret = -EINVAL;
goto err;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This error handling block has a couple of issues:

  1. copy_from_kernel_nofault() returns an unsigned long indicating the number of bytes not copied. A non-zero value means failure. The current check ret < 0 is incorrect because the return value is never negative. Also, storing the result in an int can cause truncation.
  2. The free logic uses range->alloc_type, but this field is not initialized until line 157. The local alloc_type variable should be used instead.

Here's a suggested fix that addresses both points. I've also changed the error code to -EFAULT which is more appropriate for a copy failure.

if (unlikely(copy_from_kernel_nofault(kbuf, buf, size))) {
		if (alloc_type == TYPE_KALLOC)
			kfree(kbuf);
		else
			vfree(kbuf);
		kfree(range);
		ret = -EFAULT;
		goto err;
	}

Comment on lines +403 to +423
if (kernel == NULL || initrd == NULL || cmdline == NULL) {
char *c, buf[64];

c = buf;
if (kernel == NULL) {
strcpy(c, "kernel ");
c += strlen("kernel ");
}
if (initrd == NULL) {
strcpy(c, "initrd ");
c += strlen("initrd ");
}
if (cmdline == NULL) {
strcpy(c, "cmdline ");
c += strlen("cmdline ");
}
c = '\0';
pr_err("Can not extract data for %s", buf);
ldata = ERR_PTR(-EINVAL);
goto err;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This error handling logic has several flaws:

  1. It treats a missing initrd or cmdline as a fatal error by checking them with ||. These are often optional, and only a missing kernel should be a fatal error.
  2. The error message construction is buggy. It uses strcpy in a loop, which overwrites the buffer on each iteration instead of appending. For example, if both kernel and initrd are NULL, the buffer will only contain "initrd ".
  3. c = '\0'; is incorrect for null-terminating a string. It should be *c = '\0';.

I suggest refactoring this to only fail on a missing kernel and issue warnings for other missing components.

if (!kernel) {
		pr_err("Can not extract kernel from PE container\n");
		ldata = ERR_PTR(-EINVAL);
		goto err;
	}

	if (!initrd)
		pr_warn("Could not find initrd in PE container\n");

	if (!cmdline)
		pr_warn("Could not find cmdline in PE container\n");

Comment on lines +3723 to +3727
/*
* At present, one global allocator for decompression. Later if needed, changing the
* prototype of decompress_fn to introduce each task's allocator.
*/
static DEFINE_MUTEX(output_buf_mutex);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of a global mutex (output_buf_mutex) and a single global allocator instance (dcmpr_allocator) makes the bpf_decompress kfunc non-reentrant. While this might be acceptable for the current kexec use case where concurrent calls are unlikely, it limits the general applicability of this helper for other potential use cases in the future. As noted in the comment on line 3724, making this mechanism re-entrant would be a valuable improvement for wider adoption.

munmap(base_start_addr, sb.st_size);
close(base_fd);
close(out_fd);
close(bpf_fd);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file descriptor bpf_fd is closed here, but it was never initialized or opened. Meanwhile, the file pointer bin_fp is opened with fopen but never closed, leading to a resource leak. You should close bin_fp using fclose.

fclose(bin_fp);

/* Overwrite buf */
bpf_probe_read((void *)buf, payload_size, context->image + payload_offset);
bpf_printk("Calling bpf_kexec_decompress()\n");
struct mem_range_result *r = bpf_decompress(buf, payload_size - 4);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The subtraction of 4 from payload_size appears to be a magic number. To improve code clarity and maintainability, please add a comment explaining why these 4 bytes are being excluded from the decompression payload. For instance, it might be a checksum or size field that isn't part of the compressed stream.

pfliu pushed a commit that referenced this pull request Dec 1, 2025
The kernel forbids the creation of non-FDB nexthop groups with FDB
nexthops:

 # ip nexthop add id 1 via 192.0.2.1 fdb
 # ip nexthop add id 2 group 1
 Error: Non FDB nexthop group cannot have fdb nexthops.

And vice versa:

 # ip nexthop add id 3 via 192.0.2.2 dev dummy1
 # ip nexthop add id 4 group 3 fdb
 Error: FDB nexthop group can only have fdb nexthops.

However, as long as no routes are pointing to a non-FDB nexthop group,
the kernel allows changing the type of a nexthop from FDB to non-FDB and
vice versa:

 # ip nexthop add id 5 via 192.0.2.2 dev dummy1
 # ip nexthop add id 6 group 5
 # ip nexthop replace id 5 via 192.0.2.2 fdb
 # echo $?
 0

This configuration is invalid and can result in a NPD [1] since FDB
nexthops are not associated with a nexthop device:

 # ip route add 198.51.100.1/32 nhid 6
 # ping 198.51.100.1

Fix by preventing nexthop FDB status change while the nexthop is in a
group:

 # ip nexthop add id 7 via 192.0.2.2 dev dummy1
 # ip nexthop add id 8 group 7
 # ip nexthop replace id 7 via 192.0.2.2 fdb
 Error: Cannot change nexthop FDB status while in a group.

[1]
BUG: kernel NULL pointer dereference, address: 00000000000003c0
[...]
Oops: Oops: 0000 [#1] SMP
CPU: 6 UID: 0 PID: 367 Comm: ping Not tainted 6.17.0-rc6-virtme-gb65678cacc03 #1 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014
RIP: 0010:fib_lookup_good_nhc+0x1e/0x80
[...]
Call Trace:
 <TASK>
 fib_table_lookup+0x541/0x650
 ip_route_output_key_hash_rcu+0x2ea/0x970
 ip_route_output_key_hash+0x55/0x80
 __ip4_datagram_connect+0x250/0x330
 udp_connect+0x2b/0x60
 __sys_connect+0x9c/0xd0
 __x64_sys_connect+0x18/0x20
 do_syscall_64+0xa4/0x2a0
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: 38428d6 ("nexthop: support for fdb ecmp nexthops")
Reported-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68c9a4d2.050a0220.3c6139.0e63.GAE@google.com/
Tested-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250921150824.149157-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pfliu pushed a commit that referenced this pull request Dec 1, 2025
Ido Schimmel says:

====================
nexthop: Various fixes

Patch #1 fixes a NPD that was recently reported by syzbot.

Patch #2 fixes an issue in the existing FIB nexthop selftest.

Patch #3 extends the selftest with test cases for the bug that was fixed
in the first patch.
====================

Link: https://patch.msgid.link/20250921150824.149157-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pfliu pushed a commit that referenced this pull request Dec 1, 2025
Add 0x29 as the accelerometer address for the Dell Latitude E6530 to
lis3lv02d_devices[].

The address was verified as below:

    $ cd /sys/bus/pci/drivers/i801_smbus/0000:00:1f.3
    $ ls -d i2c-*
    i2c-20
    $ sudo modprobe i2c-dev
    $ sudo i2cdetect 20
    WARNING! This program can confuse your I2C bus, cause data loss and worse!
    I will probe file /dev/i2c-20.
    I will probe address range 0x08-0x77.
    Continue? [Y/n] Y
         0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
    00:                         08 -- -- -- -- -- -- --
    10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    20: -- -- -- -- -- -- -- -- -- UU -- 2b -- -- -- --
    30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    40: -- -- -- -- 44 -- -- -- -- -- -- -- -- -- -- --
    50: UU -- 52 -- -- -- -- -- -- -- -- -- -- -- -- --
    60: -- 61 -- -- -- -- -- -- -- -- -- -- -- -- -- --
    70: -- -- -- -- -- -- -- --
    $ cat /proc/cmdline
    BOOT_IMAGE=/vmlinuz-linux-cachyos-bore root=UUID=<redacted> rw loglevel=3 quiet dell_lis3lv02d.probe_i2c_addr=1
    $ sudo dmesg
    [    0.000000] Linux version 6.16.6-2-cachyos-bore (linux-cachyos-bore@cachyos) (gcc (GCC) 15.2.1 20250813, GNU ld (GNU Binutils) 2.45.0) #1 SMP PREEMPT_DYNAMIC Thu, 11 Sep 2025 16:01:12 +0000
    […]
    [    0.000000] DMI: Dell Inc. Latitude E6530/07Y85M, BIOS A22 11/30/2018
    […]
    [    5.166442] i2c i2c-20: Probing for lis3lv02d on address 0x29
    [    5.167854] i2c i2c-20: Detected lis3lv02d on address 0x29, please report this upstream to platform-driver-x86@vger.kernel.org so that a quirk can be added

Signed-off-by: Nickolay Goppen <setotau@mainlining.org>
Reviewed-by: Hans de Goede <hansg@kernel.org>
Link: https://patch.msgid.link/20250917-dell-lis3lv02d-latitude-e6530-v1-1-8a6dec4e51e9@mainlining.org
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
pfliu pushed a commit that referenced this pull request Dec 1, 2025
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():

    BUG: unable to handle page fault for address: ffffbc3840291000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
    Oops: 0000 [#1] SMP NOPTI
    CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G                 N  6.17.0-rc3 torvalds#10 PREEMPT(voluntary)
    Tainted: [N]=TEST
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
    RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
    [...]
    Call Trace:
    <TASK>
    __msan_memset+0xee/0x1a0
    sha224_final+0x9e/0x350
    test_hash_buffer_overruns+0x46f/0x5f0
    ? kmsan_get_shadow_origin_ptr+0x46/0xa0
    ? __pfx_test_hash_buffer_overruns+0x10/0x10
    kunit_try_run_case+0x198/0xa00

This occurs when memset() is called on a buffer that is not 4-byte aligned
and extends to the end of a guard page, i.e.  the next page is unmapped.

The bug is that the loop at the end of kmsan_internal_set_shadow_origin()
accesses the wrong shadow memory bytes when the address is not 4-byte
aligned.  Since each 4 bytes are associated with an origin, it rounds the
address and size so that it can access all the origins that contain the
buffer.  However, when it checks the corresponding shadow bytes for a
particular origin, it incorrectly uses the original unrounded shadow
address.  This results in reads from shadow memory beyond the end of the
buffer's shadow memory, which crashes when that memory is not mapped.

To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.

Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org
Fixes: 2ef3cec ("kmsan: do not wipe out origin when doing partial unpoisoning")
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
pfliu pushed a commit that referenced this pull request Dec 1, 2025
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants