-
Notifications
You must be signed in to change notification settings - Fork 0
Kexec bpf v5 #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Kexec bpf v5 #1
Conversation
In latter patches, PE format parser will extract the linux kernel inside and try its real format parser. So making kexec_image_load_default global. Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> To: kexec@lists.infradead.org
The KEXE PE format parser needs the kernel built-in decompressor to decompress the kernel image. So moving the decompressor out of __init sections. Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> To: linux-kernel@vger.kernel.org
In the security kexec_file_load case, the buffer which holds the kernel image should not be accessible from the userspace. Typically, BPF data flow occurs between user space and kernel space in either direction. However, kexec_file_load presents a unique case where user-originated data must be parsed and then forwarded to the kernel for subsequent parsing stages. This necessitates a mechanism to channel the intermedia data from the BPF program directly to the kernel. bpf_kexec_carrier() is introduced to serve that purpose. Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> To: bpf@vger.kernel.org
This commit bridges the gap between bpf-prog and the kernel decompression routines. At present, only a global memory allocator is used for the decompression. Later, if needed, the decompress_fn's prototype can be changed to pass in a task related allocator. This memory allocator can allocate 2MB each time with a transient virtual address, up to a 1GB limit. After decompression finishes, it presents all of the decompressed data in a new unified virtual address space. Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> To: bpf@vger.kernel.org
As UEFI becomes popular, a few architectures support to boot a PE format kernel image directly. But the internal of PE format varies, which means each parser for each format. This patch (with the rest in this series) introduces a common skeleton to all parsers, and leave the format parsing in bpf-prog, so the kernel code can keep relative stable. A new kexec_file_ops is implementation, named pe_image_ops. There are some place holder function in this patch. (They will take effect after the introduction of kexec bpf light skeleton and bpf helpers). Overall the parsing progress is a pipeline, the current bpf-prog parser is attached to bpf_handle_pefile(), and detatched at the end of the current stage 'disarm_bpf_prog()' the current parsed result by the current bpf-prog will be buffered in kernel 'prepare_nested_pe()' , and deliver to the next stage. For each stage, the bpf bytecode is extracted from the '.bpf' section in the PE file. Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Philipp Rudo <prudo@redhat.com> To: kexec@lists.infradead.org
This patch does two things: First, register as a listener on bpf_copy_to_kernel() Second, in order that the hooked bpf-prog can call the sleepable kfuncs, bpf_handle_pefile and bpf_post_handle_pefile are marked as KF_SLEEPABLE. Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Philipp Rudo <prudo@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: bpf@vger.kernel.org To: kexec@lists.infradead.org
Analague to kernel/bpf/preload/iterators/Makefile,
this Makefile is not invoked by the Kbuild system. It needs to be
invoked manually when kexec_pe_parser_bpf.c is changed so that
kexec_pe_parser_bpf.lskel.h can be re-generated by the command "bpftool
gen skeleton -L kexec_pe_parser_bpf.o".
kexec_pe_parser_bpf.lskel.h is used directly by the kernel kexec code in
later patch. For this patch, there are bpf bytecode contained in
opts_data[] and opts_insn[] in kexec_pe_parser_bpf.lskel.h, but in the
following patch, they will be removed and only the function API in
kexec_pe_parser_bpf.lskel.h left.
As exposed in kexec_pe_parser_bpf.lskel.h, the interface between
bpf-prog and the kernel are constituted by:
four maps:
struct bpf_map_desc ringbuf_1;
struct bpf_map_desc ringbuf_2;
struct bpf_map_desc ringbuf_3;
struct bpf_map_desc ringbuf_4;
four sections:
struct bpf_map_desc rodata;
struct bpf_map_desc data;
struct bpf_map_desc bss;
struct bpf_map_desc rodata_str1_1;
two progs:
SEC("fentry.s/bpf_handle_pefile")
SEC("fentry.s/bpf_post_handle_pefile")
They are fixed and provided for all kinds of bpf-prog which interacts
with the kexec kernel component.
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: bpf@vger.kernel.org
To: kexec@lists.infradead.org
The routine to search a symbol in ELF can be shared, so split it out. Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Philipp Rudo <prudo@redhat.com> To: kexec@lists.infradead.org
All kexec PE bpf prog should align with the interface exposed by the
light skeleton
four maps:
struct bpf_map_desc ringbuf_1;
struct bpf_map_desc ringbuf_2;
struct bpf_map_desc ringbuf_3;
struct bpf_map_desc ringbuf_4;
four sections:
struct bpf_map_desc rodata;
struct bpf_map_desc data;
struct bpf_map_desc bss;
struct bpf_map_desc rodata_str1_1;
two progs:
SEC("fentry.s/bpf_handle_pefile")
SEC("fentry.s/bpf_post_handle_pefile")
With the above presumption, the integration consists of two parts:
-1. Call API exposed by light skeleton from kexec
-2. The opts_insn[] and opts_data[] are bpf-prog dependent and
can be extracted and passed in from the user space. In the
kexec_file_load design, a PE file has a .bpf section, which data
content is a ELF, and the ELF contains opts_insn[] opts_data[].
As a bonus, BPF bytecode can be placed under the protection of the
entire PE signature.
(Note, since opts_insn[] contains the information of the ringbuf
size, the bpf-prog writer can change its proper size according to
the kernel image size without modifying the kernel code)
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: bpf@vger.kernel.org
To: kexec@lists.infradead.org
Now everything is ready for kexec PE image parser. Select it on arm64 for zboot and UKI image support. Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> To: linux-arm-kernel@lists.infradead.org
This BPF program aligns with the convention defined in the kernel file
kexec_pe_parser_bpf.lskel.h, where the interface between the BPF program
and the kernel is established, and is composed of:
four maps:
struct bpf_map_desc ringbuf_1;
struct bpf_map_desc ringbuf_2;
struct bpf_map_desc ringbuf_3;
struct bpf_map_desc ringbuf_4;
four sections:
struct bpf_map_desc rodata;
struct bpf_map_desc data;
struct bpf_map_desc bss;
struct bpf_map_desc rodata_str1_1;
two progs:
SEC("fentry.s/bpf_handle_pefile")
SEC("fentry.s/bpf_post_handle_pefile")
This BPF program only uses ringbuf_1, so it minimizes the size of the
other three ringbufs to one byte. The size of ringbuf_1 is deduced from
the size of the uncompressed file 'vmlinux.bin', which is usually less
than 64MB. With the help of a group of bpf kfuncs: bpf_decompress(),
bpf_copy_to_kernel(), bpf_mem_range_result_put(), this bpf-prog stores
the uncompressed kernel image inside the kernel space.
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: bpf@vger.kernel.org
To: kexec@lists.infradead.org
The objcopy binary can append an section into PE file, but it disregards the DOS header. While the zboot format carries important information: payload offset and size in the DOS header. In order to keep track and update such information, here introducing a dedicated binary tool to build zboot image. The payload offset is determined by the fact that its offset inside the .data section is unchanged. Hence the offset of .data section in the new PE file plus the payload offset within section renders the offset within the new PE file. The objcopy binary can append a section to a PE file, but it disregards the DOS header. However, the zboot format carries important information in the DOS header: payload offset and size. To track this information and append a new PE section, here a dedicated binary tool is introduced to build zboot images. The payload's relative offset within the .data section remains unchanged. Therefore, the .data section offset in the new PE file, plus the payload offset within that section, yields the payload offset within the new PE file. Finally, the new PE file 'zboot.efi' can be got by the command: make -C tools/kexec zboot Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Philipp Rudo <prudo@redhat.com> Cc: bpf@vger.kernel.org To: kexec@lists.infradead.org
|
/gemini review |
Summary of ChangesHello @pfliu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Linux kernel's kexec functionality by introducing a BPF-driven mechanism for loading and parsing UEFI Portable Executable (PE) images. This new approach allows for flexible and dynamic handling of complex boot image formats, such as zboot, where various components like the kernel, initrd, and command line might be compressed or nested within the PE file. By integrating BPF, the kernel can delegate the intricate parsing logic to user-defined programs, making the kexec process more adaptable to diverse boot environments and future format evolutions without requiring kernel recompilation for each new format. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature for kexec, allowing PE files to be loaded and parsed with the help of BPF programs. This is a complex change that adds a new BPF "carrier" framework for passing data from BPF to the kernel, a PE image loader, and BPF programs for parsing specific PE formats like zboot. While the overall architecture is sound, my review has identified several critical issues, including a mutex deadlock and use-after-free vulnerability in the BPF decompression helper, a memory leak in the PE image loader's error path, and other high-severity bugs related to incorrect resource management and logic. These critical issues must be addressed before this change can be merged.
| if (!decompressor) { | ||
| kfree(range); | ||
| vfree(input_buf); | ||
| pr_err("Can not find decompress method\n"); | ||
| goto error; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are critical issues in this error handling path:
- If
decompress_method()returnsNULL, the code jumps to theerror:label whileoutput_buf_mutexis still held. This will cause the function to return without releasing the mutex, leading to a deadlock on subsequent calls. - In the same error path,
rangeis freed viakfree(range), but the function proceeds to return this now-danglingrangepointer, which is a use-after-free vulnerability.
The error handling logic must be refactored to ensure the mutex is always unlocked and resources are cleaned up correctly on all paths. Failures after acquiring the mutex should jump to a cleanup path that unlocks the mutex before returning.
| for (int i = 0; i < 3; i++) { | ||
| r = parsed_resource[i].r; | ||
| if (!r) | ||
| continue; | ||
| parsed_resource[i].r = NULL; | ||
| /* | ||
| * The release of buffer defers to | ||
| * kimage_file_post_load_cleanup() | ||
| */ | ||
| r->buf = NULL; | ||
| r->buf_sz = 0; | ||
| mem_range_result_put(r); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical memory leak in the error handling path. The cleanup logic at the err: label is also used for the success path. It unconditionally sets r->buf = NULL before calling mem_range_result_put(r).
On a failure path, this prevents __mem_range_result_free from freeing the buffer pointed to by r->buf, causing a memory leak. The cleanup logic should differentiate between success and failure. On failure, mem_range_result_put(r) should be called without modifying r to ensure the buffer is freed.
err:
/* On success, ldata is not an error pointer. */
bool success = !IS_ERR(ldata);
for (int i = 0; i < 3; i++) {
r = parsed_resource[i].r;
if (!r)
continue;
parsed_resource[i].r = NULL;
/*
* On success, buffer ownership is transferred to kimage.
* Prevent freeing it here. On failure, the buffer should be
* freed by mem_range_result_put().
*/
if (success) {
r->buf = NULL;
r->buf_sz = 0;
}
mem_range_result_put(r);
}| if (range->alloc_type == TYPE_KALLOC) | ||
| kfree(kbuf); | ||
| else | ||
| vfree(kbuf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug in this error handling block. You are checking range->alloc_type to determine which deallocation function to use for kbuf. However, range->alloc_type is not initialized until line 157. At this point, it holds an uninitialized value, which will lead to undefined behavior. You should use the local alloc_type variable, which holds the correct value.
if (alloc_type == TYPE_KALLOC)
kfree(kbuf);
else
vfree(kbuf);| image->cmdline_buf_len = cmdline_len; | ||
| } | ||
|
|
||
| if (kernel == NULL || initrd == NULL || cmdline == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is too strict. A valid kexec image might not have an initrd, and the command line can be empty. The kexec_file_load syscall supports these cases. This check will cause pe_image_load to fail for valid scenarios. It should be relaxed to only require that the kernel is present.
if (kernel == NULL) {| strcpy(c, "cmdline "); | ||
| c += strlen("cmdline "); | ||
| } | ||
| c = '\0'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This option makes the kexec_file_load() syscall cooperates with bpf-prog | ||
| to parse PE format file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return __builtin_memcmp(local_name, res_kernel, 4); | ||
| } | ||
|
|
||
| SEC("fentry.s/bpf_handle_pefile") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return 0; | ||
| } | ||
|
|
||
| SEC("fentry.s/bpf_post_handle_pefile") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| close(base_fd); | ||
| close(out_fd); | ||
| close(bpf_fd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /* Overwrite buf */ | ||
| bpf_probe_read((void *)buf, payload_size, context->image + payload_offset); | ||
| bpf_printk("Calling bpf_kexec_decompress()\n"); | ||
| struct mem_range_result *r = bpf_decompress(buf, payload_size - 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature enabling kexec to load and parse PE files using BPF programs. This is an innovative approach that involves adding new BPF helpers, a generic "carrier" mechanism for BPF-to-kernel data transfer, and new tooling for embedding BPF programs into PE files. The changes are extensive and touch multiple kernel subsystems. My review has identified a few critical and high-severity bugs in the implementation of the new BPF helpers and the PE image loading logic, along with some medium-severity issues in the supporting tools and the BPF program itself. These should be addressed to ensure the stability and correctness of this new functionality.
| ret = copy_from_kernel_nofault(kbuf, buf, size); | ||
| if (unlikely(ret < 0)) { | ||
| if (range->alloc_type == TYPE_KALLOC) | ||
| kfree(kbuf); | ||
| else | ||
| vfree(kbuf); | ||
| kfree(range); | ||
| ret = -EINVAL; | ||
| goto err; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error handling block has a couple of issues:
copy_from_kernel_nofault()returns anunsigned longindicating the number of bytes not copied. A non-zero value means failure. The current checkret < 0is incorrect because the return value is never negative. Also, storing the result in anintcan cause truncation.- The free logic uses
range->alloc_type, but this field is not initialized until line 157. The localalloc_typevariable should be used instead.
Here's a suggested fix that addresses both points. I've also changed the error code to -EFAULT which is more appropriate for a copy failure.
if (unlikely(copy_from_kernel_nofault(kbuf, buf, size))) {
if (alloc_type == TYPE_KALLOC)
kfree(kbuf);
else
vfree(kbuf);
kfree(range);
ret = -EFAULT;
goto err;
}| if (kernel == NULL || initrd == NULL || cmdline == NULL) { | ||
| char *c, buf[64]; | ||
|
|
||
| c = buf; | ||
| if (kernel == NULL) { | ||
| strcpy(c, "kernel "); | ||
| c += strlen("kernel "); | ||
| } | ||
| if (initrd == NULL) { | ||
| strcpy(c, "initrd "); | ||
| c += strlen("initrd "); | ||
| } | ||
| if (cmdline == NULL) { | ||
| strcpy(c, "cmdline "); | ||
| c += strlen("cmdline "); | ||
| } | ||
| c = '\0'; | ||
| pr_err("Can not extract data for %s", buf); | ||
| ldata = ERR_PTR(-EINVAL); | ||
| goto err; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error handling logic has several flaws:
- It treats a missing
initrdorcmdlineas a fatal error by checking them with||. These are often optional, and only a missingkernelshould be a fatal error. - The error message construction is buggy. It uses
strcpyin a loop, which overwrites the buffer on each iteration instead of appending. For example, if bothkernelandinitrdareNULL, the buffer will only contain"initrd ". c = '\0';is incorrect for null-terminating a string. It should be*c = '\0';.
I suggest refactoring this to only fail on a missing kernel and issue warnings for other missing components.
if (!kernel) {
pr_err("Can not extract kernel from PE container\n");
ldata = ERR_PTR(-EINVAL);
goto err;
}
if (!initrd)
pr_warn("Could not find initrd in PE container\n");
if (!cmdline)
pr_warn("Could not find cmdline in PE container\n");| /* | ||
| * At present, one global allocator for decompression. Later if needed, changing the | ||
| * prototype of decompress_fn to introduce each task's allocator. | ||
| */ | ||
| static DEFINE_MUTEX(output_buf_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of a global mutex (output_buf_mutex) and a single global allocator instance (dcmpr_allocator) makes the bpf_decompress kfunc non-reentrant. While this might be acceptable for the current kexec use case where concurrent calls are unlikely, it limits the general applicability of this helper for other potential use cases in the future. As noted in the comment on line 3724, making this mechanism re-entrant would be a valuable improvement for wider adoption.
| munmap(base_start_addr, sb.st_size); | ||
| close(base_fd); | ||
| close(out_fd); | ||
| close(bpf_fd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /* Overwrite buf */ | ||
| bpf_probe_read((void *)buf, payload_size, context->image + payload_offset); | ||
| bpf_printk("Calling bpf_kexec_decompress()\n"); | ||
| struct mem_range_result *r = bpf_decompress(buf, payload_size - 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subtraction of 4 from payload_size appears to be a magic number. To improve code clarity and maintainability, please add a comment explaining why these 4 bytes are being excluded from the decompression payload. For instance, it might be a checksum or size field that isn't part of the compressed stream.
The kernel forbids the creation of non-FDB nexthop groups with FDB nexthops: # ip nexthop add id 1 via 192.0.2.1 fdb # ip nexthop add id 2 group 1 Error: Non FDB nexthop group cannot have fdb nexthops. And vice versa: # ip nexthop add id 3 via 192.0.2.2 dev dummy1 # ip nexthop add id 4 group 3 fdb Error: FDB nexthop group can only have fdb nexthops. However, as long as no routes are pointing to a non-FDB nexthop group, the kernel allows changing the type of a nexthop from FDB to non-FDB and vice versa: # ip nexthop add id 5 via 192.0.2.2 dev dummy1 # ip nexthop add id 6 group 5 # ip nexthop replace id 5 via 192.0.2.2 fdb # echo $? 0 This configuration is invalid and can result in a NPD [1] since FDB nexthops are not associated with a nexthop device: # ip route add 198.51.100.1/32 nhid 6 # ping 198.51.100.1 Fix by preventing nexthop FDB status change while the nexthop is in a group: # ip nexthop add id 7 via 192.0.2.2 dev dummy1 # ip nexthop add id 8 group 7 # ip nexthop replace id 7 via 192.0.2.2 fdb Error: Cannot change nexthop FDB status while in a group. [1] BUG: kernel NULL pointer dereference, address: 00000000000003c0 [...] Oops: Oops: 0000 [#1] SMP CPU: 6 UID: 0 PID: 367 Comm: ping Not tainted 6.17.0-rc6-virtme-gb65678cacc03 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014 RIP: 0010:fib_lookup_good_nhc+0x1e/0x80 [...] Call Trace: <TASK> fib_table_lookup+0x541/0x650 ip_route_output_key_hash_rcu+0x2ea/0x970 ip_route_output_key_hash+0x55/0x80 __ip4_datagram_connect+0x250/0x330 udp_connect+0x2b/0x60 __sys_connect+0x9c/0xd0 __x64_sys_connect+0x18/0x20 do_syscall_64+0xa4/0x2a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 38428d6 ("nexthop: support for fdb ecmp nexthops") Reported-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68c9a4d2.050a0220.3c6139.0e63.GAE@google.com/ Tested-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250921150824.149157-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel says: ==================== nexthop: Various fixes Patch #1 fixes a NPD that was recently reported by syzbot. Patch #2 fixes an issue in the existing FIB nexthop selftest. Patch #3 extends the selftest with test cases for the bug that was fixed in the first patch. ==================== Link: https://patch.msgid.link/20250921150824.149157-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add 0x29 as the accelerometer address for the Dell Latitude E6530 to
lis3lv02d_devices[].
The address was verified as below:
$ cd /sys/bus/pci/drivers/i801_smbus/0000:00:1f.3
$ ls -d i2c-*
i2c-20
$ sudo modprobe i2c-dev
$ sudo i2cdetect 20
WARNING! This program can confuse your I2C bus, cause data loss and worse!
I will probe file /dev/i2c-20.
I will probe address range 0x08-0x77.
Continue? [Y/n] Y
0 1 2 3 4 5 6 7 8 9 a b c d e f
00: 08 -- -- -- -- -- -- --
10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
20: -- -- -- -- -- -- -- -- -- UU -- 2b -- -- -- --
30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
40: -- -- -- -- 44 -- -- -- -- -- -- -- -- -- -- --
50: UU -- 52 -- -- -- -- -- -- -- -- -- -- -- -- --
60: -- 61 -- -- -- -- -- -- -- -- -- -- -- -- -- --
70: -- -- -- -- -- -- -- --
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-linux-cachyos-bore root=UUID=<redacted> rw loglevel=3 quiet dell_lis3lv02d.probe_i2c_addr=1
$ sudo dmesg
[ 0.000000] Linux version 6.16.6-2-cachyos-bore (linux-cachyos-bore@cachyos) (gcc (GCC) 15.2.1 20250813, GNU ld (GNU Binutils) 2.45.0) #1 SMP PREEMPT_DYNAMIC Thu, 11 Sep 2025 16:01:12 +0000
[…]
[ 0.000000] DMI: Dell Inc. Latitude E6530/07Y85M, BIOS A22 11/30/2018
[…]
[ 5.166442] i2c i2c-20: Probing for lis3lv02d on address 0x29
[ 5.167854] i2c i2c-20: Detected lis3lv02d on address 0x29, please report this upstream to platform-driver-x86@vger.kernel.org so that a quirk can be added
Signed-off-by: Nickolay Goppen <setotau@mainlining.org>
Reviewed-by: Hans de Goede <hansg@kernel.org>
Link: https://patch.msgid.link/20250917-dell-lis3lv02d-latitude-e6530-v1-1-8a6dec4e51e9@mainlining.org
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():
BUG: unable to handle page fault for address: ffffbc3840291000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
Oops: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G N 6.17.0-rc3 torvalds#10 PREEMPT(voluntary)
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
[...]
Call Trace:
<TASK>
__msan_memset+0xee/0x1a0
sha224_final+0x9e/0x350
test_hash_buffer_overruns+0x46f/0x5f0
? kmsan_get_shadow_origin_ptr+0x46/0xa0
? __pfx_test_hash_buffer_overruns+0x10/0x10
kunit_try_run_case+0x198/0xa00
This occurs when memset() is called on a buffer that is not 4-byte aligned
and extends to the end of a guard page, i.e. the next page is unmapped.
The bug is that the loop at the end of kmsan_internal_set_shadow_origin()
accesses the wrong shadow memory bytes when the address is not 4-byte
aligned. Since each 4 bytes are associated with an origin, it rounds the
address and size so that it can access all the origins that contain the
buffer. However, when it checks the corresponding shadow bytes for a
particular origin, it incorrectly uses the original unrounded shadow
address. This results in reads from shadow memory beyond the end of the
buffer's shadow memory, which crashes when that memory is not mapped.
To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.
Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org
Fixes: 2ef3cec ("kmsan: do not wipe out origin when doing partial unpoisoning")
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches pagemap_scan_backout_range(), kernel panics with null-ptr-deref: [ 44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI [ 44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] [ 44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none) [ 44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80 <snip registers, unreliable trace> [ 44.946828] Call Trace: [ 44.947030] <TASK> [ 44.949219] pagemap_scan_pmd_entry+0xec/0xfa0 [ 44.952593] walk_pmd_range.isra.0+0x302/0x910 [ 44.954069] walk_pud_range.isra.0+0x419/0x790 [ 44.954427] walk_p4d_range+0x41e/0x620 [ 44.954743] walk_pgd_range+0x31e/0x630 [ 44.955057] __walk_page_range+0x160/0x670 [ 44.956883] walk_page_range_mm+0x408/0x980 [ 44.958677] walk_page_range+0x66/0x90 [ 44.958984] do_pagemap_scan+0x28d/0x9c0 [ 44.961833] do_pagemap_cmd+0x59/0x80 [ 44.962484] __x64_sys_ioctl+0x18d/0x210 [ 44.962804] do_syscall_64+0x5b/0x290 [ 44.963111] entry_SYSCALL_64_after_hwframe+0x76/0x7e vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are allocated and p->vec_buf remains set to NULL. This breaks an assumption made later in pagemap_scan_backout_range(), that page_region is always allocated for p->vec_buf_index. Fix it by explicitly checking p->vec_buf for NULL before dereferencing. Other sites that might run into same deref-issue are already (directly or transitively) protected by checking p->vec_buf. Note: From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output is requested and it's only the side effects caller is interested in, hence it passes check in pagemap_scan_get_args(). This issue was found by syzkaller. Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Jakub Acs <acsjakub@amazon.de> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Penglei Jiang <superman.xpt@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
No description provided.