-
Notifications
You must be signed in to change notification settings - Fork 36
aarch64: Don't perform unaligned memory accesses #934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
On a bare metal system with the MMU disabled or alignment checking enabled, accessing memory using an unaligned address causes an alignment fault exception. By changing the memory accesses to smaller sizes that matches the alignment of the addresses, the code can run without depending on the CPU handling and accepting unaligned addresses. Signed-off-by: Anders Sonmark <Anders.Sonmark@axis.com>
b2f1933 to
dc8a7a1
Compare
Thanks for your PR! |
In keccak_f1600_x4_v8a_scalar_hybrid_asm.S, function keccak_f1600_x4_scalar_v8a_hybrid_asm is called with X0 aligned to 128 bits, but it then sets X4=X0+0xC8 at line 69, making X4 only 64 bit aligned. The following "ldp q25, q26, [x0]" is thus fine, but "ldp q27, q28, [x4]" fails as it tries to read 2x128 bits from a 64 bit aligned pointer. The same happens again when X4 is set up for writing at line 976. My patch changes the reads and writes with X4 to 4x64 bits instead to fix this. In polyz_unpack_19_asm.S, the loop increases X1 by 0x28, so even if it was 128 bit aligned the first time, it will be only 64 bit aligned in the second iteration. My patch changes the reads from 128+128+64 bits to 5x64 bits. In rej_uniform_asm.S, X7 starts out 128 bit aligned but the increase inside the loop at lines 108, 110, 112, and 114 only guarantees 32 bit alignment so my patch changes the writes from 128 bit to 4x32 bits. It is possible that there are other unaligned accesses that I haven't noticed. These occurrences are the ones that caused exceptions when I tried to verify ML-DSA-87 signatures on a Cortex A55 with the MMU disabled. With these fixes, I can successfully verify signatures with native aarch64 backends for both arithmetic and fips202. |
Thanks for the details! I'll try to reproduce this tomorrow. |
|
Many thanks @flynd, this is interesting. I was not aware that when the MMU is disabled, unaligned accesses fault. Can you say more about your application context? |
The CPU only handles unaligned addresses automatically for address space marked as normal memory, however with the MMU disabled, all address space is treated as device memory which is why it causes alignment faults.
The use case is to verify signatures for secure boot in early boot stages of an embedded system that lacks hardware support for ML-DSA. For simplicity, we do not enable the MMU in the earliest boot stages, which is why this alignment problem was noticeable. We will only use ML-DSA-87 and primarily only use keys that are built into the boot code. As the early boot environment runs in internal RAM, before DRAM is available, minimizing the memory footprint is important to us. The option to build only for a specific parameter set to optimize for a single use case is thus appreciated. MLD_CONFIG_REDUCE_RAM is also very helpful.
|
|
@flynd Thank you for the context! @mkannwischer and I are happy to help where we can to make mldsa-native usable for your context, so please keep the issues coming. We recently worked through the integration into CherIOT, which also drove a number of improvements.
Did you consider
Did you notice that you can customize allocation now? https://github.com/pq-code-package/mldsa-native/blob/main/mldsa/mldsa_native_config.h#L445
I agree those are included, but they should not be built because of guards in the respective files. Otherwise, we have a bug. Can you check? |
I forgot to say but yes I also have MLD_CONFIG_INTERNAL_API_QUALIFIER defined to static. After checking again, I see that all functions I explicitly needed to add static to are located in mldsa/src/fips202/fips202.c, mldsa/src/fips202/fips202x4.c, and mldsa/src/fips202/keccakf1600.c. I've also marked the data tables as static in mldsa/src/native/aarch64/src/aarch64_zetas.c, mldsa/src/native/aarch64/src/polyz_unpack_table.c, mldsa/src/native/aarch64/src/rej_uniform_eta_table.c, and mldsa/src/native/aarch64/src/rej_uniform_table.c. After this, mldsa-native no longer have any symbols in the linked elf that are global (except the entry point I've added which is a wrapper around mldsa_verify that also sets up a heap provided by the caller.)
Yes, I'm using that option to define a heap for memory allocation and I must say that I appreciate that you have the exact heap size necessary documented in mldsa_native.h so I didn't have to guess how large it needs to be. But I still count 3.6kB of stack needed (or 7.3kB without MLD_CONFIG_REDUCE_RAM) which I was hoping to squeeze down below 2kB if possible.
I checked the linked elf to verify that all symbols were actually used somewhere and there were several that weren't. Also, I don't want to import the entire mldsa-native git into my project, only the necessary files, so if the conditions are inside each file I would still need to import extra files even though they don't add anything to the output with my configuration. |
We are tracking this in #867.
I opened #936
Oof. You are right. #937 We are doing this correctly in mlkem-native (e.g. https://github.com/pq-code-package/mlkem-native/blob/main/mlkem/src/native/aarch64/src/polyvec_basemul_acc_montgomery_cached_asm_k3.S#L52), but missed it for mldsa-native. But regardless, it's true that currently you would need to manually adjust the |
On a bare metal system with the MMU disabled or alignment checking enabled, accessing memory using an unaligned address causes an alignment fault exception. By changing the memory accesses to smaller sizes that matches the alignment of the addresses, the code can run without depending on the CPU handling and accepting unaligned addresses.
Signed-off-by: Anders Sonmark Anders.Sonmark@axis.com