Skip to content

Conversation

@flynd
Copy link

@flynd flynd commented Feb 2, 2026

On a bare metal system with the MMU disabled or alignment checking enabled, accessing memory using an unaligned address causes an alignment fault exception. By changing the memory accesses to smaller sizes that matches the alignment of the addresses, the code can run without depending on the CPU handling and accepting unaligned addresses.

Signed-off-by: Anders Sonmark Anders.Sonmark@axis.com

@flynd flynd requested a review from a team as a code owner February 2, 2026 12:01
On a bare metal system with the MMU disabled or alignment checking
enabled, accessing memory using an unaligned address causes an alignment
fault exception.  By changing the memory accesses to smaller sizes that
matches the alignment of the addresses, the code can run without
depending on the CPU handling and accepting unaligned addresses.

Signed-off-by: Anders Sonmark <Anders.Sonmark@axis.com>
@flynd flynd force-pushed the fix-aarch64-unaligned branch from b2f1933 to dc8a7a1 Compare February 2, 2026 12:09
@mkannwischer
Copy link
Contributor

mkannwischer commented Feb 2, 2026

On a bare metal system with the MMU disabled or alignment checking enabled, accessing memory using an unaligned address causes an alignment fault exception. By changing the memory accesses to smaller sizes that matches the alignment of the addresses, the code can run without depending on the CPU handling and accepting unaligned addresses.

Signed-off-by: Anders Sonmark Anders.Sonmark@axis.com

Thanks for your PR!
Can you elaborate where exactly you got a fault due to an unaligned memory access? I took a quick glance at the changes, but I believe all of those addresses should be aligned as we are using MLD_ALIGN for the corresponding allocations.
If you have expierenced an alignment fault, than that would hint that MLD_ALIGN is missing somewhere.

@flynd
Copy link
Author

flynd commented Feb 2, 2026

Thanks for your PR! Can you elaborate where exactly you got a fault due to an unaligned memory access? I took a quick glance at the changes, but I believe all of those addresses should be aligned as we are using MLD_ALIGN for the corresponding allocations. If you have expierenced an alignment fault, than that would hint that MLD_ALIGN is missing somewhere.

In keccak_f1600_x4_v8a_scalar_hybrid_asm.S, function keccak_f1600_x4_scalar_v8a_hybrid_asm is called with X0 aligned to 128 bits, but it then sets X4=X0+0xC8 at line 69, making X4 only 64 bit aligned. The following "ldp q25, q26, [x0]" is thus fine, but "ldp q27, q28, [x4]" fails as it tries to read 2x128 bits from a 64 bit aligned pointer. The same happens again when X4 is set up for writing at line 976. My patch changes the reads and writes with X4 to 4x64 bits instead to fix this.

In polyz_unpack_19_asm.S, the loop increases X1 by 0x28, so even if it was 128 bit aligned the first time, it will be only 64 bit aligned in the second iteration. My patch changes the reads from 128+128+64 bits to 5x64 bits.

In rej_uniform_asm.S, X7 starts out 128 bit aligned but the increase inside the loop at lines 108, 110, 112, and 114 only guarantees 32 bit alignment so my patch changes the writes from 128 bit to 4x32 bits.

It is possible that there are other unaligned accesses that I haven't noticed. These occurrences are the ones that caused exceptions when I tried to verify ML-DSA-87 signatures on a Cortex A55 with the MMU disabled. With these fixes, I can successfully verify signatures with native aarch64 backends for both arithmetic and fips202.

@mkannwischer
Copy link
Contributor

Thanks for your PR! Can you elaborate where exactly you got a fault due to an unaligned memory access? I took a quick glance at the changes, but I believe all of those addresses should be aligned as we are using MLD_ALIGN for the corresponding allocations. If you have expierenced an alignment fault, than that would hint that MLD_ALIGN is missing somewhere.

In keccak_f1600_x4_v8a_scalar_hybrid_asm.S, function keccak_f1600_x4_scalar_v8a_hybrid_asm is called with X0 aligned to 128 bits, but it then sets X4=X0+0xC8 at line 69, making X4 only 64 bit aligned. The following "ldp q25, q26, [x0]" is thus fine, but "ldp q27, q28, [x4]" fails as it tries to read 2x128 bits from a 64 bit aligned pointer. The same happens again when X4 is set up for writing at line 976. My patch changes the reads and writes with X4 to 4x64 bits instead to fix this.

In polyz_unpack_19_asm.S, the loop increases X1 by 0x28, so even if it was 128 bit aligned the first time, it will be only 64 bit aligned in the second iteration. My patch changes the reads from 128+128+64 bits to 5x64 bits.

In rej_uniform_asm.S, X7 starts out 128 bit aligned but the increase inside the loop at lines 108, 110, 112, and 114 only guarantees 32 bit alignment so my patch changes the writes from 128 bit to 4x32 bits.

It is possible that there are other unaligned accesses that I haven't noticed. These occurrences are the ones that caused exceptions when I tried to verify ML-DSA-87 signatures on a Cortex A55 with the MMU disabled. With these fixes, I can successfully verify signatures with native aarch64 backends for both arithmetic and fips202.

Thanks for the details! I'll try to reproduce this tomorrow.

@hanno-becker
Copy link
Contributor

hanno-becker commented Feb 2, 2026

Many thanks @flynd, this is interesting. I was not aware that when the MMU is disabled, unaligned accesses fault.

Can you say more about your application context?

@flynd
Copy link
Author

flynd commented Feb 2, 2026

Many thanks @flynd, this is interesting. I was not aware that when the MMU is disabled, unaligned accesses fault.

The CPU only handles unaligned addresses automatically for address space marked as normal memory, however with the MMU disabled, all address space is treated as device memory which is why it causes alignment faults.
The C code needs to be built with -mstrict for this environment, otherwise the compiler will assume that unaligned accesses are acceptable and produce code that triggers alignment faults.

Can you say more about your application context?

The use case is to verify signatures for secure boot in early boot stages of an embedded system that lacks hardware support for ML-DSA. For simplicity, we do not enable the MMU in the earliest boot stages, which is why this alignment problem was noticeable. We will only use ML-DSA-87 and primarily only use keys that are built into the boot code.
The development is right now on a prototype stage to get it working, evaluate whether mldsa-native is usable for us, and if so which config options that work best for us.

As the early boot environment runs in internal RAM, before DRAM is available, minimizing the memory footprint is important to us. The option to build only for a specific parameter set to optimize for a single use case is thus appreciated. MLD_CONFIG_REDUCE_RAM is also very helpful.
However there are a few things that I currently have as local (very hackish) patches that would be good to have support for. Unless I figure out how to make them into nice patches, I will most likely open issues to request these features soon:

  • An option to disable signature creation and key generation to build only the necessary code for verifying signatures.
  • An option to force all functions and data as static. Defining "MLD_CONFIG_EXTERNAL_API_QUALIFIER static" only affects some functions, but I have many more "static" directives added right now to let gcc build everything as one unit and freely optimize. (It currently also lets gcc remove all functions and data that are unused due to mldsa_verify() being the only entry point.)
  • mldsa_native_asm.S includes several files that aren't needed for a given parameter set, making it build more than necessary. With a build option to disable signature creation, additional includes can also be removed.
  • It would also be nice if the stack frame could be reduced further by utilizing the heap more.

@hanno-becker
Copy link
Contributor

hanno-becker commented Feb 2, 2026

@flynd Thank you for the context! @mkannwischer and I are happy to help where we can to make mldsa-native usable for your context, so please keep the issues coming. We recently worked through the integration into CherIOT, which also drove a number of improvements.

An option to force all functions and data as static. Defining "MLD_CONFIG_EXTERNAL_API_QUALIFIER static" only affects some functions, but I have many more "static" directives added right now to let gcc build everything as one unit and freely optimize. (It currently also lets gcc remove all functions and data that are unused due to mldsa_verify() being the only entry point.)

Did you consider MLD_CONFIG_INTERNAL_API_QUALIFIER? If you use static for MLD_CONFIG_EXTERNAL_API_QUALIFIER and MLD_CONFIG_INTERNAL_API_QUALIFIER, you should not have external linkage at all (or so I hoped). See e.g. https://github.com/pq-code-package/mldsa-native/blob/main/examples/monolithic_build_multilevel_native/main.c.

It would also be nice if the stack frame could be reduced further by utilizing the heap more.

Did you notice that you can customize allocation now? https://github.com/pq-code-package/mldsa-native/blob/main/mldsa/mldsa_native_config.h#L445

mldsa_native_asm.S includes several files that aren't needed for a given parameter set, making it build more than necessary. With a build option to disable signature creation, additional includes can also be removed.

I agree those are included, but they should not be built because of guards in the respective files. Otherwise, we have a bug. Can you check?

@flynd
Copy link
Author

flynd commented Feb 2, 2026

An option to force all functions and data as static. Defining "MLD_CONFIG_EXTERNAL_API_QUALIFIER static" only affects some functions, but I have many more "static" directives added right now to let gcc build everything as one unit and freely optimize. (It currently also lets gcc remove all functions and data that are unused due to mldsa_verify() being the only entry point.)

Did you consider MLD_CONFIG_INTERNAL_API_QUALIFIER? If you use static for MLD_CONFIG_EXTERNAL_API_QUALIFIER and MLD_CONFIG_INTERNAL_API_QUALIFIER, you should not have external linkage at all (or so I hoped). See e.g. https://github.com/pq-code-package/mldsa-native/blob/main/examples/monolithic_build_multilevel_native/main.c.

I forgot to say but yes I also have MLD_CONFIG_INTERNAL_API_QUALIFIER defined to static. After checking again, I see that all functions I explicitly needed to add static to are located in mldsa/src/fips202/fips202.c, mldsa/src/fips202/fips202x4.c, and mldsa/src/fips202/keccakf1600.c. I've also marked the data tables as static in mldsa/src/native/aarch64/src/aarch64_zetas.c, mldsa/src/native/aarch64/src/polyz_unpack_table.c, mldsa/src/native/aarch64/src/rej_uniform_eta_table.c, and mldsa/src/native/aarch64/src/rej_uniform_table.c. After this, mldsa-native no longer have any symbols in the linked elf that are global (except the entry point I've added which is a wrapper around mldsa_verify that also sets up a heap provided by the caller.)

It would also be nice if the stack frame could be reduced further by utilizing the heap more.

Did you notice that you can customize allocation now? https://github.com/pq-code-package/mldsa-native/blob/main/mldsa/mldsa_native_config.h#L445

Yes, I'm using that option to define a heap for memory allocation and I must say that I appreciate that you have the exact heap size necessary documented in mldsa_native.h so I didn't have to guess how large it needs to be. But I still count 3.6kB of stack needed (or 7.3kB without MLD_CONFIG_REDUCE_RAM) which I was hoping to squeeze down below 2kB if possible.

mldsa_native_asm.S includes several files that aren't needed for a given parameter set, making it build more than necessary. With a build option to disable signature creation, additional includes can also be removed.

I agree those are included, but they should not be built because of guards in the respective files. Otherwise, we have a bug. Can you check?

I checked the linked elf to verify that all symbols were actually used somewhere and there were several that weren't. Also, I don't want to import the entire mldsa-native git into my project, only the necessary files, so if the conditions are inside each file I would still need to import extra files even though they don't add anything to the output with my configuration.
This is part of my current local patch as example. I've tried to duplicate conditions from where each of the functions are called:

+#if defined(MLD_USE_NATIVE_POLYVECL_POINTWISE_ACC_MONTGOMERY_L4) && MLD_CONFIG_PARAMETER_SET == 44
 #include "src/native/aarch64/src/mld_polyvecl_pointwise_acc_montgomery_l4.S"
+#elif defined(MLD_USE_NATIVE_POLYVECL_POINTWISE_ACC_MONTGOMERY_L5) && MLD_CONFIG_PARAMETER_SET == 65
 #include "src/native/aarch64/src/mld_polyvecl_pointwise_acc_montgomery_l5.S"
+#elif defined(MLD_USE_NATIVE_POLYVECL_POINTWISE_ACC_MONTGOMERY_L7) && MLD_CONFIG_PARAMETER_SET == 87
 #include "src/native/aarch64/src/mld_polyvecl_pointwise_acc_montgomery_l7.S"
+#endif

@hanno-becker
Copy link
Contributor

But I still count 3.6kB of stack needed (or 7.3kB without MLD_CONFIG_REDUCE_RAM) which I was hoping to squeeze down below 2kB if possible.

We are tracking this in #867.

I forgot to say but yes I also have MLD_CONFIG_INTERNAL_API_QUALIFIER defined to static. After checking again, I see that all functions I explicitly needed to add static to are located in mldsa/src/fips202/fips202.c, mldsa/src/fips202/fips202x4.c, and mldsa/src/fips202/keccakf1600.c. I've also marked the data tables as static in mldsa/src/native/aarch64/src/aarch64_zetas.c, mldsa/src/native/aarch64/src/polyz_unpack_table.c, mldsa/src/native/aarch64/src/rej_uniform_eta_table.c, and mldsa/src/native/aarch64/src/rej_uniform_table.c. After this, mldsa-native no longer have any symbols in the linked elf that are global (except the entry point I've added which is a wrapper around mldsa_verify that also sets up a heap provided by the caller.)

I opened #936

I checked the linked elf to verify that all symbols were actually used somewhere and there were several that weren't. Also, I don't want to import the entire mldsa-native git into my project, only the necessary files, so if the conditions are inside each file I would still need to import extra files even though they don't add anything to the output with my configuration.
This is part of my current local patch as example. I've tried to duplicate conditions from where each of the functions are called:

Oof. You are right. #937 We are doing this correctly in mlkem-native (e.g. https://github.com/pq-code-package/mlkem-native/blob/main/mlkem/src/native/aarch64/src/polyvec_basemul_acc_montgomery_cached_asm_k3.S#L52), but missed it for mldsa-native.

But regardless, it's true that currently you would need to manually adjust the mldsa_native.S.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants