aarch64: Don't perform unaligned memory accesses #934

flynd · 2026-02-02T12:01:09Z

On a bare metal system with the MMU disabled or alignment checking enabled, accessing memory using an unaligned address causes an alignment fault exception. By changing the memory accesses to smaller sizes that matches the alignment of the addresses, the code can run without depending on the CPU handling and accepting unaligned addresses.

Signed-off-by: Anders Sonmark Anders.Sonmark@axis.com

On a bare metal system with the MMU disabled or alignment checking enabled, accessing memory using an unaligned address causes an alignment fault exception. By changing the memory accesses to smaller sizes that matches the alignment of the addresses, the code can run without depending on the CPU handling and accepting unaligned addresses. Signed-off-by: Anders Sonmark <Anders.Sonmark@axis.com>

mkannwischer · 2026-02-02T13:05:01Z

On a bare metal system with the MMU disabled or alignment checking enabled, accessing memory using an unaligned address causes an alignment fault exception. By changing the memory accesses to smaller sizes that matches the alignment of the addresses, the code can run without depending on the CPU handling and accepting unaligned addresses.

Signed-off-by: Anders Sonmark Anders.Sonmark@axis.com

Thanks for your PR!
Can you elaborate where exactly you got a fault due to an unaligned memory access? I took a quick glance at the changes, but I believe all of those addresses should be aligned as we are using MLD_ALIGN for the corresponding allocations.
If you have expierenced an alignment fault, than that would hint that MLD_ALIGN is missing somewhere.

flynd · 2026-02-02T13:29:58Z

Thanks for your PR! Can you elaborate where exactly you got a fault due to an unaligned memory access? I took a quick glance at the changes, but I believe all of those addresses should be aligned as we are using MLD_ALIGN for the corresponding allocations. If you have expierenced an alignment fault, than that would hint that MLD_ALIGN is missing somewhere.

In keccak_f1600_x4_v8a_scalar_hybrid_asm.S, function keccak_f1600_x4_scalar_v8a_hybrid_asm is called with X0 aligned to 128 bits, but it then sets X4=X0+0xC8 at line 69, making X4 only 64 bit aligned. The following "ldp q25, q26, [x0]" is thus fine, but "ldp q27, q28, [x4]" fails as it tries to read 2x128 bits from a 64 bit aligned pointer. The same happens again when X4 is set up for writing at line 976. My patch changes the reads and writes with X4 to 4x64 bits instead to fix this.

In polyz_unpack_19_asm.S, the loop increases X1 by 0x28, so even if it was 128 bit aligned the first time, it will be only 64 bit aligned in the second iteration. My patch changes the reads from 128+128+64 bits to 5x64 bits.

In rej_uniform_asm.S, X7 starts out 128 bit aligned but the increase inside the loop at lines 108, 110, 112, and 114 only guarantees 32 bit alignment so my patch changes the writes from 128 bit to 4x32 bits.

It is possible that there are other unaligned accesses that I haven't noticed. These occurrences are the ones that caused exceptions when I tried to verify ML-DSA-87 signatures on a Cortex A55 with the MMU disabled. With these fixes, I can successfully verify signatures with native aarch64 backends for both arithmetic and fips202.

mkannwischer · 2026-02-02T13:33:57Z

Thanks for your PR! Can you elaborate where exactly you got a fault due to an unaligned memory access? I took a quick glance at the changes, but I believe all of those addresses should be aligned as we are using MLD_ALIGN for the corresponding allocations. If you have expierenced an alignment fault, than that would hint that MLD_ALIGN is missing somewhere.

In keccak_f1600_x4_v8a_scalar_hybrid_asm.S, function keccak_f1600_x4_scalar_v8a_hybrid_asm is called with X0 aligned to 128 bits, but it then sets X4=X0+0xC8 at line 69, making X4 only 64 bit aligned. The following "ldp q25, q26, [x0]" is thus fine, but "ldp q27, q28, [x4]" fails as it tries to read 2x128 bits from a 64 bit aligned pointer. The same happens again when X4 is set up for writing at line 976. My patch changes the reads and writes with X4 to 4x64 bits instead to fix this.

In polyz_unpack_19_asm.S, the loop increases X1 by 0x28, so even if it was 128 bit aligned the first time, it will be only 64 bit aligned in the second iteration. My patch changes the reads from 128+128+64 bits to 5x64 bits.

In rej_uniform_asm.S, X7 starts out 128 bit aligned but the increase inside the loop at lines 108, 110, 112, and 114 only guarantees 32 bit alignment so my patch changes the writes from 128 bit to 4x32 bits.

It is possible that there are other unaligned accesses that I haven't noticed. These occurrences are the ones that caused exceptions when I tried to verify ML-DSA-87 signatures on a Cortex A55 with the MMU disabled. With these fixes, I can successfully verify signatures with native aarch64 backends for both arithmetic and fips202.

Thanks for the details! I'll try to reproduce this tomorrow.

hanno-becker · 2026-02-02T14:12:05Z

Many thanks @flynd, this is interesting. I was not aware that when the MMU is disabled, unaligned accesses fault.

Can you say more about your application context?

flynd · 2026-02-02T16:19:18Z

Many thanks @flynd, this is interesting. I was not aware that when the MMU is disabled, unaligned accesses fault.

The CPU only handles unaligned addresses automatically for address space marked as normal memory, however with the MMU disabled, all address space is treated as device memory which is why it causes alignment faults.
The C code needs to be built with -mstrict for this environment, otherwise the compiler will assume that unaligned accesses are acceptable and produce code that triggers alignment faults.

Can you say more about your application context?

The use case is to verify signatures for secure boot in early boot stages of an embedded system that lacks hardware support for ML-DSA. For simplicity, we do not enable the MMU in the earliest boot stages, which is why this alignment problem was noticeable. We will only use ML-DSA-87 and primarily only use keys that are built into the boot code.
The development is right now on a prototype stage to get it working, evaluate whether mldsa-native is usable for us, and if so which config options that work best for us.

As the early boot environment runs in internal RAM, before DRAM is available, minimizing the memory footprint is important to us. The option to build only for a specific parameter set to optimize for a single use case is thus appreciated. MLD_CONFIG_REDUCE_RAM is also very helpful.
However there are a few things that I currently have as local (very hackish) patches that would be good to have support for. Unless I figure out how to make them into nice patches, I will most likely open issues to request these features soon:

An option to disable signature creation and key generation to build only the necessary code for verifying signatures.
An option to force all functions and data as static. Defining "MLD_CONFIG_EXTERNAL_API_QUALIFIER static" only affects some functions, but I have many more "static" directives added right now to let gcc build everything as one unit and freely optimize. (It currently also lets gcc remove all functions and data that are unused due to mldsa_verify() being the only entry point.)
mldsa_native_asm.S includes several files that aren't needed for a given parameter set, making it build more than necessary. With a build option to disable signature creation, additional includes can also be removed.
It would also be nice if the stack frame could be reduced further by utilizing the heap more.

hanno-becker · 2026-02-02T16:29:11Z

@flynd Thank you for the context! @mkannwischer and I are happy to help where we can to make mldsa-native usable for your context, so please keep the issues coming. We recently worked through the integration into CherIOT, which also drove a number of improvements.

An option to force all functions and data as static. Defining "MLD_CONFIG_EXTERNAL_API_QUALIFIER static" only affects some functions, but I have many more "static" directives added right now to let gcc build everything as one unit and freely optimize. (It currently also lets gcc remove all functions and data that are unused due to mldsa_verify() being the only entry point.)

Did you consider MLD_CONFIG_INTERNAL_API_QUALIFIER? If you use static for MLD_CONFIG_EXTERNAL_API_QUALIFIER and MLD_CONFIG_INTERNAL_API_QUALIFIER, you should not have external linkage at all (or so I hoped). See e.g. https://github.com/pq-code-package/mldsa-native/blob/main/examples/monolithic_build_multilevel_native/main.c.

It would also be nice if the stack frame could be reduced further by utilizing the heap more.

Did you notice that you can customize allocation now? https://github.com/pq-code-package/mldsa-native/blob/main/mldsa/mldsa_native_config.h#L445

mldsa_native_asm.S includes several files that aren't needed for a given parameter set, making it build more than necessary. With a build option to disable signature creation, additional includes can also be removed.

I agree those are included, but they should not be built because of guards in the respective files. Otherwise, we have a bug. Can you check?

flynd · 2026-02-02T19:26:09Z

An option to force all functions and data as static. Defining "MLD_CONFIG_EXTERNAL_API_QUALIFIER static" only affects some functions, but I have many more "static" directives added right now to let gcc build everything as one unit and freely optimize. (It currently also lets gcc remove all functions and data that are unused due to mldsa_verify() being the only entry point.)

Did you consider MLD_CONFIG_INTERNAL_API_QUALIFIER? If you use static for MLD_CONFIG_EXTERNAL_API_QUALIFIER and MLD_CONFIG_INTERNAL_API_QUALIFIER, you should not have external linkage at all (or so I hoped). See e.g. https://github.com/pq-code-package/mldsa-native/blob/main/examples/monolithic_build_multilevel_native/main.c.

I forgot to say but yes I also have MLD_CONFIG_INTERNAL_API_QUALIFIER defined to static. After checking again, I see that all functions I explicitly needed to add static to are located in mldsa/src/fips202/fips202.c, mldsa/src/fips202/fips202x4.c, and mldsa/src/fips202/keccakf1600.c. I've also marked the data tables as static in mldsa/src/native/aarch64/src/aarch64_zetas.c, mldsa/src/native/aarch64/src/polyz_unpack_table.c, mldsa/src/native/aarch64/src/rej_uniform_eta_table.c, and mldsa/src/native/aarch64/src/rej_uniform_table.c. After this, mldsa-native no longer have any symbols in the linked elf that are global (except the entry point I've added which is a wrapper around mldsa_verify that also sets up a heap provided by the caller.)

It would also be nice if the stack frame could be reduced further by utilizing the heap more.

Did you notice that you can customize allocation now? https://github.com/pq-code-package/mldsa-native/blob/main/mldsa/mldsa_native_config.h#L445

Yes, I'm using that option to define a heap for memory allocation and I must say that I appreciate that you have the exact heap size necessary documented in mldsa_native.h so I didn't have to guess how large it needs to be. But I still count 3.6kB of stack needed (or 7.3kB without MLD_CONFIG_REDUCE_RAM) which I was hoping to squeeze down below 2kB if possible.

mldsa_native_asm.S includes several files that aren't needed for a given parameter set, making it build more than necessary. With a build option to disable signature creation, additional includes can also be removed.

I agree those are included, but they should not be built because of guards in the respective files. Otherwise, we have a bug. Can you check?

I checked the linked elf to verify that all symbols were actually used somewhere and there were several that weren't. Also, I don't want to import the entire mldsa-native git into my project, only the necessary files, so if the conditions are inside each file I would still need to import extra files even though they don't add anything to the output with my configuration.
This is part of my current local patch as example. I've tried to duplicate conditions from where each of the functions are called:

+#if defined(MLD_USE_NATIVE_POLYVECL_POINTWISE_ACC_MONTGOMERY_L4) && MLD_CONFIG_PARAMETER_SET == 44
 #include "src/native/aarch64/src/mld_polyvecl_pointwise_acc_montgomery_l4.S"
+#elif defined(MLD_USE_NATIVE_POLYVECL_POINTWISE_ACC_MONTGOMERY_L5) && MLD_CONFIG_PARAMETER_SET == 65
 #include "src/native/aarch64/src/mld_polyvecl_pointwise_acc_montgomery_l5.S"
+#elif defined(MLD_USE_NATIVE_POLYVECL_POINTWISE_ACC_MONTGOMERY_L7) && MLD_CONFIG_PARAMETER_SET == 87
 #include "src/native/aarch64/src/mld_polyvecl_pointwise_acc_montgomery_l7.S"
+#endif

hanno-becker · 2026-02-02T19:54:08Z

But I still count 3.6kB of stack needed (or 7.3kB without MLD_CONFIG_REDUCE_RAM) which I was hoping to squeeze down below 2kB if possible.

We are tracking this in #867.

I forgot to say but yes I also have MLD_CONFIG_INTERNAL_API_QUALIFIER defined to static. After checking again, I see that all functions I explicitly needed to add static to are located in mldsa/src/fips202/fips202.c, mldsa/src/fips202/fips202x4.c, and mldsa/src/fips202/keccakf1600.c. I've also marked the data tables as static in mldsa/src/native/aarch64/src/aarch64_zetas.c, mldsa/src/native/aarch64/src/polyz_unpack_table.c, mldsa/src/native/aarch64/src/rej_uniform_eta_table.c, and mldsa/src/native/aarch64/src/rej_uniform_table.c. After this, mldsa-native no longer have any symbols in the linked elf that are global (except the entry point I've added which is a wrapper around mldsa_verify that also sets up a heap provided by the caller.)

I opened #936

I checked the linked elf to verify that all symbols were actually used somewhere and there were several that weren't. Also, I don't want to import the entire mldsa-native git into my project, only the necessary files, so if the conditions are inside each file I would still need to import extra files even though they don't add anything to the output with my configuration.
This is part of my current local patch as example. I've tried to duplicate conditions from where each of the functions are called:

Oof. You are right. #937 We are doing this correctly in mlkem-native (e.g. https://github.com/pq-code-package/mlkem-native/blob/main/mlkem/src/native/aarch64/src/polyvec_basemul_acc_montgomery_cached_asm_k3.S#L52), but missed it for mldsa-native.

But regardless, it's true that currently you would need to manually adjust the mldsa_native.S.

flynd requested a review from a team as a code owner February 2, 2026 12:01

flynd force-pushed the fix-aarch64-unaligned branch from b2f1933 to dc8a7a1 Compare February 2, 2026 12:09

hanno-becker mentioned this pull request Feb 2, 2026

[BENCH] aarch64: Don't perform unaligned memory accesses #935

Draft

This was referenced Feb 2, 2026

Ensure that all non-static symbols are guarded by MLD_CONFIG_{INTERNAL/EXTERNAL}_API_QUALIFIER #936

Open

Guard parameter-set specific backend files #937

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64: Don't perform unaligned memory accesses #934

aarch64: Don't perform unaligned memory accesses #934

flynd commented Feb 2, 2026 •

edited

Loading

Uh oh!

mkannwischer commented Feb 2, 2026 •

edited

Loading

Uh oh!

flynd commented Feb 2, 2026 •

edited

Loading

Uh oh!

mkannwischer commented Feb 2, 2026

Uh oh!

hanno-becker commented Feb 2, 2026 •

edited

Loading

Uh oh!

flynd commented Feb 2, 2026

Uh oh!

hanno-becker commented Feb 2, 2026 •

edited

Loading

Uh oh!

flynd commented Feb 2, 2026

Uh oh!

hanno-becker commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aarch64: Don't perform unaligned memory accesses #934

Are you sure you want to change the base?

aarch64: Don't perform unaligned memory accesses #934

Conversation

flynd commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkannwischer commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flynd commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkannwischer commented Feb 2, 2026

Uh oh!

hanno-becker commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flynd commented Feb 2, 2026

Uh oh!

hanno-becker commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flynd commented Feb 2, 2026

Uh oh!

hanno-becker commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flynd commented Feb 2, 2026 •

edited

Loading

mkannwischer commented Feb 2, 2026 •

edited

Loading

flynd commented Feb 2, 2026 •

edited

Loading

hanno-becker commented Feb 2, 2026 •

edited

Loading

hanno-becker commented Feb 2, 2026 •

edited

Loading