[PATCH v1] linux-gen: queue: add loop unrolling for event conversions#2211

Open

sanvai01 wants to merge 1 commit intoOpenDataPlane:masterfrom

sanvai01:dev/loop_unrolling

sanvai01 commented Apr 30, 2025

Added loop unrolling for event to header conversions to improve performance for single producer single consumer and basic queues.


          linux-gen: queue: add loop unrolling for event conversions

8efbe1f

Added loop unrolling for event to header conversions to improve
performance for single producer single consumer and basic queues.

Signed-off-by: Sanjyot Vaidya <Sanjyot.Vaidya@arm.com>

odpbuild changed the title ~~linux-gen: queue: add loop unrolling for event conversions~~ [PATCH v1] linux-gen: queue: add loop unrolling for event conversions

JannePeltonen reviewed

View reviewed changes

platform/linux-generic/odp_queue_basic.c

+              		event_index[i+2] = event_hdr[idx+2]->index.u32;
+              		event_index[i+3] = event_hdr[idx+3]->index.u32;
+              	}
+              	switch (num & 0x3) {

Collaborator

JannePeltonen May 2, 2025

This looks like it is going to increase code size and add a few new branches (although I did not look at any generated assembly). It is not clear to me that the change is a net positive in actual applications and not just in microbenchmarks that presumably always enqueue long bursts. I believe some ODP apps have high instruction cache pressure.

Another possible optimization might be to get rid of the whole header-pointer-to-index conversion and just store the pointers in rings (but that would then use more cache lines in the rings, so I do not know if that is good or not).

Contributor

vinodkrishna-arm May 6, 2025

Yes, cache pressure is something we need to understand. Is there any benchmarking application (not a micro benchmark like we used here) that can be executed to confirm the behaviour?

MatiasElo reviewed

View reviewed changes

platform/linux-generic/odp_queue_basic.c

+              	switch (num & 0x3) {
+              	case 3:
+              		event_index[i++] = event_hdr[idx++]->index.u32;
+              		__attribute__((fallthrough));

Collaborator

MatiasElo May 2, 2025

Project convention has been to use simply /* Fallthrough */ comments. i386 build seems to have issues with these attributes.

platform/linux-generic/odp_queue_basic.c

+              	switch (num & 0x3) {
+              	case 3:
+              		event_hdr[i++] = _odp_event_hdr_from_index_u32(event_index[idx++]);

Collaborator

MatiasElo May 2, 2025

Unnecessary newline.

platform/linux-generic/odp_queue_basic.c

Collaborator

MatiasElo May 2, 2025

While testing this I noticed that now we could have separate optimized implementations for single and multi event enq/deq functions (e.g. odp_queue_enq(), odp_queue_enq_multi()). Currently, the single event variants e.g. plain_queue_enq()) share the same code and I measured some performance penalty from these changes.

Contributor

vinodkrishna-arm May 6, 2025

yes, this makes sense.

platform/linux-generic/odp_queue_basic.c

		@@ -515,20 +515,52 @@ static inline void event_index_from_hdr(uint32_t event_index[],
		_odp_event_hdr_t *event_hdr[], int num)

Collaborator

MatiasElo May 2, 2025

Since event_index_from_hdr() and event_index_to_hdr() functions are identical for basic and spsc queues, their code could be moved to platform/linux-generic/include/odp_event_internal.h. The convention has been to prefix implementation internal functions with _odp, so _odp_event_index_from_hdr() and _odp_event_index_to_hdr().

Collaborator

MatiasElo commented Jun 26, 2025

After this commit this PR seems no longer necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet