[PATCH v1] linux-gen: queue: add loop unrolling for event conversions#2211
[PATCH v1] linux-gen: queue: add loop unrolling for event conversions#2211sanvai01 wants to merge 1 commit intoOpenDataPlane:masterfrom
Conversation
Added loop unrolling for event to header conversions to improve performance for single producer single consumer and basic queues. Signed-off-by: Sanjyot Vaidya <Sanjyot.Vaidya@arm.com>
| event_index[i+2] = event_hdr[idx+2]->index.u32; | ||
| event_index[i+3] = event_hdr[idx+3]->index.u32; | ||
| } | ||
| switch (num & 0x3) { |
There was a problem hiding this comment.
This looks like it is going to increase code size and add a few new branches (although I did not look at any generated assembly). It is not clear to me that the change is a net positive in actual applications and not just in microbenchmarks that presumably always enqueue long bursts. I believe some ODP apps have high instruction cache pressure.
Another possible optimization might be to get rid of the whole header-pointer-to-index conversion and just store the pointers in rings (but that would then use more cache lines in the rings, so I do not know if that is good or not).
There was a problem hiding this comment.
Yes, cache pressure is something we need to understand. Is there any benchmarking application (not a micro benchmark like we used here) that can be executed to confirm the behaviour?
| switch (num & 0x3) { | ||
| case 3: | ||
| event_index[i++] = event_hdr[idx++]->index.u32; | ||
| __attribute__((fallthrough)); |
There was a problem hiding this comment.
Project convention has been to use simply /* Fallthrough */ comments. i386 build seems to have issues with these attributes.
| switch (num & 0x3) { | ||
| case 3: | ||
| event_hdr[i++] = _odp_event_hdr_from_index_u32(event_index[idx++]); | ||
|
|
There was a problem hiding this comment.
While testing this I noticed that now we could have separate optimized implementations for single and multi event enq/deq functions (e.g. odp_queue_enq(), odp_queue_enq_multi()). Currently, the single event variants e.g. plain_queue_enq()) share the same code and I measured some performance penalty from these changes.
There was a problem hiding this comment.
yes, this makes sense.
| @@ -515,20 +515,52 @@ static inline void event_index_from_hdr(uint32_t event_index[], | |||
| _odp_event_hdr_t *event_hdr[], int num) | |||
There was a problem hiding this comment.
Since event_index_from_hdr() and event_index_to_hdr() functions are identical for basic and spsc queues, their code could be moved to platform/linux-generic/include/odp_event_internal.h. The convention has been to prefix implementation internal functions with _odp, so _odp_event_index_from_hdr() and _odp_event_index_to_hdr().
|
After this commit this PR seems no longer necessary. |
Added loop unrolling for event to header conversions to improve performance for single producer single consumer and basic queues.