Conversation
Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
bosilca
left a comment
There was a problem hiding this comment.
Overall looking good. Do the cleanup and then let's look at it again.
You should run some performance tests with your local_reduction tester, and see how this compares with the default implementation.
| */ | ||
| static int avx_component_open(void) | ||
| { | ||
| opal_output(ompi_op_base_framework.framework_output, "avx component open"); |
| Note that if this function returns non-OMPI_SUCCESS, then this | ||
| component won't even be shown in ompi_info output (which is | ||
| probably not what you want). | ||
| */ |
| */ | ||
| static int avx_component_close(void) | ||
| { | ||
| opal_output(ompi_op_base_framework.framework_output, "avx component close"); |
There was a problem hiding this comment.
In fact remove all opal_output that are for debugging.
| return OMPI_SUCCESS; | ||
| } | ||
|
|
||
| static char *avx_component_version; |
| types are supported. This allows you to change the behavior of | ||
| this component at run-time (by setting these MCA params at | ||
| run-time), simulating different kinds of hardware. */ | ||
| mca_op_avx_component.hardware_available = true; |
There was a problem hiding this comment.
This is not the final version right ?
| module->opm_3buff_fns[i] = ompi_op_avx_3buff_functions[OMPI_OP_BASE_FORTRAN_BXOR][i]; | ||
| } | ||
| break; | ||
| case OMPI_OP_BASE_FORTRAN_LAND: |
| struct ompi_op_base_module_1_0_0_t *module) \ | ||
| { \ | ||
| int step; \ | ||
| switch(type_size) { \ |
There was a problem hiding this comment.
I'm sure there is a better way to decide the step.
| int size = *count/step; \ | ||
| int i; \ | ||
| int round = size*64; \ | ||
| for (i = 0; i < round; i+=64) { \ |
There was a problem hiding this comment.
Are these operations supposed to apply on cache-line aligned elements ?
| struct ompi_op_base_module_1_0_0_t *module) \ | ||
| { \ | ||
| int step; \ | ||
| switch(type_size) { \ |
| int size = *count/step; \ | ||
| int i; \ | ||
| int round = size*64; \ | ||
| for (i = 0; i < round; i+=64) { \ |
Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
| static int avx_component_init_query(bool enable_progress_threads, | ||
| bool enable_mpi_thread_multiple) | ||
| { | ||
| if (mca_op_avx_component.hardware_available && !enable_mpi_thread_multiple) { |
There was a problem hiding this comment.
Why are you disabling the component if threading is enabled ?
The float add operator is now validated. Trying to figure out the fastest implementation. Add default avx512 flags to compile seamlesly Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
ca1b5de to
65644d7
Compare
During configure it detects if the target architecture is x86_64 to enable itself. Then during query it detects processor capabilities using cpuid and disable itself if AVX2 support is not found. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
ompi/mca/op/avx/op_avx_functions.c
Outdated
| for (i = 0; i < round; i+=64) { \ | ||
| __m512i vecA = _mm512_loadu_si512((in+i));\ | ||
| __m512i vecB = _mm512_loadu_si512((out+i));\ | ||
| __m512i vecA = _mm512_loadu_si512((_in+i));\ |
There was a problem hiding this comment.
Definition for _mm512_load_si512:
__m512i _mm512_load_si512 (void const* mem_addr)
I am not sure we need to do type convert. The types_per_step will be different based on the type, I tried to calculate the correct step for each type, but it seems not working.
For correctness purpose I will remove type convertion now. If we really need to specific type. I will fix later.
There was a problem hiding this comment.
This is plain ugly and highly non portable. I suggest you put back the typecast and add not 64 to i but step. In fact you should model this on the float version below, without using the mask for the last set of operations but instead falling back to a lower vector op and then a Duff device.
…device code Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
Signed-off-by: dongzhong <zhongdong0321@hotmail.com>
|
This has been moved to OMPI master in 7419 |
Status : Checked all operations with all types.