Conversation
0199b6c to
26680e1
Compare
|
The cray compile warnings/errors look legit. Not sure why lots of warns and notes led to GCC returning failure however. |
|
Because my .m4 screwed up the flags. Fix on the way. |
93f23d9 to
84e8a49
Compare
0aa9492 to
1fbeead
Compare
1fbeead to
6f2c04b
Compare
9595ad0 to
075549f
Compare
3438d2d to
136b6da
Compare
|
@bosilca I just pushed two commits to fix some typos FWIW, I am also writing the For SVE, would you rather have it in its own module (e.g. |
|
Thanks @ggouaillardet . Regarding the test I have a way more extensive version in the pipeline, should be able to push in the next few days. For the ARM extensions having all ARM-V* versions in a single place can be beneficial, as we will be able to reuse parts of the code. It might however make the compiling stage more complex as we will need to provide specialized flags for some of the .c file (to enable specific architecture options). What's you take on this ? |
|
@bosilca I pushed a new fix in your PR I added support for NEON and SVE in my own branch (based on top of this one) at https://github.com/ggouaillardet/ompi/tree/topic/op_arm For now, I did two distinct components, and though there is some overlap, there is not that much imho. If I understand correctly the
before each step, the flags are tested, and draining the main loop (e.g. the last 63
we could also do loop peeling to be faster when data is not optimally aligned. In theory, we could have several components (e.g. one per avx type) but in order to avoid code duplication, it can be left in a single module (and since subroutine selection is performed at selection time, there is no runtime overhead) |
|
bot:ibm:xl:retest |
|
@ggouaillardet I did not hear about anything but performance issues while mixing AVX and SSE, and these are not a concern for us because in the worst case we do such a transition once. We could follow Intel's suggestions and pass the code through vtune to make sure. The code can certainly be improved, but I don't know by how much. I assumed that most of the MPI_Op executes on data large enough to completely hide the few cycles lost in the branch mismanagement, so that reordering and loop peeling might not be the such a deal breaker. But we might want to investigate how to deal with the case where the user data alignment matches the AVX needs, as this should allow us to use AVX instructions for aligned memory (load instead of loadu) accesses but I don't how much we can gain. Some hits can be found here on McCalpin answer. For the SVE code, @dong0321 and myself were working on a branch in my repo, that you can access via a PR. It is mostly based on the AVX code, and has no support for Neon. We should merge our efforts on this, and reach out to Fujitsu (@Shinji-Sumimoto) and ARM (@shamisp) for final validation. |
|
I don't understand the AZR issue here. It seems that the instruction is not supported, but it did find it while looking for it during configure. So I wonder of somehow the additional flags detected during configure are not passed to the compiler for the avx_functions.c file. But for this I would need to see the compile arguments used. Can someone form Mellanox send me this info please. |
|
bot:aws:retest Same real-but-unrelated-to-this-PR timeout we've been seeing for a while (see #7847). |
|
bot:aws:retest |
jsquyres
left a comment
There was a problem hiding this comment.
I'm running into problems when testing this PR locally. More information coming shortly.
Intel icc 19.04When I compile with icc 19.04, I get the following (which is probably unsurprising): But then I get this when compiling nearly every file in Open MPI: These are just warnings, but it makes the output from GNU Gcc 7.3.0, 8.2.0, and 9.3.0With all 3 of these versions of GCC, I get the following: And then: This is on an older Sandy Bridge architecture machine, so it could be before AVX2 was available...? My full configure output from gcc 9.3.0 and corresponding config.log is attached. |
|
The issue reported by the Intel compiler has nothing to do with this patch. It's about the compiler being vocal about a volatile being passed as an Atomic. |
|
for the second case can you please check the flags used to compile the AVX2 library. All these instructions are part of the AVX2 support, it looks as if the -mavx2 flag was not passed to the compiler. |
|
Looks like Here's the full output: avx-make-V=1.txt |
|
Filed #7909 about the intel compiler atomic warnings. |
|
@bosilca With a bunch of trial and error, this program fails to compile for me on my systems: Calling functions like |
|
We figured it out, Jeff's assembler (v2.20.51.0.2) was from an older gcc version. With a newer version everything worked as expected. I will add a test to make sure I catch such corner cases, and disable AVX2 support to allow the build process to reach completion. |
|
Just to clarify:
Case in point:
The fix is to strengthen OMPI's avx |
|
@bosilca Can you apply the same kind of fix for the 512-sized vectors that you did for the 256-sized vectors? I'm now getting errors with |
|
More bad news, I'm afraid. I ran the reduce_local test with gcc 9+newest Here's the stdout from configure in this environment: And you can see the propagation of those flags in a rendered Makefile: Is building with |
|
Why don't you just detect the platform and turn it "off" for anything older than skylake?? It isn't worth all the trouble for these corner cases. |
|
I will modify the 512 case as well, not because I think it's the right approach, but because I want to get this done. While on the 256 case I understand that |
|
@jsquyres the test is correct. As I mentionned the AVX instructions use a different overflow policy, the resulting value is set to the min/max of the representative value for the type. Thus, unsigned 8 bits 5 + 253 will be 2, except with AVX where the result will be 255. Good thing that MPI does not mandate how these operations behave with regard to overflow. I will change the test to avoid printing this information. |
|
Sorry for the delay; I finally got to testing this PR again this morning. It looks good! I pushed a commit to this PR that fixes up the check_ops.sh script -- feel free to squash if you like it. I found that on my Sandy Bridge machines:
Yay! But after I fixed up the check_ops script, I noticed that all three cases are failing MIN and MAX with 64-bit values: |
Add logic to handle different architectural capabilities Detect the compiler flags necessary to build specialized versions of the MPI_OP. Once the different flavors (AVX512, AVX2, AVX) are built, detect at runtime which is the best match with the current processor capabilities. Add validation checks for loadu 256 and 512 bits. Add validation tests for MPI_Op. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> Signed-off-by: dongzhong <zhongdong0321@hotmail.com> Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
|
A typo in the tester (max and min were not correctly computed). I merged everything into a single commit, this is ready to go. |
|
bot:aws:retest |
|
🎉 |
Add support for AVX instructions for all MPI_Op.