-
Notifications
You must be signed in to change notification settings - Fork 1
Parallel moorings #682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Parallel moorings #682
Conversation
|
Thanks @fsalmon001. I can compile netcdf-c (latest version, 4.9.3) but netcdf-cxx fails to compile |
|
Hi again @fsalmon001, If you know apptainer, maybe you could help me to make a container. Here is my recipe file: |
|
PS this is the environment file |
|
Ok, I had the same troubles. The only way I found is to first install a not recent version of hdf5 in parallel: Then, I used netcdf-c-4.8.1 Then, if you had already installed another version of netcdf, you need to point on this new version to have access to the parallel functions. You can check with nc-config --has-parallel. If the answer is no, there is an issue. Of course, you need to change the paths. Tell me if it works, I did not write every step so it may be incomplete |
|
Thanks @fsalmon001,
Where did you download that version of netcdf-c from? And do we still need
netdcf-cxx?
…On Monday, 20 October 2025, fsalmon001 ***@***.***> wrote:
*fsalmon001* left a comment (nansencenter/nextsim#682)
<#682 (comment)>
Ok, I had the same troubles. The only way I found is to first install a
not recent version of hdf5 in parallel:
wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.7/src/hdf5-1.10.7.tar.gz
tar -xzf hdf5-1.10.7.tar.gz
cd hdf5-1.10.7
./configure --prefix=/usr/hdf5-1.10 --enable-parallel --enable-shared
make -j8
make install
Then, I used netcdf-c-4.8.1
export CPPFLAGS="-I/usr/hdf5-parallel/include"
export LDFLAGS="-L/usr/hdf5-parallel/lib"
export LD_LIBRARY_PATH="/usr/hdf5-parallel/lib:$LD_LIBRARY_PATH"
cd netcdf-c-4.8.1
./configure --prefix=/usr/netcdf-parallel \
--enable-parallel-tests \
--disable-detect-parallel
make -j8
make install
Then, if you had already installed another version of netcdf, you need to
point on this new version to have access to the parallel functions. You can
check with nc-config --has-parallel. If the answer is no, there is an issue.
Of course, you need to change the paths. Tell me if it works, I did not
write every step so it may be incomplete
—
Reply to this email directly, view it on GitHub
<#682 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATVYQKVMZQJELRBMTBTNNT3YTE7LAVCNFSM6AAAAACJU5UA42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMRRG43DQMJWG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
I guess it was here https://github.com/Unidata/netcdf-c/tree/v4.8.1 @tdcwilliams For the c++ version, the parallel process is not implemented in it, so I had to take the c version for parallelization. However, it is still used sequentially in neXtSIM. For a small grid and a small number of processors, the sequential approach is still more efficient that the parallel one because I had to make a big preliminary process before writing the netcdf file in parallel. |
|
Thanks @fsalmon001
I'll try it out tomorrow.
Would it be much work to change the sequential writing to netdcf-c?
…On Monday, 20 October 2025, fsalmon001 ***@***.***> wrote:
*fsalmon001* left a comment (nansencenter/nextsim#682)
<#682 (comment)>
I guess it was here https://github.com/Unidata/netcdf-c/tree/v4.8.1
@tdcwilliams <https://github.com/tdcwilliams>
For the c++ version, the parallel process is not implemented in it, so I
had to take the c version for parallelization. However, it is still used
sequentially in neXtSIM. For a small grid and a small number of processors,
the sequential approach is still more efficient that the parallel one
because I had to make a big preliminary process before writing the netcdf
file in parallel.
—
Reply to this email directly, view it on GitHub
<#682 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATVYQOLL4F4GSILQBM3QNL3YTWVJAVCNFSM6AAAAACJU5UA42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMRSGMZDSOBUGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
In gridouput no, but there are netcdf functions elsewhere, in other functions. I did not have a look on that @tdcwilliams |
|
Ok sure thanks @fsalmon001. I'll try to compile netdcf-cxx then. Is version
4.3.1 ok and what flags did you use for configure for that one?
…On Tuesday, 21 October 2025, fsalmon001 ***@***.***> wrote:
*fsalmon001* left a comment (nansencenter/nextsim#682)
<#682 (comment)>
In gridouput no, but there are netcdf functions elsewhere, in other
functions. I did not have a look on that @tdcwilliams
<https://github.com/tdcwilliams>
—
Reply to this email directly, view it on GitHub
<#682 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATVYQOM4REL5QXNT5YP5KD3YXLOBAVCNFSM6AAAAACJU5UA42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMRVGA2TMMZVGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I did not need to modify netcdf-cxx, I did not even recompile it after compiling netcdf-c @tdcwilliams |
|
Hi @fsalmon001, I still need to try an irregular output grid though. Maybe I'll try large_arctic_5km as well to see the difference there as well. |
|
Hi again @fsalmon001 This happens both when I use a container and when I compile with intel compilers. |
|
Hi @tdcwilliams, No, it never happens for me. Maybe this can stem from a different MPI configuration or maybe a MPI barrier should be needed somewhere. Do you know in which function the process hangs? |
|
Hi @fsalmon001 |
|
Thank you @tdcwilliams I am not sur about the issue, but I had some problems here. The function nc_put_vara_float must be called by every process the same number of time. This is why I make a MPI reduce to have the max_size of the loop and the processors with less data send dummy information. If it blocks here, I think the number of nc_put_vara_float calls is not the same for each process. With my configuration it is the case, so I assume you could use another option that I did not use, which involves differences here. Maybe, the problem is that M_nodal_variables or M_elemental_variables are not the same on all processors, but it would be strange. Maybe you could put some "std::cout << M_comm.rank() << " " << k << std::endl;" in some places in the loop, I feel some process does not perform each iteration but I do not understand why. |
|
Hi @fsalmon001, It is hanging at the nullptr line when local_size < max_size; I tried making sure count = {0,0,0} but it still hung |
|
Hi @tdcwilliams, it is with the same version of netcdf as me? I do not understand why it could not run with a null pointer. It is at the first iteration of the loop or after some iterations? What are your options for moorings? Maybe I can test here. |
|
Hi @fsalmon001, yes it is the same version of netcdf. |
|
Hi again @fsalmon001 |
|
Actually Claude suggested changing mode to NC_INDEPENDENT for the dummy write and it no longer hangs. |
|
Thank you @tdcwilliams I knew for NC_INDEPENDENT, but I think there was an issue about it. Maybe every process overlaps what the other did. Did you show the resulting netcdf file? Is it good? Maybe it was an issue during my development but if each process now writes only non-overlapping rectangles, this could be ok. |
|
Hi @fsalmon001 , when I go back to 32 cores it hangs again (also writing more variables every 6 hours, instead of every time step) |
|
Ok I will have a closer look and I come back to you @tdcwilliams |
|
I do not understand how we can have such different results. Here it works with your options, but with NC_INDEPENDENT, I have NaN in the netcdf output file @tdcwilliams . You really have a good netcdf file with correct values with NC_INDEPENDENT? |
|
And if you add a 'M_comm.barrier' just after the nc_put_vara_float functions @tdcwilliams ? Maybe it is a problem for you that some processes exit the loop and the writeNetCDFParallel function before others. I ran cases with 128 processors on a cluster where I have netcdf 4.8.1. |
|
Hi @tdcwilliams, But when using only NC_INDEPENDENT, you should have NaN in the output file since each process overwrites the file. |
|
Hi @fsalmon001, Similarly, commenting the call with nullptr only works for 2cpu's. (This option makes good nc files, although the domain is truncated more than usual). |
This reverts commit 606ced6. Only works for 2 cpus. Similarly commenting the nullptr write only works for 2 cpus.
|
I also asked to some generative AIs. Have you tried like in your commit to initialize Sorry I have nothing more @tdcwilliams |
|
Hi @fsalmon001, With 3 cpu's my output is: so only one of the ranks exits I now have a test script (test_netcdf_parallel.cpp) it runs with This test actually passes (doesn't hang and makes a sensible .nc file) in my environment so I am not sure about the difference between the call in nextsim and this one. I don't know if it would be worth having a test that was more similar to nextsim. I don't know if you have any experience of containers - would you be able to try to make a container where your code runs? Then we would know we could run it anywhere. I can give some recipe files to start from if you were willing to try that? |
|
Yes it is strange @tdcwilliams. Could it be a conflict between different versions of netcdf if several netcdf are loaded? Maybe one processor runs the nc_put_vara_float function of netcdf 1 and the other the function of netcdf 2 which avoids communication. I don't know if this is possible. No I am not familiar with containers but you can give me the files, I will have a look if I can do it. |
|
Hi @fsalmon001 I've put the files here: On your personal computer install apptainer ( The nextsim.def file has the compilation formulae that I used for the netcdf libraries. Incidentally, on our HPC nextsim runs 2x faster inside apptainer than with intel compilers, so I think it would be worth getting it to work for its own sake, and not just for portability, Once you've built the image files and compiled the model, you may need some more help to run the model. On our HPC, I source this file: https://github.com/nansencenter/nextsim-env/blob/apptainer-netcdf-parallel/machines/tim/fram/env/pynextsim.apptainer.src which mounts some directories inside the container with forcing data etc and sets some variables inside it. It works with |
|
Hi @tdcwilliams, With your netxsim.def et nextsim.sh, the code already runs inside the container with |
|
Thanks very much for checking that, @fsalmon001. A bit puzzling but maybe
our Hpc is the problem (it's getting shut down soon). I'll try the new
machine and see what happens
…On Friday, 31 October 2025, fsalmon001 ***@***.***> wrote:
*fsalmon001* left a comment (nansencenter/nextsim#682)
<#682 (comment)>
Hi @tdcwilliams <https://github.com/tdcwilliams>,
With your netxsim.def et nextsim.sh, the code already runs inside the
container with '''moorings.parallel_output=true''' here, with :
'''
[moorings]
#snapshot=true
output_timestep=1
spacing = 10
use_moorings=true
output_time_step_units=time_steps
file_length=monthly
variables = conc
#variables = thick
#variables = velocity
#variables = ridge_ratio
#variables = damage
parallel_output = true
'''
—
Reply to this email directly, view it on GitHub
<#682 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATVYQMTFSB66IA3XLBRAFT32NZ45AVCNFSM6AAAAACJU5UA42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTINZTGQ2DQOBZHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Hi @fsalmon001 |
|
Hi @tdcwilliams, I am on my laptop, with ubuntu 22.04.3. I just go into my container, then |
|
Hi @fsalmon001, Can I double check that it runs for the same mesh as I was using ( You copy it to |
|
By the way @fsalmon001, |
|
Hi @tdcwilliams, can you also give me your mesh file please, I have only the small arctic mesh? No I did not rigorously check the memory, but we keep the same global grid in each process, but with only local data, so that the memory need can only decrease, but not like in a fully parallel decomposition. For the final file, it is obviously the same size. |
|
Hi @fsalmon001 |
|
I did not expect that the irregular grid will be curvilinear. What I have coded is not suitable for this kind of grid. I tried to find a way to make it work but I did not find a simple approach to do it because my algorithm is based on rectangles. I think we need a totally other philosophy for this. I will think a bit about this in case I find a workaround but I don't think it will be possible. If I can't, I will remove the parts corresponding to the irregular grids in my code @tdcwilliams. |
|
Thanks @fsalmon001. |
|
Hi @fsalmon001 |
|
The result is: |
|
In addition @tdcwilliams, I think I have found a way to deal with the irregular grids. There are more parallel exchanges, so this should be less time efficient, but in terms of memory, it should be similar to the regular grids. It works here. Moreover, I had an issue with hanging runs. This stemmed from MPI exchanges, so I modified them with non-blocking exchanges. Then, could you try again with this commit using both the regular and irregular grids please? Hopefully this could solve your problem. |
|
Hi @fsalmon001 I tried the latest code but it still hangs. Could you try one more thing with the container please? |
|
Hi @tdcwilliams, with your command, it stills works, with both regular and irregular grids. |
|
Thanks @fsalmon001. |
So far, two options to write a mooring file:
In the proposed pull request, one file is written in parallel using the parallel library netCDF (you need to compile netCDF with the parallel option). NeXtSIM can still be compiled without netcdf parallel but this option could not be used.
The netCDF library can only write rectangular grids efficiently, so there is first a rectangle decomposition of the domain.
Then each processor writes a part of the grid corresponding to one rectangle or a set of rectangles.
I tested it only with a regular grid because I have no file for an arbitrary grid. Please check with an arbitrary grid if you use it, there could be issues in this case.