Technical Working Group Meeting, August 2020

Minutes

Date: 12th August, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
Russ Fiedler (RF) CSIRO Hobart
Rui Yang (RY), Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
Marshall Ward (GFDL)

PIO with MOM (and NCAR)

MW: NCAR running global 1/8 CESM with MOM6. Struggling with IO requirements. Worked out they need to use io_layout. Interested in parallel. Our patch is not in FMS, and never will be. They understand but don’t know what to do. Can’t guarantee my patch will work with other models and in the future. Said COSIMA guys are not using it. Using mppnccombine-fast so don’t need it. Is that working? AK: Yes.

RY: Issue is no compression. Previous PIO not support compression and size is huge. Now netCDF supports parallel compression, so maybe look at it again. Haven’t got time to look at it. Should be a better solution for COSIMA group.

MW: Ideally Ed Harnett or someone else from NCAR would add PIO to FMS. They have been working on latest FMS rewrite for more than 2 years. Haven’t finished API update. FMS API is very high level. They have decided too high level to do PIO. FMS completely rewriting API. Ed stopped until FMS update. Added PIO and used direct netCDF IO calls. Bit hard-wired but suitable for MOM-like domains. Options 1 sit and wait, Option 2 is do their own, Option 3 is do it now and use fork of FMS. Maybe Option 4 is mppnccombine-fast. What do you think?

AK: Outputting compressed tiles with io_layout and using fast combine. Potential issue is if io_layout makes small tiles. MW: Chunk size has to match tile size? Do tiles have to be same size? AH: Yes. Still works but slower but has to do deflate/reflate step. It is fast when it can just copy compressed chunks from one file to another. Limit is only filesystem speed. Still uses much less memory if it has to do deflate/reflate. Chooses first tile size and imposes that on all tiles. If first tile is not typical tile size for most files could do end up reflating/deflating a lot of the data. Also have to choose processor layout and io_layout carefully. For example 0.25 1920×1080 doesn’t have consistent tile sizes. MW: Trying to figure out if it is worth telling them to reach out to you guys. Sent them a link to the repo. AH: Might be a decent way to keep going until they get a better solution.

MW: Bob had strategy to force modelling services to include PIO support by getting NCAR to use PIO.

NH: Can they use PIO patch with their current version of FMS? MW: They want to get rid of the old functions.

Bad idea to ditch old API, creates a lot of problems. The parallel-IO work is on a branch.

AH: Regional output would be much better. Output one file per rank. Can aggregate with PIO? NH: One output file. Can set chunking. AH: Not doing regional outputs any more because so slow. Would give more flexibility. AK: Slow because of huge number of files. Chunks are tiny and unusable. Need to use non-fast. AH: I thought it was hust the output is slow. RF: Many processors on same node will pump output. MW: Many outputs will throttle lustre. Only have a couple of hundred disks. Will get throttled. AH: Another good reason to use for MOM. MW: Change with io_layout? RF: No, always output for themselves. MW: Wonder how patch would behave. AH: NCAR constrained to stay consistent with FMS. MOM5 is not so constrained, should just use it. NH: Should try it if code already. parallel netcdf is a game changer. AH: I have a long-standing branch to add FMS as a sub-tree. Should do it that way. Have our own FMS fork with the code changes. MW: Only took 3 years!

AK: Put in a deflate level as namelist parameter as it defaulted to 5. Used 1 as much faster but compression was the same.

CICE PIO

NH: Solved all known issues. Using PIO output driver. Works well. Can set chunking, do compression, a lot faster. Ready to merge and will do soon. I don’t understand why it is so much better than what we had. I don’t understand the configuration of it very well. Documentation is not great. When they suggested changes they didn’t perform well. Don’t understand why it is working so well as it is, and would like to make it even better.

NH: Converted normal netcdf CICE output driver to use latest parallel netCDF library with compression. So 3 ways, serial, same netCDF with pnetcdf compressed output, or PIO library. netcdf way is redundant as not as fast as PIO. Don’t know why. Should be doing this with MOM as well. Couldn’t recall details of MW and RY previous work. Should think about reviving that. Makes sense for us to do that, and have code already.

MW: Performance difference is concerning. NH: Has another player of gathers compared to MPI-IO layer. PIO adds another layer of gathering and buffering. With messy CICE layout PIO is bringing all those bits it needs and handing it lower layer. Maybe possible reason for performance difference. RY: PIO does some mapping from compute to IO domain. Similar to io_layout in MOM. Doesn’t use all ranks to do IO. Sends more data to a single rank to IO, saves contention issues. NH: MPI-IO has aggregators? RY: In the library you can select number of aggregators. Default is 1 aggregator per node. If you use PIO to use single rank per node this matches MPI-IO. Did this in the paper where we tested this. If consistent io_layout, aggregator number and lustre striping should get good performance.

RY: Tried different compression levels? NH: Just using level 1. Did some testing in serial case not much point going higher. Current tests doing all possible outputs. RF: A lot of compression will be due to empty fields. RY: compression performance is related to chunk size. NH: performance difference with chunk size. Too big and too small is slower. Default chunk size is fastest for writing. 360×300 for 2D field. Might not be ok for reading. RY: Should consider both read and write. Write once and many read patterns. MW: Parallel reads were slower than POSIX reads. AH: What is dependence of time on chunk size. NH: Depends how many fields we output. Cutting down should be fast for larger chunk size. Is a namelist config currently. Tell it chunk dimension. RY: Did similar with MOM. AH: CICE mostly 2D, how many have layers. AH: What chunking in layers? NH: No chunking, chunk size is 1, for layers. AH: Have noticed access patterns have changed with extremes. Looking more at time series, and sensitive to time chunking. Time chunking has to be set to 1? NH: With unlimited not sure RF: Can chunk in time with unlimited, but would be difficult as need to buffer until writing possible. With ICE data layers/categories are read at once. Usually aggregated, not used as individual layers. Make more sense to make category chunk the max size. Still a small cache for each chunk. netCDF4 4.7.4 increased default cache size from 4M to 16M.

AH: I thought deflate level 4 or 5 was still worth it. NH: Can give it a try. Don’t really care about deflate level, just getting rid of zeroes.

Masking ICE from MOM – ICE bypass

NH: Chatted with RF on slack. Mask to bypass ICE. Don’t talk to ice in certain areas. Like the idea. Don’t know how much performance improvement. RF: Not sure it would make much difference. Just communication between CICE and MOM. NH: Also get rid of all the CICE ranks that do nothing. RF: Those are hidden away because of round-robin and evenly spread. Layout with no work would make a difference. NH: What motivated me was the IO would be nicer without crazy layouts. If didn’t bother with middle, would do one block per cpu, one in north and south. Would improve comms and IO. If it were easy would try. Maybe lots of work. AK: Using halo masking so no comms with ice-free blocks.

AH: What about ice to ice comms with far flung PEs? RF: Smart enough to know if it needs to MPI or not. Physically co-located rank will not use MPI. AH: Thought it would be easy? NH: Not sure it is justified in terms of performance improvement. With IO tiny blocks were killing performance, so this was a solution to IO problem. MW: Two issues are funny comms patterns and calculations are expensive, but ice distribution is unpredictable. Don’t know which PEs will be busy. Load imbalances will be dynamic in time. Seasonal variation is order 20%. Might improve comms, but that wasn’t the problem. Stress tensor calcs are expensive, so ice regions will do a lot more work. NH: Reason to use small blocks which improves ability to balance load. MW: Alisdair struggling with icebergs. Needs dynamic load balancing. Difficult problem. RF: Small blocks are good. Min max problem. Every rank has same amount of work, not too much or too little. CICE ignores tiles without ice. CICE6 a lot of this can be done dynamically. There is dynamic allocation of memory. AH: Dynamic load balancing? RF: Who knows. Now using OpenMP. AH: Doesn’t seem to make much difference with UM. MW: Uses it a lot with IO as IO is so expensive.

AH: A major reason to pursue masking is it might make it easier when scaling up. If round-robin magically scales well that is ok, but last time there was a lot of analysis with heat maps and discussion about optimal block sizes. Conceptually it might be easier to understand how to best optimise for new config. NH: Does seem to make sense, could simplify some aspects of config. Not sure if it is justified. MW: Easy to look at comms inefficiency. Did this a lot for MOM5, and mostly it wasn’t comms. Sometimes the hardware, or a library not handling the messages well, rather than comms message composition. Bob does this a lot. Sees comms issues, fixes it and doesn’t make a big difference. Definitely worth running the numbers. NH: Andy made the point. This is an architecture thing. Can’t make changes like this unilaterally. Coupled model as well. Fundamentally different architecture could be an issue. MW: Feel like CPUs are the main issue not comms. Could afford to do comms badly. NH: comms and disks seems pretty good on gadi. Not sure we’re pushing the limits of the machine yet. Might have double or triple size of model. AH: Models are starting to diverge in terms of architecture. Coupled model will never have 20K ocean cpus any time soon. NH: Don’t care about ice or ocean performance.

AH: ESM1.5 halved runtime by doubling ocean CPUS. RF: BGC model takes more time. Was probably very tightly tuned for coupled model. 5-6 extra tracers. MW: 3 on by default, triple expensive part of model. UM is way more resources. AH: Did an entire CMIP deck run half as fast as they could have done. My point is that at some point we might not be able to keep infrastructure the same. Also if there is code that needs to move in case we need to do this in the future. NH: Code is more of an ocean calculation anyway? RF: Kind of. Presume there is a separate ice calc. Coupling code taken from gfdl/MOM and put into CICE to bypass FMS and coupler code. From GFDL coupler code. Rather than ocean_solo.f90 goes through coupler.f90. NH: If 10 or 20K cores might revisit some of these ideas. Goal to get to those core counts working, not sure about production.

MW: Still thinking about super high res, like 1/30. OceanMaps people wanted it? More concrete. RF: Some controversy with OceanMaps and BoM. Wanting to go to UM and Nemo. There is a meeting, and a presentation from CLEX. Wondering about opportunity to go to very high core counts (20K/40K). AH: Didn’t GFDL run 60K cores for their ESM model? NH: Never heard about it. Atmosphere more aggressive. RF: Config I got for CM4 was about 10K. 3000 line diag_table. AH: Performance must be IO limited? MW: Not sure. Separated from that group.

New bathymetry

AK: Russ made a script to use GEBCO from scratch. Worked on that to polish it up. Everything so far has been automatic. RF: Always some things that need intervention for MOM that aren’t so much physically realistic but are required for the model. AK: Identified some key straits. Retaining previous land masks so as not to need to redo mapping files. 0.25 need to remove 3 ocean points and add 2 points. Make remap weights scripts are not working on gadi, due to ESMF install. Just installed latest esmf locally, 8.0.1, currently running. AH: ESMF install for WRF doesn’t work? AK: Can’t find opal/MCA MPI error. RF: That is an MPI error.

AH: Sounds like the sort of error that was a random error, but if happening deterministically not sure. AK: Might be a library version issue. AH: They have wrappers to guess MPI library, major version the same should be the same.

AH: All this is scriptable and be re-run right? Bathymetries are intimately tied to vertical grid, so needs to be re-run if that is changed. AK: Vision is certainly for it to be largely automated. Not quite there yet.

NH: I’ll have a quick look too. Noticed there is no module load esmf? AK: Using esmf/nuwrf. I’ll have a look at what esmf built with. AH: I want esmf installed centrally. We should get more people to ask. NH: I think it is very important. AK: Definitely need it for remapping weights. AH: Other people need it as well.