Technical Working Group Meeting, May 2019

Minutes

Date: 15th May, 2019
Attendees:

  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Marshall Ward (MW) GFDL
  • Russ Fiedler (RF), Matt Chamberlain(MC) CSIRO Hobart
  • Nic Hannah  (NH) Double Precision
  • Rui Yang (RY) NCI

Agenda

– Follow up on migrating FMS to an external library
– WOMBAT in harmonised MOM update and testing
– Tenth load balancing
– CICE IO bound in high core counts

CICE IO bound in high core counts

AK: Runs with new CICE executables NH compiled a while ago. Performance slowdown with compression level 5. Tested with level1 few % larger in size, 2500s -> 1800s for IO time. 1300s without compression. Compresses well with low value because a lot of missing data with ice.
NH: Went from netCDF3 to netCDF4. Might be worth trying no compression. AK: have a run with compression level zero. RF: Does impact on walltime. MOM is waiting. Usually have CICE waiting on MOM, but when outputting is the other way. MW: Compressing MOM before, now both? NH: Compressing and daily output an issue. AH: What is the chunking? RF: Uses default. AH: Some libraries chose weird values for time value? RF: No funny business, all sensible. RF: All these point to point gather, maybe not efficient. MW: Do you know where the time taken is? RF: Slowdown, but not sure split between gather and write. NH: Breaking new ground, daily output and running at scale, and unusual tile distribution. Increases the COMMS to gather. So many different new things. MW: On sect robin still? AH: 10% of total runtime.
NH: With MOM do all this with post-processing to get performance of model best as possible. Anything we do slowing model as whole, should post-process. Didn’t think about that option when put change in. If slowing down as a whole, back out change and work out post-processing step. AK: Half the data in daily files is static. Totally unnecessary. Made issue to maybe output static data to a file once. RF: Aggregate daily files to monthly? AK: Slows down output from model. Less compressible? RF: Highly correlated, will compress easily. AH: How much extra wait time? RF: The whole write time. AK: 25 or 18% in MOM runtime. AH: Monthly output issue disappears? RF: Yes. RY: CICE write to single file? RF: Yes through one processor. RY: Can we do it like MOM, each processor writes data to it’s own file. NH: Yes, good idea, but more complicated than MOM. CICE tiles are not located close to each other in space. RF: Could use PIO interface. Not compatible with centrally installed netCDF libraries. Bugs in version of HDF. Need OpenMPI > 1.10.4  and netCDF > 4.6.1. MW: PIO good candidate, RY can help. CICE developers looking into this? Stayed in touch with them? NH: Look at CICE6 GitHub. RF: Looked, but no active development on IO in any fundamental way.
NH: If we did decide to go that way, good opportunity to feed that back to CICE community.
MW: NCAR as a developer of PIO, keen to get it into other models. If CICE is on their radar might get some feedback there. RY: MOM has IO layer a bit like PIO. MW: Not a good idea to use PIO in MOM6.
RY: Tried PIO in MOM and found it was not a good candidate. MW: Yeah, MOM6 was already doing something like that.
RY: Parallel compression will be supported in future in netCDF.
RY: Been experimenting with my own version of library and got some positive results.
End result: take compression out, take out static fields. Post processing. Is anyone using daily fields. RF: We’re interested in daily ice fields. Using data assimilation. MW: Shorter runs though? RF: 20 years.
NH: Instead of writing individual daily files, should write to a single file, static fields won’t be replicated, maybe benefit from some netCDF buffering. AH: Big code change? NH: Not sure. AK: Has a file naming convention for different frequencies. Frequency part of filename. NH: Saying could already output daily into monthly files? AK: No, filename encodes time and frequency. Doesn’t seem to write repeatedly to any of it ’s output files. AH: Define unlimited dimension.
NH: Make a GitHub issue. If high priority could get some time. MW: Make the issue in the CICE repo, inform them what we’re doing. They mentioned an NCAR community board.
AH: Make a namelist option and recompile? Compression level as option?

Tenth load balancing

AK: RF suggested a smaller core count of 799. Doesn’t change wall time which is a win. How low can we go? RF: Worked out a few more configs. Slight change of tile size, 720 would be ok. 36×36 or 40×30.. Running some quick tests with tool under /short/v45/masking. Run and output masks and where tiles get located. Also number of processors/blocks you need. AH: Put code on COSIMA GitHub? RF: Just a quick little thing. AH:  Yes but useful.
AH: Down from 1380. Big win. Total core count? AK: not sure. RF: Total just over 5000. AH: Still running on normalbw? AK: Yes. AH: Wait on normal crazy. RF: Look at skylake? Usually empty. RY: Yes new nodes, not large total core count. AK: Get 6mo/submit without daily outputs. Daily over by 30/45mins with ice. dt=600s.
NH: If no-one else to fix, and no-one else to fix, assign NH to issue.

WOMBAT

RF: Got Matear up to speed. Ran a few tests. One or two bugs yet to be fixed. A couple of fields that weren’t coming through from OASIS properly. Was the ice field, wasn’ t coming through correctly. Got it going with external fields forcing it. Figured out changes to get it running properly with full ACCESS mode. Running some tests cases after bugs fixed. MC: Now running with calculated gas exchange coefficients. RF: The way it was originally written the way fields were ingested into MOM. MC: Using the same wind field in BGC and wind mixing? RF: Yes, all together. MC: Level of the wind? In ACCESS-ESM was getting lowest atmospheric wind. MC: CICE will send a 10m wind through OASIS? RF: Not FMS coupler, this is just OASIS 10m wind. MC: ACCESS-ESM case?
AH: Hakase could be used as a guinea pig. Any of these changes affect ACCESS-CM2? RF: Shouldn’t. AH: Do we need to do any bit repro tests? RF: Shouldn’t change anything.

migrating FMS to an external library

AH: I put my hand up to do the change and test.
MW: FMS updated to Xanadu a couple of weeks ago. AH: So a good time to try it out. MW: Already tried it, put some MOM patches in to fix some issues. AH: On the GFDL FMS repo? MW: They have opted not to take the parallel netCDF using MPI IO patch RY and I worked on. Have set up a branch with parallel IO, and Xanadu has been merged into that branch. May want to use branch with parallel netCDF extensions. Ongoing conversation with this. They may merge it in. Can use what you want. Your call as to what to use.
RF: Any whitespace issues? MW: FMS and MOM6 live on different planets. They don’t interact much. Don’t collaborate with FMS guys.
MW: Alistair getting miffed at the red buttons on the jenkins server. He/I will look at some GFDL independent solution. Happy for NH to be involved as much or as a little as he wants. NH: They should be more blue than red. MW: Happened in March due to checksumming? NH: Bitrot, Jenkins is fragile. Scott often fixes it. Good idea, happy to help in any way. May be easier to set up on raijin. Does one qsub and runs them all under one sub. MW: slurm is sort of designed to do that. NH: slurm is awesome. MW: slurm is better. NH: like it a lot more. MW: Good for running multiple jobs per submission. Blurs the line between MPI and scheduler. Some sort of meta-scheduling. Place jobs on ranks within the request. AH: More flexibility.

Actions

  • Update MOM build to use external FMS library (CMake) – AH
  • Finish WOMBAT integration – RF
  • Make CICE compression issues – AK

Technical Working Group Meeting, April 2019

Minutes

Date: 10th April, 2019
Attendees:

  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Marshall Ward (MW) GFDL
  • Russ Fiedler (RF), Matt Chamberlain(MC) CSIRO Hobart
  • Nic Hannah (Double Precision)

Updates

MW: Discovered travis test GFDL uses for FMS has been failing for six months. MW fault. Introduced new MPI function. Function doesn’t exist openMPI 1.6, which is what travis uses. Doesn’t show up in MOM5 as changes not in there. Solution was to switch to MPICH. Bleeding edge travis only uses openMPI 1.10.

FMS in MOM5

AH: Link FMS rather than have in repo. NH: I agree. MW: Still think subtree best solution. Now FMS has dedicated automake build can formally install as module? MW: Had a long chat about this with Alistair. Not hot on submodule/subtree. NH: Just have in CMake or a script. AH: Makes sense.
RF: One of my jobs with decadal project is to have MOM5 linked with AM4. Uses a more recent version of FMS. Will be useful for RF. Have to get MOM5 talking with FMS used in AM4. MW: Have run MOM5 with latest FMS. RF: Just making sure no surprises, changes of interfaces. Not just FMS, also other bits and pieces. AH: Auto testing with multiple FMS.
MW: If we go path of building independent libraries, not sure how C world tracks this? ABI changes? How do you manage binary compatibilities? NH: Did that with OASIS, and not sure it was worthwhile. MW: C programs don’t seem to have these issues. NH: Using precompiled libs necessary in linux, for us not reason not to compile FMS when compiling MOM. MW: Not keeping public library? NH: More complexity than necessary. We’re just talking about splitting source code out into separate repo. Good idea. MW: MOM6 has FMS repo, and a macro repo above that the builds everything. Not sure we want to go that way. NH: access om2 works that way too but experiment repos are separate. MW: Maybe submodules / subtrees aren’t so bad. NH: MOM5 repo can have a build script that references a build script for FMS. MW: Doesn’t CMake have some functionality to check it out for you.
AH: Finish off CMake build scripts and add in FMS stuff.
MW: What they do with MOM6 is having issues. Will bring up with them. Maybe some convergence on library dependencies. AH: Don’t favour central lib install with MPI dependencies.

WOMBAT in MOM5

RF: Not much to report. Make sure WOMBAT can be called in MOM-SIS. Only outstanding issue. Just changing a few if statements. MC: Comfortable that it will run in ESM framework? RF: Not sure who is going to test? MC: OM2? RF: Should run in OM2.
MC: Richard Matear went to visit AH. I haven’t run anything yet. With experiments running under payu ready to go. OM2 test with WOMBAT.
AH: Is there a PR for these code changes? Make a PR. RF: Maybe said to do that. Split up testing to avoid duplication.
MC: Hakase wants to run this too I believe?

Tenth Model

AK: Set up RYF for Spence. Run 20 years. Looking at test bed for improved config. Improved bathy from RF. Conservative temp. Running at half the cost of previous config. 10Mh/yr, 60-65 KSU. Speed up from higher time ocean tilmestep, and ice is now 2 time steps per ocean timestep, compared to 3. Due to removal of fine cells in bathymetry in tripole. Wanted to use non-mushy ice, but low ice con in Baltic fails to converge thermodynamic temp profile. Should converge in a few steps, but limit at 100 and still doesn’t converge. Paul is using mushy. Had a run with non-mushy up top crash. Spinup7 is mushy spinup8 is non-mushy. 10-15% extra cost for mushy ice. Not sure if we’re CICE or MOM bound. Other resolutions are not using mushy. Want to set up an IAF tenth run starting in 1958. Can afford with cheaper model.
AK: TEOS-10, not sure if we want to use. Need absolute salinity and cons temp.
AK: Noticed gyres are much too weak in all resolutions. Looks have to careful with JRA55. Did a test with 0.25 with abs wind rather than relative. No change. Florida current 65%, EAC about 70%. Gulf stream is not separating properly. Mean position ok, but to variable. Maybe insufficient momentum. Doesn’t go around grand banks properly. Causes SST biases. Not sure how much to fix before IAF.
NH: All resolutions? AK: All resolutions are too weak. Gulf stream separartion is ok in tenth in average, but too much variance. Mean position in 0.25 is really bad. Biases around grand banks similar in all resolutions. NH: Improving separation improve biases? AK: Maybe. SSH is localised in model, but stretched out in obs.
NH: Is this specific to JRA55? Does it happen with CORE forcing as well? AK: Don’t know. Griffies said others find gyres a bit weak with JRA55. It uses scatterometer winds, which are relative to an eddying ocean. Not in the same location as a model. JRA55 paper suggest adding climatological mean current to the wind to force the ocean. AH: Should that be in the product? AK: Griffies says people aren’t too keen on Sujino suggestion. AH: Diagnose wind stress from 0l25 test? AK: yes. 10-20% change in stress in western boundary currents and southern ocean where large mean currents. Stress changes are in quite small areas, not a big effect on gyres.
MC: Do you recall what the EAC numbers were in model compared to obs. I thought we had 20Sv which is similar to OFAM/BRAN. AK: Obs: 18.7, 17.5 and 17.2 Sv about 2000m in models. 22.1 pm 7.5 from a mooring. Florida current is 30% too low. Well observed.
NH: What is big challenge in future? AK: Not sure how much to change before next IAF. Will put out a call for diagnostics. Also explain config and see if people have an issue with that.
AH: Doesn’t MOM5 not fully support TEOS-10? RF: Not obvious to user that can use TEOS-10. Kind of fudged. Proper way is to carry an extra tracer. Have preformed salinity and an adjustment factor to create abs salinity. Another way is to have abs salinity as a single variable and adjustment factor is zero. To use full TEOS-10 in MOM5, need 2 tracers. If you do it the same as the rest of the world would have a zero tracer. Don’t want a wasteful tracer.
RF: There is a newer way to parameterise the equation of state. Need updated. AH: New module? RF: Yes, just switch.

FAFMIP errors in ACCESS-OM2

RF: Frazil not being redistributed. Needs fixing. AH: Affect other runs? RF: No just FAFMIP.