Minutes
Date: 10th January 2018
Attendees:
- Marshall Ward (Chair) (NCI)
- Aidan Heerdegen, Andrew Kiss (ARCCSS/ARCClEx ANU)
- Fanghua Wu (National Climate Center, China Meteorological Administration, Visitor ANU)
- James Munroe (Memorial University of Newfoundland)
- Russ Fiedler, Matt Chamberlain (CSIRO Hobart)
- Peter Dobrohotoff (CSIRO Aspendale)
MPI Errors on raijin
- Marshall: MPI updates on raijin to try and improve MPI performance by turning on all mellanox accelerators. Mellanox can be hit and miss. Had to recompile as fabric was updated. OpenMPI has direct calls to mellanox in library?
- Andrew: tenth job hung. Raised issue. Ben said turn off all mca flags. Then reading the conf file. Lustre issue. Should be fixed.
- Marshall: fca is now turned off. MOM has few collectives.
- NCI changed default config settings which may have let other jobs work. Has now turned everything off. Maybe try and get some UM/MOM testing into NCI upgrade testing suite to avoid these issues.
- Nic: if we can’t get tenth running, fairly sure it is mxm stuff. Marshall: mxm reported too many retries error (jobs become list). Andrew’s jobs are hanging in MPI_Init(). Nic: experienced hang on gather in OASIS without mxm.
- Russ: working on CM2.5. Getting sat-vap pressure errors in first timestep. Which is why Russ implemented traceback update in MOM5 code. Sat vap errors written to stdout. Anything written to stdout not on root PE don’t get output. FMS has stdlog, stdout and stderr. Unless recompiled will not make it to stdout. Most gets redirected to /dev/null in FMS for every PE except root PE. Performance issue.
- mpirun can report which rank is reporting which line (tag output)
- Nic: so any “print *” statements will have a major performance hit. Used to be a lot of them in MOM5.
- Marshall: will have to assess all accelerators separately from now. 5 accelerators, 2 never implemented, all off by default. Now all 5 available, all off by default now, but initially were on but caused instabilities.
- Marshall: NCI are continuing to push openmpi 2/3. Still have speed issues. Need to get in top of it. Memory pinning has changed. Either explicit or changed. This is a critical performance issue for us. We need to be able to say why we don’t use openmpi 2.
- Peter: need a broader discussion around older versions, compiled against older libraries. Papers have been published with these executables. Do we continue to support software which is still being used for science. Scientists aren’t always interested in same things as HPC people. Bigger discussion needs to be had. To what extent does NCI as a partner need to be supporting these deprecated libraries?
- Marshall: our jobs to communicate our issues to program leaders
- Peter: is reproducibility off the table once hardware changes? Yes. By 2nd quarter next year clean slate.
- James: do we need to run these models on other hardware for reproducibility? Marshall: NCI have some secret stuff (other architectures)! Not on the normally accessible queues, but maybe could be used for testing?
- If we have our jobs in MOM6 tests we get cross platform tests for free.
Action Clean-up
- Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic. Can’t merge pull request without testing. Ping Mirko about pull request? Russ Might be able to test Mirko’s code. Russ to make test case
- Create document outlining options for configuration sharing? No, configs now in github. Iterate on that
- Ask Dale Roberts about effects of OpenMP for Roger (Marshall): Â not relevant. Delete.
- Start a new google doc about coupler issues and MATM (Marshall). Too vague. Delete.
- Add new access-om2 test cases to Jenkins test suite (Nic). Done and ongoing. Delete.
- Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic, Marshall) — Marshall raised this in front of NCI Systems. Does not want ANY SERVER OF ANY SORT running on any machine. Process should only talk to files on a filesystem. Could be sticky. What about an IO server of some kind?
-
Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes — no progress to date. Important step.
-
Russ to add all his ocean bathymetry code to OceansAus repo. Not done.
-
Check current sea surface salinity restoring smoothing (Aidan) Russ: can see some strange patterns in north. Laptev/Kara sea. Russ to provide images on slack.
- Test Andy’s 5 year config with different netcdf library versions to check MATM error is not a just a library issue (Aidan). No need. Fixed.
- Send link to spinup diagnostics spreadsheet to Russ (Andrew Kiss). Done.
- Follow up with NCI MAS people (Marshall). Need to turn on netcdf crawler on hh5, and need read access to postgres DB. There was some follow up to an email Aidan sent at the end of last year, promising “early January”.
- Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation? Marshall will do this.
- Collation errors on regional outputs (Aidan). Fixed on Paul’s newest runs. Unknown why it occurred, possibly mismatched t and u grids.
- Nic said he would get IAF working. Had to rewrite MATM to fix stuff up.
Actions
New:
- Nudging code test case (Russ)
- Redo SSS restoring with patch smoothing (Aidan)
- Follow up with NCI MAS people. Need something by end of the month (Marshall)
- Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
- CICE and MATM need to output namelists for metadata crawling (Andrew)
- Doodle poll for new meeting time (Marshall)
Existing:
- Move FMS to submodule of MOM5 github repo (Marshall)
- Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
- Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
- Blog post around issues with high core count jobs and mxm mtl (Nic)
- Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic)
-
Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes (Nic, Peter)
-
Russ to add all his ocean bathymetry code to OceansAus repo (Russ)
-
Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (Nic).