Technical Working Group Meeting, January 2018

Minutes

Date: 10th January 2018
Attendees:

  • Marshall Ward (Chair) (NCI)
  • Aidan Heerdegen, Andrew Kiss (ARCCSS/ARCClEx ANU)
  • Fanghua Wu (National Climate Center, China Meteorological Administration, Visitor ANU)
  • James Munroe (Memorial University of Newfoundland)
  • Russ Fiedler, Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff (CSIRO Aspendale)

MPI Errors on raijin

  • Marshall: MPI updates on raijin to try and improve MPI performance by turning on all mellanox accelerators. Mellanox can be hit and miss. Had to recompile as fabric was updated. OpenMPI has direct calls to mellanox in library?
  • Andrew: tenth job hung. Raised issue. Ben said turn off all mca flags. Then reading the conf file. Lustre issue. Should be fixed.
  • Marshall: fca is now turned off. MOM has few collectives.
  • NCI changed default config settings which may have let other jobs work. Has now turned everything off. Maybe try and get some UM/MOM testing into NCI upgrade testing suite to avoid these issues.
  • Nic: if we can’t get tenth running, fairly sure it is mxm stuff. Marshall: mxm reported too many retries error (jobs become list). Andrew’s jobs are hanging in MPI_Init(). Nic: experienced hang on gather in OASIS without mxm.
  • Russ: working on CM2.5. Getting sat-vap pressure errors in first timestep. Which is why Russ implemented traceback update in MOM5 code. Sat vap errors written to stdout. Anything written to stdout not on root PE don’t get output. FMS has stdlog, stdout and stderr. Unless recompiled will not make it to stdout. Most gets redirected to /dev/null in FMS for every PE except root PE. Performance issue.
  • mpirun can report which rank is reporting which line (tag output)
  • Nic: so any “print *” statements will have a major performance hit. Used to be a lot of them in MOM5.
  • Marshall: will have to assess all accelerators separately from now. 5 accelerators, 2 never implemented, all off by default. Now all 5 available, all off by default now, but initially were on but caused instabilities.
  • Marshall: NCI are continuing to push openmpi 2/3. Still have speed issues. Need to get in top of it. Memory pinning has changed. Either explicit or changed. This is a critical performance issue for us. We need to be able to say why we don’t use openmpi 2.
  • Peter: need a broader discussion around older versions, compiled against older libraries. Papers have been published with these executables. Do we continue to support software which is still being used for science. Scientists aren’t always interested in same things as HPC people. Bigger discussion needs to be had. To what extent does NCI as a partner need to be supporting these deprecated libraries?
  • Marshall: our jobs to communicate our issues to program leaders
  • Peter: is reproducibility off the table once hardware changes? Yes. By 2nd quarter next year clean slate.
  • James: do we need to run these models on other hardware for reproducibility? Marshall: NCI have some secret stuff (other architectures)! Not on the normally accessible queues, but maybe could be used for testing?
  • If we have our jobs in MOM6 tests we get cross platform tests for free.

Action Clean-up

  • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic. Can’t merge pull request without testing. Ping Mirko about pull request? Russ Might be able to test Mirko’s code. Russ to make test case
  • Create document outlining options for configuration sharing? No, configs now in github. Iterate on that
  • Ask Dale Roberts about effects of OpenMP for Roger (Marshall):  not relevant. Delete.
  • Start a new google doc about coupler issues and MATM (Marshall). Too vague. Delete.
  • Add new access-om2 test cases to Jenkins test suite (Nic). Done and ongoing. Delete.
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic, Marshall) — Marshall raised this in front of NCI Systems. Does not want ANY SERVER OF ANY SORT running on any machine. Process should only talk to files on a filesystem. Could be sticky. What about an IO server of some kind?
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes — no progress to date. Important step.
  • Russ to add all his ocean bathymetry code to OceansAus repo. Not done.
  • Check current sea surface salinity restoring smoothing (Aidan) Russ: can see some strange patterns in north. Laptev/Kara sea. Russ to provide images on slack.
  • Test Andy’s 5 year config with different netcdf library versions to check MATM error is not a just a library issue (Aidan). No need. Fixed.
  • Send link to spinup diagnostics spreadsheet to Russ (Andrew Kiss). Done.
  • Follow up with NCI MAS people (Marshall). Need to turn on netcdf crawler on hh5, and need read access to postgres DB. There was some follow up to an email Aidan sent at the end of last year, promising “early January”.
  • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation? Marshall will do this.
  • Collation errors on regional outputs (Aidan). Fixed on Paul’s newest runs. Unknown why it occurred, possibly mismatched t and u grids.
  • Nic said he would get IAF working. Had to rewrite MATM to fix stuff up.

Actions

New:

  • Nudging code test case (Russ)
  • Redo SSS restoring with patch smoothing (Aidan)
  • Follow up with NCI MAS people. Need something by end of the month (Marshall)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (Andrew)
  • Doodle poll for new meeting time (Marshall)

Existing:

  • Move FMS to submodule of MOM5 github repo (Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes (Nic, Peter)
  • Russ to add all his ocean bathymetry code to OceansAus repo (Russ)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (Nic).

Technical Working Group Meeting, December 2017

Minutes

Date: 14th December 2017
Attendees:

  • Marshall Ward (Chair) (NCI)
  • Aidan Heerdegen, Andrew Kiss (ARCCSS/ARCClEx ANU)
  • Fanghua Wu (National Climate Center, China Meteorological Administration, Visitor ANU)
  • James Munroe (Memorial University of Newfoundland, Visitor ANU)
  • Russ Fiedler (CSIRO Hobart)

 

Output file metadata indexing

  • MAS database at NCI. POSIX info. ncdump blob. nodal style. Can put index on netcdf files and search by them.
  • James did similar thing for COSIMA cookbook running in user space. James has had no action on MAS DB so far.
  • Currently NCI is POSIX crawling hh5. James: need to switch on netcdf for certain directories.
  • James: Ben pitched MAS as a great innovation. Maybe Andy needs to formally ask Ben for this?
  • What can MAS deliver that existing DB cannot? James: stopped developing DB because of MAS. SQLite was 40-50K vars/files. Spinup of 1 deg model have 1M+ variables/metadata. SQLite already 1-2GB. Only scales to 1M rows. Can’t deploy postgres without admin access. Could host one, but should live on NCI resource. Makes sense to MAS.
  • James: just a user role in DB and switching on netcdf indexing — should be fine. Marshall will follow up with NCI MAS bods to make sure this happens soon.
  • Andrew was concerned that this will have on-going support. Use of MAS in other high profile projects (geoscience australia for example) means this is a critical piece of infrastructure.
  • James: can we just access their schema? Want to open source, not sure how. James: NCI has confluence, do they have bitbucket license? Marshall: no.
  • Need mom.out copied to hh5 also to be able index important info with f90nml. Russ: logfile has just namelist info.
  • Andrew: any equivalent for CICE and MATM? Maybe not? Andrew: need to make CICE and MATM print out namelists.
  • Marshall: get Ben/Andy to endorse official use of MAS by CoE.
  • Marshall: do we need to add attributes to files to accomodate this? James: does the executable spit out a version string? No. Marshall his build script puts a version string. Russ: version part of FMS? Russ: Marshall took version out when moved to oom version of FMS. See Issue #31 on GitHub (can’t find issue Russ refers to). Russ: already have a version.c
  • James: CSIRO wants some of the automated processing for decadal prediction. Can we apply to both?
  • James: make a MOM module? Marshall: make codebase a submodule of payu
  • Aidan talks about reproducible builds using spack. Reproducible builds require a package manager so that it can find and know about all the components of the build.
  • Marshall will put hashes in executable in MOM.

COSIMA Models

  • Andrew tenth degree runs: salinity crashes in the arctic. Recent crash: MPI Abort error code 111. Resubmit? Use broadwell
  • Andrew: has added regional runoff caps. Tighter caps in arctic rivers.
  • Paul Spence issue with regional outputs, had incorrect bounds. Might affect in future. High temporal resolution in small regions.
  • Russ: weird happened a while ago. Mixing velocity and tracer grids in a single file? At least for regional output. Mixing u and t grids? — Aidan look into it.
  • Migrating to FMS submodule. When Marshall updated to oom one of the open boundary cases broke. Took 2-3 weeks of scientific coding to fix.
  • Russ looking at CM2.5 and new FMS. AM4 has been released.
  • Marshall: will make FMS a submodule.  This works for decadal prediction people who will need this work done in any case.
  • COSIMA will do JRA55 IAF tenth run.
Wednesday meetings next year. 11.30am.

Actions

New:

  • CICE and MATM need to output namelists for metadata crawling (no-one assigned)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • Make MOM (and other models) emit GitHub version hash (Marshall)
  • Collation errors on regional outputs (Aidan)
  • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation?
  • Follow up with NCI MAS people (Marshall)

Existing:

  • Send link to spinup diagnostics spreadsheet to Russ (Andrew Kiss)
  • Nic add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation.
  • Test Andy’s 5 year config with different netcdf library versions to check MATM error is not a just a library issue (Aidan)
  • Check current sea surface salinity restoring smoothing (Aidan)
  • Russ to add all his ocean bathymetry code to OceansAus repo.
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic, Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Nic create a discussion document (on COSIMA?) to document current approaches and strategies for future
  • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic.
  • Add new test cases to Jenkins test suite (Nic).
  • Start a new google doc about coupler issues and MATM (Marshall)
  • Ask Dale Roberts about effects of OpenMP for Roger (Marshall)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Create document outlining options for configuration sharing (?)