Technical Working Group Meeting, February 2018

Minutes

Date: 13th February 2018
Attendees:

Marshall Ward (Chair) (NCI)
Aidan Heerdegen (ARCCSS/ARCClEx ANU)
James Munroe (Memorial University of Newfoundland)
Russ Fiedler, Matt Chamberlain (CSIRO Hobart)
Peter Dobrohotoff (CSIRO Aspendale)

MPI Errors on raijin

Marshall still having MPI problems.
Russ is running ~1200 processor jobs, haven’t been running a lot. Haven’t had any big MPI problems.
Russ had getcwd() race condition in kernel. Using mom4p1 reading input namelist. Error was “can’t get current working directory”. Hit similar problem 8 years ago. Was a lustre problem then. Run crashed. Only happened once. Same file being read by every processor. Was maybe the issue. Newer MOM does one read and uses MPI to distribute.
Marshall found a performance issue in MOM on CRAY system. NFS rather than lustre. Struggled to read ocean_grid.nc. Maybe multiple reads of the same file. Took 40% of run time to open file and read it. Maybe need some testing on different systems.
Kial Stewart had some run time blowout issues. Wondered if it could be an MPI problem. Aidan asked if others had seen similar issues. Matt is not running production runs. Not sure if he has had any run time issues. Hasn’t got a clean baseline. Not seeing anything like 2x longer.
Marshall: lots of people have hanging jobs issues right now. NCI has had waves of issues over past 5 years where this happens more frequently and then abates. Rhodri at RSES having similar issues, may mean NCI will take more notice.
Matt will keep monitoring. Marshall: can switch over to a debug MPI library. Will give better info. Ben M says it will run slower, Marshall hasn’t noticed a slow down. If runtime similar maybe use it. Marshall is learning about MPI debug info, TBs of it. Intel 17 now includes C code in backtrace. Debug gives a bit more about MPI state when bails.
Matt using old compilations. In coupled decadal config switched from mom4p1 to mom5 last year, recompiled then.
Russ got the openmpi/1.6.3 library path issue, recompiled and was fine.
Peter working now. Memory pinning stuffed them up. Perhaps due to resource exhaustion. Had some problems with getting on the queue. Was using up the time too quickly.
Marshall: epic issues with MPI 2.x. Running on tenth. Getting random rank fails. Not always the same. When it fails it thrashes the stack and kills the backtrace process. Get a backtrace of the backtrace. Other nodes wipe their stack and stop gracefully. Severe stack overflow, might be in the library itself. Trying valgrind. Failing in collectives. Running access-om2. Test programs are fine. Seems to be a memory thing. Need to run a large model for a decent time. OpenMPI 3.x fails in the same way as OpenMPI 2.x in a memory offset function (longjmp_).

Models

Paul’s wombat runs were failing because didn’t have the xgrid alltoall. Matt put the wombat changes into MOM5? Matt: WOMBAT has some ugly hooks into ice/ocean boundary conditions. Putting WOMBAT into access coupled model. Update to MOM5 should be straightforward.
Russ has put current MOM5 code into CM2.5 for decadal prediction. Should add in WOMBAT. Try and keep it up to date as much as possible. Kill two birds with one stone.
Russ’ monitor code runs on one processor, if whole jobs stops it takes it down. All you have to do is instrument MATM. Wrote it to not interact with WORLD communicator from OASIS. It gets spawned as a slave. Instrument MATM. If it takes a long time between segments will issue an abort, which will cascade. Only needs to communicate with MATM. Need knowledge of how long to expect things to take. Do an MPI_commspawn, issues an ab abort to it’s parent (MATM). Marshall: could we integrate this into MATM? Does it need an extra rank and sub communicator? Russ: yeah, could do something more complicated. This was easy and stands alone. Not interacting with OASIS. MATM already ends up wasting ranks.
CICE ncpus. Russ use halo approach like MOM barotropic solver? Marshall: had this suggestion before. CICE future was uncertain, never a big bottle neck or resource use, so not a target.
Marshall: profile OM2 with tenth. Russ: optimise processor layout? Nic did it. Could we improve this? Sub-blocks? 90×90. Marshall: slender2? Russ: no, cartesian, single 90×90 block. Russ once you get to large numbers, starts to fall apart. No sub-blocking. Compiled with 90×90. Marshall: CICE best suited to hybrid. 1 CPU/node, n threads per rank, with load balancing of threads. Long way away from that.
Russ looked at some of Andrew’s timings. Hard to make much sense. Without knowing where all the blocking/synchronising is happening.
Marshall: fan of score-p. Met the developer. Can make cartesian maps of processor timings. Did it for LIM profiling.

MAS Database for COSIMA Cookbook

James: has queried the DB and taken a quick look. Marshall: Happy with it? James: useful that a third party doing filesystem crawling. We might need to do some extra crawling for faster update. Their shard approach will scale better than James’. Will go forward, take the schema they have developed and write tools to use schema. Will use MAS and then fallback to local DB when not available. They gave us what we asked for.
Aidan still unable to access it due to a low uid number not playing nicely with security settings.
MAS will be good for other people might have data they want to use not under our control.
James: Regarding the COSIMA Cookbook experiment DB file, if you delete the file and recreate, umask will stuff it up. Can put in logic in the software to do the permissions checking.
With MAS DB, James can keep an eye on how often it is updating with a view to requesting more frequent updates if needed.

MPI-IO

Marshall: Rui has made a lot of progress on MPI-IO. Marshall wrote a bunch of fortran hooks into parallel netcdf and put it into FMS. Handed over to Rui. Worked on it a lot. netCDF4 struggles. Takes a long time to open and close files. Serious synching issues around metadata. Bottle-neck in MPI-IO. Tapping into parallel hdf5. Been around for a long time but noone uses it. Rui has a working relationship with Urbana HDF5 group. Rui has switched to pnetcdf. Works really well. Doesn’t have meta data synch issues. See speed ups with serial case. Not what we traditionally use. 3-10x faster than serial case. Writes that take a few hours take an hour. It is usable. Metric shows clear improvement.
Downside is netcdf3. Traditionally this isn’t what we usually do. Usually stitch multiple files together. This will eliminate post-processing. Will make a single coherent file. Not sure what to do. Do we want to use this approach? Good enough to put in main MOM code, but turned off by default. Very sensitive to lustre striping, number of writers. Correct lustre settings are essential.
If the overhead is a few minutes, it might be convenient to eliminate post processing. Thinking ahead to 1/30th simulations. Aidan: would still need to post-process and convert to compressed netCDF4.
James: mppnccombine isn’t based on pnetcdf? Russ: no. Was attempted, but now abandoned. Marshall: Rewrite mppnccombine to use MPI-IO? Yes good idea, but NCI wanted to test it inside a model.
Been very instructive. netcdf gets in the way, so hdf5 gets in the way. Not sure what is the best way forward. Might scale with the number of writers. Maybe 1-2 writers per node. Collectives on the nodes, and written by number of writers.
Dale has done this with native IO on UM and got 4x speed up. Quite profligate for CPU hours.

Actions

New:

After FMS moved to submodule, incorporate MPI-IO changes into FMS (Marshall)
Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (Russ)
Profile ACCESS-OM2-01 (Marshall)

Existing:

Move FMS to submodule of MOM5 github repo (Marshall)
Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
Blog post around issues with high core count jobs and mxm mtl (Nic)
Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic)
Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes (Nic, Peter)
Russ to add all his ocean bathymetry code to OceansAus repo (Russ)
Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (Nic).
Nudging code test case (Russ)
Redo SSS restoring with patch smoothing (Aidan)
Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
CICE and MATM need to output namelists for metadata crawling (Andrew)