Technical Working Group Meeting, March 2018

Minutes

Date: 13th March 2018
Attendees:

  • Marshall Ward (Chair) (NCI)
  • Aidan Heerdegen and Andrew Kiss (CLEX ANU)
  • Russ Fiedler, Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff, Roger Bodman (CSIRO Aspendale)

Models

ACCESS-OM2-01

Andrew: stability was bad. A couple of weeks ago Ben Menadue said a dodgy cable fixed and then was working fine. Lately maybe 1 of 8 model runs fail. Model stops before time stepping. Does init. No model runtime. All the hangs look the same. Saved most of the outputs. Never timesteps. Hangs Marshall was looking at were OASIS restart writes (OpenMpi 2.1). OpenMPI 1.10.2 fails differently and maybe more randomly. OpenMPI 2.0 was fixing a problem but adding a new problem.
Aidan: Russ’ stuff might be useful to kill the job and resubmit.
Aidan: Should use padb to find where process hang. Marshall: can take 5+ mins to produce trace. Aidan: Need to give Marshall to info on crashes.
Aidan: ACCESS-OM2-01 running on only 4.2K processes. Shouldn’t be as fast as MOM-SIS-01 which was running on 5232 (7200 config masked).
Andrew: can’t get model on to queue during business hours. Runs start in the evening.
Aidan: Queue stuff unavoidable. Just very busy with some large jobs queued that Andrew’s job can’t leapfrog.
Andrew: ACCESS-OM2-01 quite variable in runtime. Between 2 and 3 hours per month. Just one month per submit. Variability forces shorter runs. Currently running 360s, every month but august. Currently testing 400s.
Marshall: tried yalla pml? Any effect? Andrew: hard to tell. Marshall: 100% OpenMPI 2.0 hang went away with yalla. Andrew: of first 4 submits, 2 hanged. Seems ok now, but no big improvement.
Marshall: rewrote global field to use alltoallw (actually using gather). Didn’t fix hang. 1 and 025 deg worked and produced correct response. Tried throttling, update a fraction of the domain and 0.1 ran. Library can’t handle the number of messages.
Russ: global field just for restart? Marshall: No. Yes used to produce OASIS restart. Also used when we use io_layout. Uses function to gather whole io subdomain onto individual master rank. Used inside FMS. Probably don’t fail at the moment. Thought that function was a relic. Still there and still used. Code changes did work, but failed on io servers when I have an io server of 1 (happens with masked runs).
If Marshall can get it working, can chunk alltoallw, free us from magic mpi/accelerator flags. This is a positive thing.
Marshall: Hanging on single tile in masked run is a bug. MOM has some logic checking for single tile and not run stuff, which might make other bits hang if they are expecting some communication.
Marshall: now using MPI types, avoids buffers.
Roadmap for OpenMPI updates:
  1. Resolve issue with io-layout with rank 1 will be working
  2. Clean up MPI code, get out of FMS code
  3. Try chunking/staging in alltoall. Might not need MXM at all
  4. Try to get into FMS and switch to independent FMS module (maybe GFDL)
  5. Address performance issue
  6. Run on OpenMPI 2.0
Marshall: Hope to solve random hangs. Run on supported libraries. Less dependent on magic MPI flags. More resilient for new machine. Maybe wait to submodule until GFDL accepts patch?
Marshall: If MPI issues happens again we have a better strategy. Can’t just replace point to points with collectives. Library has issues. Won’t scale to a new machine.

MOM-SIS-025-WOMBAT

Matt: Paul Spence is happy WOMBAT 025. Paul isn’t hanging? Aidan: Does Matt see WOMBAT runtime variations? Matt: not to frequently. Maybe check with Paul? When Matt did initial testing, got identical output and timings. So similar wasn’t sure if put the new code in!
Marshall: Paul was getting hang in xgrid init. New code required for that. Runtime is weird. Need more info before get worked up.
Marshall: diag_step stuff is really slow. Expensive function. Scales horribly.
Aidan: Matt, have you done a pull request to mom-ocean? Matt: need testing? Marshall: if it works for you, don’t worry. Matt: a number of hooks from tracer package into ocean and ice model. Similar to ACCESS, not quite the same. Few extra variables in boundary package.

Other business

Marshall: Unlimited time axis on sss restoring. Aidan: Yes was an issue, fixed.
Roger: not that busy in this space. Still wondering about change between 12×8, to 8×12. Marshall: no expectation for that to be
reproducible. Marshall: restart issues seem more severe. Has it been looked into? Peter: No. Tasked with fixing, not looked at this year.

Common codebase

Peter: Agenda item that Nic and I would bring OM2 and CICE5 to a common codebase. Nic: doesn’t seen this as being feasible anymore because the models have diverged so much.
Marshall: Does he mean he has put in coupling code that has diverged from yours? Peter: not sure. Hoping to talk to Nick today about this. Is the idea really dead? Just taking a lead from Nic as don’t know about the other codebases. Had a couple of meetings with Nic. Nic went and looked at the code, expected the differences to be trivial, but they weren’t.
Peter: Would like to work from a common codebase. Would like to capture the activity on GitHub. Some scientists would like them to be the same, can’t really make the case for that, but that is what they want. Not sure how to proceed. If we don’t share code now, we won’t ever. Do we just drop our MOM5 and grab the GitHub version? Seems like a lot of work a this point.
Matt: can you clarify the relationship between the code bases? Is it closely related to OM, CM etc? Peter: No, can’t give clarity.
Peter: in 2015 Nic and Hailin put together a version of MOM5 that was to be used with GA6. No idea what was specific about Hailin’s version don’t know. Not sure why they can’t be brought back together. Can we do some emails/issues to get this moving because there is a month.
Russ: I’d like to be brought in on this. Part of my work with decadal work is to couple wave watch 3.0 into ACCESS-CM2 and OM2. Worry that they are diverging so much.
Marshall: Andy would be disappointed about this news. Six years ago aspired to this goal. 3 years passed, nothing happened, ok, but to drop it now is unfortunate. Marshall: difficult for Aspendale, and volatile with runs about to begin. Next CMIP aspiring to do this? Disappointing to the science guys. Can resources be pumped into this?
Andrew: one of the objectives of COSIMA was to avoid duplication of effort. Marshall: doing better than duplication. Some redundancy.
Aidan: I think MOM should be relatively straightforward to get harmonised, CICE is the issue? Russ: yeah, problem with CICE. A lot of the things that are done in individual components should be done in OASIS-mct. Averages and double looping that is really confusing. Using native OASIS calls to do averaging would be much simpler. Old OASIS had to bring it all on to one processor was a disaster. Decisions made were sensible then, but coming back to bite.
Russ: Way to run with OASIS is to call it every timestep. Let OASIS decide. Don’t need specialist code in individual components. It is distributed. Time averages can be done on the local processor.
Marshall: I think the problem Nic found was a inefficiency between ocean/ice code communication. Maybe that makes merge undoable.  Weird log-jamming of messages. Nic has done this and done it in an ocean/ice context without too much consideration of atmosphere.
Marshall: Nic and I were going to sit down and look at it.
Russ: would like to get my head around it.
Aidan: can we converge to a common codebase? Maybe CM needs to make these changes anyway?
Peter: Nic said need to get CM2 and ESM code up to date with MOM5. From CM perspective, need to be a bit risk averse. Also risk with no changing. Already doing spin ups for CMIP6. A specific change I am aware of — pull request from Fabio. Ticket #211? Steve Griffies e patch? Paul Spence convection code changes. Were important to his OM2 model. Haven’t been regression tested. Needed for a student.
Peter: Changes important enough to pull into CMIP6. Conflict with direct merge. Can do by hand. Marshall: hand merge if required at this stage.

Actions

New:

  • Follow up with Andy Hogg regarding shared codebase (Marshall)
  • Marshall liase with Andrew Kiss about tenth model hangs (Andrew, Marshall)
  • Pull request for WOMBAT changes into MOM5 repo (Matt, Marshall)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (Russ)

Existing:

  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (Marshall)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (Russ)
  • Profile ACCESS-OM2-01 (Marshall)
  • Move FMS to submodule of MOM5 github repo (Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes (Nic, Peter)
  • Russ to add all his ocean bathymetry code to OceansAus repo (Russ)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (Nic).
  • Nudging code test case (Russ)
  • Redo SSS restoring with patch smoothing (Aidan)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (Andrew)