Technical Working Group Meeting, August 2017

Minutes

Date: 8th August 2017
Attendees:

  • Marshall Ward (NCI, Chair)
  • Aidan Heerdegen (ARCCSS ANU)
  • Nicholas Hannah (ARCCSS/Double Precision)
  • Russ Fiedler and Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff and Roger Bodman (CSIRO Aspendale)

COSIMA Models

  • Nic: Toy model is good for yes/no type tests
  • Invite James Munroe to the TWG meetings. He could be of assistance getting a test suit together.
  • Nic: using Issues interface on github has been very helpful. Hasn’t written emails and has answers to problems. Steve and Russ been super helpful. Marshall: good that users have been Issues. Nic: every time I go to write an email, can I make this a github issue? Dave Bi, Siobhan have useful input. Marshall: how do we get them onto github too? Russ: Arnold has used it. Hopefully Dave and Siobhan will jump on to it also.
  • Peter: using a different MOM. Nic and Peter to meet to sort this out. Roger: can cice be on the same repo? Nic + Peter will liase.
  • Invite Arnold to these meetings.
  • Marshall: Bit repo issues. Surprised ACCESS-CM2 has bit reproducibility problems. Nic found 4 issues that affected this. Payu was missing restarts. Also found a couple of issues where there was a different code branch for first coupling time step. Red Sea and another one. Added checksums to all coupling fields. And tested all of these. Now restarts ok from restarts, but 3 or 4 time steps starts to diverge. Probably still some small issue from restarts. Wasn’t just one problem. Some of these were Nic’s issue with restarts and payu. Wasn’t just one problem. Some were particular with Nic’s setup. When it is solved can talk to CM guys. The general pproach is useful though. Concentrate on coupling fields.
  • Pete: combing log files and looking for checksums and numbers to compare between the runs.
  • Nic: once I get through all this I can talk to Peter about the method and use it to work on Peter’s issues.
  • Marshall: was Red Sea confirmed to be a repro issue? Did Russ fix it? Russ: Aspendale code was fixed. Marshall: was fixed in Aspendale but not main repo? Yes. Marshall: Arnold knew about that. Peter: we had the fix in that case and hadn’t shared it.

CMIP6

  • Peter talked to David Karoly. Wondering about spin-up for CMIP6. 500-1000yr spinup. Is it possible to spin up ocean first by forcing an Ocean/Ice model with JRA? Then add CM when it is ready. Ice issues might make big difference to stratification. Aidan: model drift issues won’t help. Marshall: can you stabilise stratification with OM run? Russ: deep stuff takes thousands of years to get into steady state. Marshall: ask this at a MOM meeting? Russ: ask Ocean modellers.
  • What is CMIP schedule? Peter+ Roger: about six months behind. Start production run early next year. Still working on configuration. Coupled cable model running now. Reproducibility issue has become more issues which need to be nailed down.
  • Marshall: when do you feel like you need to fix source/versions? Roger: delaying until the end of the year. How long would a 500 year 1 deg MOM spin up take? Nic: 0.5 hour/year. 50 years/day without queuing issues. Have to take crashing into account. Marshall: does 1 deg crash? Nic: probably not.
  • Nic: created issue recently, wanted to 50 years in single submit. Memory leaks limit how long you can run. Maybe only 3 years.
  • Aidan: Should add multi-year runs per submission ability to payu.

Bathymetry

  • Aidan: how do you deal with non-advective cells? Russ: it is potholes with no advective velocity possible. If you allow cells to fill if they’re too thin, can create cells that have no velocities.
  • Russ to add his code to OceansAus repo.

New HPC

  • Marshall: Tender for new machine. Understand current limits of codes, and if new machine will work, and what we need to get more performance. Convinced MOM is a RAM bound code. Vectorisation is not making a difference. Want a machine with more RAM bandwidth, not more vectorisation. Away from KNL and SkyLake, towards IBM power and AMD.
  • Peter: Met Office XC40 can run coupled model with 48*24 processors. They are 32 processor  Broadwell nodes. Marshall: maybe running more threads? Roger: they run 2 threads.
  • Marshall:  bring errors that stop it working. Roger: ok, will get some info together.
  • Marshall: incorrect Message Parsing and halo understanding. MPI messages in MOM5 are healthy. Get GB/s bandwidth, even corners. Problems are related to library or load imbalance, or maybe CPU throttling. We are doing a reasonable job of MPI. Faster interconnect may be useful. Broadwell is 10-15% faster, as it has faster interconnect.

Actions

New:

  • Invite Arnold Sullivan and James Munroe to TWG meetings.
  • Add feature request to payu: multiple runs per submission
  • Ask MOM Ocean meeting about 1000yr OM spin-up possibility
  • Russ to add all his ocean bathymetry code to OceansAus repo.

Existing:

  • Aidan investigate tenth degree MOM configs for benchmarks.
  • Possible bench-mark configs (everyone)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic, Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Nic create a discussion document (on COSIMA?) to document current approaches and strategies for future
  • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation?
  • Test Nic’s access-om model config on OceansAus (All)
  • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic.
  • Add new test cases to Jenkins test suite (Nic).
  • Start a new google doc about coupler issues and MATM (Marshall)
  • Ask Dale Roberts about effects of OpenMP for Roger (Marshall)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Do longer runs with Nic’s 1 deg and 0.25 deg ACCESS-OM2-JRA55 configs (Andy and Aidan)
  • Try repeat year forcing with Nic’s configurations (Nic and Andy)
  • Create document outlining options for configuration sharing (?)