Technical Working Group Meeting, April 2018

Minutes

Date: 10th April 2018
Attendees:

  • Marshall Ward (MW) (Chair), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Peter Dobrohotoff (PD), CSIRO Aspendale

COSIMA Meeting

MW: Will we present something again? Same as last time, a list of achievements? Consensus was yes we should.
MW: Want to make time during the meeting to work together on collective problem? NH: Yes, definitely. Even for just a short time.

Common codebase

AH: Have rebased PDs MOM code on to latest MOM5 source. Wrapped all PDs changes that are incompatible in ifdef ACCESS_CM statements. This is available as branch cm2 on MOM5 GitHub repo. Created a pull request to allow code review (https://github.com/mom-ocean/MOM5/pull/214).

AH: PD has provided a rose suite (u-aw048) for testing. AH successfully copied (u-aw405) and ran this suite. Created another suite (u-aw445) to test that this reproduces first copy with no changes. It does. Created a third suite (u-aw497), changed git URL to point at cm2 branch, but initial compile failed to find the source. Eventually got it to recognise the updated source, now compile fails due to an absence of a main routine. Needs some modification.

PD: Best to make small incremental changes to a suite. In this case just change the fortran files and see if it works. Avoid changing how the compile is done, definitely avoid changing rose app conf. Was just trying to determine the compile flags use in UM fcm build from Met Office was not a trivial task.

MW: cylc is good, small and configurable. rose is difficult and opaque.

AH: Will get some help from others via the cm2-om2-harmonisation slack channel

AH: Next step is to do the same for CICE5 as has been done for MOM5.

NH: CSIRO is getting CICE from the UKMO. There are code changes under src, not just in the drivers.

AH: Does UKMO version of CICE5 have a special licence? Will we be able to host UKMO modifications to CICE5 on OceansAus repo? NH: CICE has a CICE licence

 MW: UKMO runs CICE as a NEMO/CICE5 executable, not linked through OASIS like us.

Models

PD: ACCESS-CM2 doesn’t reproduce over restarts. Would like to run CICE stand alone. Does CICE reproduce over restarts?

NH: cleanest way to test is in the coupling code, before any model sends anything to another model, checksum fields. That is the point you can compare the output of a model. In the case of restarts, can include in checksum what current model run time is.

MW: there are 2 restarts, individual models and oasis restarts. Have to make sure trigger the same number of time steps in models.

NH: OASIS restarts are not a problem. OASIS get tells you if it read out of a restart. Not reading out magically, still working using PUT and GET. GET is from a file instead of from a model.
After each GET and before each PUT, print out time, processor, checksum can see when the checksums diverge. Will be different before the PUT, can then identify the model. If you determine CICE was the culprit can then look at CICE only run.
PD: How to do checksum? NH: Just sum whole array. MW: use MPP_checksum? NH: not in CICE. Our CICE code will output checksums if required. NH has already done this for CICE.
MW: does OM2 reproduce? NH: I spent some time on this. Repro for a couple of coupling time steps, then diverges.
MW: PD has 1×2 days not same as 2×1. Have we done equivalent tests with OM2? NH: yes, not passing.
NH: MOM by itself might not do this anymore? MC: restarts ok. RF: if you had the redsea bug it wouldn’t reproduce. Old repro results null and void. Models also have coupling code that might cause repo issues.
MW: UM passes, would think GC3 reproduces.
MW: interesting OM2 does not reproduce. Easier platform to test.  ACCESS-om2 needs to reproduce. NH: looking at this. With MATM changes, needs to make sure it works to get others to use it.
MW: Does someone want to check MOM?  Restarts, processor layouts. AK: don’t change layouts often so wouldn’t know if it does currently reproduce with layout changes. NH: non repo with layoutt changes indicates bug. MW: maybe not bug, but definitely volatile behaviour, maybe in a collective.
MW: Did find a repo problem with MPP_sum. Ran MPP_sum and MPP_reprosum, and got difference in one bit. Even something that simple can cause issues. GFDL always matches with same test. Maybe something we can control with compiler flags.
PD: As voltages go down can get make random errors occur. MW: Bob found bug with tridiagonal solver due to voltage issue in Intel chip. Maybe something going on with flags?
RF: GFDL definitely use precise option. Atmospheric model crashes otherwise. MW: MetOffice also uses precise.
MW: Maybe all could look more carefully?
AH: Do we have a reproducibility checklist? Some strategy. Shared google doc?
NH: starting work on tenth degree performance. Anyone interested in doing some profiling? MW: Any Hogg pressuring to do this. Will do it this week, and send to NH. Hope to have a bunch of profiles for the meeting.
NH: MATM is now clean, 100 lines of code, uses CMake. Hoping to start using it. All goes through CICE. Nothing about coupling has changed.
NH: Want to use newest version of OASIS-mct (v3 not v2). Improvement in performance, can collect together MPI comms.

Actions

New:

  • Poll TWG on list of achievements for Meeting presentation (MW)
  • Shared google doc on reproducibility strategy (AH)

Existing:

  • Follow up with Andy Hogg regarding shared codebase (MW)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Profile ACCESS-OM2-01 (MW)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Nudging code test case (RF)
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)