Technical Working Group Meeting, June 2018

Minutes

Date: 12th June 2018
Attendees:

  • Marshall Ward (MW) (Chair), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Peter Dobrohotoff (PD), CSIRO Aspendale
  • Justin Freeman (JF) BoM Melbourne
  • Nic Hannah (NH) Double Precision

TWG Meeting

JF:  Would be able to attend more regularly if there was a calendar invite which would enable him to schedule the meeting. How do we integrate calendars for Justin

COSIMA Models

AK: Bathymetry error in tenth model in Cumberland Sound, Baffin Island. Causes model blow ups.
RF: Yes blast it out. Russ will do it today. AH: Do we need any changes to restart/input files? Russ: if below zero  for eta_t, might have to set to zero. Otherwise will complain about penetrating rock.
AK: tenth very unstable over the weekend.
MW: longjmp error means the backtrace is failing. Memory go so severely corrupted that can’t properly debug.
“nearest_index array must be monotonically increasing error”
AK: Sweep and resubmit and works.
AK: More errors since turned on diagnostics for Adele. RF: are these globals? MW: could be FMS bugs because MPI is being strained and things are out of order.
AK: daily outputs in regional area: temp, salt, uhrho_et, vhrho_nt, rho_dzt. RF: spewing output from a lot of processors as regional outputs do not use io_layout, so every affected processor outputting data. AK: only doing for 2 years and then turn it off. It has slowed it down. Become erratic in timing. RF: Some processors not outputting the field, not sure why it should make it unstable.
AK: Put up as an issue.
AH: What is the current model config for tenth, and performance? AK: 4.5K on MOM, 2K on cice. Runs with 450s timestep, 1.5 hr/mo. Now running at 400s. Crash in Baffin Island goes away with shorter timestep.
AH: Try and get tenth running faster. Ice no longer holding back timstep. AK: Was running 540s before Baffin island issues.
MW: netCDF4 v4.4 has FPE turned on. Built by a different person. Historically always had FPE disabled. AK: 4.2.1.1 in MOM. 4.3.2 in CICE. 4.4.1 in matm. OASIS has default. AK: waiting for yatm build to be signed off. Ben M suggested we should be using openmpi/1.10.7 (optionally with debug). Number of bugs fixed between 1.10.2 and 1.10.7.
AK: Want to try out orange layout with CICE. Currently 2000 cores with no landmasking and 1 block / processor. Could be run a lot cheaper. Currently MOM bound. Should be to run well below 2000 cores. Waiting for yatm to be sorted out. Trying some frankenstein builds and back porting to matm.
AK: Timing is very inconsistent. RF: Ocean eta and plot diagnose has a collective. Does a sum. Somewhere it has hung. All depend on this function. MW: Could be load imbalance in CICE.
MW: MPI_Comm_split hangs or fails intermittently.
AK: No stock runs since looking at runtime. MW: thinks his profiling was wrong because of lack of ice. AK: looking at the load imbalance there is ice. MW: ran from rest for 10 days. CICE would normally do work that wasn’t captured. MW: tried to redo profiles and all runs stopped working. Shocking.
MW: moved on to yatm. Putting scorep into yatm had issues, so not redone the profiles with realistic ice.
AK: will spin off run with no diagnostics as point of comparison. MW: at dt=300s, 100s/day seemed reproducible. Andrew’s 50% slower. Maybe more stuff happening. One of two issues that need to be resolved. CICE bound results different, second is MOM slowdown. Matching MOM-SIS important goal.
MC: how much longer running spinup? When switch to IAF? AK: will switch to IAF ASAP. Andy is running RYF @ quarter. Then Paul Spence will run IAF quarter. Currently 34 years of spin-up with 84/85 repeat year. MC: Will start from year zero? AK: there are biases in RYF, so not sure if we should spin off from this run. Might depend on how many years we have to get done.
MC: will there be multiple cycles of IAF? AK: depends. MC: start at WOA or from RYF spin up.
AH: For the model documentation paper there will be the standard 5 x IAF (JRA55) protocol for 1 degree and 0.25 degree. The MOM meeting discussed strategy for 0.1 degree. Andy Hogg thought the tenth was just too expensive to run this protocol and might have to run only one cycle of IAF, or maybe spin up with RYF and then run IAF from 85 onwards. Whatever was done would be repeated in a second quarter degree run to provide a point of comparison between the different resolutions.
RF: interested from 93 once the satellites go up.
JF: wanted to get up to speed. Looks over minutes when they come out, very useful. Mirko has been doing some runs. Will try and join in regularly. BoM will take up ACCESS-OM2 when up to speed. Will be OceanMaps version, used for forecasting.
AH: Andy running KDS50 for 0.25 deg for RYF spin up. Found KDS75 too unstable.
JF: Mirko is testing COSIMA models in back end. Mirko getting up to speed what we’ve done. Need the 75 level (COSIMA) grid. Will do some hindcast runs and compare with OceanMaps. Don’t have experience with sea ice model . Don’t know how it will affect forecasting. Need to look at the ice parameterisation. Also need to look at data assimilation. Will talk to Russ and Matt. At some point will be able to contribute back, will work from GitHub repo, using same codebase.
AK: run parameters and namelists on git repo are a long way out of date. JF: can we make sure these are updated. AK: Still in a state of flux. Still bedding down YATM configurations. Will do best.

ACCESS OM2/CM2 Code Harmonisation

AH: What is the other significant code difference in CM2 that Russ wanted to reimplement? RF: wave mixing scheme. Gets added into KPP. Comes via CVMix package. Two ways to implement. 1. 10m winds to come in via sbc. 2. Can empirically calculate them in MOM6. Russ has implemented this scheme under CM2.5 framework. Run for a while. Had to put in a limiter because it caused too much mixing. Dave reckoned it didn’t make difference. Haven’t looked at the most recent results. Running with CM2.5 coupled model.
RF: Also another scheme Russ wants to implement. Slightly different to ACCESS-CM2. Both schemes already in MOM6. One of them is in CVMix. That is what Dave has implemented in MOM5. Taken routine out of CVMix and plopped it into KPP module to give enhanced mixing. Also need 10m wind information to come in. Need changes in surface flux code. Russ has done this. Russ has implemented same thing, just change in the way winds get through. Not sure why ACCESS-CM2 didn’t see difference.
RF: Occasionally get massive mixing coefficients in KPP so put in a limiter.
RF: will put code changes into master branch. AH: when you have done this I can pull into CM2 and can test. RF: Griffies wants it in MOM5.
PD: followed along in slack channel. Not sure about all technical details. Big difference after 10 days between harmonised code and CM2 codebase. Has this been solved? How far along are we with this? Spinups will not have harmonised code if we don’t have a frozen version soon. ESM and CM2 groups want to know how close we are. We haven’t helped much to this point. How can I contribute.
PD: copied suite. Ran it. Thought was tracking down bug. PD: couldn’t find preprocessed source files. MW: do we run cpp? I get the right source code lines and don’t see .f90 files. No we don’t … which is why Peter couldn’t find them.
RF: why was red sea fix timing different? CM code has a fix? AH: might be because my fix uses relative time, not absolute model time. RF: timing fix should have absolute origin. AH: I’ll check.
AH: I don’t think there is that much more to go for the harmonisation
PD: when can I run harmonised MOM?
RF: when I can find some time to put in there. Now we have a way forward. Hopefully in a week or two.
PD: will put runs on ASAP. If harmonised code not ready, won’t be in spinups.
AH: will lease with Peter and tell him as soon as something is ready.
MW: if there are differences what do  they use? AH: they will use the MOM5 repo as far as I know.

Actions

New:

  • Edit tenth bathymetry to remove Cumberland Sound (RF)
  • Create calendar invites to TWG Meeting (AH)
  • Update model name list and other configurations on OceansAus repo (AK)
  • Check red sea fix timing is absolute, not relative (AH)

Existing:

  • Shared google doc on reproducibility strategy (AH)
  • Follow up with Andy Hogg regarding shared codebase (MW)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Profile ACCESS-OM2-01 (MW)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Nudging code test case (RF)
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)