Technical Working Group Meeting, July 2018

Minutes

Date: 10th July 2018
Attendees:

Marshall Ward (MW) (Chair), NCI
Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
Peter Dobrohotoff (PD), CSIRO Aspendale

Tenth Model Update

AK: Starting from yr 38, new codebase, hangs on MPI_finalize. Hard to trace. padb can’t help.

MW: Ranks not getting traces from probably shut down cleanly. Depending on how process exits it may leave cleanly. Russ added explicit backtrace calls to force a backtrace.

RF: Only if it goes through a particular routine. If it calls a FATAL within MOM. Doesn’t happen if something internal goes haywire.

AH: bug in the code according to Nic? AK: recompiled. Will run this morning. MW: may need to force back traces in some situations?

OM2/CM2 Code Harmonisation

AH: finished wave mixing?

RF: Tested in CM2.5. Happy with results. Can get from my GitHub. MW: happy doing PR? AH: I will take care of the code wrangling.

RF: But not completely happy with on MOM side. Using one instantaneous value rather than an average. Should be fine for ACCESS. AH: your code? RF: had to add in a field, bit tricky to know what it is doing. Has to get it from CICE, hard to figure out what is going on. Matt fixed it 5 years ago. Getting one value in the coupling stage. One timestep fine. Coupling several time steps. Only when MOM-SIS. Running with ACCESS it is fine. MW: will it produce the same numbers for a MOM-SIS run? RF: won’t change anything. MW: Change FMS code? RF: No, coupling code. Added in an extra field. In the coupling code, in the flux exchange. Atmosphere supplies wind at it’s bottom level. Scaling law to calculate 10m winds, which is what ACCESS passes. In normal MOM code, doesn’t have access to that field. Added in something to get that field, pass to ICE, which then passes to the ocean. ACCESS code, through OASIS, it passes these 10m winds directly. Difference to ACCESS-CM2 code, passed through ocean_sbc now. ACCESS took the ice/ocean boundary field and sent directly to KPP scheme. Shouldn’t do that, should be going through ocean_sbc. Now 10m winds are in the velocity derived type. Cleans up the interfaces. Another slight change with the fraction of ice passed in. Had some #ifdef in ocean_model. Now made aice a local variable in ocean_sbc. Doesn’t have to be passed around. Same interface with ACCESS or MOM-SIS.

RF: Now distinction in KPP code for ACCESS version and MOM version, just some namelist variables. MW: 10m always passed? RF: there by default. MW: so memory usage is marginally higher? RF: 2D field. MW: Otherwise no impact? Increased data transfer? MW: Previously MOM-SIS was converting to a stress? RF: Yeah, just have this extra field. MW: Good we now have 10m winds. Was awkward. RF: Also put in an empirical method to calculate 10m winds given friction velocity. Don’t need 10m winds in case forcing with fluxes, can’t regenerate this. This is copied from MOM6.

RF: Do you mind having elemental functions? AH: No! Love elemental. MW: good, aspirational.

RF: Aidan look at what I’ve done. MW: this is the bottle neck in the code merge. Awesome!

PD: Russ, thanks for work on Harmonisation. Can we now test on CM2? RF: Yes, depends on Aidan’s updates. Just based on current version of MOM. Anything else isn’t there. Once Aidan does his stuff should be fine. AH: Yes I will do this. PD: timeframe is yesterday. Estimate of timeframe pass on to ESM meeting and CMIP6 meeting. Other people make decisions. Hopeful to use harmonised code. AH: Want me to attend a meeting? PD: ESM meeting on Friday. Matt, helpful? MC: a report from PD would be enough. PD: If RF is there, that might be enough. PD; 11 am Friday.

Model Reproducibility

PD: Any work on restarts? Working on warm restarts in CM2? AH: more information MW: 2×1 vs 1×2 jobs. AH: not that I know. Stability more information.

PD: the way a scientist sees this, a model is perturbed at every restart. How can you write a paper with this “feature”? Needs to be fixed. MW: MOM5/6 can do this. ACCESS-CM1 can. ACCESS-OM2 cannot. UM can do this. PD: CMIP5 runs can do this. MW: tested MOM-SIS and didn’t work. Steve reckons settings are correct for reproducibility. Tested a year ago, all differences in tripole. Didn’t pus this. With GFDL coupler, atm, ice model. Not our model. Nic confirmed this issue with OM2. libaccessom2 growing pains have pushed this out. AK: scientific credibility requires this to be solved. MW: floating point arithmetic is a perturbation. MW: Get consistency with consistent restart times. PD: if FP errors are on the same magnitude as restart errors, maybe we can say they’re ok. Interesting perspective. AK: should be save the state of the model and reload and carry on. MW: the order something is being handled with init not same as time step. AK: fields calculated from saved variables might not match. MW: need checksums at every step. Needs to be someone’s job. Need to communicate that to the people that control us. AH: Needs to be prioritised by Andy Hogg. AK: Are these differences large, or least sig bit. Any perturbation, of model stable perturbation will disappear, but maybe new trajectory due to chaos. Same order as numerical round off error, or different compiler, optimisation, maybe of order that are being made all the time. PD: a lot of calculations from one time step to another. When you say how big is this change? Measured at the end of the time step after the restart. AK: model state at beginning of restart must be the point where they are different. When do we measure difference? AK: when restart and initialise fields. at that stage should match when model restarts were written. MW: hard to define model state. global vars, scratch fields etc. Need to define state, then compare checksums at end of run, and beginning of next time step. After 30 time steps, get checksums, then proceed. Then compare to timesteps with a restart run.

PD: each processor checksums array, print that out. AK: specific reproducible order to sum? MW: MOM or UM safe operation, need a gather on a rank. PD: can we work on what a state might be? MW: Can do this in MOM. Need it for all models. Hard for coupled models. MOM has framework for this. Could be as simple as OASIS getting out of sync. Depending on configuration it might not be restarting correctly. AH: nic has tested OASIS field consistency. MW: volatile time. PD: lags might be set explicitly for first time step? MW: restarts are supposed to handle that. AH: could we use compiler options to perturb FP operations to get scale of differences. MW: fused multiply add might not reproducible. PD: some clarity about what the model can and cannot do. MW: push this up to science leaders. Bob Hallberg did cool thing with MOM6 converts FP to fixed point and does global sum and converts back to FP.

MW: lack of testing and reproducibility means we can’t confidently change code quickly and easily. AK: engineering problem. useful for finding subtle bugs the way code is written. Hard to know how big this effect might be. MW: lab can do stuff. PD: is this a showstopper? MW: need a conversation at CSIRO wrt CMIP6. They have rules. PD: this isn’t a showstopper for science publishing? AK: depends on size of perturbation. AK: for testing need to walk all used code branches.

FMS (MPI) updates

MW: Been rewriting global field function. Done for a while, concerned about performance. Fair bit slower than original. Fixed stability. Probed it, but it was MPI alltoallw and it was slow. Tested against other MPI libraries. In Intel MPI alltoallw is a lot faster than p2p. OpenMPI is across the board slower than IntelMPI. Whatever I did was not a question of performance. Has anyone been testing IntelMPI? Maybe we have been making our lives hard by using IntelMPI? What do people think? RF: Makes no difference to us. Up to MW. MW: will keep testing. MW: Intel is not necessarily faster, but it might be smarter about choosing algorithms. Around the 1000 ranks it makes a bad choice. AH: How size sensitive? MW: has not tested alltoallw. Others are faster on IntelMPI. 2x as fast. Small tests. AH: full MOM test with MOM? MW: Years ago, volatile timing. This was IntelMPI 4, when it was sort of bad. Seems to have improved. Intel MPI is MPICH.

Actions

New:

Incorporate RF wave mixing update into MOM5 codebase (AH)
Code harmonisation updates to ACCESS and ESM meetings (PD, RF)

Existing:

Edit tenth bathymetry to remove Cumberland Sound (RF)
Update model name list and other configurations on OceansAus repo (AK)
Check red sea fix timing is absolute, not relative (AH)
Shared google doc on reproducibility strategy (AH)
Follow up with Andy Hogg regarding shared codebase (MW)
MW liase with AK about tenth model hangs (AK, MW)
Pull request for WOMBAT changes into MOM5 repo (MC, MW)
Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
Profile ACCESS-OM2-01 (MW)
Move FMS to submodule of MOM5 github repo (MW)
Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
Blog post around issues with high core count jobs and mxm mtl (NH)
Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
Add RF ocean bathymetry code to OceansAus repo (RF)
Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
Nudging code test case (RF)
Redo SSS restoring with patch smoothing (AH)
Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
CICE and MATM need to output namelists for metadata crawling (AK)