Technical Working Group Meeting, September 2020

Minutes

Date: 16th August, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD) CSIRO Aspendale

ACCESS-CM2 + 0.25deg Ocean

PD: Dave Bi thinking about 0.25 ocean. Still fairly unfamiliar with MOM5. Trying to keep was harmonised as possible. Learn from 1 degree case harmonisation. PD: Doing performance plan for current financial year hence talk about that now. Asking supervisors what they want and popped out of that conversation with Dave Bi. AH: Andy Hogg has been pushing CLEX to do the same work. PD: Maybe some of this already happening, need some extra help from CSIRO? AH: We think it is much more that CSIRO has a lot of experience with model tuning and validation. Not something researchers want to do, but want to use a validated model and produce research. So a win-win. PD: Validation scripts are something we run all the time, so yes would be a good collaboration. Won’t be any CMIP6 submission with this model. AH: Andy Hogg keen to have a meeting. PD: Agree that sounds like a good idea.
PL: How do get baseline parameterisation and ocean mask and all that. Grab from OM2? PD: Yes grab from OM2. AH: Yes, but still some tuning in coupled versus forced model.

ACCESS-OM2 and MOM-SIS benchmarking on Broadwell on gadi

PL: Started this week. Not much to report. Running restart based on existing data fell over. Just recreated a new baseline. Done a couple of MOM-SIS runs. Waiting on more results. Anecdotally expecting 20% speed degradation.

Update on PIO in CICE and MOM

AK: Test run with NH exe. Tried to reproduce 3 month production run at 0.1deg. Issue with CPU masking in grid fields. Have an updated exe set running. Some performance figures under CICE issue on GitHub. Speeds things up quite considerably. AH: Not waiting on ice? AK: 75% less waiting on ice.
AH: Nick getting queue limits changed to run up to 32K.
PD: Flagship project such as this should be encouraged. Have heard 70% NCI utilisation? May be able to get more time. RY: No idea about utilisation. Walltime limitation can be adjusted. Not sure about CPU limit. AH: I believe it can. They just wanted some details. Have brought up this issue at Scheme Managers meeting. Would like to get number of cpu limits increased across the board. There is positive reaction to increasing limits, but no motivation to do so. Need to kick it up to policy/board level to get those changes. Will try and do that.
NH: Some hesitation. Consume 70-80KSU/hr. Need to be careful. PL: What is research motivation? NH: Building on previous work of PL and MW. With PIO in CICE can make practical configs with daily ice output with lots more cores. Turning Paul’s scaling work in production config. Possible due to PL, MW work, moving to gadi and having PIO in CICE.
NH: Got 3 new configs. Existing small (5K), medium (8K), large (16K) and x-large (32K). MOM doubling. CICE doesn’t have to double. Running short runs of configs to test. PL: 16K is where I stopped. NH: Andy Hogg said good to have a document to show scalability for NCMAS. PL: All up on GitHub. NH: Will take another look. NH: Getting easier and easier to make new configs. CICE load balancing used to be tricky. Current setup seems to work well to increase cpus.
PD: What is situation with reproducibility? In 1 degree MOM run 12×8. Would be same with 8×12? More processors? NH: Possible to make MOM bit repro with different layout and processor counts, but not on by default. Can’t do with CICE, so no big advantage. PL: What if CICE doesn’t change? NH: Should be ok if keep CICE same, can change MOM layout with correct options and get repro. RF: Generally have to knock down optimisation and floating point config. Once you’re in operational mode do you care? Good to show as a test, but operationally want to get through fast as possible as long as results statistically the same. PL: Climatologically the same? RF: Yeah. PL: All the other components, becomes another ensemble member. RF: Exactly. NH: Repro is a good test. Sometimes there are bugs that show up when don’t reproduce. That is why those repro options exist in MOM. If you do a change that shouldn’t change answers, then repro test can check that change. Without repro don’t have that option.
RF: Working with Pavel Sakov, struggling with some of the configs, updating yam to latest version. Moving on to 0.1 degree model. Hoping to run 96 member ensembles, 3 days or so, with daily output. A lot of start/stop overhead. PIO will help a lot. Maybe look at start up and tidy up time. A lot different to 6 month runs. AH: Use ncar_rescale? RF: Just standard config. Not sure if it reads in or does dynamic. AH: Worth precalculating if not doing so. Sensitivity to io_layout with start-up? RF: Data assimilation step, restarts may come back joined rather than separate. Thousands of processors trying to read the same file. AH: mppncscatter? RF: Thinking about that. Haven’t looked at timings yet, but will be some issues. AH: How long does DA step take? RF: Not sure. Right at the very start of this process. Pavel as had success with 1 degree. Impressed with quality of results from the model. Especially ice.
AH: Maybe Pavel present to a  COSIMA meeting? RF: Presented to BlueLink last week. AH: Always good to get a different view of the model.

Testing

AH: Trying to get testing framework NH setup on Jenkins running again. Wanted to use this to check FMS update hasn’t broken anything. Can then update FMS fork with Marshall and Rui’s PIO changes.
NH: A couple of months ago got most of the ACCESS-OM2 tests running. MOM5 runs were not well maintained. MOM6 were better maintained and run consistently until gadi. Can get those working again if it is a priority. Was being looked at by GFDL. AH: Might be priority as we want to transition to MOM6.
NH: Don’t have scaling results yet. Will probably be pretty similar to Pauls numbers. Will should you next time. PL: Will update GitHub scaling info. NH: Planning to do some simple plots and tables using AK’s scripts that pull out information from runs.

Bathymetry

AK: Got list of edits from Simon Marsland for original topography. Wanted to get feedback about what should be carried across. Pushed a lot into COSIMA topography_tools repo. Use as a submodule in other repos which create the 1 degree and 0.25 degree topographies. Document the topography with the githash of the repo which created it. Pretty much finished 0.25. Just a little hand editing required. Hoping to get test runs with old and new bathymetry.
AH: KDS50 vertical grid? QA analyses? AK: Partial cells used optimally from KDS50 grid. Source data is also improved (GEBCO) so not potholes and shelves. AH: Sounds like a nice well documented process which can picked up and improved in the future.
AK: Way it is done could be used for other input files, have all in git repo and embedding metadata into file its exact link to git hash. Good practice. Could also use manifests to hash inputs? NH: Great, talked about reproducible inputs for a long time. AH: Hashing output can track back with manifests. Ideally would put hashes in every output. There is an existing feature for unique experiment IDs in payu, but has not gone further, still think it is a good idea.
AK: Process can be applied to other inputs. AH: The more you do it, the more sense it makes to create supporting tools to make this all easier.

Jupyterhub on gadi

AH: What is the advantage too using jupyterhub?
JM: Configurable http proxy and jupyter hub forward ports through a single ssh tunnel. If program crashes and re-run, might choose different port but script doesn’t know. This is a barrier. Also does persistence, basically like tmux for jupyter processes. AH: Can’t do ramping up and down using bash script? JM: Could do, that is handled through dask-jobqueue. Bash script could use that too. JM: Long term goal would be a jupyterhub.nci.org.au. Difficult to deploy services at scale. AH: Pawsey and the NZ HPC mob were doing it.