Technical Working Group Meeting, March 2020

Minutes

Date: 18th March, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

Scalability of ACCESS-OM2 on gadi

(Paul’s report is attached at the end)

PL: Looking at scaling. Started with ACCESS-OM2, but went to testing MOM5 directly with MOM5-SIS. Using POM25, global 0.25 model with NYF forcing. The model MW developed for testing scaling prior to ACCESS-OM2. Had to add specify min_thickness in ocean_topog_nml.

PL: Tested the scaling of 960/1920/3840/7680/15360, with no masking. Scales well up to some point between 7680 and 15360.

PL: Tested effect of vectorising options (AVX2/AVX512/AVX512-REPRO). Found no difference in runtime with 15360 cores. MW: Probably communication bound at the CPU count. Repro did not change time.

MW: Never seen significant speed up from vectorisation. Typically only a few percent improvement. Code is RAM bound, so cannot provide enough data to make use of vectorisation. Still worth working toward a point where we can take advantage of vectoeisatio.

PL: Had one “slow” run outlier out of 20 runs. Ran 20% slower. Ran on different nodes to other jobs, not sure if that is significant. MW: IO can cause that. AH: Andy Hogg also had some slow jobs due to a bad node. AK: Job was 20x slower. Also RYF runs become consistently slower a few weeks ago. MW: OpenMPI can prepend timestamps in front of output, can help to identify issues.

PL: Getting some segfaults in ompi_request_wait_completion, caused by pmpi_wait and pmpi_bcast. Both called from the coupler. NH: Could be a bad bit of memory in the buffer, and if it tries to copy it can segfault. PL: Thinking to run again using valgrind, but would require compiling own version of valgrind wrapper for OpenMPI 4.0.2. Would be easier to Intel MPI, but no-one else has use this. Saw some cases similar when searching which were associated with UCX, but sufficiently different to not be sure. These issues are with highest core count. MW: Often see a lot of problems at high core counts. NH: Finding bugs can be a never ending bug. Use time wisely to fix bugs that affect people. MW: Quarter degree at 15K cores would have very small tile sizes. Could be the source of the issue. AH: This is not a configuration that we would use, so it is not worth spending time chasing bugs.

PL: Next testing target is 0.1 degree, but not sure which configuration and forcing data to use. Will not use MOM5-SIS, but will use ACCESS-OM2 for direct comparison purposes. AK: Configurations used in the model description paper have not been ported to gadi. Moving on to a new iteration. Andy Hogg is running a configuration that is quite similar, but moving to new configurations with updated software and forcing. Those are not quite ready.

PL: Need a starting configuration for testing. Want to confine to scalability testing and compiler flags. NH: ACCESS-OM2 is setup to be well balanced for particular configurations. Can’t just double CPUs on all models as load imbalance between submodels will dominate any other performance changes. Makes it a problematic config for clean configurations for things like compiler flags. MW: Useful approach was to check scalability of sub-model components independently. Required careful definition of timers to strategically ignore coupling time. MOM was easy, CICE was more difficult, but work with Nic’s timers helped a lot. Try to time the bits of code that are doing computation and separate from code that waits on other parts. Coupled model is a real challenge to test. Figure out what timers we used and trust those. Can reverse engineer from my old scripts.

PL: Should do MOM-SIS scalability work? MW: Easier task, and some lessons can be learned, but runtime will not match between MOM-SIS and ACCESS-OM2. Would be more of a practice run. PL: Maybe getting out of scope. Would need 0.1 MOM-SIS config. RY: Yes we have that one. If PL wanted to run ACCESS-OM2-01 is there a configuration available? AK: Andy Hogg’s currently running configuration would work. PL: Next quarter need to free up time to do other things.

MW: Might be valuable to get some score-p or similar numbers on current production model. Useful to have a record of those timings to share. Scaling test might be too much, but a profile/timing test is more tractable. RY: Any issues with score-p? Overhead? MW: Typical, 10-20%, so skews numbers but get in-depth view. Can do it one sub-model at a time. Had to hack a lot scripts, and get NH to rewrite some code to get it to work. score-p is always done at compile time. Doesn’t affect payu. Try building MOM-SIS with score-p, then try MOM within ACCESS-OM2. Then move on to CICE and maybe libaccessom2. PL: Build script does include some score-p hooks. MW: Even without score-p MOM has very good internal timers. Not getting per-rank times. score-p is great for measuring load imbalance. AH: payu has a repeat option, which repeats the same time, which removes variability due to forcing. Need to think about what time you want to repeat as far as season. AK: CICE has idealised initial ice, evolves rapidly. MW: My earlier profile runs had no ice, which affects performance. MW: Not sure it is huge, maybe 10-20%, but not huge.

MW: Overall surprised at lack of any speed up with vectorisation, and lack of slow-down with repro. PL: Will verify those numbers with 960 core config.

AH: Surprised how well it scaled. Did it scale that well on raijin? MW: The performance scaling elbow did show up lower. AH: 3x more processors per node has an effect? MW: Yes, big part of it. AH: 0.1 scaled well on raijin, so should scale better on gadi. 1/30th should scale well. Only bottleneck will be if the library can handle that many ranks.

NH: If repro flags don’t change performance that is interesting. Seem to regularly have a “what trade off does repro flags have?”, would be good to avoid. MW: Probably best to have an automated pipeline calculating these numbers. NH: People have an issue with fp0 flag. MW: Shouldn’t affect performance. NH: Make sure fp0 is in there. MW: Agree 100%.

ACCESS-OM2 update

AH: Do we have a gadi compatible master branch on gadi? AK: No, not currently. NH: At a previous TWG meeting I self-assigned getting master gadi compatible. Merged all gadi-transition branches and tested, seemed to be working ok. Subsequent meeting AK said there were other changes required, so stopped at that point. gadi-transition branches still exist, but much has already been merged and tested on a couple of configurations. Have since moved to working on other things.

NH: Close if AK has all the things he wants into gadi-transition branch. Previous merge didn’t include all the things AK wanted in there. Happy to spend more time on that after finishing JRA55 v1.4 stuff.

JRA55-do v1.4 update

NH: Made code changes in all the models, but have not checked existing experiments are unchanged with modified code.

NH: v1.4 has a new coupling field, ice calving. Passing this through to CICE as a separate field. In CICE split into two fields, liquid water flux and a heat flux. MOM in ACCESS-CM2 already handles both these fields. Just had to change preprocessor flags to make it work for ACCESS-OM2 as well.

NH: Lots of options. Possible to join liquid and solid ice at atmosphere and becomes the same as we have now. Can join in CICE and have a water flux but not a heat flux.

Strange MOM6 error

AH: A quick update with Navid’s error. Made a little mpi4python script to run before payu to check status of nodes, and all but root node had a stale version of the work directory. Like it hadn’t been archived. Link to executable was gone, but everything else was there. Reported to NCI, Ben Menadue does not know why this is happening. Also tried a delay option between runs and this helped somewhat, but also had some strange comms errors trying to connect to exec nodes. Will next try turning off all input/output can find in case it is a file lock error. Have been told Lustre cannot be in this state.

MW: In old driver do a lot of moving directories from work to archive, and then relabelling. Is it still moving directories around to archive them? Maybe replace with hard copy of directory to archive. MOM6 driver is the MOM5 driver, so maybe all old drivers are doing this. Definitely worth understanding, but a quick fix to copy rather than move.

NH: Filesystem and symbolic links might be an issue MW: Maybe symbolic links are an issue on these mounted filesystems. AH: There was a suggestion it might be because it was running on home which is NFS mounted, but that wasn’t the problem. MW: Often with raijin you just got the same nodes back when you resubmit, so maybe some sort of smart caching.

 

Scalability of ACCESS-OM2 on Gadi – Paul Leopardi 18 March 2020

 

 

Technical Working Group Meeting, February 2020

Minutes

Date: 27th February, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

New installed payu version

Version 1.0.7 is now installed in conda/analysis3-20.01 (analysis3-unstable

AH: payu is now 100% gadi compatible. Default cpus/node is now 48 and memory 192GB/node. Python interpreter, short path and manifests are scanned to automatically determined from model config and manifests. Using qsub_flags to manually specify storage flags no longer works, as automatically determined storage flag option is appended and the manually specified one no longer works.

RF: Paul Sandery having issues getting 0.1 deg model working. [AH: turns out it was a typo in config.yam]

AH: No need for the number of cpus in a payu job to be divisible by the number of CPUS in a node. Request however many the job uses, and payu will pad the request to make sure the PBS submission is requesting an integer number of nodes if ncpus is greater than the number in a single node. PL: Rounds up for each model? AH: No, just the total. MW: Will spread models across ranks, so a rank can have different models on it.

AH: Andy Hogg ran out 80 odd submits with the tenth model. Occasional hang, resubmit ok. Might be more stable than raijin.

AH: Navid has MOM6 model that cannot run more than a couple of submits without it crashing with an error that it cannot find the executable. Weird error, let me know if you see anything similar.

NH: Caution with disks and where to put things. Reading input files can be very slow sometimes, or not, and then files not there and turn up later. If executable is missing, running off a disk that is not good? MW: Filesystems are very complicated on gadi? NH: Less certainty of performance with such a different system with data file systems being mounted separately. I’d look at this.
PD: Good place to look if disk has got caught up doing too many tasks. gdata just hangs, saving text file takes a while. Due to being on login node? Get similar delays with interactive job on execute node.
AH: People reporting issues with login delays. Probably a disk issue? Navid’s job is not being run from gdata, but from scratch. Inclined to blame new system of mounting. Could we use jobfs. MW: Like in the old days when we ran on the node? Good luck! AH: Could just do some tests. NH: Concerning if scratch is slow.
AH: Not sure if filesystems are mounted with NFS. MW: That is what we do on gaia, and have tons of problems with mount on demand. Biggest frustration with using GFDL machine. It’s a nightmare. At least NCI have lustre know-how. AH: Used to have a lot of problems with NFS cache errors in the past, files disappearing and reappearing. Does sound similar to Navid’s problem.
MW: Raijin’s filesystem was quite good. Why the change? AH: Security. Commercial in confidence stuff. I think it is overblown. Can’t seen anyone else’s jobs on the queue. Can’t even check it other people are running on the project. Are moving to 2-factor auth also.

What is required to get gadi transition into master for ACCESS-OM2

AH: Andrew Kiss is on personal leave but sent around an email:
re. gadi-transition, we could proceed like so:
– we’ve also been transitioning libaccessom2 to use submodules for its dependencies instead of cmake https://github.com/COSIMA/libaccessom2/issues/29 which would require this commit https://github.com/COSIMA/libaccessom2/tree/53a86efcd01672c655c93f2d68e9f187668159de (not currently in gadi-transition branch)
– get the libaccessom2 tests working https://github.com/COSIMA/libaccessom2/issues/36
– there’s a gadi-transition branch libaccessom2, cice and mom that could be merged into master. They use openMPI4.0.2
– there’s also a gadi-transition branch for all the primary (ie JRA, non-minimal) configurations but the exe paths would need to be updated before merging to master
– the access-om2 gadi-transition branch would then need to be updated to use the correct submodules for model components and configurations. We also want to remove the core and minimal config submodules https://github.com/COSIMA/access-om2/issues/183
also fyi the current gadi build instructions are here
AH: Feels urgent that people can use on gadi. Any comments on Andrew’s email?
PL: Transition to submodules finished? AH: That is on a separate branch. NH: I did that work. Put it in a dev branch. Not intending to be part of gadi transition to have least number additions. AH: Agree if that is the easiest. Master is broken for gadi, so anything that works is an improvement. If there is no feedback can do this offline. Could make a project to be explicit about what is required. NH: Given that gadi-transition does work. Andrew and Andy use it. Wouldn’t hurt to put it in now. Work that PL has done to make sure it does reproduce ticks that box. So ready to go. Able to reproduce if we need to. I’ll merge it and do some interactive testing. Then people can use it and I can do automatic testing.
PL: What branch will it be merged into? A lot of branches in a lot of repos.
NH: Isolate gadi-transition branches and merge into master straight away. Not bother with other development branches at this stage. Want to get something in master that people can use. In future bring everything into dev as discussed, with master staying stable, just bug fixes, until decide to update from dev. I’ll go through the branches and just bring in the gadi transition stuff. PL: So dev will have submodule changes and master will not? NH: For the time being. With previous discussion we’ll be slower moving on master, to make sure it is working. Having dev will allow us to move that more rapidly. People can run off dev at their own risk. AH: Submodules will remain a named feature branch and pulled into dev at some future time. Should discourage having personal development branches on the main repo. If you want to experiment do it on your own fork. Branches on the main repo should be master, dev or named feature to keep it clean and everyone can understand what they mean.

Stack array errors and heap-array option

AH: Apologies minutes from last TWG meeting are not on the COSIMA website. There is an IT issue with the server. We wanted to follow up with stack array errors.
AH: Did ever test on raijin with same compiler? Is there any way we can do comparative test? Use raijin image? Any more from Dale about this stack stuff? PL: Haven’t heard anything. AH: Last meeting some mention of there being a limit on UM stacksize. RY: Already fixed Ilia’s issue. Fixed by making stacksize unlimited. RF: Always run with unlimited stack size. When had problem only fixed by setting heap arrays small or zero. When I went into code and made array allocation from automatic to allocatable the error went away.
MW: If I have an automatic array I get three different heap allocations for three different compilers. RF: This option forces all arrays on to the heap.
AH: This was fixed a while ago Rui? RY: Not clear this is the same problem. Ilia’s issue was the end of 2019 when gadi first on line. Not sure it is the same issue.

BGC Update

AH: Russ forwarded an update to Andy Hogg.
RF: Work was completed on raijin in 2019. BGC code in to MOM and CICE. Required changes in CICE: moving arrays around to different modules due to scope issues which allow optional fields to be sent. Main one is to send 10m winds to ocean, not just the wind stress. Holding off to issue PR until gadi transition done so could go in clearly.
NH: Will be useful for JRA1.4 work.
RF: Hakase will be using it for BGC. Passing algae between ice and ocean components. To add new field, need to add field to code, but don’t have to be passed. Just picked up from namcouple using the flags in OASIS to see if it’s registered.
AH: Can this be the next cab off the rank after gadi-transition, before AKs science tweaks. Not relying on any changes in Andrews branches? RF: Would like to get gadi transition out of the way and then test these changes. Not tested on gadi yet.
How to proceed? Testing?
I’ve held off issuing a pull request until the dust settles wrt the gadi transition. There’s a bit of code rearrangement in order to allow optional fields (10m wind speed but this can be extended) to be passed from CICE.
The flags ACCESS-OM-BGC (tested) and ACCESS-ESM (untested) enable compilation of the BGC code. The 10m winds need to be added to the namcouple files and the MOM coupling fields namelist.
Work done on raijin last year. Changes in CICE to move arrays around in modules due to scope issues. Main one is to send 10m winds to ocean. No just wind stress. Holding off until gadi-transition done.
NH: Useful for stuff I’m doing with JRAv1.4.
RF: Hakase will use for BGC, passing algae between ice and ocean components. Have to change code to add fields. Don’t need to hard code as much. Once field in there optional to pass. Using the OASIS flags to see if registered.

JRA55-do counter-rotating cyclones

RF: Fortunately Paul Sandrey’s started in 1988. Last reverse cyclone in 1987. Cafe 60 use whole month window, so washed out on the average.
One of the RYF runs has reverse cyclone (83-84). Tell Kial.

Scaling

PL: Thanks to Marshall for getting me up to speed on scaling tests and sharing scripts. Can reproduce diagrams so can compare between raijin and gadi.
 AH: Any more performances numbers? PL: Now in a position to answer questions, just need to know what questions to ask.
AH: ACCESS-OM2-01 currently running around 5K cores, would love to be able to scale to 10K, 20K even better. MW: MOM scaled to 50K. AH: CICE doesn’t scale as well. MW: Any work on CICE distributions? RF: Nope. Would need to be done again at higher core counts. MW: Current one working really well. AH: On NH’s to-do list was to experiment with layouts and load balancing. MW: Alistair is very interesting in load balancing sea ice models. Particularly icebergs. Has some quasi lagrangian code in SIS2 to load balance icebergs. Maybe some ideas will translate or vice versa.
PL: For the moment will just look at MOM and see how it scales at 0.1? AH: Maybe just try doubling everything and see if it scales ok? MW: Used to make those processor heat maps to get the load imbalance of CICE. Would be good to keep an eye on that while working with scaling. Tony Craig (CICE developer) is very interested.

 Atmosphere/coupled models

 PD: Still using code frozen for CMIP runs. Extending number of runs in ensemble.
AH: People in CLEX are keen to run CM2. PD: Not aware, maybe through someone else, maybe Simon or Martin? CM2 and ESM-1.5 runs have been published under s38 project.
AH: Scott Wales doing an ultra high resolution atmosphere run over Australia, under  the STRESS2020 project. PD: Atmosphere only, do you know what resolution? I’ve also done some high res atmosphere only runs. On a project to improve turbulent kinetic energy spectrum in UM. Working on code to put stochastic back scatter into low res N96 (CMIP6) atmosphere. Got some good results injecting turbulent kinetic energy into small scales to improve artificial dissipation associated with semi-lagrangian timestep in UM. To test this is to see how improved N96 results compare to N512 runs using STRESS2020 resources. Working with Jorgen Fredrikson. Should talk to Scott.
AH: At the moment Scott is targeting 400m over Australia. PL: Convection resolving? AH: Planning a 2 day run to simulate Cyclone Debbie. Nested 400m run for Australia, inside BARRA at 2.2km. 10500×13000. PD: We’re going global. MW: How many levels? Same as global? PD: 85. AH: Major problem is running out of memory. MW: More cores should mean less memory. Maybe their Helmholtz server imposes some memory limit on the ranks. AH: Currently waiting for large memory nods to come online.

New FMS

MW: New FMS version coming. Targeting auto tools and getting rid of mkmf. If you’re on MOM5 you can use your frozen version. Completely rewritten IO in FMS. Now a thin wrapper to netCDF. No more magic functions like save_restart, write_restart. They have been replaced by lower level ops to allow model developers to have more control. Not sure MOM5 significance. AH: API compatible? MW: Keep compatible with old API as long as they can. Could dump it in and slowly integrate. Only raising in case you want to do more innovative stuff with IO. PL: Affects MOM6 mainly? MW: MOM6 is one of the main targets. PL: Parallel IO support? MW: Part of the reason. They want parallel IO in atmosphere model which NCAR now uses it. Now an important model. This implements the hooks for that work. RY: MPI-IO still there or be replaced by PIO? MW: It is. RY: Simpler to do one? MW: They’ve sent a patch to get MOM6 working with that now. Doesn’t work currently. Not sure about the progress, but know you were interested in PIO. RF: We’re interested from the ICE point of view. New version of BRAN will need daily inputs in CICE. Performance is terrible as IO is collected on to one processor.  MW: FMS will not help CICE, but a test case if PIO is a valid solution.

COSIMA 2019 Report

This report summarises the fourth meeting of the Consortium for Ocean Sea Ice Modelling in Australia (COSIMA), held in Canberra on 3-4 September 2019. Shweta Sharma has provided a more informal (and entertaining) report here.

Aims & Goals

The annual COSIMA workshop aims to:

  • Maintain and grow the established community around ocean-sea ice modelling in Australia;
  • Discuss recent scientific advances in ocean and sea ice research in a forum that is inclusive and model-agnostic, particularly including observational programs;
  • Agree on immediate next steps in the COSIMA model development plan; and
  • Develop a long-term vision for ocean-sea ice model development to support Australian researchers.

Participants


Attendees included Alberto Alberello (U Adelaide), Christopher Bladwell (UNSW), Fabio Boeira Dias (UTAS/CSIRO), Gary Brassington (BOM), Matt Chamberlain (CSIRO), Navid Constantinou (ANU), Prasanth Divakaran (BOM), Kelsey Druken (NCI), Matthew England (UNSW), Ben Evans (NCI), Hakase Hayashida (IMAS, UTAS), Petra Heil (AAD & AAPP), Andy Hogg (ANU), Ryan Holmes (UNSW), Maurice Huguenin (UNSW), Yi Jin (CSIRO), Andrew Kiss (ANU), Andreas Klocker (UTAS), Qian Li (UNSW), Kewei Lyu (CSIRO), Simon Marsland (CSIRO), Josue Martinez Moreno (ANU), Richard Matear (CSIRO), Ruth Moorman (ANU), Adele Morrison (ANU), Eric Mortenson (CSIRO), Jemima Rama (ANU), Paul Sandery (CSIRO), Abhishek Savita (UTAS/IMAS/CSIRO), Callum Shakespeare (ANU), Shweta Sharma (UNSW), Callum Shaw (ANU), Taimoor Sohail (ANU), Paul Spence (UNSW), Kial Stewart (ANU), Veronica Tamsitt (UNSW), Mirko Velic (BOM), Nick Velzeboer (ANU), Jingbo Wang (NCI), Xuebin Zhang (CSIRO), Xihan Zhang (ANU), Aihong Zhong (BOM), plus those who attended via video conference.

Program

Tuesday 3rd September

Session 1 (Chair – Navid Constantinou)

Andrew Kiss (ANU): ACCESS-OM2 update
Simon Marsland (CSIRO): ACCESS and CMIP6
Hakase Hayashida (IMAS, UTAS): Preliminary results of biogeochemistry simulation with ACCESS-OM2 and plans for OMIP-BGC and IAMIP
Ben Evans (NCI): Addressing the next HPC challenges for Climate and Weather

Session 2 (Chair – Andreas Klocker)

Veronica Tamsitt (UNSW): Lagrangian pathways and residence time of warm Circumpolar Deep Water on the Antarctic continental shelf
Ruth Moorman (ANU): Response of Antarctic ocean circulation to increased glacial meltwater
Kewei Lyu (CSIRO): Southern Ocean heat uptake and redistribution in theoretical framework and model perturbation experiments
Fabio Boeira Dias (UTAS/CSIRO): High-latitude Southern Ocean response to changes in surface momentum, heat and freshwater fluxes under 2xCO2 concentration

Session 3 (Chair – Simon Marsland)

Xuebin Zhang (CSIRO): Dynamical downscaling of climate changes with OFAM3
Matt Chamberlain (CSIRO): Multiscale data assimilation in Bluelink Reanalysis
Paul Sandery (CSIRO): A data assimilation framework for ocean-sea-ice prediction
Prasanth Divakaran (Bureau of Meteorology): OceanMAPS 3.3 Developments

Wednesday 4th September

Session 4 (Chair – Veronica Tamsitt)

Ryan Holmes (UNSW):  Atlantic ocean heat transport enabled by Indo-Pacific heat uptake and mixing
Eric Mortenson (CSIRO): Decoupling of carbon and heat uptake rates of the global ocean over the 21st century
Christopher Bladwell (UNSW): Diahaline transport in global ocean models
Abhishek Savita (UTAS, IMAS, CSIRO): Uncertainty in the estimation of global and regional ocean heat content since 1970
Gary Brassington (BOM): Comparison of ACCESS-OM2-01 to other models and observations

Session 5 (Chair – Qian Li)

Xihan Zhang (ANU): Gulf Stream separation in ACCESS-OM2
Alberto Alberello (U Adelaide): Impacts of winter cyclones on sea ice dynamics
Petra Heil (AAD): Sea ice in the ACCESS-OM2-01: Exploring near-coastal processes

COSIMA Discussion (Chair – Paul Spence)

Open discussion highlighted a number of potential avenues for work in the near-term, as well as some suggestions for directions that could be included in a future COSIMA funding bid.

Near-term Priorities

  • Running the COSIMA cookbook on the VDI is becoming untenable, and recent improvements in the cookbook have not been widely adopted. This should be a priority, possibly with a tutorial session at the CLEx Annual Workshop?
  • Start investigating coupled data assimilation for parameter estimation, especially for sea ice.
  • Start serious perturbation experiments with ACCESS-OM2-01, potentially including:
    • SAMx (RYF forcing with SAM Extreme Years).
    • Adding katabatic winds?
    • Tropical mixing and the AMOC.
    • Influence of the Amundsen Sea Low on the Southern Ocean.
    • Turbulent Kinetic Energy and winds in the Southern Ocean.
  • OMIP2 contribution for CMIP6.
  • Improve communication of COSIMA achievements.
  • Begin work on nesting regional MOM6 models.

Longer term suggestions

  • Better connection with Paleo community.
  • Improve links with the wave community.
  • Start running ensemble simulations?
  • Do we need to move to CICE6?
  • Capacity to run future scenarios based on coupled model output.

It was agreed that COSIMA V will be held in 2020, hosted by Xuebin Zhang in Hobart.

Additional discussion points are given here.

Awards

The COSIMA Most Selfless Contributor Awards for 2017, 2018 and 2019 were presented in absentia to

  • 2017 James Munroe
  • 2018 Marshall Ward
  • 2019 Russ Fiedler (pictured)

in appreciation of their tireless efforts which have greatly improved the software used by the COSIMA community.

Technical Working Group Meeting, August 2019

Minutes

Date: 14th August, 2019
Attendees:

  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) RSES ANU, Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Marshall Ward (MW) GFDL
  • Nich Hannah (NH), Double Precision
  • James Munroe (JM), COSIMA

PIO work with CICE

NH: PIO code in CICE not as complete or thorough as netCDF code. Nothing to suggest it won’t work. Relies on NCAR PIO library, and a CESM utility library. Dependencies which are not part of CICE. Built PIO dependency on raijin, ran into CESM dependency. Can either remove dependency or remove code.
NH: Initially thought to use the MOM approach. Tile and collate. Russ’ comments encouraged to try PIO. Will be supported in future and will be supported in CICE6. Nothing working, but will soon test with 1 degree.
RF: Real bottleneck with high freq output. Worth a go. Attempt to put this into FMS by Hartnett. AH: Different to parallel netCDF? NH: PIO is wrapper around parallel netcdf. Written by NCAR to simplify parallel netcdf. Another layer. On GitHub, continuing to be maintained. RY: Wrapper that does work to match computing to IO domain. Not so useful for MOM5 as it has io_layout already.
MW: Harntnett motivated by FE3 (forecast model) rather than ocean. Not sure what project even involved in.
NH: Big test is handling interesting CICE layout, difference between cartesian grid and PE layout. MW: PIO will support explicit decomposition and other approaches.
NH: Parallel netCDF version on raijin only links with OpenMPI3.0. RY: New machine launched soon. OpenMPI 1.* will be dropped. No new software depending on 1. MW: OpenMPI 2 is not good. Should use 3.
NH: Probably have to test this with OpenMPI 3.0 RY: 3.1.3. Switch everything to that. Good test for new machine. AH: Working now? RY: My fault. Used unmatched openMPI library. Everything looks fine. OpenMPI 2/3/4 with Intel 19. All working. 1 deg & 0.25 deg working. Tenth not working. MW: I was able to run tenth with 3.1.2/3.1.3.
MW: One of the intel compilers broke MOM. A compiler bug with types in types.
AH: Should  start an issue for testing RY: Will email MW directly. RY: Not a MOM bug.
MW: Tried MOM-SIS tenth? Good test. RY: From earlier this year do have this working. This is testing for new machine, so ACCESS-OM2.

OMIP date restart protocol

RF: Talked to Griffies. GFDL take ensemble approach. Run for N years using true dates. At finish reset back to start date with correct calendar. Storing new stuff in different directory. End up with 5 sequences of 55 years. All dates are correct. No issues with leap years going wrong. Think this is the best way to go.
AK: Came to conclusion that this was right way to go, mostly due to leap year issue. Problem is, can we get the model to do that, but Maurice and Ryan had issues. Issue with CICE getting the correct date. CICE has a flag “use_restart_dates”. Suggested set this to false, and set the dates in access_restart.nml, but CICE is not picking up dates. Looks like libaccessom2 is not passing them on to CICE. Some confusion about exactly what they have done. Some instructions on Wiki for restarting, from restarting IAF from RYF at tenth, but doesn’t work for other people. NH: I’ll look at it. AK: Will send issue. NH: Didn’t realise it was happening. CICE date handling is not great.
AH: Downside with ensemble, difficult to get metrics across the whole time series. RF: Need extra meta-data added in. Maybe which cycle you’re in. An extra variable which gives the actual number of days since the start of the run. Down with post-processing. Might be able to concatenate files using extra meta-data. AH: Always have issues with missing leap years if it spans a century. But only daily is an issue. AK: Cookbook do something. MC: Pretend it is no leap? JM: Data looking at as time series? AH: Extra metadata, say offset day is a good idea. RF: Add buffer in netCDF file so don’t need copies. mppnccombine can add padding. usually done with nccreate, make sure the header has some space. hbuf?

Strategy for CICE updates for flexibly adding fields

RF: Way CICE drivers work, variables you want are either hard coded, or muck around with pre-processing to compile them in and out. Wondering if anyone looked at doing it on the fly. Using error codes coming back when setting up variables, so have flexible number of variables passed in and out. Would like this to pass total wind speed, to harmonise code. Also Hakase wants it for some BGC stuff. Phytoplankton through to the ice. So specify the variables, work out if they’re there or not.
NH: Would want the exe to handle configuration with different sets of coupling fields. Sometimes include total wind speed, sometimes not. RF: would know complete set, if not there skip it. Currently have to be hard wired in, or make another driver. NH: Way to do it, start with superset in namcouple, and code would exclude certain variables. RF: Maybe if variable not in namcouple, return an error code, but ignore error. NH: Shouldn’t be too hard to do. NH: OASIS does return error codes that could be used. Either abort or return error code. If aborting could change that. AH: Restart fields? NH: Should do behind the scenes.

Paths for JRA55-do forcing files. Some changes to support v1.4

AH: JRA55-do not part of Input4MIPs, part of CMIP6. Have to use the copy that is CMIP6. Encodes all the metadata in filename, consequently doesn’t currently work with YATM. Circumvented by creating symbolic links that worked with YATM. When I did this couldn’t reproduce. Not sure if this is actually an issue with the fields being different or not.
AH: Tried to use testing framework NH developed for this using jenkins. The historical test that tests against known checksums doesn’t seem to actually compare them. Not sure if that is intentional. Would like to use framework, as NH has done a great job with it.
MH: MOM6 has diag_mediator, supports CMOR name alongside internal model name. Porting to MOM5 is a big task, but idea is good and saved them a lot of work. Could create a thin wrapper to translate to CMOR name if that helps. AK: How integrate with YATM? MW: Don’t know. At FMS level, so only help with 1 model (MOM). AK: YATM access the JRA files. So libaccessom2 change. AH: Looked at YATM code. Generates filename form date. Input4MIPS has current year and next year, so would require code changes. Might just be easier to create a file with date->filename mapping? AH: Possible to do. Would need to add a token for year+1. Possible to do. Probably best to do it that way.
AK: Also need code changes with v1.4. Solid and liquid runoff are separate. What to do with solid runoff? Griffies either use iceberg model, or melt them and add them to runoff. Take account latent heat of fusion? Assuming solid runoff is at zero, which could be a problem. Put in a request to download v1.4. Scripts they have should automatically download it, but not. MW: Think GFDL only has v1.3.
MW: Fields go to end of 2017, is 2018 downloaded? Looking in wrong place? Looking in ua8. AK: Should look in qv56. AK: qv56 up to feb 2018. AH: If not automatically downloading, we should ask. What does the OMIP protocol say about end date? AK: JRA55 can find out about 2018. RF: It is specified, but would like latest for ongoing runs.

Testing FMS merge

AH: Putting FMS in as a sub-repo. Just needs testing. If it reproduces checksums for a month we’re sure it is ok? Is that sufficient?
NH: When Marshall upgraded FMS, went through every MOM test. Including 0.25. Can’t recall how strict we were. AH: Testing framework still there? NH: It is there. Because it never gets used, might be rotted a bit. Can give Jenkins URL of PR and it would do it. We should work together to get that working.

New NCI HPC hardware announcement

RY: System by end of the year. 2 phases, install new machine with Cascade Lake nodes. Short period gabi and raijin run simultaneously. After that skylake and broadwell will be merged with new machine and SandyBridge nodes removed. 100 GPU installed. 16 skylake k-80 nodes. PBS pro again. Storage and network infiniband. 200GB/s transfer speed. OS is CentOS 8. AH: Trying to figure out total core count for new machine. Do you know what core count will be? RY: Not clear on exact number. Can check with system guys if they know the exact number. If 32 cores/node, 150+K processors. AH: Will runtimes be extended for new machine. Find 5 hours too low for high core count jobs. Reduces flexibility. RY: Queue time limits are per project. Quite flexible. Contact NCI help. AH: Have asked for time limit changes in past, but usually time limited. RY: Have been asked by other users, not sure about the policy. Good time to ask and get a better policy for the new machine.

New Repeat Year Spinup

We are currently undertaking a new spinup simulation using our highest resolution ocean-sea ice model: ACCESS-OM2-01. Our goal is to run at least 50 years of this simulation in the third quarter of 2019, using the  1990-91 Repeat Year Forcing strategy (that we call RYF9091). After the first month of the quarter, we have managed to complete 23 years of the simulation.

Early indications are that temperature and ice biases are reduced in the new simulation (compared with our previous RYF8485 spinup) but that many large scale circulation metrics, such as ACC transport) are largely unchanged.

Globally Averaged temperature from RYF9091 spinup.
ACC Transport from RYF9091 spinup

For those who are interested, this simulation is being stored in /g/data3/hh5/tmp/cosima/access-om2-01/01deg_jra55v13_ryf9091. Timeseries describing the simulation, periodically updated, can be found here, and additional diagnostics are available upon request.

COSIMA IV, ANU, 3-4 Sept 2019

Announcing that our 4th Annual COSIMA workshop will be held at ANU on Tuesday 3rd and Wednesday 4th September.

The workshop will feature our regular mix of talks and discussions, covering a mix of physical oceanography, sea ice, model development and technical advances. Our workshops are not restricted to ACCESS-OM2/MOM users – we welcome contributions from all who consider themselves part of the COSIMA community.

Registration is now closed, but you can join via Zoom – see below.

We will begin with morning tea at about 10am on the first morning (3 Sept), with the first talk at about 10:30.

Click here for the workshop program (updated 1st Sept).

The workshop report provides slides from the presentations and a summary of the discussion. Shweta’s post on the workshop is here.

Joining via Zoom
The workshop will be streamed via Zoom. When joining via zoom, please mute your mic and hide your video. Unfortunately we probably won’t have an ability to take questions from remote participants.

Connection details:

Join from a PC, Mac, iPad, iPhone or Android device: https://anu.zoom.us/j/8232211676

Join from a H.323/SIP room system:
Dial: +61 2 6222 7588
or SIP:7588@aarnet.edu.au
or H323:8232211676@182.255.112.21 (From Cisco)
or H323:182.255.112.21##8232211676 (From Huawei, LifeSize, Polycom)
or 162.255.36.11 or 162.255.37.11 (U.S.)

Meeting ID: 8232211676

ACCESS-OM2 at the CLEx Winter School

This week saw the ARC Centre of Excellence for Climate System Science hold its annual Winter School. This year’s edition is on modelling the climate system, hosted by the University of Melbourne. Yesterday was “Oceans Day” at the Winter School, and the highlight was an afternoon lab session based on ACCESS-OM2. Tasks included:

  • To run the ACCESS-OM2 model (about 65 of the 70 students managed to do this); and
  • To analyse some existing ACCESS-OM2 model output using the COSIMA cookbook (most students also completed this, with about a third of students producing great plots to show some new and intriguing results).

Progress was enabled by erstwhile helpers from CLEx’s CMS Team – Aidan, Claire Holger, Paola and Scott!

ACCESS-OM2 Workshop
Feverishly running ACCESS-OM2 at the 2019 CLEx Winter School.

ACCESS-OM2 Evaluation Paper

A community paper evaluating the performance of ACCESS-OM2 has been submitted to Geoscientific Model Development (GMD).  The paper has 30 authors from across the Australian community. GMD has an open review process, so you can track its progress.

This paper outlines the performance of ACCESS-OM2 at three different model resolutions, as indicated in the figure below.  It aims to spell out which versions of the model are suitable for different types of studies, and highlights the performance of the 0.1° resolution configuration (referred to as ACCESS-OM2-01). The paper shows that ACCESS-OM2 does a good job of representing many features of the ocean. Historical sea ice extent trends are well-represented, and the surface properties and transects in each ocean basin compare well with the observational record. The large scale overturning circulation, flow through the Indonesian archipelago and patterns of boundary currents are realistic, supporting the notion that this suite of models is competitive with similar models from other institutions. Areas for improvement include the relatively weak barotropic transport in the midlatitude gyres, particularly in the Pacific Ocean, the weaker than observed Drake Passage transport and the weak AMOC in the 1° configuration. For full details, feel free to browse the paper!

Kiss, A. E., Hogg, A. McC., Hannah, N., Boeira Dias, F., Brassington, G. B., Chamberlain, M. A., Chapman, C., Dobrohotoff, P., Domingues, C. M., Duran, E. R., England, M. H., Fiedler, R., Griffies, S. M., Heerdegen, A., Heil, P., Holmes, R. M., Klocker, A., Marsland, S. J., Morrison, A. K., Munroe, J., Oke, P. R., Nikurashin, M., Pilo, G. S., Richet, O., Savita, A., Spence, P., Stewart, K. D., Ward, M. L., Wu, F., and Zhang, X.: ACCESS-OM2: A Global Ocean-Sea Ice Model at Three Resolutions, Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2019-106, in review, 2019.

Technical Working Group Meeting, March 2019

Minutes

Date: 12th March, 2019
Attendees:

  • Marshall Ward (MW) (Chair) NCI
  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Russ Fiedler (RF), Matt Chamberlain(MC) CSIRO Hobart

Updates

MW: Submitted parallel IO FMS patch. New automake made PR more complicated. FMS now buildable by automake. If we add new files/dependency build will fail. Not very auto. Works well. Some tuning with lustre. Just need a good io_layout with large contiguous chunks.

AH: Compression? MW: IO runtimes double with compression. Testing some of the newer algos and getting better numbers. AH: How does he get other compression into netCDF? MW: Custom libraries. Will make accessible when time comes. Have been using it. Good. MW: Haven’t heard back from GFDL yet. AH: will be needed for further high res models.

MW: RF big help with all the fill value stuff. RF: You put missing values in missing tiles. There is a mppnccombine bug which stuffs up. MW: Used netCDF fill value for restarts, MOM sets it to CMOR fill value.
automake goal is to make FMS into a library that can be centrally installed.
AH: Should just link against libFMS rather than have in the repo.
RF: part of decadal project is to update FMS. Wants to use it in AR4. MW: make it loadable module? AH: dynamic linking? AH: need access to compiled module files?
MW: Issues with GNU/Intel. Dale roberts wooing on this. Also module versions are an issue. Intel 18 can’t read 19 mod files for example. Don’t think there are any includes. If there are, should be fixed. RF: There is include mpi.f. Not sure how they include netcdf.
RF: Advanced with getting WOMBAT into CM2 harmonised MOM5. Fixed a few issues with redundant ifdefs. Hopefully sometime this week can finish. Pretty close by the end of the day.
MC: Might have an executable with WOMBAT compiled into it? RF: Yeah, just changed a few interfaces to get rid off ifdefs, changed to optional arguments for WOMBAT arguments.
MC: If Matear is there tomorrow, can try something.
AH: Now have a suite of models. Different resolutions, now with BGC as well. Great position to be in.

Technical Working Group Meeting, January 2019

Minutes

Date: 15th January, 2019
Attendees:

  • Marshall Ward (MW) (Chair) NCI
  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD), CSIRO Aspendale

MOM5 CM2 code harmonisation

PD: Stopping an 18 year run with harmonised code. Seems successful. Not losing summer Antarctic sea ice, which had been an issue. Dave Bi gave his approval.

PD: Looking at new bug fixes on GitHub. Didn’t appear to be in the CSIRO code.

RF: The fixes I added weren’t to do with harmonisation. Diagnostic for transport on density levels. PD: Won’t affect our model run? RF: Yes. PD: Maybe should keep the 18 year run going. RF: There is also a fix to a submeso scale smoothing that you’re probably not using. 99% likely you’re not using that. PD: Could check by looking at the namelist? RF: It’s smooth hblt, or something, but also a note in the code specifying the namelist that shouldn’t be used because of this error. PD: Could you send me the namelist value? RF: Should be in GitHub issue/pull-request. AH: I’ll put links from your commits on to the slack channel.

https://github.com/mom-ocean/MOM5/commit/11f06f989645b1b21aa990ade61440976451bbb6

https://github.com/mom-ocean/MOM5/commit/06a6d0afb55a1f188d0b58b89513d646d47f062f

AH: hblt_smooth

RF: If you applied the smoothing could smooth into rock. PD: Keep getting current going? Want to get an ENSO spectrum. RF: Shouldn’t make a difference, but will get noise due to changes in red sea fix. Statistics will be the same

AH: In the release candidate code we reproduced a red sea fix timing bug to make the comparison as clean as possible, using a namelist option. That has been stripped out before merging into the master branch. If you continue with this test run and it becomes your spin up run then you will not be able to do a clean comparison if you then change something else from the master branch version. Is this a test run or will it become a spin up?

PD: Sometimes test runs become real runs, but I’ll say this is a test.

AH: If at any point you start a spin up you need to be on a commit on the master branch. Currently running from a commit on a pull request that no longer exists. There is no comparable commit on the MOM master branch repo because of removing the salinity time unfix option, and merging in RF’s bug fixes.

NH: Second that. Also important if we want to continue to be harmonised, this is a divergence and if we carry on with that we’re diverging immediately.

PD: Alright. Will leave this run going to test ENSO spectrum, but will start a new parallel run for safety. Don’t want to be in the position where the test run gets turned into a spin-up because of lack of time.

AH: Calendar time not compute time is your constraint? PD: Yes. AH: Definitely agree with that strategy. PD: Never have enough compute time, but always have that trade off.

AH: Will tag the code with CM2 version which can use to identify the code. PD: Should tag straight away and I will clone and let people know this is the correct code.

NH: Reproducibility is important, and the current MOM code does not reproduce. PD: On restarts. NH: Yes, so would be a shame to lose that by not using the merged code.

AH: Not only is it reproducible, but NH is running tests for this. NH: That’s right. Just a simple 2 day versus 2×1 day runs. To do that test presently turns off red sea fix. Now I can turn it back on? RF: It should now reproduce. AH: Turn it back on and check it reproduces? AH: This is reproducible between runs, not necessarily reproducible to before RF’s fix. PD: Which is reproducible on restarts and which isn’t? AH: Current harmonised code in the MOM5 master is reproducible between restarts, the code from the pull request that PD is currently running is potentially reproducible if you turn off the flag we introduced which emulated the incorrect behaviour of the old MOM5 CM2 code for testing purposes. PD: So the harmonised code in the pull request is different to the master branch? AH: Yes, in that the hack to emulate the incorrect timing behaviour of the red sea fix has been removed, and a couple of RF’s bug fixes added. PD: Going forward will there be a harmonised branch and a main branch? AH: No, there will be a tag identifying where you get your code from. If you need to pull in updates but don’t want all the updates in master, then you might start a new branch, and cherry pick those updates. At that point you might have a different branch, but not currently. In general better off not having a separate branch, as it just starts diverging again. I don’t know how you guys work and I’m guessing you don’t want code changes, but if, for example, you just wanted to add code changes that added diagnostics, you could add those in, and have something to compare against. PD: Just trying to figure out how it will work, if there are different branches, and if down the track we want to develop the harmonised code again. AH: Yes, and if you have some testing then you an add code and test to see if there are differences in the output, and so add code changes with confidence.

MW: Highlights that we have not been tagging MOM for some time. Maybe we need more regular tagging.

 

ESM code harmonisation

PD: In an ESM meeting on Friday Tilo said it was too late to include changed code for CMIP6.

AH: Whatever we did he wouldn’t put it in CMIP6? PD: Yes. AH: Invested too much time on spin ups? PD: Yes. AH: So no particular rush to do this. PD: I thought it was good to pass that on. AH: Good to know, thanks. MW: There ESM for CMIP6, but also in the CoE. Will they use what you’re working on? AH: Yes, but shame, as we’re not then using the same code as the CMIP6 submission. MW: I thought that was your interest in ESM. AH: Yes, not sure.

MOM5 Governance model

MW: Not sure how much we can do without Steve being present, or anyone from GFDL.

AH: Made some notes, hoped to get some feedback and others make changes, additions. MW did add some useful points about defining domain experts.

AH: The MOM5 repository and community is not welcoming to those outside the current clique of COSIMA. Pull requests languish for years without attention because it is no-one’s responsibility, no-one has a defined role. Even if some people put their hands up to monitor pull requests.

Have a contributing.md, to tell people how to contribute to the code. There are users outside this group who use it, see them on the MOM users mailing list, and can use that channel to advertise it.

MW: That was my experience as a grad student. Not sure who is charge of the code. Tried to contribute code and found it intimidating. Not a fan of overly prescriptive instructions on how pull requests must look, or how code must be written. That has made me less likely to report bugs. Would rather get bad contributions than no contributions. AH: Poor quality contributions require effort from us to work through them. MW: Yes, just want to make sure they know it is ok to screw-up. A lot of projects require a lot of environment information etc, which can be onerous for new users. Now tend to ignore those instructions and wait for devs to ask for it. Do think that governance model is good, but want to encourage contributions and emphasise it is the effort that matters rather than the quality of the contribution.

NH: I think I agree. Want to be more friendly to outsiders, also agree with MW, want to put as few roadblocks as possible. People have little incentive to contribute to the repository. The harder we make it, the less likely they are to contribute. With Pull Requests, in order to merge we need testing, but people aren’t going to do those tests. They test for themselves to satisfy their scientific goals and that’s it. Not sure how to reconcile that.

MW: Automated code coverage tests help. A lot of CI services panic too much when code coverage drops after a contribution. This can be useful if it then spurs contributors to improve units tests to improve that metric. AH: Currently have 0% code coverage, so it can’t get worse!

MW: Yes, one issue is we have no code coverage presently. But looking more broadly if we have some automated tests that can tell us what is broken that can help a lot. I know we have testing. AH: Only compilation testing.

AH: If we do this, it will be a burden, want to minimise this, hence the document defined some steps for assessing a PR:

  1. PR assigned to committer
  2. Committer checks PR conforms to guidelines
  3. PR passes CI checks
  4. Iterate until correct or time limit passes (say developer does not update PR as needed)
  5. Testing? Do we push that burden on to contributor to demonstrate bit repro? What about performance?
  6. Accept or reject (rejection is automatic if contributor does not meet expectations above and time limit is exceeded, so rejection at this point would be unusual, but might occur if the efficacy or efficiency of the code is questionable)

AH:

MW: James Munroe has experience contributing to dask, managed by experienced software engineers. He had to do a lot of testing, as well as style and engineering changes. They had someone asking him to make changes until it was deemed good enough. We could do this too, but it would require engagement from us. Also, are we overthinking this? We hardly get any pull requests.

AH: Yes, but some of this stuff is useful for us to do too. A lot of it is just good communication. It may sound onerous, but in many ways it makes it easier, so people aren’t trying to guess the right thing to do, e.g. with code style it may be as simple as pointing to a particularly well written section/module and say “seek to emulate this”

MW: GFDL is struggling with this right now. I sent them a beefy FMS patch and they did not how to handle it. They didn’t want it, but they didn’t want to just reject it. Might be worth figuring out how they are dealing with it. AH: They might copy what we end up doing. MW: Yes, they might end up copying this.

AH: First up, should we consider roles, e.g.

Contributors: Anyone who wants to contribute code

Committer: Anyone with commit access

Maintainers: Committers who check PRs, assign PRs to other committers and maybe do some other admin tasks.

Admins: Do we need another layer? Currently only Steve, Nic and I are Organisation Owners and Admins on the MOM5 repo. Need at least 2 admins at all times (run over by bus scenario).

Sponsoring institutions: Acknowledge role of institutions which provide time for code development?

BDFL: Steve? (Does he even want to be a Benevolent Dictator?)

 

AH: Do we want to do something like this? NH: Yes. It is a great idea. Write it down and we can use it ourselves, and our collaborators.

AH: Are we happy with the roles? NH: Yes, need to look more closely. Looks good. Keep it simple.

MW: Maintainers should be admins, at least for now. I wonder if Steve would prefer to not have a formal role, as he is transitioning out of the coding. AH: Would be want to be BDFL? MW: Would probably accept it, but maybe he doesn’t want to

AH: I think it is good to keep him there. At some point we will transition to MOM6 and it will no longer be my role to look after this. Others will move on too, so it would be good to have Steve still there in that eventuality.

MW: He could be a decider. AH: Yes, like having a Queen. Just have to host him in visits every now and then. MW: Head of State, but not the Head of Government.

AH: Ok, get rid of admin/maintainer distinction. Just make them Maintainers.

MW: Sponsoring institutions is interesting. AH: It was to acknowledge that institutions that pay people like me might have a reasonable expectation that the people they pay to help maintain the code could have some say in the way it is run. RF: I don’t like that, it is a community code. MW: I agree with RF, but there is some reality. AH: More about expectations. If I leave my job, the CoE might have some expectation that my replacement would be able to become a committer/admin. MW: I worry about sponsoring institutions insisting on having some oversight. AH: I always thought CSIRO was a bit keener on the whole “official stamp of approval” thing, but if we don’t like it I’ll get rid of it. No worries.

AH: Ok. Lets codify the roles, and look into a contributing.md to sit at the top of the repo to tell people how to contribute code. This all started with wanting to have timely responses to Pull Requests. Who wants to be what?

MW: What is the

NH: Happy to be a maintainer, but COSIMA would have to pay for my time. AH:

MW: Don’t have an explicit list of users.

RF: How far back do you go with sponsoring institutions? Back to 2009?

MW: I would prefer not to involve them.

RF: Just leave it. Keep it simpler.

MW: When I say sponsor, I mean someone who can’t function without this code. Need a maintainer.

AH: It is literally my job, and happy to volunteer as a maintainer.

AH: Maintainers check PRs, and then assign them to people, we need others to be willing to have a PR assigned to them. MW: Happy to do it, but feel like others are more invested. AH: Doesn’t have to be just you. It is just having someone respond to a PR, check progress, feedback what is required to get it merged, close it if it doesn’t meet standards and the submitted doesn’t respond to queries to fix it.

MW: Yes we need that role, at the moment it is AH and NH. If you want to add me I’m happy to do it, just not sure where I will be in the near future.

AH: Ok how do people get these roles? Commit access is currently Steve, NH and AH. Do other people need it, or want it. How do we decide who should have it?

MW: Best to have one person responsible for commits/merges. AH: Steve recently merged a commit that needed to be reverted. MW: Which is what prompted this. AH: Yes, and he wouldn’t have felt the need to do it if he knew it was being handled.

AH: Do we need to do this now? Specify who is on “the committee”?

MW: Contributor is obvious. Pull committers from contributors. Pull maintainers from committers. If you don’t get enough commits, then a project just dies. AH: Don’t need to codify how someone becomes a committer? MW: No. We’re too small. Look on the issues, and PRs, represents very few people. AH: Ok, so no need right now.

MW: Quite premature to codify this.

AH: Need a `contributing.md`, how to contribute.

MW: Any role for governance is getting the word out. Find the user base. There are a lot of MOM users out there. Steve cited a thousand MOM users in 2010. Governance should be about where are these people, and then get the community. AH: You run the danger of “hey guys come and contribute” and not have processes in place. MW: A good problem to have. AH: What happens if you got 10 PRs tomorrow what would you do with them? MW: Deal with them one at a time. AH: But who would deal with it? We have existing PRs that people aren’t being dealt with, one has been there for a year. MW: Yeah, but it is a bit stupid. Changed an integer to an 8-byte. RF: I would have rejected it because it used a specific kind. Change was a good idea, but badly implemented. MW: Ok, let’s look at this and learn some lessons and produce a document.

MW: So lets look at this PR. What are we going to do about it? So who do we assign this to? I would assign to Russ as he has expressed a strong opinion. AH: Ok. MW: What would you say about this PR RF? RF: I would split into two, as they are unrelated. If you do submit these things, try and keep to individual issues. Specifically about this, first change is non-portable, which can easily be fixed. Would we do the fix, or would he do the fix? All it needs is a selected_integer_kind. MW: This is a portability issue. RF: Yes, anything from now on that gets contributed should be standard. MW: Lesson should be an evolving document about what we want. RF: Yes, standards compliant, independent commits. MW: Would you be ok with being assigned this? AH: Already done it. RF: Second part could be a real bug in remapping vectors. Maybe get back to him and ask if he can show a test case. AH: Ideally this would be associated with an Issue which explains the problem. AK: Really ideally would include a test case that fails and that subsequently passes with the fix in the PR. MW: I would argue against pedantry. GitHub treats Issues and PRs similarly, so while an Issue with a test would be ideal, shouldn’t be mandatory. If you would rather formalise then fine.

RF: Have to leave

Subsequent discussion:

AK: Who feels they have responsibility, so things don’t languish, and who has authority to say what is going to happen. NH: And who has time. MW: Good not to dump maintainer responsibilities on NH.

AH: If suggestion is prescriptive so as not to waste time wondering how to do things. Not mandatory, but good to give guidance.

MW: Current touchy point is assigning stuff to RF. Had a meeting and asked it was ok, but won’t scale. Let’s clean up the pull requests.

MW: I have different idea of purpose of Issues page to AH for payu. You see as dumping idea for ideas (AH: Yes), and I see it as something to keep as small as possible. Ok for a small group, but with a larger group it can balloon and lose track of good ideas. MW: Yes bad idea, or not enough time should be a criteria for closing an issue, without losing the idea. Would be great to get issues down to 1-2 pages, would reveal a lot of lessons.

AK: Paul’s PR is pretty substantial. A whole new file, ocean_basal_tracer. MW: But bungled the build script. Also made a timer that does nothing. This is a good issue with a terrible title. This is a good example.

MW: These projects work best when they’re community driven, that we agree to do it. AH: If you put up your hand to be assigned stuff, you will get assigned stuff. If they don’t, then people won’t. AK: If someone has that role you feel ok assigning to them.

MW: When we talk about governance, lets not worry about users, lets figure out how to assign issues, that is a great place to start. Not worry so much about how users should make PRs.

NH: I agree, but maybe still work on guidelines for contributors. There was a document called “How to contribute”, based on technical specifications. It was a markdown doc in the original website. AH: Maybe that exists somewhere. NH: It’ll be around somewhere, outlines how to do a PR and things like that. AH: Should definitely start with that.

MW: Make it a goal this week to deal with those two PRs. Assign me Paul’s if you want.

AH: Willing to be a maintainer? NH: Yes, I like being part of that community. Hard to know how to dedicate time to it sometimes. We need to do a certain amount of maintenance. Make think of it as pro-bono professional work. Matt England is keen on that sort of stuff, he might be willing to support it/sanction it/pay for it.

Agreed: Maintainers: Aidan, Nic, Marshall. Ask Steve if willing to have BDFL status.

COSIMA Models – ACCESS-OM2-01

AK: Modified topography to remove terraces (bug). Smoothed a seamount near Sverny Island, and eliminated very small cells near tripoles. 54KSU / model year. This is repeat year forcing. Time step is 720s. Did a test run with 900s and it crashed. 720s is a factor of 1800 (coupling time step). No factors in between.

AK: MOM was going unstable, generating 18 m/s velocities, CICE was the first thing to crash with ice remapping error. Was using ndtd (number of CICE time steps per MOM baroclinic timestep). When using ntdt=3, making CICE stable with vaguely unstable MOM. At one point in the run I added Rayleigh damping in Kara Strait as I had done in IAF run. MOM was going unstable in Kara Strait, moving ICE around too quickly, and CICE crashed. Once I stabilised MOM, could get away with ndtd=2, so now the model is CICE bound rather than MOM bound. AH: What is runtime? AK: 2 hours, so can do 2 months/submit. MW: What were the fails at the beginning? Related? AK: Showing out from a run-summarising tool. Just shows when runs failed. Not necessarily crashes, might have just stopped to change something. AH: So when you say MOM went unstable, not so unstable it crashed, but unphysical velocities? AK: Haven’t looked in detail, but imagine it would be grid scale alternation and that sort of thing. AH: Does that mean, in order to compatible with CICE, maybe we need to reduce some of MOM’s thresholds for warnings so they are better matched? AK: Yes. CICE has a CFL condition that is tighter than MOM. MW: They are solving different equations. AH: Not saying they should, or could be, exactly the same, but MOM is more permissive than CICE, but the crashes happen in CICE. So if you reduce the limits in MOM will diagnose the problem in the correct place, rather than indirectly through CICE crashing. A new user would then get the right information.

AK: Haven’t examined the fields. It is right on the limit of the barotropic CFL limit. Splitting factor of 80. The outputs show it is right on the limit. MW: Griffies would hate that. AH: Pretty routine to change that when timestep changes. AK: Had to go 100 for dt=900s for it to pass that check. Might be wise to up that number.

MW: This is tenth degree? Really dropped your CPU count. AK: This is a minimal config. Dropped from 6K to 2K, getting same throughput as larger model. MW: But with lower ndtd and higher timestep. AK: Yes. AH: timestep is like a superpower.

NH: Great! Well done. AK: Thanks for your work. On the number of CPUs, it is good to run minimal configs, as they are inherently more load balanced. Each CPU is doing more work and a greater variety of work. A good reason this minimal config is more efficient. MW: Won’t it run slower if you reduced timestep? Doesn’t make it more efficient necessarily? NH: I think it does.

NH: How does this compare to MOM-SIS? We can’t really compare without running with the same grid? MW: Can usually use the MOM timing and add 15%. I went back and checked this numbers. AH: I think I was getting six months / submit with MOM-SIS at a nominal 7200 core layout, with dt=600s. The larger ACCESS-OM2 config has 6K cores in total, but only 4300 MOM cores.

PD: Have to go now. Bye.

NH: Are you going to try getting the bigger config running? AH: 6 months per submit is very interesting. Could do some decent 100 year runs. Could be worth looking at.

MW: According to the numbers I have been posting, could speed MOM up by a factor of 4 with more cores. CICE ice step is also scaling well. Don’t know if switched to sectrobin, or missing a coupling bottleneck. This model could scale well. Has anyone checked with sectrobin? NH: This is using it. MW: I am talking about much larger, I get improvement up to 12K CICE cores. AH: Load balancing not an issue? MW: Maybe, but I’m not seeing it with the function I am testing. Is this new? Or was this always the case and I’m not seeing what slows the model down? If it is new, maybe we should experiment with throwing more cores at CICE. Even with the low config, maybe you could ramp the CICE cores up to 600? NH: That one is MOM bound. AK: The IAF was using round robin, and dt=450s. MW: Right so, that is your difference right there. AK: Dropping ndtd to 2 makes it MOM bound.

NH: Let me know if there is anything I can do to improve the load balancing in larger config. Pretty sure there is larger config using sectrobin.

MW: I’ve seen consistent improvements in all CICE configs, including 1 deg. Doubling cpus in CICE and MOM individually in 1 degree saw improvement in both, but got errors when I tried to both. Gave up because 1 degree not so important. Will try again in tenth degree.

NH: Sounds like you’re discovering what the best config should be using some evidence based mechanism. The 1 degree config was pulling numbers out of hat that balanced ok with decent efficiency. Maybe want a small/medium/large config, but not for all resolutions. 1 degree maybe just want max throughput. Would be great to capture what you’ve discovered, and actually start to create those configurations.

MW: MOM numbers are clear. Time per step is very informative. Want numbers around 0.1s/step. Allows us to tackle this independent of time step. Surprised CICE is scaling. Reminds me of the last COSIMA talk which everyone said was wrong because I didn’t have enough ice. Will make a figure, which will show strong scaling numbers for CICE. Looks like it a strong scaling model.

AK: On CICE, there is a workshop in Hobart in late Feb. It is a CICE modelling workshop, and Elizabeth Hunke is visiting. AH: Are they up to CICE6? MW: Dealing with a fork. Los Alamos has a version no-one is allowed to see. NH: Worth looking at the CICE repo. Our CICE5 version contains all changes up to CICE6 tag, We’re as up to date as you can be on CICE5. AH: Have you back-ported stuff? NH: Yes, it wasn’t very hard. This is CICE6, but it has CICE5 in the repo. AH: They have a cice6 branch. MW: What is ice-pack? NH: They have split out the column physics to make it more portable. It is just code from CICE5 but repackaged. AH: Icepack is included in CICE6 as a submodule.

AH: Ruth has been having queuing issues, anyone else having problems? Might just be someone using broadwell heavily. Do you know if we have to use broadwell for minimal? NH: My runs used to crash. AK: My runs are ok normal. Running the same as Ruth. AH: Using less than 1GB/processor? NH: I put something in config.yaml so when running in normal it used the high memory nodes. Did you take that out? AK: Yes I removed that. NH: Who knows. So much has changed since then. Memory can spike.AK: Ruth should be able to run on normal without requesting extra memory. MW: What made you think it was a memory issue? NH: Crashed at a particular place on initialisation in MOM where I know there is a big jump in memory usage. I have made an issue about it. It is in FMS. As soon as I went to the high memory node and it went away. Doesn’t print out specific memory errors, but as soon I as I increased memory it ran fine. Good to go back if it has gone away. It is something that gets done on the MOM root PE on initialisation that increase memory usage a lot. AH: Shame we no longer have access to the PBS logs as there was a lot of information in there on this sort of thing.

MW: Intel 19 gives awesome error messages. Not only are tracebacks more readable, but they give 7 lines of code each side of the error. Now getting MPI trackbacks without even using mpi-debug. Can’t use openmpi/1.10 with Intel 19.

Parameter sweep runs

AH: To use extra time at the end of last quarter suggested we could do a parameter sweep, which would use up a lot of compute time in short order. It didn’t end up happening, but Andy Hogg thought it would be a good capability to develop in case we did want to deploy it. I understand AK has done this a lot? AK: Well sweeps of 2 parameters, yes.

AH: Not a big deal, using payu, maybe have a YaML file specifying parameters and how to sweep them, create a bunch of payu configs programmatically and run them. MW: Always wanted to add this a a feature in payu. Spin off 20 runs with one file. AH: Any particular ideas, or anyone want to do it, feel free, otherwise will go on my very long to-do list. NH: I couldn’t think of anything. I don’t know enough about the science. AH: Me too. I have no idea what parameters I would want to change, so that is where it stopped. The parameter search could mean you always have something to run if there is spare time and add it to the database. If you have a database of runs, you effectively have a parameter search if you change anything.

AK: Have a current grant to do a study on parameter sensitivity in CICE. This capability could be useful.

NH: I have written something which can divide your grid by an integer number, creates initial conditions and everything. Could be interesting to programatically create a bunch of configs and run them with something like this.