Technical Working Group Meeting, September 2019


Date: 11th September, 2019

  • Aidan Heerdegen (AH) CLEX ANU,  Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Nic Hannah (NH), Double Precision


AK: JRA55 v1.4 splits runoff into liquid and solid. Most elegant way to support? Have a flag in accessom2 namelist to enable combining these runoffs. NH: Is it a problem in terms of physics? Have to melt it? AK: Had previously ignored this anyway, so ok to continue. NH: Backward compatibility!
AK: Some interest in multiplicative scaling and additive perturbations to allow for model perturbation runs. NH: Look at existing code. Might not be too hard. AK: Test framework for libaccessom2? NH: When did scaling did longer to write test than make the code change. All there, could use as an example. Worth to run tests, don’t want to get it wrong. AK: Not familiar with pytest. NH: In this case just copying scaling test, modify, and get pytest to run just that test. Once got just  that test running and passing you’re done.
AK: New JRA55 now in Input4MIPS. Used JRA v1.3 from that directory and didn’t reproduce. AH: Correct. Didn’t work out why it wasn’t reproducing. AK: Ingesting the wrong files? Should be identical. AH: Never figured out what was wrong. Didn’t match checksums from historical runs. Next step was to regenerate those checksums to make sure the historical ones were correct. Could have been ok, but didn’t get that far.
AH: JRA55-do is now on the automatic download list, should be kept up to date by NCI. If it isn’t let us know.
NH: Liquid and frozen runoff backwards compat, but what about future? AK: Some desire to perturb solid and liquid separately, and/or distribute solid runoff. NH: Can we just put it somewhere and allow model to deal with it. AK: In terms of distributing it, not sure. Some people are waiting on this for CMIP6 OMIP run. Leave open for the future. NH: MOM5 doesn’t have icebergs? AK: No. Depoorter et al. has written a paper for meltwater distribution. Maybe use a map to distribute. RF: What they use for ACCESS-CM2. Read in from a file.
AK: Naming convention for JRA55 v1.4 has year+1 fields. Put in a PR some time ago. AH: Problem with operator in token? NH: Should be fine as long as within quotes. AK: Just a string search shouldn’t make a difference.
AK: Can’t get libaccessom2 to compile and link to correct netcdf library. Ben Menadue tried and worked ok for him. Problem with findnetCDF plugin for CMake. Not properly supported on NCI. Edited the CMake file to remove this, could find netCDF, but used different versions for include than linking. Should move to a newer version of netCDF. v4.7.1 has just been released. Have requested this be installed on NCI NH: Does supported include CMake infrastructure around library? If getting findnetCDF working was NCI responsibility that would be great. Difficult getting system library stuff working properly with CMake. CMake isn’t well supported in HPC environments. AK: Ben suggested adding logic to check and not use on NCI. NH: Definitely upgrade, to 4,7 if they install it.
AH: Didn’t Ben Menadue login as AK and it ran ok? AK: No, he didn’t do that as far as I know. AH: Definitely check there is nothing in .bashrc. Also worth checking if there is a csh login file that is sourced by the the csh build scripts.

OpenMPI testing

RY: OpenMPI 2,3,4 and Intel 2019. Consistent results between for all OpenMPI versions. 1, 0.25 and 0.1. Some differences between Intel 2017, not from MPI library. Not sure if difference is acceptable or not? Would like some help to check differences.
Just looking at access-om2.out differences. Maybe need to look at output file like RF: Need to compile with strict floating point precision to get repro results. MOM is pretty good. Don’t know about CICE. Can’t use standard compilation options. fp-precise at a minimum.
RY: If this difference is not acceptable need to use flags to check difference between 2017 and 2019? RF: Once get a bit change, chaos and get divergence. RY: Intel 2017 still on new system. AH: So not only newest versions of modules on gadi? RY: 2017 will be there, but no system software built with it. AH: Done a lot of testing. Should be possible to just use 1 degree as a test to get 2017 and 2019 to agree. There are repro build targets in some of those build files. Could try and find them. RY: Yes please.
AK: Any difference in performance? RY: No big difference. NH: New machine? RY: No, old machine, with broadwell.
RY: NCI recently sent out gadi update and blog and webpage. 48 cores/node. NH: Did we think it was 64 cores/node? AH: Still 150K cores in gadi, with 30K of broadwell+skylake. Maybe have to change some decompositions. RY: Not the same as any existing processors.
AH: Two week overlap with gadi, then short will be read only on gadi. RF: There was panic in ACCESS due to an email that said short would disappear in mid October. AH: Easy to misread those dates.

accessom2 release strategy

AK: Harmonising accessom2 configurations. Somewhat haphazard release strategy, but not tested. Maybe master branch that is known good, and have a dev branch people can try if they want? Any thoughts?
NH: Good way is really time consuming and labor intensive. Would mean testing every new configuration. Not sure if we can do that. Tried to keep master of parent repo only references master of all the control experiments. Not sure if necessary or desirable? Maybe makes more sense to develop freely on own experiment and keep everything in control stable? Not sure. If all control experiments are stable and working, can be a bit slow to update. Just update your experiment.
AK: Some people are cloning directly from experiment repos, some cloning all of access-om2. Would reduce confusion if control directories under accessom2 are kept up to date with latest known good version. NH: Does make sense I guess. Shame for people to clone something that is broken which has already been fixed. There is some python code in utils directory which can update everything. Builds everything at all resolutions, copies to public space, updates all exes in config.yaml and does something with input directories. AK: I ended up writing up something like that myself.
AH: Should split out control dirs from access-om2 repo. Is a support burden to keep them synched. Not all users need entire repository, as using precompiled binaries. Tends to confuse people. NH: Did need a way for config to reference source code and vice versa. AH: Required to “publish” code? Maybe worth looking into. NH: Ideally from the experiment directories need to know what code you’re using. Probably got that covered. In config.yaml do reference the code and it’s in the executable as well. When run executable it prints out the hash from the source code. Enough to link them?
AH: I recall NH wanted to flip it around and have the source code part of the experiment. NH: Probably too confusing for users. AH: True, but a useful idea to help refine a goal and best way to achieve it.
AH: A dev branch is a good idea. Then you have the idea that this is the version that will replace the current master. Can then possibly entrain others into the testing. Users who want updates can test stuff, you can make a PR and detail testing that has been done.
NH: Good idea. Some documentation that says experiments have stable and dev. When people are aware and have a problem, wonder if they can go to dev, see if it fixes. AK: Bug fixes should go into master ASAP. Feature development is not so urgent. A bit gray, as sometimes people need a feature but they can work off dev. AH: Now have some process for this: hot fixes that go straight in. Other branches are dev/feature branches. Maybe always accumulate changes into dev. Any organisation helps.
NH: Re: Removing experiment repositories: namelists depend on source code. AK: Covered by executables defined in config.yaml. NH: Yes ok.


RF: Did it work? It’s got a lot of merges. RF: Just two lines. Did a merge and pushed it to my branches on GitHub. AH: I’ll merge it in. Just wanted to check. AH: Can always make a new master branch that tracks the origin, check that out and pull in code from other branches. RF: Have a lot of other branches. AH: Can get very confusing.

payu restart issue

AH: Issue has resurfaced. I commented on #193, but didn’t look into the source of the problem. Should look into it rather than talk about it here.

FMS subrepo

AH: Still not done the testing on this. Been sick. Will try and get back to it.

Tenth update

AK: Andy done 50 years with RYF 90/91. Running stably. AH: What tilmestep? RF: Think he was using 600s. AK: 3 months / submit. Should ask for longer wall time limit. RF: Depends on how queues will be on new machine, what limits and what performance. AH: Talking about high temporal res output. AK: Putting out 3D daily prognostic fields. Want it for particle tracking. Including vertical velocity. Slowed it down a little bit. RF: More slowdown through ice. AK: No daily outputs from CICE.


NH: Still in progress. AK: Also requires newer version of netCDF? NH: Requires specific version of netCDF. Needs parallel version. Not a parallel build for every version. AK: Has parallel for 4.6.1. RF: Bug in HDF5 library which it is linked to. Documented in PIO. Probably a bug we’re not going to trip. Doing a collective write, and some of the processors not taking part/writing no data. Fixed next version of HDF5 1.10.4? AH: Not a netCDF version so much as the HDF library it links to. RF: Yes. AH: So should make sure we ask for a version of netCDF that doesn’t have this bug? AK: Add to request.
RY: If want parallel version, use OpenMPI 3 or 4? AH: Good question! RY: All dependencies will be available and very easy to use. AH: This using spack? RY: Above spack and other stuff. Automatic builds with all possible combinations. AH: Using it for your builds? RY: We are requested to test and are now using. Difficult to create new versions currently. In transition difficult, but in new system should be fixed quite easily. AH: Should fix the various versions of OpenMPI with different compilers. RY: Yes. AH: Will have a compiler/OpenMPI toolchain? RY: Will automatically use correct MPI and compiler. AH: Any documentation? RY: Some preliminary, but not released. When gadi is up all this should be available.
AK: Should I ask for a specific version of MPI? RY: If don’t specify, will be built with 3 or 4. Do you gave a preference? AK: No, just want the version with performance and stability we need. Do we need to use the same MPI version across all components. RY: Not necessarily. Good time to try OpenMPI3. No performance benefit as system hardware is still old hardware.

Technical Working Group Meeting, July 2019


Date: 1tth July, 2019

  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) RSES ANU, Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Peter Dobrohotoff (PD), CSIRO Aspendale
  • Marshall Ward (MW) GFDL
  • Nich Hannah (NH), Double Precision

Config checking

AH: Made payu configuration checker. Include safety checks for synching scripts in BASH scripts. Interested in checking for bad namelist options. Russ any specific bad ones.
RF: Kpp kbl standard method should be false, red sea fix used in access-cm. For a CM model maybe warn. OM model just not allowed. AK: nprocs and ncpus driver issue?
NH: diag_step checks? To frequent ok for low res, bad for high. RF: For production runs don’t want? But how you diagnose problems. Best way to find how things are going wrong. Maurice’s issue was trivial to spot. AH: Definitely say don’t want debug_this_module turned on. RF: diag_table debug turned on, should be turned off. Creates huge numbers of messages. AK: Setting up new updated configs. Ten configs. Make them more homogenised. Fixing all these things as they go. AH: These things will get changed by mistake. Don’t have enough people to keep checking things. Doesn’t scale. This will allow new users to submit a config that at least passes these checks, also gives others confidence to change config knowing they have produced something that meets some minimum standard, appropriate for public facing production.

Tenth Run

AK: Andy Hogg is running ACCESS-OM2-01 with JRA-55-do RYF90/91, seems to have smaller biases than previous repeat year (84/85). Currently 13 years. When it does die it is CICE CFL problem. Sometimes the same date, subsequent years didn’t occur. Checked dates. Storm goes near tripole. Not currently messing with forcing winds. Did this with 85/85 but this doesn’t seem as bad, so haven’t done it so far. One drawback doesn’t run at dt=600s, but takes 2.55 hours to do 3 months. With dt=600s could do 6 month submits, which would mean less queue wait. Should be straightforward to fix the winds to enable this. AH: Not a priority considering extra SU cost. AK: About 10%. Cost of losing 6 month submit halfway through > 10%. AH: Shame it is the tail wagging the dog. AK: Could ask for 6 hour limit from NCI? AH: Worth trying. Done it before, and have seen others with increased limits. Prefer to do it, but time limited, and just for one project. AK: Hopefully limits will change with new machine. Currently 65KSU/3mths. RF: MOM or CICE bound? AK: Fraction of time MOM is waiting 2-3%. RF: Not greatly MOM bound. Throw a few more processors at MOM to get it to run less than 2.5 hours?
AK: 40 years. IAF not split, just start from climatology. AH: When will IAF start? AK: No plans, not simultaneously run RYF and IAF.

NCI update

AH: Attended an NCI scheme manager meeting. Mostly about new storage scheme for short term storage. Push came from CSIRO to change to scratch model, but some others in CSIRO not happy. PD: Wasn’t aware that was being driven from this end. Maybe further up the food chain.
AH: Change to time-limited scratch, or a tidal model deleting oldest data first. Maybe a split scheme with old style short on one disk, time limited on another, but not a lot of appetite for that.
RY: First stage November. Our group look for HPC application for new machine. Already have ACCESS-OM from Andy Hogg to look into software state. New machine some old library will not be maintained.


Recently used ACCESS-OM2 with OpenMPI 3.0. Seems to hang? Know this issue? Or avoid 3.0? Some work required to run on new machine. Spend some time on this work.
AH: Marshall any ideas? MW: Have tried 3.0.0 3.0.1, maybe 3.1.1. Earlier ones didn’t work then got fixed. Newest 3.x should work. RY: Tried 3.1.3, MOM keeps hanging until finish of job. Should finish at 40min. Keeps hanging. 1.10.2 works. 3.1.3 hanging.  MW: Sure I got it running. Will make sure they are in repo. RY: Catch up with your personally? MW: where it hung should tell you. RY: Talk later offline.
AH: Definitely need it working on the new machine. MW: No work needed to be done, it just worked. AH: What changes would you have made? MW: Just versions, environment file and flags. Maybe using some of the alltoallw changes, but I don’t think that was a deal-breaker.
AH: What is the minimum version OpenMPI supported on new machine. RY: Under discussion. System guys will decide. Haven to prepare for any. Not sure OpenMPI 1.10 will still be supported. Don’t know. AH: Likely to be OpenMPI 3.x+? RY: New machine with new architecture. Performance enhancements with new architecture. MW: What arch? RY: Now have skylake. Newer than Skylake. MW: Intel architecture, not Ryzen. AVX512 can’t benefit. fma which we already have AH: AVX512 because can’t vectorise enough? MW: Currently vectorising, but bandwidth limited. Ryzen has better bandwidth. RY: Not announced. No idea. AH: At scheme managers meeting it was an Intel chip. Told it was November when they commission new nodes, take equivalent raijin nodes offline. Iron out the bugs, and early next year will turn off the rest of raijin and turn on the rest of the new machine and at that point it will be larger than raijin now, but not a huge increase in compute. AH: Thanks for bringing that up Rui, as we definitely need to keep an eye on this for the new machine.
MW: Apparently used 3.0.3. Maybe a reference point to start with. RY: Start with 3.0.3? MW: Whole space is volatile, some 3.0.* series work some don’t. But start with 3.0.3 and Intel19.
AH: Would be nice to have a spack like build tool so can say for certain what was run. MW: payu build! AH: spack written by a smart guy from TACC, and lots people use it, and they still have a lot of issues. Not an easy problem to solve. MW: Dale was keen on it. AH: When we met with Dale he was thinking to have spack as a tool preconfigured with compiler toolchains that we can build our tools from. RY: Dale is very busy getting new for the new machine.

Splitting off FMS

AH: Been working on Cmake to compile FMS separately from MOM. Been using the FMS fork in mom-ocean repo with your alltoallw changes. MW: Also a branch on the GFDL repo with those changes.
AH: How to organise the FMS fork? Have a branch that tracks GFDL and master contains our local changes? Could have a branch called gfdlmaster, could have our master branch exactly track the GFDL FMS. Any opinions on how to organise this? MW: Don’t want to use GFDL FMS? AH: I want an easy way to update FMS without touching MOM source tree. MW: Want to get FMS out of MOM? AH: Yes. MW: And want to know how to refer to FMS you want to use? AH: FMS we want to use is a fork on mom-ocean. Gives flexibility to add changes when we need to.
MW: Best to have your own FMS fork. GFDL don’t want to support anything but for GFDL, including MOM5. Don’t really want to get involved in supporting other projects. Will be receptive. No harm in using FMS repo straight, but if doing anything with FMS better off maintaining own version and update as see fit. Don’t see compatibility with older models as priority. Planning a big IO rewrite. Wouldn’t be surprised if it starts breaking and not salvageable.
AH: alltoallw we definitely want on our architecture as we’ve had issues in the past? MW: A lot of work, return not what I’d hope. Latest MPI version bigger impact. Are cases with speed up, but such an infrequent operation not such a big deal. AH: Stopped initialisation hangs? MW: Yes, some rare scenarios where they did alltoall with point to points that broke a lot. In OpenMPI 2.0/3.0 and later they changed something, scenario no longer happened. Segfaulted before, now properly checking. Only necessary for 1.10. It is better, as collectives are generally more responsible. May become necessary, assuming 3.0.3 works.
AH: If want alltoallw, would keep a branch with those changes and rebase on to gfdl master. This would be a well documented branch, or branches, and a well documented way of applying those changes when an update is required.
MW: Can CMake build as libfms and link to MOM when you build it. No submodules, rely on Cmake. Does that work? AH: FMS is not suitable to be a loadable module. Get OpenMPI conflicts, best to build at the same time with the same compiler toolchain. There is a new Cmake tool called FetchContent that can grab a repository and it behaves like it is physically in the source tree. Works well, but not great versioning. MW: Isn’t Nic already doing something like this for ACCESS-OM2 to pull in specific versions of son-fortran. AH: Yes, you can specify a library git hash. The only thing stopping it from working is relocating the versioning string stuff Nic did as it is currently sitting in the FMS directory, and that is going to disappear. Needs it’s own directory, maybe ocean_shared? RF: ocean_shared is used for other tracers. MW: should not use that name. AH: Ok, will make a new directory called version. Can recreate the sed script functionality that is currently in the build script in Cmake using template files. Quite a clean solution. I have a cmake branch on the MOM5 repo and FMS fork on mom-ocean, will get them compiling properly and working properly together. There is a way forward.
MW: Alistair is pretty interested, might be a template for MOM6. AH: Angus already did this for MOM6? MW: Angus, is what you did still viable? AG: Haven’t tried recently, don’t know why it wouldn’t work. Replicating mkmf process in CMake. MW: Automake is not good and won’t touch it. AH: Surprised there was no way to build FMS from the FMS repo. Relies on being imported into another project that knows how to build it. Not sure it is great that a project can’t build itself. MW: CMake support not widespread enough? Not available everywhere? AG: Updates frequently, can have features that break old versions. Used in a lot of projects. Surprised if it went away. AH: Cmake can be brilliant, but also terrible, but better than mkmf. MW: mkfmf is doing two jobs, importing stuff and working out dependencies. Does work well for the latter job. Set a high bar. AH: Haven’t done proper comparisons, but Cmake seems to better for dependencies. Can do parallel builds with Cmake you can’t with mkmf. MW: mkmf just generates a makefile, which is already parallel. AH: So does cmake AG: doesn’t seem like a good makefile, don’t know if the dependency tree is deficient. Rebuilds too much even after touching a single file. MW: if CMake intelligently supports mod files then it is fantastic. AG: Has native fortran support. AH: Speed point of view, Cmake is better. Generated correct dependencies so that parallel compilation worked. Couldn’t do that with mkmf. Also had compilation cascade issues. MW: I build 5 exes at once, so it always looks fast to me. AG: MW same makefile gen as mkmf. MW: More readable makefile than automake? AG: Yes. More readable than automake. AH: When the magic works Cmake is great, when it doesn’t it is a pain, but the magic is worth it. Also supports multiple architectures.


RF: Aidan can you approve change to FAFMIP. Starting to get conflicts. Ryans changes put it all in conflict. Riccardo has disappeared, but Fabio’s changes so it is all the same bit for bit. AH: Current conflict in ocean_frazil. RF: Because you put Ryan’s changes in. AH: Sorry. Could rebase on Ryan’s changes. Maybe pull in Ryan’s changes. AG: Could check out the branch, make changes and push to the branch. AH: I’ll try doing it directly on GitHub, get back to you about it. RF: Get that done and I can finish up some of the WOMBAT stuff. With the ESM model I also have to make some changes to CICE. A couple of design things with the number of fields that are passed. Hard wired at the moment. A couple of issues there. Have a chat at a later stage. Rather than hard wire fields, flexibility, test error codes, make compatible with namcouple, so can be done on the fly. Also feed into BGC Hakase is putting into CICE. Need to pass BGC fields between the two modules. Rather than having a plethora of drivers, or CPP directives, better ways to do it.
AH: Made that change on GitHub and merged it. Once checks are finished will accept the PR.
MW: Been working on a test with MOM6, where we turn of every diagnostic, fantastic for finding bugs. Found nearly 2 dozen bugs. Don’t actually register the diagnostics with FMS, just spoof the whole thing at the diag_mediator level, which is a wrapper around the diag manager. Interesting if this could be translated to MOM5. Don’t know a natural way to do it, but might be worth some thought at some point. RF: Code you’re putting into MOM6, not the diagnostic manager? MW: Yes. FMS moves too slow, very conservative, don’t have a robust test framework so are worried about putting in changes. There are some hints that maybe this code could be shared with MOM5. Lots more in there than just this. Just raising it as food for thought. AK: Put as an issue? MW: Opposed to those sorts of issues, but you can if you want.
AK: Want to set up new vanilla reference versions of the 1 and 0.25 deg ACCESS-OM2 models. The forcing on those use 2nd order conservative interpolation. There are overshoots for some fields which have to be positive definite. Would like 1st order conservative for some fields. Do they exist? NH: They should be there, we were using 1st order for a long time, and should be in the input directory. Not sure how well they are named. Should say in the filename, have a look and if you can’t find them we can recreate them.

Technical Working Group Meeting, June 2019


Date: 19th June, 2019

  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Russ Fiedler (RF), Matt Chamberlain(MC) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Peter Dobrohotoff (PD), CSIRO Aspendale
  • James Monroe (JM) Memorial University


RF: FAFMIP into MOM. Riccardo will do his tests. Don’t expect issues. AH: Did Fabio notice problems? RF: Started with ice formation used by ACCESS wasn’t coded up. Did that and then noticed way things were being done didn’t match with what was in the literature. Mismatch between what Griffies did and Riccardo wrote. Now at a stage where that is now consistent. Talking with Trevor McDougall about equation of state. Coded in MOM not totally consistent with what protocol says should be done. All groups do it a little differently. How badly can we violate the freezing condition and still get reasonable results. If you do this incorrectly can fall below freezing and not form frazil. Behaves ok down to -3 degrees. Hopefully won’t get that far. Are other approaches, have to have a think about that. Will stick with what is done currently. AH: Modifications? RF: Look at other mods to see if we can do it more consistently. AK: More consistently without additional tracer? RF: Still need additional tracer, but more consistent, temp and redistributed heat tracers see the same values of frazil. The way Griffies et al constructed get slightly different values. Not completely clean. Can’t get runwaway with one of the tracers. Safe but not right way. Other ways: fix problem with implicit diffusion. Code as it stands is at least consistent with what has been written up. MC: None of these full TEOS-10? RF: Yes TEOS-10. Also had to fix the conversions to potential temperature. MC: Dealing with salinity etc? RF: Simplified version. Need these changes to do FAFMIP correctly.
AH: Any other ramifications? RF: None. All changes only take place in this style of experiment. Everything separate from other experiments. Only issue was prognostic versus pot temp.
AH: Merged independent of the WOMBAT stuff? RF: No. WOMBAT stuff relies on changes on ocean_sbc. Have to rebase. Get FAFMIP in first.


 RF: Haven’t had a chance to sit with Matear and test it properly. Just a few changes needed from current code. Hopefully pin down Matear. AH: Hakase with WOMBAT  in tenth? RF: Yes. Hakase will test. Currently inputting winds via a file rather than in through coupler. MC: Richard Matear is working directly with Hakase.
RF: Few lines in the coupler that I have to add and a namelist item. In namcouple file need to pass 10m winds. It is in CM2 code, but not in OM2. AH: Can Hakase work with ice BGC stuff in his current setup? Is this slowing him down? RF: No idea.
AH: Few weeks? RF: Have to rebase WOMBAT stuff.

CICE Mushy ice

RF: Code suddenly got changed and altered and no-one knew why? AH: Nick been keeping our codebase up to CICE6. RF: He made other changes that caused problems. That code also moved to CICE5 svn repository. AH: Backporting to CICE5? A lot of assumed logic in those code changes. RF: Have to familiar with POP code makes salinity changes. Doesn’t go through the surface like MOM. The clause where


“this is done elsewhere”, not true for all models. Nowhere in the code those salt fluxes are being calculated. AK: Proof in runs, results show drift. RF: Looking at it, needs that if clause removed for coupling to MOM. AH: We’re not part of any CICE6 test suite so they can’t spot errors. AK: Elizabeth Hunke said consortium was open, anyone can join. Have a comprehensive testing regimen. Get more involved so they test our use cases? AH: Definitely need more oversight on code changes into CICE. JM: Any testing when code changes added to CICE5? AH: Not currently no. Nic has some scheduled Jenkins tests but not sure on the status of those.

AK: Hit problem as using mushy ice. Wouldn’t see it otherwise. Using to overcome bug in other scheme, but don’t really want to use it. Slow, don’t need. AH: Can we fix it? AK: Iterative solver fails in high res case. Happens in fresh water regions with low ice concentration. Had intended to dig down more. AH: Would struggle to find this bug anyway as we wouldn’t routinely test tenth.
AH: Fixed now. AK: Not sure about any other problems with changing parameter setting. Took a lot of digging. AH: don’t want science changes without reason

Ob runoff

AK: Not sure how important this is. Shows how runoff code can fail. Cut away a lot of the Ob estuary due to small grid cells causing instabilities. Runoff is done on the fly. Find all runoff that is on land, move to  nearest coastline. Then check for high runoff and spread out if over threshold. Some runoff goes to embayment to the west. Changes to the Ob means that is the nearest bit of ocean. GitHub issue
Not sure how important it is. Similar issue with spreading out. Uses kdtree to find neighbouring points. Doesn’t account if there is land between those points. JM: Can tunnel. AK: What could be done to make land impassable.
JM: Resolution on that discussion?
AK: Not sure high enough priority to spend time on. AH: Use connectivity? Like used to find isolated water bodies. Move land runoff to nearest connected wet cell. AK: Depends on runoff being ocean in the first place? AH: Yes. RF: If can get to right place and just smear it out and use neighbouring ocean points. AH: Is all JRA55 runoff currently on a wet cell on the JRA55 grid? AK: Don’t know if it is a wet cell, it is on the coast. AH: Need to look into that.
AH: How important? AK: Not paying close attention Arctic. Correct volume of fresh water, just in slightly wrong location. Already severe liberties at that location.. Points to failure mode of this method. Can cross land.

Splitting FMS and other components

AH: You want to talk about other components as well Russ?
RF: If we start doing things like that to MOM repo. Will that affect anyone else who already has stuff from there? Cause problems if they want to update if we move to different setup?
AK:  Proposal to put FMS codebase into different repo? AH: Yes. AH: Can’t compile without pulling from another repo. RF: Not sure how it would all work. Use submodules? JM: In submodule right now? RF: Not for MOM5.
AH: I proposed to use CMake to create an alternate way of compiling to pull in those libraries from external repos. Could keep the FMS directory in the repo, but at some point the MOM5 code may use features in an updated FMS that are incompatible. However, they can always pull from a previous commit. Could tag a commit as the last one that had FMS included. Marshall did update FMS in the past. Desirable to go this way, to have a tighter coupling with changes in FMS, put in pull requests to main repo for features we want.
AH: Got CMake working for half the builds. Super simple to swap out external library, already compile it separately. Will finish this so people can test as proof of concept.

Langmuir KPP

AK: Progress with ACCESS-CM2. Turned on langmuir param for kpp and improved Antarctic intermediate water. Should we turn it on for OM2? RF: Our coupled runs got improvement in southern ocean. Getting shallow summer mixed layers. Helped deepen them a little bit. Different types of simulations, but work in the right direction.
RF: Not sure if that is an issue mixed layers in southern ocean over summer? If shallow, could be good. AH: Turn on/off or parameter? RF: Just turn on/off. Pretty sure I changed ACCESSOM2 to get wind coming through. Might need change in namcouple. AK: Need wind velocity as well as stress through compiler? RF: Two ways. Both have been enabled. Standard to pass 10m winds as well as stresses. Other way, if don’t pass winds, flag in kpp scheme can derive 10m winds. MOM6 does it that way. Pass through stress and calculated 10m. AH: Would still work without passing winds? RF: If forcing model with stresses and don’t have winds, this is an alternate way. Not being used currently as most models can pass wind.
AK: Might be a good time to compare OM2 and CM2. Perhaps there are beneficial changes from one or the other? Might just be model specific changes?  AH: How would this happen? AK: Maybe a meeting. Sent an email to Dave and Peter. Look at the namelists and input files.

Other updates

PD: Not up too much. Interested in getting models aligned and best outcomes for both. Maybe have a small VC and discuss. A fairly complicated set of outputs, suites etc. Can be difficult navigating this structure. Definitely encourage talking about it.
AH: What is the status of your runs? PD: PI control is up to year 950. A lot of that is pinup. Historical forked around yr 900, and a 4x historical. This is CM2. No carbon cycle. Two submissions, ACCESS-ESM-1.5. Old atmosphere, cice, updated MOM. ACCESS-CM2 is much newer atmosphere, full aerosol scheme, 5-6x slower, but no carbon cycle. ESM is a lot further along. CM2 is  not as advanced. Took some time to reach equilibrium. AH: Happy with results? PD: Yeah, seems pretty  good. Climate sensitivity seems about right. Sensitivity is a lot higher for CMIP6 than CMIP5.
JM: Will attend meetings going forward. To complement some stuff Angus is doing on the cookbook.

Technical Working Group Meeting, May 2019


Date: 15th May, 2019

  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Marshall Ward (MW) GFDL
  • Russ Fiedler (RF), Matt Chamberlain(MC) CSIRO Hobart
  • Nic Hannah  (NH) Double Precision
  • Rui Yang (RY) NCI


– Follow up on migrating FMS to an external library
– WOMBAT in harmonised MOM update and testing
– Tenth load balancing
– CICE IO bound in high core counts

CICE IO bound in high core counts

AK: Runs with new CICE executables NH compiled a while ago. Performance slowdown with compression level 5. Tested with level1 few % larger in size, 2500s -> 1800s for IO time. 1300s without compression. Compresses well with low value because a lot of missing data with ice.
NH: Went from netCDF3 to netCDF4. Might be worth trying no compression. AK: have a run with compression level zero. RF: Does impact on walltime. MOM is waiting. Usually have CICE waiting on MOM, but when outputting is the other way. MW: Compressing MOM before, now both? NH: Compressing and daily output an issue. AH: What is the chunking? RF: Uses default. AH: Some libraries chose weird values for time value? RF: No funny business, all sensible. RF: All these point to point gather, maybe not efficient. MW: Do you know where the time taken is? RF: Slowdown, but not sure split between gather and write. NH: Breaking new ground, daily output and running at scale, and unusual tile distribution. Increases the COMMS to gather. So many different new things. MW: On sect robin still? AH: 10% of total runtime.
NH: With MOM do all this with post-processing to get performance of model best as possible. Anything we do slowing model as whole, should post-process. Didn’t think about that option when put change in. If slowing down as a whole, back out change and work out post-processing step. AK: Half the data in daily files is static. Totally unnecessary. Made issue to maybe output static data to a file once. RF: Aggregate daily files to monthly? AK: Slows down output from model. Less compressible? RF: Highly correlated, will compress easily. AH: How much extra wait time? RF: The whole write time. AK: 25 or 18% in MOM runtime. AH: Monthly output issue disappears? RF: Yes. RY: CICE write to single file? RF: Yes through one processor. RY: Can we do it like MOM, each processor writes data to it’s own file. NH: Yes, good idea, but more complicated than MOM. CICE tiles are not located close to each other in space. RF: Could use PIO interface. Not compatible with centrally installed netCDF libraries. Bugs in version of HDF. Need OpenMPI > 1.10.4  and netCDF > 4.6.1. MW: PIO good candidate, RY can help. CICE developers looking into this? Stayed in touch with them? NH: Look at CICE6 GitHub. RF: Looked, but no active development on IO in any fundamental way.
NH: If we did decide to go that way, good opportunity to feed that back to CICE community.
MW: NCAR as a developer of PIO, keen to get it into other models. If CICE is on their radar might get some feedback there. RY: MOM has IO layer a bit like PIO. MW: Not a good idea to use PIO in MOM6.
RY: Tried PIO in MOM and found it was not a good candidate. MW: Yeah, MOM6 was already doing something like that.
RY: Parallel compression will be supported in future in netCDF.
RY: Been experimenting with my own version of library and got some positive results.
End result: take compression out, take out static fields. Post processing. Is anyone using daily fields. RF: We’re interested in daily ice fields. Using data assimilation. MW: Shorter runs though? RF: 20 years.
NH: Instead of writing individual daily files, should write to a single file, static fields won’t be replicated, maybe benefit from some netCDF buffering. AH: Big code change? NH: Not sure. AK: Has a file naming convention for different frequencies. Frequency part of filename. NH: Saying could already output daily into monthly files? AK: No, filename encodes time and frequency. Doesn’t seem to write repeatedly to any of it ’s output files. AH: Define unlimited dimension.
NH: Make a GitHub issue. If high priority could get some time. MW: Make the issue in the CICE repo, inform them what we’re doing. They mentioned an NCAR community board.
AH: Make a namelist option and recompile? Compression level as option?

Tenth load balancing

AK: RF suggested a smaller core count of 799. Doesn’t change wall time which is a win. How low can we go? RF: Worked out a few more configs. Slight change of tile size, 720 would be ok. 36×36 or 40×30.. Running some quick tests with tool under /short/v45/masking. Run and output masks and where tiles get located. Also number of processors/blocks you need. AH: Put code on COSIMA GitHub? RF: Just a quick little thing. AH:  Yes but useful.
AH: Down from 1380. Big win. Total core count? AK: not sure. RF: Total just over 5000. AH: Still running on normalbw? AK: Yes. AH: Wait on normal crazy. RF: Look at skylake? Usually empty. RY: Yes new nodes, not large total core count. AK: Get 6mo/submit without daily outputs. Daily over by 30/45mins with ice. dt=600s.
NH: If no-one else to fix, and no-one else to fix, assign NH to issue.


RF: Got Matear up to speed. Ran a few tests. One or two bugs yet to be fixed. A couple of fields that weren’t coming through from OASIS properly. Was the ice field, wasn’ t coming through correctly. Got it going with external fields forcing it. Figured out changes to get it running properly with full ACCESS mode. Running some tests cases after bugs fixed. MC: Now running with calculated gas exchange coefficients. RF: The way it was originally written the way fields were ingested into MOM. MC: Using the same wind field in BGC and wind mixing? RF: Yes, all together. MC: Level of the wind? In ACCESS-ESM was getting lowest atmospheric wind. MC: CICE will send a 10m wind through OASIS? RF: Not FMS coupler, this is just OASIS 10m wind. MC: ACCESS-ESM case?
AH: Hakase could be used as a guinea pig. Any of these changes affect ACCESS-CM2? RF: Shouldn’t. AH: Do we need to do any bit repro tests? RF: Shouldn’t change anything.

migrating FMS to an external library

AH: I put my hand up to do the change and test.
MW: FMS updated to Xanadu a couple of weeks ago. AH: So a good time to try it out. MW: Already tried it, put some MOM patches in to fix some issues. AH: On the GFDL FMS repo? MW: They have opted not to take the parallel netCDF using MPI IO patch RY and I worked on. Have set up a branch with parallel IO, and Xanadu has been merged into that branch. May want to use branch with parallel netCDF extensions. Ongoing conversation with this. They may merge it in. Can use what you want. Your call as to what to use.
RF: Any whitespace issues? MW: FMS and MOM6 live on different planets. They don’t interact much. Don’t collaborate with FMS guys.
MW: Alistair getting miffed at the red buttons on the jenkins server. He/I will look at some GFDL independent solution. Happy for NH to be involved as much or as a little as he wants. NH: They should be more blue than red. MW: Happened in March due to checksumming? NH: Bitrot, Jenkins is fragile. Scott often fixes it. Good idea, happy to help in any way. May be easier to set up on raijin. Does one qsub and runs them all under one sub. MW: slurm is sort of designed to do that. NH: slurm is awesome. MW: slurm is better. NH: like it a lot more. MW: Good for running multiple jobs per submission. Blurs the line between MPI and scheduler. Some sort of meta-scheduling. Place jobs on ranks within the request. AH: More flexibility.


  • Update MOM build to use external FMS library (CMake) – AH
  • Finish WOMBAT integration – RF
  • Make CICE compression issues – AK

Technical Working Group Meeting, April 2019


Date: 10th April, 2019

  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Marshall Ward (MW) GFDL
  • Russ Fiedler (RF), Matt Chamberlain(MC) CSIRO Hobart
  • Nic Hannah (Double Precision)


MW: Discovered travis test GFDL uses for FMS has been failing for six months. MW fault. Introduced new MPI function. Function doesn’t exist openMPI 1.6, which is what travis uses. Doesn’t show up in MOM5 as changes not in there. Solution was to switch to MPICH. Bleeding edge travis only uses openMPI 1.10.


AH: Link FMS rather than have in repo. NH: I agree. MW: Still think subtree best solution. Now FMS has dedicated automake build can formally install as module? MW: Had a long chat about this with Alistair. Not hot on submodule/subtree. NH: Just have in CMake or a script. AH: Makes sense.
RF: One of my jobs with decadal project is to have MOM5 linked with AM4. Uses a more recent version of FMS. Will be useful for RF. Have to get MOM5 talking with FMS used in AM4. MW: Have run MOM5 with latest FMS. RF: Just making sure no surprises, changes of interfaces. Not just FMS, also other bits and pieces. AH: Auto testing with multiple FMS.
MW: If we go path of building independent libraries, not sure how C world tracks this? ABI changes? How do you manage binary compatibilities? NH: Did that with OASIS, and not sure it was worthwhile. MW: C programs don’t seem to have these issues. NH: Using precompiled libs necessary in linux, for us not reason not to compile FMS when compiling MOM. MW: Not keeping public library? NH: More complexity than necessary. We’re just talking about splitting source code out into separate repo. Good idea. MW: MOM6 has FMS repo, and a macro repo above that the builds everything. Not sure we want to go that way. NH: access om2 works that way too but experiment repos are separate. MW: Maybe submodules / subtrees aren’t so bad. NH: MOM5 repo can have a build script that references a build script for FMS. MW: Doesn’t CMake have some functionality to check it out for you.
AH: Finish off CMake build scripts and add in FMS stuff.
MW: What they do with MOM6 is having issues. Will bring up with them. Maybe some convergence on library dependencies. AH: Don’t favour central lib install with MPI dependencies.


RF: Not much to report. Make sure WOMBAT can be called in MOM-SIS. Only outstanding issue. Just changing a few if statements. MC: Comfortable that it will run in ESM framework? RF: Not sure who is going to test? MC: OM2? RF: Should run in OM2.
MC: Richard Matear went to visit AH. I haven’t run anything yet. With experiments running under payu ready to go. OM2 test with WOMBAT.
AH: Is there a PR for these code changes? Make a PR. RF: Maybe said to do that. Split up testing to avoid duplication.
MC: Hakase wants to run this too I believe?

Tenth Model

AK: Set up RYF for Spence. Run 20 years. Looking at test bed for improved config. Improved bathy from RF. Conservative temp. Running at half the cost of previous config. 10Mh/yr, 60-65 KSU. Speed up from higher time ocean tilmestep, and ice is now 2 time steps per ocean timestep, compared to 3. Due to removal of fine cells in bathymetry in tripole. Wanted to use non-mushy ice, but low ice con in Baltic fails to converge thermodynamic temp profile. Should converge in a few steps, but limit at 100 and still doesn’t converge. Paul is using mushy. Had a run with non-mushy up top crash. Spinup7 is mushy spinup8 is non-mushy. 10-15% extra cost for mushy ice. Not sure if we’re CICE or MOM bound. Other resolutions are not using mushy. Want to set up an IAF tenth run starting in 1958. Can afford with cheaper model.
AK: TEOS-10, not sure if we want to use. Need absolute salinity and cons temp.
AK: Noticed gyres are much too weak in all resolutions. Looks have to careful with JRA55. Did a test with 0.25 with abs wind rather than relative. No change. Florida current 65%, EAC about 70%. Gulf stream is not separating properly. Mean position ok, but to variable. Maybe insufficient momentum. Doesn’t go around grand banks properly. Causes SST biases. Not sure how much to fix before IAF.
NH: All resolutions? AK: All resolutions are too weak. Gulf stream separartion is ok in tenth in average, but too much variance. Mean position in 0.25 is really bad. Biases around grand banks similar in all resolutions. NH: Improving separation improve biases? AK: Maybe. SSH is localised in model, but stretched out in obs.
NH: Is this specific to JRA55? Does it happen with CORE forcing as well? AK: Don’t know. Griffies said others find gyres a bit weak with JRA55. It uses scatterometer winds, which are relative to an eddying ocean. Not in the same location as a model. JRA55 paper suggest adding climatological mean current to the wind to force the ocean. AH: Should that be in the product? AK: Griffies says people aren’t too keen on Sujino suggestion. AH: Diagnose wind stress from 0l25 test? AK: yes. 10-20% change in stress in western boundary currents and southern ocean where large mean currents. Stress changes are in quite small areas, not a big effect on gyres.
MC: Do you recall what the EAC numbers were in model compared to obs. I thought we had 20Sv which is similar to OFAM/BRAN. AK: Obs: 18.7, 17.5 and 17.2 Sv about 2000m in models. 22.1 pm 7.5 from a mooring. Florida current is 30% too low. Well observed.
NH: What is big challenge in future? AK: Not sure how much to change before next IAF. Will put out a call for diagnostics. Also explain config and see if people have an issue with that.
AH: Doesn’t MOM5 not fully support TEOS-10? RF: Not obvious to user that can use TEOS-10. Kind of fudged. Proper way is to carry an extra tracer. Have preformed salinity and an adjustment factor to create abs salinity. Another way is to have abs salinity as a single variable and adjustment factor is zero. To use full TEOS-10 in MOM5, need 2 tracers. If you do it the same as the rest of the world would have a zero tracer. Don’t want a wasteful tracer.
RF: There is a newer way to parameterise the equation of state. Need updated. AH: New module? RF: Yes, just switch.


RF: Frazil not being redistributed. Needs fixing. AH: Affect other runs? RF: No just FAFMIP.

Technical Working Group Meeting, February 2019


Date: 14th February, 2019

  • Marshall Ward (MW) (Chair) NCI
  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Russ Fiedler (RF), Matt Chamberlain(MC) CSIRO Hobart
  • Peter Dobrohotoff (PD), CSIRO Aspendale

TWG Meta Stuff

AH will redo MOM5 governance doc for next meeting.
AH finding minutes a burden, MW suggested exploring other options.
MW: Will leave at the end of March. Will maybe try and attend. Given time.
Even in anarchy someone has to send out the email.

CICE Meeting

MW: CICE meeting. AK going. Going to Hobart Ocean Workshop? MC & RF not registered, might drop in.
MW: Also a VC: chat with Elizabeth Hunke. Who is going to attend? Just me? AK: Yes. MW: AH  come too? Ben Evans asked Rui to come. Not sure about NH. Assume interested.
MW: What to ask her about? Agenda? What motivated it? AK: Just that Elizabeth is around and could chat. AK: Any point me turning up a day before, more talking about Petra, but Petra thought it might be useful. Could show her how we set stuff up, some results?
RF: Anyone from Aspendale coming down? PD: Not sure. MC: Simon Marsland and Siobhan on the attendance list.
MW: Might ask about using latest GitHub branch (cice6). If we were to use it what should we do? Incorporate changes from OM2 codebase? Others more interested in physics?
AK: Might be interested in scaling work. Hoping to put some in my talk. MW: Fine with me.
MW: Not done as much as Tony Craig (?) on load balancing.
Monday 18th @3pm with Elizabeth (2 hours)
AK: Valuable networking opportunity.
MW: Would be great for NH to come.
MW: Maybe AK give a run down of some of the runs, start from there.

MOM5 Pull Requests

MW: RF been busy
RF: Bug in one of those in GW scheme. Was testing temperature in the wrong direction. Also something odd happens to temp rebinning at the bottom of a level compared to density. Missing value is zero. Interpolates first non-zero temperature to below bottom level. Because density in the rock is zero, can’t get a bounding. Problem with the way the diagnostic is originally done.
RF: Calculates transport in density one don’t account for transport in lower half of bottom cell, but temperature remapping you do. MW: Haven’t looked at the patch yet. Is this what Ryan Holmes was asking about? RF: This would speed up Ryan’s remapping. His PR was different. Trying to remap onto different levels. He sort of fudged the code. Take code from remapping onto density levels, and made something spoof, pretends neutral density is temp or salt. Don’t like what he’s done. Probably works, but not totally sure, but my optimisations might break some of the things he does. AH: Your optimisations are field dependent? RF: Yes. Assume it is density, with assumption density increases as you get deeper.  MW: He added a neutral density thing? RF: Trying to trick the code into something else.
RF: Can’t do it on more than one variable.
AH: Might be worth telling Ryan this might break his code.
RF: I thought he had put the commit in there. AH: No deleted the PR. He doesn’t have commit rights.
MW: Has a hard coded neutral density point that he has defined.
AH: RF still thought worthwhile? RF: Yeah, have a general thing, remap to level? A lot of code would be copy/paste. Could be a lot of work. AH: classes of rebinning?
MW: Not sure I understand exactly what RF’s commit does. Not sure I can add value.
RF: Just a lot faster.
AH: How did you pick up the error? RF: Was worried about it. Hadn’t checked rebinning to temperature. Wasn’t sure I had accounted for reverse in signs. In transport beta _ gm. Neutral physics utilities module. Checking for maximum and minimum temperatures on wrong levels. Hadn’t tested that diagnostic. Missed temperature. When tested failed. Doesn’t alter results of simulation, diagnostic slightly wrong. Other things were bit repro, all checksums were identical.
AH: So when code changes are made to diagnostics make sure those diagnostics. Make sure we paste in pics of diagnostics. Made sure to double precision in `diag_table`.

MOM5 Governance

Last month agreed to tackle PRs. MW: Paul never answered. Other didn’t answer. AH: AK didn’t answer! 😉
MW: A lot of weird hard constants in FMS. Data structures are weird.
MW: Other PRs when we got no answer? Ask for an update without interaction without a month? Have some policy? Paul looks more valuable. Other one is more FMS. Could call phone.
General approach for non-responding PRs: Get in contact again. Warn it will be closed. Close and say they can reopen.
MW: Sometimes got good ideas with poor implementation, accepted and completed reimplemented. RF: Short one best to redo a different way, and reject the FMS stuff. Contact Paul and get it done?
MW: No answer after prolonged time, incorporate good ideas in a different branch.
AH: Why coding now? RF: Had these ideas for ages, but noticed low hanging fruit. Remapping and submeso scale. Knew we could make significant time savings. Knew about these ages ago. Similar with tidal mixing. AH: Uses MOM timings? RF: It was slow, and looked at it and wondered about looping. MC: With changes what improvements? RF: 20-30% in each module. I run short cases, so data writing might dominate a bit. Will depend on the size of the model. Time spend on each tile proportional to mixed layer. MW: Shallow levels will be a big improvement? RF: yes. MW: Not  iterating where there aren’t values? RF: Yes. Two types of tests, check if entire tile can be topped, other times if a latitude can be stopped. RF: Did test of 1200 cpu job on OFAM grid too 30% off those routines.
AH: submeso is 10% of total ocean runtime.
RF: Starting a big run, good time to get it in.
MW; Sometimes said MOM was well balanced. Aggressively masks everything.
RF: Imbalance comes through the parameterisation code. KPP, Tidal mixing. Found another weird thing in the barotropic routines. Takes a lot of time. eta and pbot diagnose. No reason to diagnose the pressure at bottom on a u cell. Except if you’re writing the diagnostic. AH: standard for the code to check if diagnostic used before calculating? RF: Required for restart file. Check at restart stage and write it out that time. AH: don;’t restarts have to be field_table? RF: No
AH: If they don’t affect science can add to 0.1 at any time.
Ocean eta and pbot diagnose 10% of runtime.
AH: should we prioritise any changes. RF: just the ones I have put in. Others not so much. I’ll fix up the PR. Just got compiled and testing.

netCDF Parallel MPI IO

MW: Parallel IO stuff looking good and nearly done. Getting parallel IO without collation. Even restarts. A few masked cases where things look odd  with completely missing values.
MW: Fill value versus zero over land? If I do mppnccombine intelligently turns zero over land into missing values.
RF: When MOM sends diagnostics sends a mask with the call.
MW: Should land be zero or fill value? RF: should be fill. MC: What about restarts? RF: Used to have zero and then changed. Turned up in the density restarts.
AH: Performance?
MW: As fast as the number of disks. Can be subtle to configure. Have to balance the nodes with io_layout with ncpus on node. Negligible with 0.25 deg. Write speeds at about speed of lustre (half speed x number of disks).
PD: Fan of missing_value stuff. Parallel IO work from Dale.
MW: Rui will know about timing variance. Worried GFDL will find it slow and reject. Rui looked into compressed parallel IO. Interesting results. Reasonably fast. It’s half the speed of non-compressed. What is the serial (offline) compression time? No idea. AK: Is speed MB/s. Or twice as slow for total data file? MW: Twice as slow as the entire dataset.
MW: Currently uncompressed. Can then compress.RF: Need to work for regional output. MW: Do at FMS level. AH: Should test for regional output. RF: Regional output done by geographic rather than index. If by index would make it easier. MW: If you can get that for a test.



  • Amend MOM5 governance doc (AH)
  • Feedback to RF PRs (MW+AH)
  • Check back on Paul’s PR (MW)


  • Shared google doc on reproducibility strategy (AH)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • CICE and MATM need to output namelists for metadata crawling (AK)
  • Provide 1 deg RYF ACCESS-OM-1.0 config to MC (AK)

Technical Working Group Meeting, December 2018


Date: 11th December 2018

  • Marshall Ward (MW) (Chair) NCI
  • Aidan Heerdegen (AH) and Andy M Hogg (AMH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Nic Hannah (NH) Double Precision



MW: Been profiling CICE, score-p profiling doesn’t work. Been timing by time step. Anomalously long time spent at step 72. AH: could it be atmosphere being updated. JRA55 is 3 hourly. Not sure timestep. MW: Seem to have lost my logs. Not sure best way to handle it.

CM2 Harmonisation update

AH: Peter has been testing release candidate. Russ supplied a diag_table which just outputs fields for first 2 time steps which is really good for seeing code issues. Russ found some bugs introduced by me. A couple of logic errors with preprocessor flags and omission of a couple of lines that got lost in translation. Confident latest update has squashed all the bugs. MW: Not old bugs? AH: Did find some old issues. Russ found a stuffed iceberg file. RF: Not related, but is something they were using for CMIP6. AH: Did find some old bugs, had to emulate the lack of reproducibility from a the readsea salinity fix timing bug to be able to closely reproduce CM2 output. Put a flag in to do the wrong thing to do the same as theirs, will remove before merging. MW: I thought reds fix had been changed to be faster but not reproducible. RF: That’s right, but not issue. This has to do with timing. Aidan fixed it, but not compatible with what they are using. AH: Just need something that reproduces CM2 output.

Narrator: The new way of doing salt fix will reproduce over time steps, but is not bit reproducible with the old algorithm. Don’t see that effect in these tests.

AH: Peter has a test suite which is old CM2, and a copy which uses updated MOM. He compiles the new code manually and runs the two suites side by side. Both use Russ’ diag_table. Just find out which fields don’t match. Most are the same, few different, seem to be affected by the same issue. Once we’re good for a few time steps then maybe look at them after a few months RF: Once chaos starts, hard to say. As long as nothing gross happening. Unless there is something further on with coupling. AH: Yes, look after a month and check it looks close. MW: Not trying to be bit reproducible? AH: Just want to fix my bugs. RF: Make sure you’re getting the same forcing fields. Can see out in the open ocean hardly any change. Just noise. This means we’re close. Saw the outline of where the forcing field is supposed to be. The bug in the forcing field data showed up, which indicated the issue. AH: Once we’ve confirmed fixed, will merge PR and then move on to ESM.

MW: Will the CM2 code remain in step with the MOM5 code? RF: CSIRO Aspendale not doing much code development at the moment. AH: Peter is pulling directly from his GitHub repo, but once it is harmonised they will pull directly from the MOM5 repo. They will want to have a tag and pull from the tag. RF: Yes they will want frozen versions. AH: Should have some automated tested, if we find a bug, should be able to updated CM2 code and confirm doesn’t change important answers.

AH: Short answer: Lots of progress. I made lots of bugs and Russ found them. Thanks Russ. NH: Yes thanks Russ.

Model reproducibility and payu bug

NH: working on documentation, wiki, tech report and model paper. Like to do more. Wiki doc easier as a brain dump. Made sure ACCESS-OM2 Jenkins tests are passing. Takes time something always seem to go wrong. Six tests passing and useful. Repro test working and now reproducing across restarts. Wasn’t working due to 1. payu bug, 2. red sea fix and 3. compiling with repro.
NH: Doing 2 runs with and without that payu bug on 1 and 0.25 degree. Doing 4 years as individual 1 year submits. Make sure bug not too serious. The way the coupling field restarts are done not good. Ocean has to write out a restart for cice ( Copy of restart file missing. Had in the past. Refactor with libaccessom2 and change of payu model driver didn’t carry this over. Means every first forcing fields that the ice model gets at the beginning of a new submit for the first coupling step are from the beginning of the run, not the previous run. Ice model is getting the wrong forcing for the first 3 hours.
MW: Has it been fixed? All runs affected? AK: Yes fixed now. Scope which runs affected. Only since YATM? NH: Yes. If your run uses YATM it will have this problem. Around the time the bug introduced. Restructured how config.yaml organised. Created libaccessom2 driver, and bug came in at that point. MW: Used to have oasis driver that did that. NH: Restart repro test existed but failing for other reasons, not being kept up to date. If that test was passing and then started failing, then would have been noticed. Doing a post mortem to see if there is anything significant on a 5 year run. Gut feeling, just in the ice. RF: Will just be the SST that it sees. If running a month at a time significant. Yearly not so important. Also depends what was in the initial coupling field. NH: Initial field correct, probably January. RF: Didn’t get updated for changes to landmasks? NH: Land has been eliminated so not necessary. NH: Any run which is a multiple of 1 year, problem is smaller. AH: Quarter and 1 degree aren’t that affected, tenth most affected. NH: Could do 1 month 1 degree runs. AH: Good idea. Don’t forget about runspersub option, could do 50 in a single submit. MW: payu restart flag now works as well. Could be useful for testing reproducibility. NH: This could be a problem in other cases as well. Existing restart is based on a specific time. May be correct for the specific model it was created for. RF: Should be matched to initial condition, with correct fields. MW: This is a cold start? NH: Needs to be created each time based on start time of your forcing. AH: Write code into model to read in IC and write back out to coupling fields? NH: Something like that might be good.
AK: Bunch of fields SST, SSS, SS velocity, SS slope, frazil ice formation energy. RF: SST and SSS only ones not zero in a cold start. AK: Replace by initial condition for entire experiment NH: There is a single file in the ACCESS-OM2 input directory that all experiments use. NH: Could diff that against what it should have been. MC: That is cold start bug, not so important. Warm start bug fixed? NH: yes fixed in latest version version 0.11.2. AK: People aren’t using that? MW: No, because it was broken. Now fixed. AH: Arguably should delete payu versions with the known warm start bug. Or back port the fix? MW: Don’t have framework to back port fix. AH: How many versions affected? NH: Put a warning message/assert in that stops and doesn’t let it load. MW: happy to delete old versions. Some people use a specific payu versions. Easy to put warnings in module files. Can also delete old ones. Not a huge problem.
AH: figure out which payu versions affected. Make a decision based on that. MW: Only those with libaccessom2. AH: Don’t delete straight away. Turn off modules first. See if there are people affected. AK: Could be people not using access-om2. AH: yes, but can use new versions. Need to make sure people not using buggy code. AK: Possibly move to new space. AH: yes, but might not be necessary. MW: May be impossible to back port fixes. Driver might not be functional. No problem doing backports, not sure how.
AH/MW: Might not need to back port, should:
  1. Confirm payu/0.11.2 working correctly
  2. Set as default version
  3. Determine which payu versions affected
  4. Turn off affected modules in modulefile and issue message about bug, what module to load and to email climate_help if users still has issues
  5. When complain assess individual cases
  6. If necessary move payu module to non-app path
  7. Delete old versions?
2 week time frame.
MW: People shouldn’t be encouraged not to specify module versions.
MW: Make sure 0.11.2 working correctly. Works for NH and AH. AK a good test for it as running. AK: Not running at the moment. Can we use old mppnccombine with payu/0.11.2. AH: Yes. MW: Use whichever you want. AH: works better for 1 deg in any case.
MW: added a restart directory feature. run 0 uses the restart and reset counters back to zero. AK: Had been copying stuff. MW: I’ve been symlinking and other hideous things. AK: Documents what you did better. AH: Used to have problems with drivers trying to delete symlinks when cleaning up restart directories.
AH: Will finish manifest this week. Chatted with Marshall and reimplementing it a bit differently. Will make NH’s job a lot easier. Run config has all the files, just need to clone and run. NH: awesome.
NH: Want any post-mortem or checking on tenth model for the payu bug? Could do some short 1 month runs. AK: Not sure what we would do with the information. Diagnosis without treatment. Interesting from an academic viewpoint. Planning to do a longer re-run with other changes and will be fixed in that. Interesting to see a couple of months and see scope of issue. Is it negligible? Maybe tell people AH: Choose a worse case: Southern summer? NH: Ok, might do that.


MW: Been using OpenMPI/3.0.3. Working well. Speeds same as 1.10. uses ucx by default. Turn off all flags, except error aggregate if you want. Can try 3.1.3, had some issues. Likely the version on the next machine.
AH: Test on Jenkins with new OpenMPI? MW: Good idea
MW confirmed that using hyperthreading option in payu is harmless (might even be on by default).



RF: Wanted to get rid of Ob river? 1150 looks good. Need an inlet to keep runoff in correct place. See GitHub issue. Plot shows 0.25 degree cell size is cut off.
AMH: Need to get rid off the Ob. Russ’ plot at 1150m looks good, maybe smooth out corners. RF: Have to look at index space, straight edges, no inlet, things like that. Depth is minimum depth, 10m, a lot more shallow in actuality. AK: Only real reason to keep it is to have the runoff in the right place. Had to smooth to stop model crashing. Main reason to keep is to make sure runoff is mapped correctly. AMH: Where is runoff coming from? Take it too far up and might get remapped to the wrong embayment. Why I like the minimal change. It is stable. AK: Yes since Russ’ fix that stops salinity drop below zero with ice formation. AH: If your map had water at depth zero, as opposed to land, then can follow the water along until it is > 0. Say this is water, use for remapping but not for model. AK: Need a separate file? AH: Not necessarily. Remapping using it’s own logic anyway. AK: Remapping takes no account for topography. NH: Could make the distance function smarter, use a directional weight, something like AH suggested, or take into account topography. AK: Go downslope.
RF: Other problem was Southhampton Island. Just taking out inlet was sufficient. AMH: Keep Island separated from mainland? RF: Yes. Hasn’t been causing problems? AK: No. AMH: Will leave cells smaller than 1150m. AK: Yes, but not too bad. Also an abrupt change in spacing. RF: Yes tripolar grid has discontinuity. AH: Cut of at 1150m, what was it before? AK: 880m. All crashes I had with ice remap error were less than 1100m. Those can be eliminated with closing channels. AMH: Worried about Southampton. AK: Never had issues there. Will be getting new constraints. Had to put damping on Kara Strait, and had issues with seamount off tip of Severny. AMH: Ok, keep it at 1150m and see.
AK: In quarter degree Baffin Island is attached to Canadian mainland. Tenth has much more open water. A lot of it extremely shallow (less than 100m), so unlikely important for sea water transport, but likely important for ice transport. AMH: And therefore fresh water transport. AH: Who will do this? RF: Planning to do it today or tomorrow. AMH: Awesome, thanks.


AMH: getting different numbers between IAF and RYF due to AK needing more ice time steps in IAF case. He can’t run with ndtd=2, so load imbalanced to cice. ntdt=2 with minimal. AK: Time difference is due to value of ndtd. Ruth still getting bad departure points with minimal. Reduces ocean time step for a single submit. I reduced ndtd instead. AMH: This has caused a load imbalance. Not the same as our optimisation that NH targeted. NH used ndtd=2 in optimisation. AK using 50% more time.
MW: What optimisation? AMH: When NH looking at load balancing. AK using 50% more time steps, and taking 50% more time.
NH: Now have a rebalanced tenth minimal with ndtd=3. With the bathymetry changes might not need it. AH: Hold off on that until AK can tell if we need it. AK: May still on occasion need to reduce time step every 5 or 10 years, preferable to ndtd=3. IAF variability means can’t guarantee it will work with every year.
MW: OASIS timing issue. Struggling to define main loop time. Looking at 1 deg, outputting time of every time step. Not literally useful due to overhead. AH: Give you scaling? MW: Not sure.
MW: timing between 170-200ms per step. Step 32 get a big number. 36s in one, 72s in the other. Is it just waiting? Doing IO? Maybe some sort of OASIS thing happening to bootstrap. Get infrequent huge time steps. Run again and don’t get them. Going to remove the largest timestep. Anyone know what is causing this?
NH: What are you profiling? MW: Just the coupling step. Reporting the coupling code.
MW: Does it do a lot of IO on that first coupling step? NH: Yes it does on the first step. What about CICE diagnostics? Are they printing to ice_diag.d. Should be consistent. If it goes away?
RF: CICE does IO through one PE, so does a global collective. MW: Could be IO and MPI collective issues. Not sure if this is legitimate timing or not?
NH: Not sure what the bigger picture is, but find targeting specific routines to look at load imbalance. NH: definitely look into CICE diagnostics.
MW: Timing so inconsistent. AH: Run a bunch of use the minimum. Turn off all diagnostics. AH: For the paper MOM scales well. Need to say something about CICE scaling. Doesn’t need to be the final word. MOM gives some leeway and these are the best configurations …
NH: Happy to help. Can do more fine grained stuff. Do some counting. MW: like score-p but it dies with CICE.

Grid scale noise

 RF: Chris Chapman problem with submeso scale stuff (see issue). There is a smoothing feature in submeso but says it doesn’t reproduce. Think I found a bug. Does smoothing of mixed layer. Possible to put mixed layer into rock with smoothing, doesn’t seem to be any check. Might get some others to look at it. If they agree we might be able to fix it and reduce the checkerboard. AK: This in MOM6? Also in MOM5? RF: There is a namelist parameter, says not to use because not repro, but because buggy. No reason it shouldn’t reproduce.
MW: Is this filtering a numerical mode? AK: KPP purely numerical, so adjacent columns can decouple. RF: Will point out code and see if people agree. AK: Get fixed and could be good to put in for next tenth degree run.

Technical Working Group Meeting, September 2018


Date: 11th September 2018

  • Marshall Ward (MW) (Chair), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD), CSIRO Aspendale

Clean up Actions list


  • Incorporate RF wave mixing update into MOM5 codebase + bug fix (AH)
  • Code harmonisation updates to ACCESS and ESM meetings (PD, RF)
  • Check red sea fix timing is absolute, not relative (AH)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Profile ACCESS-OM2-01 (MW)


  • Follow up with Andy Hogg regarding shared codebase (MW)
  • Nudging code test case (RF)


MW: 4 block success. 16 block didn’t work. sectrobin also didn’t work. Limited perspective on problem.

RF: blow out in time with extra blocks was halo updates. Weakness with round robin. A lot of overhead, no local comms. Maybe 8 tiles/processor might work. Marshall’s profiling showed small number of processors dominated run time. Want to minimise the maximum. That is the limiter

AH: Where are the max tiles?

RF: Seasonal ice near Hudson Bay, Sea of Okhotsk and Aleutian Islands.

MW: Nic used total CPU count less than number of blocks

RF: Could run with more, or less. MW: 80 CPUs less, could solve this.

AH: General strategy to concentrate on not assigning CPUs to the low work (blue areas) and let the high work areas take care of themselves?

RF: Only worried about slowest tile. Nice to have even distribution, but hard to achieve that in practice.

AH: Slowest tiles change over time RF: read in a map of expected ice concentration. Or have a heuristic, say weight by latitude. AH: If identify areas that do very little work, say never want to have many processors there, and free up processors for high work areas.

AK: There are five hot stripes and four cold stripes. Some processors have 5 blocks, some have 4. The outlying busiest ranks are on those hot stripes. If we get rid of striping with more even split, that would have maybe a spike on a lower baseline

RF: About half the processors have 5 about half have 4, request a few more PEs and that would close to balancing this issue.

NH: First attempt 1600 PEs with an even 4 blocks across all. With idealised test case Ocean was not blocking at all. Though could save a couple of hundred PEs, and there was not a big difference. However Andrew’s real world config is behaving differently. Worth going back up to 1600 and doing an even 4 or even 8 blocks. Assumed wanted everything to be even. Seemed roughly the same to have a mix. This profiling shows I was wrong.

RF: Can easily work out to get exactly 5 blocks per PE. AK: If you give me that number I can try it. NH: 5 across the board is better. Don’t want a single PE doing more work. RF: Slowest one kills you.

AH: How does the land masking affect it? A thicker stripe in NH? RF: Yes. Did I post a picture of where tiles are allocated? NH: More blocks means getting rid of more land? RF: Lose with communication cost.

NH: In order to get this working I ran into the raijin problem: messages getting lost and deadlocks. When we got 0.1 deg MOM-SIS working had issues with point to point sends and recvs, and Marshall change that to proper gather to get initialisation working. The gather inside CICE is implemented with point to point sends and recvs. Assume similar. It is doing a send for every block. MW: Andrew’s finished ok? AK: Ran with 30×35. MW: mxm might resolve this problem? NH: Resolved by putting in a barrier after all the sends, otherwise deadlocks. MW: Did you add barriers? NH: Yes to the MPI gather code. MW: Clear that CICE is heavily barriered. NH: Could implement properly with MPI_gather. MW: Caveat didn’t work with the global field. NH: Only does a global gather once when writing out restarts. Not too bad. MW: A lot of MPI ranks? NH: 1600 x number of blocks is the number of sends. MW: So number of messages, not number of ranks. MW: Only added barrier for restart? NH: Could have done that, but added in MPI_gather. Maybe that is bad? Actually didn’t add, just enabled it by defining a preprocessor flag.

AH: Is there an effect that it gets wider in the north that you’re sampling more ice in those areas?

AH: Should we pull out the slowest blocks and see where all the blocks are that contribute to the slowest processors.

RF: Correspond to areas of highest ice concentration. AH: There is ice in Okhotsk in northern summer? RF: Yes.

MW: Arctic and Antarctic are sharing work. RF: How many for this run? MW: 1385 RF: If you run with 1500 or so get an even distribution.

NH: Should decided what is the next step/run?

MW: Two options, massively increase number of blocks, but this is blowing out with comms time, or even divided 5 blocks. RF: Yes that is the one to do next.

AK: sectrobin should solve the communications issue but couldn’t get it to run. NH: Not sure if code needs to change? RF: Test on 1 degree model.

AK: First step to even up current run with 4 or 5 blocks. MW: Should confirm that many blocks is a comms problem and not a tripole issue for example. But this is a research problem.

AK: Will switch to this for 0.1 deg production as it is already better.

NH: New code 1 block per PE gives identical answers to old code. 4 blocks does not give identical answers to old code. Not sure if I should expect it to be the same. Don’t know how CICE works. In terms of coupling it should be the same if you’re coupling to individual blocks or multiple blocks. Not ruling out it should be identical and there is something going wrong. AK: What would make it non-identical? Order of summation? NH: could be something like that. MW: Might be CIE doing a layer calc before doing vertical? Have to know more about CICE. NH: might be worth looking into further so at least we know that we’re not making bugs.

AK: How would I switch to this for the production run? Not bitwise identical? Just check fields look physically reasonable? NH: Hard problem. Can’t see physical difference. Only looking at last few bits of a floating point number. MW: Did an MPI sum on a single rank and it changed the last bit. Found it running the FMS diagnostics and that is why they failed. Don’t fail at GFDL. Scary stuff. NH: Scary and time consuming.

MW: Clear strategy. Get rid of bands. Go with 1600 cores. Have a 16 block job running, will keep everyone updated.

Code Harmonisation

AH: My understanding with the ESM harmonisation is that we’re close, as we haven’t yet put in the coupling changes from CM2 that you had to take out of the ESM code. PD: Dave Bi’s iceberg scheme? AH: If we get the WOMBAT code into MOM5 that would be harmonised I think. PD: Maybe Matt has a better handle?

MC: Are the OM and CM almost harmonised except for iceberg information? Are they almost the same? AH: I believe so. Once we get WOMBAT in there we’re good to go. Russ had a different idea about how to handle the case of different coupling fields.

RF: Have to get rid of ACCESS keyword. In many cases redundant. AH: ACCESS keyword can be replaced  by ACCESS_CM or ACCESS_OM. RF: Yes!

RF: On CICE side of things (and probably MOM) coupling fields are currently defined as parameters. Can use calls to PRISM, test return code, put some tests for legal code/parameters for icebergs for example. Don’t need ifdef’s, can test on the fly. A lot easier than recompiling every time.

AH: How do we implement this? Put WOMBAT code in now so we have an ESM harmonised version and then deal with coupling etc as this is ACCESS-CM? RF: Want to bed down ACCESS-CM and OM harmonised first. The WOMBAT stuff will move in quite simply. I’d like to take that on, have been tasked to do this to take some of the load off Matt. Get this first step out of the way and then move on to WOMBAT and ESM. Until the first step done things can be in a state of flux.

MC: Is wind ehanced mixing in ACCESS-OM? RF: Yes. MC: FAMIP in ACCESS-OM? RF: They’re in MOM5. MC: They weren’t in ACCESS-CM code. AH: That is a 3 year old fork. MC: Can we update ESM from ACCESS-OM? AH: This morning putting WOMBAT changes into MOM5 pull request. Can grab and check if it works. MC: What is the difference in pulling from one direction to the other? AH: ESM is a 3 year old fork with little history in common with current MOM. Couldn’t code  into ESM would be too difficult. Cherry picked your changes into the MOM5 code, but wouldn’t work the other way. Will lease with Russ to get ACCESS-CM changes.

AH: Would WOMBAT always be part of MOM5-SIS. MW: Is it big? RF: No, very small. MW: Let’s leave it in MOM5. Just executable bloat. RF: Just a few fields. MC: Allocated, so if not turned on, then no issues. RF: WOMBAT wants the 10m waves, but we need that for the wave mixing as well.

Travis CI on MOM5

AH: ACCESS-OM no longer compiles because you need libaccessom2 as well. NH: Same before. Always needed OASIS. AH: I’ve got CM compiling by pulling in OASIS and make it. All the compilation tests are passing. Could pull in the libaccessom2 and compile in a similar way to ACCESS-CM. There is no old ACCESS-OM build anymore. It is ACCESS-OM2. MW: Do we want to do this external to the repo? AH: Nice to have the tests there and passing. OM now has different driver code to CM, so can’t be sure you’ve done it properly without an ACCESS-OM compilation test. NH: There always needs to be a dependency on a coupler. libaccessom2 is more than a coupler. Maybe some of it is undesirable. Not worse than having a dependency on OASIS. AH: Just wanted to make sure there wasn’t an ACCESS-OM that was independent of libaccessom2. MW: Can you provide libaccessom2 as a binary and headers? AH: Yes, that is a possibility. NH: Could just be a .a file. MW: that is how you handle dependencies, as a binary, like libc. MW: Do you call OASIS in MOM? NH: Yes. In yatm don’t directly call OASIS. Could change coupler in future without changing models. MW: No problem with wrapping OASIS. AH: Can do the same thing I did with CM, pulled in OASIS, built it. Pretty straightforward.



  • Create even 5 blocks per PE map for CICE (RF)
  • Get coupling changes into MOM for harmonisation (RF+AH)


  • Update model name list and other configurations on OceansAus repo (AK)
  • Shared google doc on reproducibility strategy (AH)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)
  • Provide 1 deg RYF ACCESS-OM-1.0 config to MC (AK)
  • Update ACCESS-OM2 model configs (AK)

Technical Working Group Meeting, August 2018


Date: 14th August 2018

  • Marshall Ward (MW), NCI
  • Aidan Heerdegen (AH) (Chair) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Nic Hannah (NH) Double Precision

Peter Dobrohotoff (CSIRO Aspendale) gave his apologies that he couldn’t attend.

MPI Coding Update Update

MW: Fixed constellation of bugs. 1/10th still not working under MPI, looks like a new issue. How do I approach getting this code into respositories? Not invested, but want working in long term. Intel v18 has a bug, fixed, but found another. Dale has built an OpenMPI3 library built using Intel v17. Do people want to use it? Or afraid another issue?

AH: What advantages? MW: Hangs in MPI_Init, commsplit, and hangs at random time steps. MXM seems to have solved random hangs. First two still happening. Still getting random fails during initialisation. Betting on newer library to solve them. Do we want to invest in new libraries and hope we get new solutions, or happy with status quo?

AH: Is there another way? Maybe dev branch on MOM5, submit to more testing? MW: Yes. I can add all build stuff to all repos, independent on NCI configs. Optionally turn them on over time. AH: Just using different versions of OpenMPI? MW: Yes, have my FMS changes as well. AH: FMS changes in master branch? MW: Could do multiple ways. FMS not been updated to GFDL version. Could have subtree/submodule or just dump in. Will change everyone’s code, so might not be solution. I want people to start testing soon. AH: The init hangs: critical core count when these occur? MW: Yes. Tenth gets them, not at 1 degree. AH: Don’t have these at a quarter either? MW: don’t run 1/4, but frequency of errors increases with cores. AH: If we can test this by running the model for a single time step lots of times to quantify issue, see how bad it is, and see if we get improvement. MW: Have not seen commsplit hang with newer versions of OpenMPI. AH: Do you get hangs AK? AK: Get hangs at initialisation about 10% of time. Since going to YATM not had these issue. Also much more consistent with timing. Was 1.5-2.5 hours/month. Now 4h5m-4h10m for 3 month submit. MW: When we studied variability always IO issues. AK: Don’t know what is behind it. AH: Has coincided with transition to YATM? AK: Yes.

AH: Just got to a point where the model is running at all. AK: By late today will have 1 year of IAF. Need capacity to shift to new MPI versions. MW: Concerned about next machine. OpenMPI 1.0 will not be available. Not concerned short term. Concerned long term and stability issues. AK: Yet to fail

MW: Just updating. Want a plan. Will submit build changes to all the projects. Will also replace FMS with submodule but no change in code, but can be changed when required. AH: with submodule can easily test changes. Need a plan to for implementation, testing, check timings.

NH: YATM makes sure it does all it’s work before ICE asks for anything. Does reads and regridding and waiting until required. Would hide any jitter in this disk access. AK: Preemptive fetching data. MW: Good case for IO servers.

Date handling bug in YATM

AH: We all good now? This fixed? AK: Yes. Just had to tell it to ignore the date in ice restart for the first one. AH: We have a method for this strange restart? Will we need to do this again? NH: Will happen any time you want to use someone else’s restart and don’t want to use their calendar. AH: Are there code changes we need to support this? AK: No. One off thing. Just need change a value in CICE restart and then change back again. AH: Kial got burnt with this once. MW: We could do this at payu level. NH: Little more to it. Also necessary to MOM and YATM date. AK: 3 or 4 things I needed to do to make it go. AH: Think if we need to streamline this? NH: To begin with just document on wiki. AK: I can put it up there.

MOM5 wind langmuir mixing stuff

AH: Fixed?

RF: Should not be able to do an ACCESS-OM run but not do langmuir mixing unless using u10 calculate from empirical formula. Won’t break current ACCESS-OM runs. Have to look into CICE5 and work out how to get winds into ACCESS-OM. ACCESS-CM was fine. OM I thought there was a option to pass winds, misread a preprocessor flag.

RF: A couple of other issues. The order it passes the fields between ice and ocean, the 10m winds are 22nd, another one at 21 which isn’t used. They can’t be done as a common thing. Strange code. The changes I have done will make it safe for the time being. Have to explicitly compile you want to use the winds. AH: Under what circumstances can you use langmuir in ACCESS-OM? RF: Can use MOM6 style to calculate 10m winds. Need to turn on another namelist option. Currently don’t pass winds in OM models. AH: If want to use langmuir do you need to also set the compile flag? RF: At moment can’t pass 10m winds from cice to MOM. Defaults will be fine. ACCESS wind preprocessor flag is for ACCESS CM. ACCESS wind flag is a placeholder at the moment. All allocations made no matter what. Problem all allocations no matter what. Currently initialised to zeroes. AH: A placeholder for the future when get winds through CICE? RF: Yes. Would like to figure out how they pass 10m winds in fully coupled model, and whether they mask them with ice or not. Currently not clear. Would like to make them compatible. AH: I will put in those changes, add new changes and submit a PR. MW: Turn off langmuir by default? It’s broken? RF: No, can calculate winds in MOM6 style. Not using this at the moment.

MOM5 testing

AH: Above bug brought home issue of testing on MOM5 repo. Currently have 3 targets, MOM-SIS, ACCESS-OM, ACCESS-CM. This got through beause I only tested MOM-SIS. NH: There is a jenkins setup which runs every MOM6 test case. Can’t remember if it has ACCESS-OM, ACCESS-CM  builds/test cases. Spent a lot of time to set up testing but it takes maintenance work. Doesn’t run periodically. Love the idea of testing MOM5, please look at what I have already done. Like the idea of production ready, but it takes effort to maintain the system, might not be justified by the number of tests we have to run. If we had weekly PRs would make sense. If infrequent need to revisit the testing every time. AH: Idea was to do some simple builds. MW: build tests on travis? AH: Yes. MW: Don’t have to run, just build. Nic did a lot of work to do runs.

NH: Periodically: ACCESS-OM2 build test, and a fast run test (1 day experiment). There is a lot of stuff being done for MOM5, no build test. MW: Is this the GFDL tests? AH: Nic runs the GFDL tests. NH: MOM5 runs not run as frequently. Not maintained, going red. Sure if something simple, maybe not worth doing on Jenkins. But definitely take a look? MW: Travis for commits? Weekly Jenkins runs for commits. AH: can see five MOM5 builds. NH: folder on Jenkins. MOM6 guys get a lot of value from it. AH: will take a look. NH: If you can’t change anything let me know.

ESM 1.5 Repo

MC: Didn’t work with https. AH: Made an ESM1.5 repo on OceansAus for Matt to upload MOM5+Wombat code. Pretty much frozen. Peter wanted somewhere to put this. Should it be possible to https? RF: Always had to use ssh. NH: Just need to put in password. MW: https should work, help to know error. AH: give it another crack. Complain on slack. Get on slack.

AH: ESM1.5 repo on OceansAus won’t change much (now frozen), but we have goal of getting WOMBAT code into main MOM5 repo. MC: Might do that in parallel. Who knows what will happen to ESM1.5. Depends on where investment with ACCESS-CM investment goes. ACCESS-CM2 is quite expensive. In the process of putting WOMBAT into ACCESS-OM-1.0. Going through steps. Put WOMBAT into it and submit PR. AH: ESM1.5 is just MOM+ Wombat? AH: We’re doing the harmonisation, so MOM5 master will have all the important changes. Once we have WOMBAT we have an ESM1.5 equivalent. ESM1.5 will be workhorse coupled model for CoE because ACCESS-CM2 is too expensive. Whilst ESM1.5 on OceansAus will be the canonical version, the MOM5 repo will be effectively the same but can included updates to diagnostics etc.

MC: Checked out ACCESS-OM2-1.0, checked out, compiled, but falling over on running payu. Config file has changed a lot since I last used. Want to run a 1 degree RYF model as basis. MW: Is Matt using the version that isn’t working? Is that what Matt is dealing with? AH: Matt, get on slack and let us know your issues, and we’ll get you going. AK: looking for a working setup? MC: Yes, 1 deg JRA RYF. AK: Can point you to working config. MC: A month since I cloned. AK: Yeah, need to update.

AK: Asking for just config? MC: Yes, but any information useful. AH: Kial has a lot of configs. I cloned one and changed exe paths and was up and running very quickly. MC: I cloned and built, but when I checked out the config it was pointing to common shared exes. AK: should change that. NH: Maybe documentation is out of date. Should follow the simple “if you’re a raijin user” instructions. MC: Yes mostly worked. AH: Get on slack! MC: Browser is out of date. NH: If you do it again, follow the quickstart for raijin users instructions. If that doesn’t work we need to fix stuff. MW: a lot of use problems we don’t know about. We have to think about students who will be coming to run this. If Matt can’t figure it out there is no hope. AK: there is a lot that needs to be updated for the more complex instructions. Also the configurations in control are not what they’re currently using. Could fix that easily.

JRA55-do versioning

AH: Andrew has issues with a ‘latest’ directory that has symlinks that point to most recent version. AK: Common use case is perturbation experiments. Go back to previous restart and branch a new experiment, but need to know what forcing was used. Rather than latest, have a directory which is named for the date it was setup, or date forcing was updated. If and when things are changed, make another one. All softlinks. AH: One good thing about latest is you have a config that always works with most recent version. If you have a config with latest, they start a new model and they can be confident that it works. AK: No problem extending forcing, only an issue if old forcing files change. AH: they have versioning issues with CMIP5, have a database. NH: latest is not reproducible. Experiment I ran, but latest is changed. Problem with old system, every version jumbled in one directory. At times there were different variables which had different versions. Not all variables had the same version. AH: That is correct. NH: If there is a single directory that has all the variables for that version that is fine. AH: some cases the variables don’t have the same version. I agree this is an issue, but best solved with manifests in payu. MW: filename is not a good system. Filenames change and hashes don’t. AH: If someone has a naming scheme they want, then happy to implement it, but will keep latest, and solve using manifests. NH: was there a reason to put all versions in same directory? AH: the way the JRA55 people publish it.

AK: Do we care if JRA forcing is extended? Does it affect reproducibility? NH: Not an issue. YATM has no end date for an experiment. You set a forcing start/end date, so no problem.


RF: Pavel Sakov is running a KDS75 MOM only on OFAM -75/+75 tenth model. Running 600s timestep from the start, hoping to get up to 900s. The problems in global model is not between +-75. NH: Just poles messing us up. RF: From a flat surface, huge heave. NH: all those little grid boxes. AK: Yes the tripole is the issue. AH: redo bathymetry? RF: did a naive regridding, some issues, potholes etc. Still works. Will be running a 100 member ensemble. AH: What is he trying to find out? RF: look at some issues with OFAM/BRAN/OceanMaps. Interested to see how much is due to vertical resolution. Also a test for the future. An intermediate model between what we run at the moment, and what Andrew is running. MC: interested in a figure from Kial at the COSIMA meeting, showing how variability changes with surface resolution. AH: how long will he run? RF: A year or two. Thought you might be interested.



  • Incorporate RF wave mixing update into MOM5 codebase + bug fix (AH)
  • Code harmonisation updates to ACCESS and ESM meetings (PD, RF)
  • Provide 1 deg RYF ACCESS-OM-1.0 config to MC (AK)
  • Update ACCESS-OM2 model configs (AK)


  • Edit tenth bathymetry to remove Cumberland Sound (RF)
  • Update model name list and other configurations on OceansAus repo (AK)
  • Check red sea fix timing is absolute, not relative (AH)
  • Shared google doc on reproducibility strategy (AH)
  • Follow up with Andy Hogg regarding shared codebase (MW)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Profile ACCESS-OM2-01 (MW)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Nudging code test case (RF)
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)

Technical Working Group Meeting, July 2018


Date: 10th July 2018

  • Marshall Ward (MW) (Chair), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Peter Dobrohotoff (PD), CSIRO Aspendale

Tenth Model Update

AK: Starting from yr 38, new codebase, hangs on MPI_finalize. Hard to trace. padb can’t help.
MW: Ranks not getting traces from probably shut down cleanly. Depending on how process exits it may leave cleanly. Russ added explicit backtrace calls to force a backtrace.
RF: Only if it goes through a particular routine. If it calls a FATAL within MOM. Doesn’t happen if something internal goes haywire.
AH: bug in the code according to Nic? AK: recompiled. Will run this morning. MW: may need to force back traces in some situations?

OM2/CM2 Code Harmonisation

AH: finished wave mixing?
RF: Tested in CM2.5. Happy with results. Can get from my GitHub. MW: happy doing PR? AH: I will take care of the code wrangling.
RF: But not completely happy with on MOM side. Using one instantaneous value rather than an average. Should be fine for ACCESS. AH: your code? RF: had to add in a field, bit tricky to know what it is doing. Has to get it from CICE, hard to figure out what is going on. Matt fixed it 5 years ago. Getting one value in the coupling stage. One timestep fine. Coupling several time steps. Only when MOM-SIS. Running with ACCESS it is fine. MW: will it produce the same numbers for a MOM-SIS run? RF: won’t change anything. MW: Change FMS code? RF: No, coupling code. Added in an extra field. In the coupling code, in the flux exchange. Atmosphere supplies wind at it’s bottom level. Scaling law to calculate 10m winds, which is what ACCESS passes. In normal MOM code, doesn’t have access to that field. Added in something to get that field, pass to ICE, which then passes to the ocean. ACCESS code, through OASIS, it passes these 10m winds directly. Difference to ACCESS-CM2 code, passed through ocean_sbc now. ACCESS took the ice/ocean boundary field and sent directly to KPP scheme. Shouldn’t do that, should be going through ocean_sbc. Now 10m winds are in the velocity derived type. Cleans up the interfaces. Another slight change with the fraction of ice passed in. Had some #ifdef in ocean_model. Now made aice a local variable in ocean_sbc. Doesn’t have to be passed around. Same interface with ACCESS or MOM-SIS.
RF: Now distinction in KPP code for ACCESS version and MOM version, just some namelist variables. MW: 10m always passed? RF: there by default. MW: so memory usage is marginally higher? RF: 2D field. MW: Otherwise no impact? Increased data transfer? MW: Previously MOM-SIS was converting to a stress? RF: Yeah, just have this extra field. MW: Good we now have 10m winds. Was awkward. RF: Also put in an empirical method to calculate 10m winds given friction velocity. Don’t need 10m winds in case forcing with fluxes, can’t regenerate this. This is copied from MOM6.
RF: Do you mind having elemental functions? AH: No! Love elemental. MW: good, aspirational.
RF: Aidan look at what I’ve done. MW: this is the bottle neck in the code merge. Awesome!
PD: Russ, thanks for work on Harmonisation. Can we now test on CM2? RF: Yes, depends on Aidan’s updates. Just based on current version of MOM. Anything else isn’t there. Once Aidan does his stuff should be fine. AH: Yes I will do this. PD: timeframe is yesterday. Estimate of timeframe pass on to ESM meeting and CMIP6 meeting. Other people make decisions. Hopeful to use harmonised code. AH: Want me to attend a meeting? PD: ESM meeting on Friday. Matt, helpful? MC: a report from PD would be enough. PD: If RF is there, that might be enough. PD; 11 am Friday.

Model Reproducibility

PD: Any work on restarts? Working on warm restarts in CM2? AH: more information MW: 2×1 vs 1×2 jobs. AH: not that I know. Stability more information.
PD: the way a scientist sees this, a model is perturbed at every restart. How can you write a paper with this “feature”? Needs to be fixed. MW: MOM5/6 can do this. ACCESS-CM1 can. ACCESS-OM2 cannot. UM can do this. PD: CMIP5 runs can do this. MW: tested MOM-SIS and didn’t work. Steve reckons settings are correct for reproducibility. Tested a year ago, all differences in tripole. Didn’t pus this. With GFDL coupler, atm, ice model. Not our model. Nic confirmed this issue with OM2. libaccessom2 growing pains have pushed this out. AK: scientific credibility requires this to be solved. MW: floating point arithmetic is a perturbation. MW: Get consistency with consistent restart times. PD: if FP errors are on the same magnitude as restart errors, maybe we can say they’re ok. Interesting perspective. AK: should be save the state of the model and reload and carry on. MW: the order something is being handled with init not same as time step. AK: fields calculated from saved variables might not match. MW: need checksums at every step. Needs to be someone’s job. Need to communicate that to the people that control us. AH: Needs to be prioritised by Andy Hogg. AK: Are these differences large, or least sig bit. Any perturbation, of model stable perturbation will disappear, but maybe new trajectory due to chaos. Same order as numerical round off error, or different compiler, optimisation, maybe of order that are being made all the time. PD: a lot of calculations from one time step to another. When you say how big is this change? Measured at the end of the time step after the restart. AK: model state at beginning of restart must be the point where they are different. When do we measure difference? AK: when restart and initialise fields. at that stage should match when model restarts were written. MW: hard to define model state. global vars, scratch fields etc. Need to define state, then compare checksums at end of run, and beginning of next time step. After 30 time steps, get checksums, then proceed. Then compare to timesteps with a restart run.
PD: each processor checksums array, print that out. AK: specific reproducible order to sum? MW: MOM or UM safe operation, need a gather on a rank. PD: can  we work on what a state might be? MW: Can do this in MOM. Need it for all models. Hard for coupled models. MOM has framework for this. Could be as simple as OASIS getting out of sync. Depending on configuration it might not be restarting correctly. AH: nic has tested OASIS field consistency. MW: volatile time. PD: lags might be set explicitly for first time step? MW: restarts are supposed to handle that. AH: could we use compiler options to perturb FP operations to get scale of differences. MW: fused multiply add might not reproducible. PD: some clarity about what the model can and cannot do. MW: push this up to science leaders. Bob Hallberg did cool thing with MOM6 converts FP to fixed point and does global sum and converts back to FP.
MW: lack of testing and reproducibility means we can’t confidently change code quickly and easily. AK: engineering problem. useful for finding subtle bugs the way code is written. Hard to know how big this effect might be. MW: lab can do stuff. PD: is this a showstopper? MW: need a conversation at CSIRO wrt CMIP6. They have rules. PD: this isn’t a showstopper for science publishing? AK: depends on size of perturbation. AK: for testing need to walk all used code branches.

FMS (MPI) updates

MW: Been rewriting global field function. Done for a while, concerned about performance. Fair bit slower than original. Fixed stability. Probed it, but it was MPI alltoallw and it was slow. Tested against other MPI libraries. In Intel MPI alltoallw is a lot faster than p2p. OpenMPI is across the board slower than IntelMPI. Whatever I did was not a question of performance. Has anyone been testing IntelMPI? Maybe we have been making our lives hard by using IntelMPI? What do people think? RF: Makes no difference to us. Up to MW. MW: will keep testing. MW: Intel is not necessarily faster, but it might be smarter about choosing algorithms. Around the 1000 ranks it makes a bad choice. AH: How size sensitive? MW: has not tested alltoallw. Others are faster on IntelMPI. 2x as fast. Small tests. AH: full MOM test with MOM? MW: Years ago, volatile timing. This was IntelMPI 4, when it was sort of bad. Seems to have improved. Intel MPI is MPICH.



  • Incorporate RF wave mixing update into MOM5 codebase (AH)
  • Code harmonisation updates to ACCESS and ESM meetings (PD, RF)


  • Edit tenth bathymetry to remove Cumberland Sound (RF)
  • Update model name list and other configurations on OceansAus repo (AK)
  • Check red sea fix timing is absolute, not relative (AH)
  • Shared google doc on reproducibility strategy (AH)
  • Follow up with Andy Hogg regarding shared codebase (MW)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Profile ACCESS-OM2-01 (MW)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Nudging code test case (RF)
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)