Technical Working Group Meeting, July 2019

Minutes

Date: 1tth July, 2019
Attendees:

  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) RSES ANU, Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Peter Dobrohotoff (PD), CSIRO Aspendale
  • Marshall Ward (MW) GFDL
  • Nich Hannah (NH), Double Precision

Config checking

AH: Made payu configuration checker. Include safety checks for synching scripts in BASH scripts. Interested in checking for bad namelist options. Russ any specific bad ones.
RF: Kpp kbl standard method should be false, red sea fix used in access-cm. For a CM model maybe warn. OM model just not allowed. AK: nprocs and ncpus driver issue?
NH: diag_step checks? To frequent ok for low res, bad for high. RF: For production runs don’t want? But how you diagnose problems. Best way to find how things are going wrong. Maurice’s issue was trivial to spot. AH: Definitely say don’t want debug_this_module turned on. RF: diag_table debug turned on, should be turned off. Creates huge numbers of messages. AK: Setting up new updated configs. Ten configs. Make them more homogenised. Fixing all these things as they go. AH: These things will get changed by mistake. Don’t have enough people to keep checking things. Doesn’t scale. This will allow new users to submit a config that at least passes these checks, also gives others confidence to change config knowing they have produced something that meets some minimum standard, appropriate for public facing production.

Tenth Run

AK: Andy Hogg is running ACCESS-OM2-01 with JRA-55-do RYF90/91, seems to have smaller biases than previous repeat year (84/85). Currently 13 years. When it does die it is CICE CFL problem. Sometimes the same date, subsequent years didn’t occur. Checked dates. Storm goes near tripole. Not currently messing with forcing winds. Did this with 85/85 but this doesn’t seem as bad, so haven’t done it so far. One drawback doesn’t run at dt=600s, but takes 2.55 hours to do 3 months. With dt=600s could do 6 month submits, which would mean less queue wait. Should be straightforward to fix the winds to enable this. AH: Not a priority considering extra SU cost. AK: About 10%. Cost of losing 6 month submit halfway through > 10%. AH: Shame it is the tail wagging the dog. AK: Could ask for 6 hour limit from NCI? AH: Worth trying. Done it before, and have seen others with increased limits. Prefer to do it, but time limited, and just for one project. AK: Hopefully limits will change with new machine. Currently 65KSU/3mths. RF: MOM or CICE bound? AK: Fraction of time MOM is waiting 2-3%. RF: Not greatly MOM bound. Throw a few more processors at MOM to get it to run less than 2.5 hours?
AK: 40 years. IAF not split, just start from climatology. AH: When will IAF start? AK: No plans, not simultaneously run RYF and IAF.

NCI update

AH: Attended an NCI scheme manager meeting. Mostly about new storage scheme for short term storage. Push came from CSIRO to change to scratch model, but some others in CSIRO not happy. PD: Wasn’t aware that was being driven from this end. Maybe further up the food chain.
AH: Change to time-limited scratch, or a tidal model deleting oldest data first. Maybe a split scheme with old style short on one disk, time limited on another, but not a lot of appetite for that.
RY: First stage November. Our group look for HPC application for new machine. Already have ACCESS-OM from Andy Hogg to look into software state. New machine some old library will not be maintained.

OpenMPI3 and ACCESS-OM2

Recently used ACCESS-OM2 with OpenMPI 3.0. Seems to hang? Know this issue? Or avoid 3.0? Some work required to run on new machine. Spend some time on this work.
AH: Marshall any ideas? MW: Have tried 3.0.0 3.0.1, maybe 3.1.1. Earlier ones didn’t work then got fixed. Newest 3.x should work. RY: Tried 3.1.3, MOM keeps hanging until finish of job. Should finish at 40min. Keeps hanging. 1.10.2 works. 3.1.3 hanging.  MW: Sure I got it running. Will make sure they are in repo. RY: Catch up with your personally? MW: where it hung should tell you. RY: Talk later offline.
AH: Definitely need it working on the new machine. MW: No work needed to be done, it just worked. AH: What changes would you have made? MW: Just versions, environment file and flags. Maybe using some of the alltoallw changes, but I don’t think that was a deal-breaker.
AH: What is the minimum version OpenMPI supported on new machine. RY: Under discussion. System guys will decide. Haven to prepare for any. Not sure OpenMPI 1.10 will still be supported. Don’t know. AH: Likely to be OpenMPI 3.x+? RY: New machine with new architecture. Performance enhancements with new architecture. MW: What arch? RY: Now have skylake. Newer than Skylake. MW: Intel architecture, not Ryzen. AVX512 can’t benefit. fma which we already have AH: AVX512 because can’t vectorise enough? MW: Currently vectorising, but bandwidth limited. Ryzen has better bandwidth. RY: Not announced. No idea. AH: At scheme managers meeting it was an Intel chip. Told it was November when they commission new nodes, take equivalent raijin nodes offline. Iron out the bugs, and early next year will turn off the rest of raijin and turn on the rest of the new machine and at that point it will be larger than raijin now, but not a huge increase in compute. AH: Thanks for bringing that up Rui, as we definitely need to keep an eye on this for the new machine.
MW: Apparently used 3.0.3. Maybe a reference point to start with. RY: Start with 3.0.3? MW: Whole space is volatile, some 3.0.* series work some don’t. But start with 3.0.3 and Intel19.
AH: Would be nice to have a spack like build tool so can say for certain what was run. MW: payu build! AH: spack written by a smart guy from TACC, and lots people use it, and they still have a lot of issues. Not an easy problem to solve. MW: Dale was keen on it. AH: When we met with Dale he was thinking to have spack as a tool preconfigured with compiler toolchains that we can build our tools from. RY: Dale is very busy getting new for the new machine.

Splitting off FMS

AH: Been working on Cmake to compile FMS separately from MOM. Been using the FMS fork in mom-ocean repo with your alltoallw changes. MW: Also a branch on the GFDL repo with those changes.
AH: How to organise the FMS fork? Have a branch that tracks GFDL and master contains our local changes? Could have a branch called gfdlmaster, could have our master branch exactly track the GFDL FMS. Any opinions on how to organise this? MW: Don’t want to use GFDL FMS? AH: I want an easy way to update FMS without touching MOM source tree. MW: Want to get FMS out of MOM? AH: Yes. MW: And want to know how to refer to FMS you want to use? AH: FMS we want to use is a fork on mom-ocean. Gives flexibility to add changes when we need to.
MW: Best to have your own FMS fork. GFDL don’t want to support anything but for GFDL, including MOM5. Don’t really want to get involved in supporting other projects. Will be receptive. No harm in using FMS repo straight, but if doing anything with FMS better off maintaining own version and update as see fit. Don’t see compatibility with older models as priority. Planning a big IO rewrite. Wouldn’t be surprised if it starts breaking and not salvageable.
AH: alltoallw we definitely want on our architecture as we’ve had issues in the past? MW: A lot of work, return not what I’d hope. Latest MPI version bigger impact. Are cases with speed up, but such an infrequent operation not such a big deal. AH: Stopped initialisation hangs? MW: Yes, some rare scenarios where they did alltoall with point to points that broke a lot. In OpenMPI 2.0/3.0 and later they changed something, scenario no longer happened. Segfaulted before, now properly checking. Only necessary for 1.10. It is better, as collectives are generally more responsible. May become necessary, assuming 3.0.3 works.
AH: If want alltoallw, would keep a branch with those changes and rebase on to gfdl master. This would be a well documented branch, or branches, and a well documented way of applying those changes when an update is required.
MW: Can CMake build as libfms and link to MOM when you build it. No submodules, rely on Cmake. Does that work? AH: FMS is not suitable to be a loadable module. Get OpenMPI conflicts, best to build at the same time with the same compiler toolchain. There is a new Cmake tool called FetchContent that can grab a repository and it behaves like it is physically in the source tree. Works well, but not great versioning. MW: Isn’t Nic already doing something like this for ACCESS-OM2 to pull in specific versions of son-fortran. AH: Yes, you can specify a library git hash. The only thing stopping it from working is relocating the versioning string stuff Nic did as it is currently sitting in the FMS directory, and that is going to disappear. Needs it’s own directory, maybe ocean_shared? RF: ocean_shared is used for other tracers. MW: should not use that name. AH: Ok, will make a new directory called version. Can recreate the sed script functionality that is currently in the build script in Cmake using template files. Quite a clean solution. I have a cmake branch on the MOM5 repo and FMS fork on mom-ocean, will get them compiling properly and working properly together. There is a way forward.
MW: Alistair is pretty interested, might be a template for MOM6. AH: Angus already did this for MOM6? MW: Angus, is what you did still viable? AG: Haven’t tried recently, don’t know why it wouldn’t work. Replicating mkmf process in CMake. MW: Automake is not good and won’t touch it. AH: Surprised there was no way to build FMS from the FMS repo. Relies on being imported into another project that knows how to build it. Not sure it is great that a project can’t build itself. MW: CMake support not widespread enough? Not available everywhere? AG: Updates frequently, can have features that break old versions. Used in a lot of projects. Surprised if it went away. AH: Cmake can be brilliant, but also terrible, but better than mkmf. MW: mkfmf is doing two jobs, importing stuff and working out dependencies. Does work well for the latter job. Set a high bar. AH: Haven’t done proper comparisons, but Cmake seems to better for dependencies. Can do parallel builds with Cmake you can’t with mkmf. MW: mkmf just generates a makefile, which is already parallel. AH: So does cmake AG: doesn’t seem like a good makefile, don’t know if the dependency tree is deficient. Rebuilds too much even after touching a single file. MW: if CMake intelligently supports mod files then it is fantastic. AG: Has native fortran support. AH: Speed point of view, Cmake is better. Generated correct dependencies so that parallel compilation worked. Couldn’t do that with mkmf. Also had compilation cascade issues. MW: I build 5 exes at once, so it always looks fast to me. AG: MW same makefile gen as mkmf. MW: More readable makefile than automake? AG: Yes. More readable than automake. AH: When the magic works Cmake is great, when it doesn’t it is a pain, but the magic is worth it. Also supports multiple architectures.

Codebase

RF: Aidan can you approve change to FAFMIP. Starting to get conflicts. Ryans changes put it all in conflict. Riccardo has disappeared, but Fabio’s changes so it is all the same bit for bit. AH: Current conflict in ocean_frazil. RF: Because you put Ryan’s changes in. AH: Sorry. Could rebase on Ryan’s changes. Maybe pull in Ryan’s changes. AG: Could check out the branch, make changes and push to the branch. AH: I’ll try doing it directly on GitHub, get back to you about it. RF: Get that done and I can finish up some of the WOMBAT stuff. With the ESM model I also have to make some changes to CICE. A couple of design things with the number of fields that are passed. Hard wired at the moment. A couple of issues there. Have a chat at a later stage. Rather than hard wire fields, flexibility, test error codes, make compatible with namcouple, so can be done on the fly. Also feed into BGC Hakase is putting into CICE. Need to pass BGC fields between the two modules. Rather than having a plethora of drivers, or CPP directives, better ways to do it.
AH: Made that change on GitHub and merged it. Once checks are finished will accept the PR.
MW: Been working on a test with MOM6, where we turn of every diagnostic, fantastic for finding bugs. Found nearly 2 dozen bugs. Don’t actually register the diagnostics with FMS, just spoof the whole thing at the diag_mediator level, which is a wrapper around the diag manager. Interesting if this could be translated to MOM5. Don’t know a natural way to do it, but might be worth some thought at some point. RF: Code you’re putting into MOM6, not the diagnostic manager? MW: Yes. FMS moves too slow, very conservative, don’t have a robust test framework so are worried about putting in changes. There are some hints that maybe this code could be shared with MOM5. Lots more in there than just this. Just raising it as food for thought. AK: Put as an issue? MW: Opposed to those sorts of issues, but you can if you want.
AK: Want to set up new vanilla reference versions of the 1 and 0.25 deg ACCESS-OM2 models. The forcing on those use 2nd order conservative interpolation. There are overshoots for some fields which have to be positive definite. Would like 1st order conservative for some fields. Do they exist? NH: They should be there, we were using 1st order for a long time, and should be in the input directory. Not sure how well they are named. Should say in the filename, have a look and if you can’t find them we can recreate them.