Technical Working Group Meeting, September 2019

Minutes

Date: 11th September, 2019
Attendees:

  • Aidan Heerdegen (AH) CLEX ANU,  Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Nic Hannah (NH), Double Precision

libaccessom2

AK: JRA55 v1.4 splits runoff into liquid and solid. Most elegant way to support? Have a flag in accessom2 namelist to enable combining these runoffs. NH: Is it a problem in terms of physics? Have to melt it? AK: Had previously ignored this anyway, so ok to continue. NH: Backward compatibility!
AK: Some interest in multiplicative scaling and additive perturbations to allow for model perturbation runs. NH: Look at existing code. Might not be too hard. AK: Test framework for libaccessom2? NH: When did scaling did longer to write test than make the code change. All there, could use as an example. Worth to run tests, don’t want to get it wrong. AK: Not familiar with pytest. NH: In this case just copying scaling test, modify, and get pytest to run just that test. Once got just  that test running and passing you’re done.
AK: New JRA55 now in Input4MIPS. Used JRA v1.3 from that directory and didn’t reproduce. AH: Correct. Didn’t work out why it wasn’t reproducing. AK: Ingesting the wrong files? Should be identical. AH: Never figured out what was wrong. Didn’t match checksums from historical runs. Next step was to regenerate those checksums to make sure the historical ones were correct. Could have been ok, but didn’t get that far.
AH: JRA55-do is now on the automatic download list, should be kept up to date by NCI. If it isn’t let us know.
NH: Liquid and frozen runoff backwards compat, but what about future? AK: Some desire to perturb solid and liquid separately, and/or distribute solid runoff. NH: Can we just put it somewhere and allow model to deal with it. AK: In terms of distributing it, not sure. Some people are waiting on this for CMIP6 OMIP run. Leave open for the future. NH: MOM5 doesn’t have icebergs? AK: No. Depoorter et al. has written a paper for meltwater distribution. Maybe use a map to distribute. RF: What they use for ACCESS-CM2. Read in from a file.
AK: Naming convention for JRA55 v1.4 has year+1 fields. Put in a PR some time ago. AH: Problem with operator in token? NH: Should be fine as long as within quotes. AK: Just a string search shouldn’t make a difference.
AK: Can’t get libaccessom2 to compile and link to correct netcdf library. Ben Menadue tried and worked ok for him. Problem with findnetCDF plugin for CMake. Not properly supported on NCI. Edited the CMake file to remove this, could find netCDF, but used different versions for include than linking. Should move to a newer version of netCDF. v4.7.1 has just been released. Have requested this be installed on NCI NH: Does supported include CMake infrastructure around library? If getting findnetCDF working was NCI responsibility that would be great. Difficult getting system library stuff working properly with CMake. CMake isn’t well supported in HPC environments. AK: Ben suggested adding logic to check and not use on NCI. NH: Definitely upgrade, to 4,7 if they install it.
AH: Didn’t Ben Menadue login as AK and it ran ok? AK: No, he didn’t do that as far as I know. AH: Definitely check there is nothing in .bashrc. Also worth checking if there is a csh login file that is sourced by the the csh build scripts.

OpenMPI testing

RY: OpenMPI 2,3,4 and Intel 2019. Consistent results between for all OpenMPI versions. 1, 0.25 and 0.1. Some differences between Intel 2017, not from MPI library. Not sure if difference is acceptable or not? Would like some help to check differences.
Just looking at access-om2.out differences. Maybe need to look at output file like ocean.nc? RF: Need to compile with strict floating point precision to get repro results. MOM is pretty good. Don’t know about CICE. Can’t use standard compilation options. fp-precise at a minimum.
RY: If this difference is not acceptable need to use flags to check difference between 2017 and 2019? RF: Once get a bit change, chaos and get divergence. RY: Intel 2017 still on new system. AH: So not only newest versions of modules on gadi? RY: 2017 will be there, but no system software built with it. AH: Done a lot of testing. Should be possible to just use 1 degree as a test to get 2017 and 2019 to agree. There are repro build targets in some of those build files. Could try and find them. RY: Yes please.
AK: Any difference in performance? RY: No big difference. NH: New machine? RY: No, old machine, with broadwell.
RY: NCI recently sent out gadi update and blog and webpage. 48 cores/node. NH: Did we think it was 64 cores/node? AH: Still 150K cores in gadi, with 30K of broadwell+skylake. Maybe have to change some decompositions. RY: Not the same as any existing processors.
AH: Two week overlap with gadi, then short will be read only on gadi. RF: There was panic in ACCESS due to an email that said short would disappear in mid October. AH: Easy to misread those dates.

accessom2 release strategy

AK: Harmonising accessom2 configurations. Somewhat haphazard release strategy, but not tested. Maybe master branch that is known good, and have a dev branch people can try if they want? Any thoughts?
NH: Good way is really time consuming and labor intensive. Would mean testing every new configuration. Not sure if we can do that. Tried to keep master of parent repo only references master of all the control experiments. Not sure if necessary or desirable? Maybe makes more sense to develop freely on own experiment and keep everything in control stable? Not sure. If all control experiments are stable and working, can be a bit slow to update. Just update your experiment.
AK: Some people are cloning directly from experiment repos, some cloning all of access-om2. Would reduce confusion if control directories under accessom2 are kept up to date with latest known good version. NH: Does make sense I guess. Shame for people to clone something that is broken which has already been fixed. There is some python code in utils directory which can update everything. Builds everything at all resolutions, copies to public space, updates all exes in config.yaml and does something with input directories. AK: I ended up writing up something like that myself.
AH: Should split out control dirs from access-om2 repo. Is a support burden to keep them synched. Not all users need entire repository, as using precompiled binaries. Tends to confuse people. NH: Did need a way for config to reference source code and vice versa. AH: Required to “publish” code? Maybe worth looking into. NH: Ideally from the experiment directories need to know what code you’re using. Probably got that covered. In config.yaml do reference the code and it’s in the executable as well. When run executable it prints out the hash from the source code. Enough to link them?
AH: I recall NH wanted to flip it around and have the source code part of the experiment. NH: Probably too confusing for users. AH: True, but a useful idea to help refine a goal and best way to achieve it.
AH: A dev branch is a good idea. Then you have the idea that this is the version that will replace the current master. Can then possibly entrain others into the testing. Users who want updates can test stuff, you can make a PR and detail testing that has been done.
NH: Good idea. Some documentation that says experiments have stable and dev. When people are aware and have a problem, wonder if they can go to dev, see if it fixes. AK: Bug fixes should go into master ASAP. Feature development is not so urgent. A bit gray, as sometimes people need a feature but they can work off dev. AH: Now have some process for this: hot fixes that go straight in. Other branches are dev/feature branches. Maybe always accumulate changes into dev. Any organisation helps.
NH: Re: Removing experiment repositories: namelists depend on source code. AK: Covered by executables defined in config.yaml. NH: Yes ok.

FAFMIP PR

RF: Did it work? It’s got a lot of merges. RF: Just two lines. Did a merge and pushed it to my branches on GitHub. AH: I’ll merge it in. Just wanted to check. AH: Can always make a new master branch that tracks the origin, check that out and pull in code from other branches. RF: Have a lot of other branches. AH: Can get very confusing.

payu restart issue

AH: Issue has resurfaced. I commented on #193, but didn’t look into the source of the problem. Should look into it rather than talk about it here.

FMS subrepo

AH: Still not done the testing on this. Been sick. Will try and get back to it.

Tenth update

AK: Andy done 50 years with RYF 90/91. Running stably. AH: What tilmestep? RF: Think he was using 600s. AK: 3 months / submit. Should ask for longer wall time limit. RF: Depends on how queues will be on new machine, what limits and what performance. AH: Talking about high temporal res output. AK: Putting out 3D daily prognostic fields. Want it for particle tracking. Including vertical velocity. Slowed it down a little bit. RF: More slowdown through ice. AK: No daily outputs from CICE.

CICE PIO

NH: Still in progress. AK: Also requires newer version of netCDF? NH: Requires specific version of netCDF. Needs parallel version. Not a parallel build for every version. AK: Has parallel for 4.6.1. RF: Bug in HDF5 library which it is linked to. Documented in PIO. Probably a bug we’re not going to trip. Doing a collective write, and some of the processors not taking part/writing no data. Fixed next version of HDF5 1.10.4? AH: Not a netCDF version so much as the HDF library it links to. RF: Yes. AH: So should make sure we ask for a version of netCDF that doesn’t have this bug? AK: Add to request.
RY: If want parallel version, use OpenMPI 3 or 4? AH: Good question! RY: All dependencies will be available and very easy to use. AH: This using spack? RY: Above spack and other stuff. Automatic builds with all possible combinations. AH: Using it for your builds? RY: We are requested to test and are now using. Difficult to create new versions currently. In transition difficult, but in new system should be fixed quite easily. AH: Should fix the various versions of OpenMPI with different compilers. RY: Yes. AH: Will have a compiler/OpenMPI toolchain? RY: Will automatically use correct MPI and compiler. AH: Any documentation? RY: Some preliminary, but not released. When gadi is up all this should be available.
AK: Should I ask for a specific version of MPI? RY: If don’t specify, will be built with 3 or 4. Do you gave a preference? AK: No, just want the version with performance and stability we need. Do we need to use the same MPI version across all components. RY: Not necessarily. Good time to try OpenMPI3. No performance benefit as system hardware is still old hardware.