Technical Working Group Meeting, August 2018

Minutes

Date: 14th August 2018
Attendees:

  • Marshall Ward (MW), NCI
  • Aidan Heerdegen (AH) (Chair) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Nic Hannah (NH) Double Precision

Peter Dobrohotoff (CSIRO Aspendale) gave his apologies that he couldn’t attend.

MPI Coding Update Update

MW: Fixed constellation of bugs. 1/10th still not working under MPI, looks like a new issue. How do I approach getting this code into respositories? Not invested, but want working in long term. Intel v18 has a bug, fixed, but found another. Dale has built an OpenMPI3 library built using Intel v17. Do people want to use it? Or afraid another issue?

AH: What advantages? MW: Hangs in MPI_Init, commsplit, and hangs at random time steps. MXM seems to have solved random hangs. First two still happening. Still getting random fails during initialisation. Betting on newer library to solve them. Do we want to invest in new libraries and hope we get new solutions, or happy with status quo?

AH: Is there another way? Maybe dev branch on MOM5, submit to more testing? MW: Yes. I can add all build stuff to all repos, independent on NCI configs. Optionally turn them on over time. AH: Just using different versions of OpenMPI? MW: Yes, have my FMS changes as well. AH: FMS changes in master branch? MW: Could do multiple ways. FMS not been updated to GFDL version. Could have subtree/submodule or just dump in. Will change everyone’s code, so might not be solution. I want people to start testing soon. AH: The init hangs: critical core count when these occur? MW: Yes. Tenth gets them, not at 1 degree. AH: Don’t have these at a quarter either? MW: don’t run 1/4, but frequency of errors increases with cores. AH: If we can test this by running the model for a single time step lots of times to quantify issue, see how bad it is, and see if we get improvement. MW: Have not seen commsplit hang with newer versions of OpenMPI. AH: Do you get hangs AK? AK: Get hangs at initialisation about 10% of time. Since going to YATM not had these issue. Also much more consistent with timing. Was 1.5-2.5 hours/month. Now 4h5m-4h10m for 3 month submit. MW: When we studied variability always IO issues. AK: Don’t know what is behind it. AH: Has coincided with transition to YATM? AK: Yes.

AH: Just got to a point where the model is running at all. AK: By late today will have 1 year of IAF. Need capacity to shift to new MPI versions. MW: Concerned about next machine. OpenMPI 1.0 will not be available. Not concerned short term. Concerned long term and stability issues. AK: Yet to fail

MW: Just updating. Want a plan. Will submit build changes to all the projects. Will also replace FMS with submodule but no change in code, but can be changed when required. AH: with submodule can easily test changes. Need a plan to for implementation, testing, check timings.

NH: YATM makes sure it does all it’s work before ICE asks for anything. Does reads and regridding and waiting until required. Would hide any jitter in this disk access. AK: Preemptive fetching data. MW: Good case for IO servers.

Date handling bug in YATM

AH: We all good now? This fixed? AK: Yes. Just had to tell it to ignore the date in ice restart for the first one. AH: We have a method for this strange restart? Will we need to do this again? NH: Will happen any time you want to use someone else’s restart and don’t want to use their calendar. AH: Are there code changes we need to support this? AK: No. One off thing. Just need change a value in CICE restart and then change back again. AH: Kial got burnt with this once. MW: We could do this at payu level. NH: Little more to it. Also necessary to MOM and YATM date. AK: 3 or 4 things I needed to do to make it go. AH: Think if we need to streamline this? NH: To begin with just document on wiki. AK: I can put it up there.

MOM5 wind langmuir mixing stuff

AH: Fixed?

RF: Should not be able to do an ACCESS-OM run but not do langmuir mixing unless using u10 calculate from empirical formula. Won’t break current ACCESS-OM runs. Have to look into CICE5 and work out how to get winds into ACCESS-OM. ACCESS-CM was fine. OM I thought there was a option to pass winds, misread a preprocessor flag.

RF: A couple of other issues. The order it passes the fields between ice and ocean, the 10m winds are 22nd, another one at 21 which isn’t used. They can’t be done as a common thing. Strange code. The changes I have done will make it safe for the time being. Have to explicitly compile you want to use the winds. AH: Under what circumstances can you use langmuir in ACCESS-OM? RF: Can use MOM6 style to calculate 10m winds. Need to turn on another namelist option. Currently don’t pass winds in OM models. AH: If want to use langmuir do you need to also set the compile flag? RF: At moment can’t pass 10m winds from cice to MOM. Defaults will be fine. ACCESS wind preprocessor flag is for ACCESS CM. ACCESS wind flag is a placeholder at the moment. All allocations made no matter what. Problem all allocations no matter what. Currently initialised to zeroes. AH: A placeholder for the future when get winds through CICE? RF: Yes. Would like to figure out how they pass 10m winds in fully coupled model, and whether they mask them with ice or not. Currently not clear. Would like to make them compatible. AH: I will put in those changes, add new changes and submit a PR. MW: Turn off langmuir by default? It’s broken? RF: No, can calculate winds in MOM6 style. Not using this at the moment.

MOM5 testing

AH: Above bug brought home issue of testing on MOM5 repo. Currently have 3 targets, MOM-SIS, ACCESS-OM, ACCESS-CM. This got through beause I only tested MOM-SIS. NH: There is a jenkins setup which runs every MOM6 test case. Can’t remember if it has ACCESS-OM, ACCESS-CM  builds/test cases. Spent a lot of time to set up testing but it takes maintenance work. Doesn’t run periodically. Love the idea of testing MOM5, please look at what I have already done. Like the idea of production ready, but it takes effort to maintain the system, might not be justified by the number of tests we have to run. If we had weekly PRs would make sense. If infrequent need to revisit the testing every time. AH: Idea was to do some simple builds. MW: build tests on travis? AH: Yes. MW: Don’t have to run, just build. Nic did a lot of work to do runs.

NH: Periodically: ACCESS-OM2 build test, and a fast run test (1 day experiment). There is a lot of stuff being done for MOM5, no build test. MW: Is this the GFDL tests? AH: Nic runs the GFDL tests. NH: MOM5 runs not run as frequently. Not maintained, going red. Sure if something simple, maybe not worth doing on Jenkins. But definitely take a look? MW: Travis for commits? Weekly Jenkins runs for commits. AH: can see five MOM5 builds. NH: folder mom-ocean.org on Jenkins. MOM6 guys get a lot of value from it. AH: will take a look. NH: If you can’t change anything let me know.

ESM 1.5 Repo

MC: Didn’t work with https. AH: Made an ESM1.5 repo on OceansAus for Matt to upload MOM5+Wombat code. Pretty much frozen. Peter wanted somewhere to put this. Should it be possible to https? RF: Always had to use ssh. NH: Just need to put in password. MW: https should work, help to know error. AH: give it another crack. Complain on slack. Get on slack.

AH: ESM1.5 repo on OceansAus won’t change much (now frozen), but we have goal of getting WOMBAT code into main MOM5 repo. MC: Might do that in parallel. Who knows what will happen to ESM1.5. Depends on where investment with ACCESS-CM investment goes. ACCESS-CM2 is quite expensive. In the process of putting WOMBAT into ACCESS-OM-1.0. Going through steps. Put WOMBAT into it and submit PR. AH: ESM1.5 is just MOM+ Wombat? AH: We’re doing the harmonisation, so MOM5 master will have all the important changes. Once we have WOMBAT we have an ESM1.5 equivalent. ESM1.5 will be workhorse coupled model for CoE because ACCESS-CM2 is too expensive. Whilst ESM1.5 on OceansAus will be the canonical version, the MOM5 repo will be effectively the same but can included updates to diagnostics etc.

MC: Checked out ACCESS-OM2-1.0, checked out, compiled, but falling over on running payu. Config file has changed a lot since I last used. Want to run a 1 degree RYF model as basis. MW: Is Matt using the version that isn’t working? Is that what Matt is dealing with? AH: Matt, get on slack and let us know your issues, and we’ll get you going. AK: looking for a working setup? MC: Yes, 1 deg JRA RYF. AK: Can point you to working config. MC: A month since I cloned. AK: Yeah, need to update.

AK: Asking for just config? MC: Yes, but any information useful. AH: Kial has a lot of configs. I cloned one and changed exe paths and was up and running very quickly. MC: I cloned and built, but when I checked out the config it was pointing to common shared exes. AK: should change that. NH: Maybe documentation is out of date. Should follow the simple “if you’re a raijin user” instructions. MC: Yes mostly worked. AH: Get on slack! MC: Browser is out of date. NH: If you do it again, follow the quickstart for raijin users instructions. If that doesn’t work we need to fix stuff. MW: a lot of use problems we don’t know about. We have to think about students who will be coming to run this. If Matt can’t figure it out there is no hope. AK: there is a lot that needs to be updated for the more complex instructions. Also the configurations in control are not what they’re currently using. Could fix that easily.

JRA55-do versioning

AH: Andrew has issues with a ‘latest’ directory that has symlinks that point to most recent version. AK: Common use case is perturbation experiments. Go back to previous restart and branch a new experiment, but need to know what forcing was used. Rather than latest, have a directory which is named for the date it was setup, or date forcing was updated. If and when things are changed, make another one. All softlinks. AH: One good thing about latest is you have a config that always works with most recent version. If you have a config with latest, they start a new model and they can be confident that it works. AK: No problem extending forcing, only an issue if old forcing files change. AH: they have versioning issues with CMIP5, have a database. NH: latest is not reproducible. Experiment I ran, but latest is changed. Problem with old system, every version jumbled in one directory. At times there were different variables which had different versions. Not all variables had the same version. AH: That is correct. NH: If there is a single directory that has all the variables for that version that is fine. AH: some cases the variables don’t have the same version. I agree this is an issue, but best solved with manifests in payu. MW: filename is not a good system. Filenames change and hashes don’t. AH: If someone has a naming scheme they want, then happy to implement it, but will keep latest, and solve using manifests. NH: was there a reason to put all versions in same directory? AH: the way the JRA55 people publish it.

AK: Do we care if JRA forcing is extended? Does it affect reproducibility? NH: Not an issue. YATM has no end date for an experiment. You set a forcing start/end date, so no problem.

Misc

RF: Pavel Sakov is running a KDS75 MOM only on OFAM -75/+75 tenth model. Running 600s timestep from the start, hoping to get up to 900s. The problems in global model is not between +-75. NH: Just poles messing us up. RF: From a flat surface, huge heave. NH: all those little grid boxes. AK: Yes the tripole is the issue. AH: redo bathymetry? RF: did a naive regridding, some issues, potholes etc. Still works. Will be running a 100 member ensemble. AH: What is he trying to find out? RF: look at some issues with OFAM/BRAN/OceanMaps. Interested to see how much is due to vertical resolution. Also a test for the future. An intermediate model between what we run at the moment, and what Andrew is running. MC: interested in a figure from Kial at the COSIMA meeting, showing how variability changes with surface resolution. AH: how long will he run? RF: A year or two. Thought you might be interested.

Actions

New:

  • Incorporate RF wave mixing update into MOM5 codebase + bug fix (AH)
  • Code harmonisation updates to ACCESS and ESM meetings (PD, RF)
  • Provide 1 deg RYF ACCESS-OM-1.0 config to MC (AK)
  • Update ACCESS-OM2 model configs (AK)

Existing:

  • Edit tenth bathymetry to remove Cumberland Sound (RF)
  • Update model name list and other configurations on OceansAus repo (AK)
  • Check red sea fix timing is absolute, not relative (AH)
  • Shared google doc on reproducibility strategy (AH)
  • Follow up with Andy Hogg regarding shared codebase (MW)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Profile ACCESS-OM2-01 (MW)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Nudging code test case (RF)
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)