Technical Working Group Meeting, November 2018

Minutes

Date: 13th November 2018
Attendees:

  • Marshall Ward (MW) (Chair) NCI
  • Aidan Heerdegen (AH) CLEX, and Andrew Kiss (AK) COSIMA, ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Nic Hannah (NH) Double Precision

Payu update

MW: payu is now python3 compatible. Can be run from a .local install. No longer uses modules. Tagged 0.11.1. AH: will also install into conda environment on raijin. NH: sounds great. Have used in conda python27 environment. Good to have python3.

MW: Want to get this to a position where I can leave it to others to support. Might get GFDL interested in it. Have time to wrap up a lot of things.

AK: How rigid is the 3 digit output in archives? MW: Only a print statement. Should work with higher numbers. AH: Won’t list nicely. MW: Had meant to add a format option.

MW: Try out the new payu, want to make it the new one.

TWG Organisation

AH: Peter Dobrohotoff sends his apologies, cannot make meeting.

AH: Do we need to decide on a new chairman? MW: Happy to resign straight away. MW: Considering going to GFDL in February to bootstrap remote working. AH: Nic did you want to take over? NH: No, convinced it’s not a good idea. AH: Sort something out next meeting. Last one of the year?

OM2/CM2 MOM5 Harmonisation

AH: Peter Dobrohotoff sends his apologies, cannot make meeting.

AH: Shame, as PD tested harmonised MOM5 but used incorrect namelist options. Losing some momentum as would like to have that checked off so we could start harmonised ESM.

MC: Wrong namelist options? RF: Didn’t have correct namelist options to include my new mixing scheme.

AH: They were also concerned about background mixing, in case that as having an effect. Could point them to the relevant PR/Issue on GitHub with all the plots and documentation showing it was working correctly.This is a valuable resource and a good way of working.

AH: Richard Matear contacted me and wanted to know status of ESM harmonisation and what he could do to help it progress.

ESM MOM5 Harmonisation

ESM is using an version of MOM5 updated to the beginning of the year with WOMBAT added by MC. It was decided to not continue with ESM harmonisation until CM2 bedded down, as it requires some of the code changes from the CM2 harmonisation.

MW: Is cylc suite used for testing using the MOM repo compilation? AH: I believe Peter is currently turning off the automatic compilation in the suite and using the repo compilation script on the command line to create the MOM executable. MW: I think improving/streamlining and harmonising the cylc suite is as important as harmonising the code. AH: I would have liked a test suite that incorporates this, but this is the way Peter has been comfortable working. Can’t progress ESM until have the all clear that CM2 is working correctly.

MW: Richard & Matt are using the ESM model? Not using GFDL stack? MC: GFDL in decadal project. This harmonisation will get WOMBAT in the MOM5 master branch, which has long been a goal. Decadal project not using UM. Research effort there is data assimilation and reanalysis that they’re running, rather than updating the model. Won’t hear much from them. When AH gets the WOMBAT code in, please contact us. Have experience using payu runs at 0.25 deg. Will fix you up with files and help when it comes to testing. AH: I will mostly be relying on others to run stuff. MC: Not running under payu currently. Out of the loop at the moment. Did run one of Kial’s runs. I need help running with payu. AH: Yes we can do that together. Holger is trying to get payu to run with ESM, making slow progress. MW: Who is supporting the ESM? Surprised Richard will contact AH. AH: Someone told him it was part of this code harmonisation. MC: Richard’s interest is WOMBAT in MOM5. AH: ESM will be the CLEX coupled model. MW: Who will be responsible for ESM? AH: Tilo is doing a CMIP6 submission with ESM. CLEX will be wanting to use all of Tilo’s runs to spin off their own experiments. At that point CLEX CMS will support the model with payu on NCI HPC systems. MC: Tilo is doing the work of the equivalent of the entire CM2 team to get ESM working. We support Tilo somewhat, and there are others who are no longer formally part of the team but contribute. Richard has interests in this space also. AH: Tilo can benefit from work we’re doing. Scott found a 10% improvement in UM speed. MC: Speed/efficiency not a priority right now. Focus on land model, forcing etc. Not #1 priority. AH: If they’re open to that input, they can still get the benefit even if it isn’t their focus.

MC: Have to go to another meeting.

AH: Do we need to move the meeting? MW: Happy to move to another day if it works for people. AH: Doodle poll? RF: Next year MW: Ok, do something next year.

Minimal 0.1 MOM

AH: Working really well. Would like to get it to run a little faster so she could get 2 months per submit. Which would improve her throughput a lot. Does NH have any ideas to speed it up? Seems like MOM model has 10 minutes of spare time. Is it CICE bound? NH: That might be the initialisation time. AH: Don’t MOM timings take into account initialisation? NH: Not until recently. Marshall fixed it. MW: Clocks weren’t showing MPI initialisation time, just MOM initialisation time. AH: Could be 10 minutes? MW: I would be surprised. Would guess 5 mins, less than 10. Big model, spends a lot of time on field exchange.

AH: Not compiled with AVX2. Will that help? MW: Did AVX and AVX2 test. Could see the difference, wasn’t large enough to bother making non-compatible binaries. NH: Slack me the path and I can take a look. Haven’t spent a lot of time trying to optimise it. MW: Can sometimes improve time by changing layout. GFDL tried very long tiles that means halo updates are only north/south. NH: She might have an older config. I switched to Sandybridge for efficiency. MW: I think you can get 7-8% speed up going from AVX to AVX2. AH: I advised Ruth to use broadwell due to memory requirements making for better throughput. AH: I suggested she have higher diag_steps, but RF pointed out global scalars mean she is doing daily global MPI calls. RF said doesn’t necessarily have to be this way? Could it be changed in FMS? RF: Yeah, all the diagnostic code. In many cases time average can be commuted with area average. Every timestep doing a MPP sum or MPP global sum, can do local sums on local process and call MPP sum when you need to. Could rewrite the MOM code to do cumulative average and do an instantaneous output. Sort of fudge an average. MW: Wouldn’t make much difference to speed. RF: Depends how much those global sums are hurting you if any, but it might be a single global sum acts as a synchronisation point and doesn’t really matter. MW: I told NCI MOM didn’t do collectives because they were so fast they didn’t show up in profile. So unlikely to help a lot. RF: MOM collectives are very simple. If not doing bitwise stuff, just taking a collective of one number. MW: Caveat, only tested at 0.25 deg, so can’t know for sure it is the same at 0.1 deg. Should do it because it will eventually start to bite. Could do a profile?

AH: AK only does 2 months/submit, so maybe we would all be better running a minimal config? MW: doesn’t bode well for exascale. AH: So many constraints with PBS etc. AK: Optimised for model+machine+queue constraints not for the model on its own. NH: Could just bump all the cpus by 10%? AH: You’ve got a nice sweet spot there, first try AVX2 and see if we can get the speed up we need.

MW: Vectorisation can help. AH: Would bigger tiles help? MW: So much time moving in and out of L1 cache that it doesn’t make much difference. AH: Broadwell got bigger caches? MW: Bigger L3 but 12 more cores. NH: Is she using 600s timestep without crashing? AH: Had an ice remap crash after a couple of years. AK: Unclear if that will happen again. Doing RYF, and I found once the crashes started happening they kept happening. RF: Using latest bathymetry? AK: Yes. RF: Any difference? AK: No idea. NH: I think it  is generally more stable.

MW: Can play with barotropic halo. Barotropic solver has halo of 10 so it can do it’s work every 10 steps. Might be able to get some speed up by playing with that. AH: Put path to Ruth’s control directory on TWG slack channel.

NCRIS ACCESS meeting

NCRIS is doing a scoping study to see if it is feasible for a team of 15-20 people to support ACCESS modelling in Australia, which would be used for submission for funding from NCRIS. The meeting was to get feedback to help write the submission.

Some discussion of the experience of the meeting.

Calendar issues

RF: MOM uses Proleptic Gregorian calendar type, but does not use the correct calendar attribute when outputting the file. It sets it as Gregorian instead. So, when using days since 01/01/0001 there is a jump in October 1582 depending on which calendar is used. Get a 2 day offset for IAF files because of this incorrect calendar attribute. Found python netCDF interface uses udunits calendars and has problems. Had to force it to proleptic gregorian to read dates correctly. Big issue when dealing with daily data. Output files need to be fixed. Could change the calendar attribute to Proleptic Gregorian or change units to be days since a year after 1582. MW: GFDL use since 1900? RF: Yes, as this is what Ferret uses.

AH: Had a lot of date issues using python. Uses date library from numpy as there is limited date range available due to nanosecond resolution. We often have to do date offsets anyway, so probably don’t see this issue as much. Should we put proleptic gregorian into MOM? MW: Shouldn’t we change the start date? RF: That is the easiest thing. There is a lot of broken software that doesn’t treat these calendars correctly. MW: should tell GFDL about this RF: Looked at the code and made some changes, but not uploaded. AK: Is MOM using the correct dates? As with coupling to CICE etc? RF: Works ok internally. AH: Arguably a bug if they’re using proleptic and not using correct attribute. RF: Yes. CMIP6 accounts for this. Checks for dates before 1582 and requires using proleptic gregorian. Future runs should have an offset of some later date.

RF: Getting huge number of messages from restoring files starting at year 0000. Restoring files on a time modulo axis and created from Ferret, which automatically treats any file with a start year of zero or one as modulo. However year zero does not exist and is incorrect. Just need to change that attribute in the restoring files, won’t make any difference to operation but save a huge number of warning messages. MW: I get 482,000 lines of errors. I would be very happy if this is fixed. MW: Someone should change those fields. RF: I don’t have access. MW: Should go and edit the public forcing fields. What specific files? AH: If you’re talking about the ACCESS-OM2 configs, NH has the most ability to change them. MW: salt_sfc_restore? RF: Yes, and temperature, chlorophyll. Anything seasonal, a restoring. MW: Anything that says “months since 0000”? AH: Yes change to 0001. RF: Anything that uses that date (zero years) can be changed. AK: Anything that isn’t JRA that isn’t multiyear? RF: Maybe runoff? Do we use the JRA runoff? The problem really is the stuff MOM reads directly, like sponges. NH: I am happy to look into this. I might be the only one with access, hope not. Have been thinking about this for a while. Changed the OASIS code as well to ensure DEBUG_LEVEL zero does not output anything. Was outputting thousands of lines. Also an Andy Hogg GitHun issue and this was the next one on my list. MW: So far I can only find salt restore and ssw shortwave. RF: Shouldn’t be using that. Should be using GFDL formulation which reads in chl.nc.

AK: Some files in those ACCESS-OM2 input tarballs that aren’t used. Should they be removed? NH: Posted on slack about this? AK: Yes, but not sure they aren’t used by someone. NH: Bit messy how this is done. Should really just have a bunch of files and grab what they need. Would save a lot space. Currently versioning sets of files rather than individual files.

mppnccombine-fast

AK: Some issues on GitHub. RF: Been discussing this with AH and Scott Wales. An attribute needs to be removed. AH: Biggest issue is regional outputs having incorrect dimensioning. Which has been fixed. Also fixed the unlimited dimension getting squashed. Also another issue with passing too many files on the command line due to an MPI issue. Requires a change to payu as globbing is now done internally so any glob needs to be quoted. It’s on my list of tasks.

MW: Original tool used a pattern? AH: Didn’t implement that in mppnccombine-fast, maybe we should? MW: Stopped doing that in payu to support some coupled FMS codes where tiles didn’t start at zero, but could go back to the old way. AH: Does using the pattern work with masked configs when tiles are missing? I can’t recall. MW: Not sure.

ACCESS-OM2 disk usage

AH: AK and I went through some of the 0.1 deg output directories and found we could get significant space savings in the ice diagnostics AK: Ice outputs are not compressed, daily data is in individual files half of which is grid data. Can get a 8 fold decrease in size. Out of 20TB of total data can save 12TB of space. AH: Want to make a post processing script to run this automatically. AK: Yes, also delete all the zero length log files AH: This was to clean up for archiving. MW: payu should do this, maybe not looking in the right places. NH: FORTRAN has an option when closing a file to delete if empty, so looking into that. Also some CICE logs just have one line at the top with exactly the same text. AH: Yes we found those, matched the same number of bytes and deleting them. MW: If payu isn’t deleting zero length files not sweeping through submodes. AH: A lot tidier after cleaning. AK: Yes an hour well spent.