Technical Working Group Meeting, August 2019

Minutes

Date: 14th August, 2019
Attendees:

  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) RSES ANU, Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Marshall Ward (MW) GFDL
  • Nich Hannah (NH), Double Precision
  • James Munroe (JM), COSIMA

PIO work with CICE

NH: PIO code in CICE not as complete or thorough as netCDF code. Nothing to suggest it won’t work. Relies on NCAR PIO library, and a CESM utility library. Dependencies which are not part of CICE. Built PIO dependency on raijin, ran into CESM dependency. Can either remove dependency or remove code.
NH: Initially thought to use the MOM approach. Tile and collate. Russ’ comments encouraged to try PIO. Will be supported in future and will be supported in CICE6. Nothing working, but will soon test with 1 degree.
RF: Real bottleneck with high freq output. Worth a go. Attempt to put this into FMS by Hartnett. AH: Different to parallel netCDF? NH: PIO is wrapper around parallel netcdf. Written by NCAR to simplify parallel netcdf. Another layer. On GitHub, continuing to be maintained. RY: Wrapper that does work to match computing to IO domain. Not so useful for MOM5 as it has io_layout already.
MW: Harntnett motivated by FE3 (forecast model) rather than ocean. Not sure what project even involved in.
NH: Big test is handling interesting CICE layout, difference between cartesian grid and PE layout. MW: PIO will support explicit decomposition and other approaches.
NH: Parallel netCDF version on raijin only links with OpenMPI3.0. RY: New machine launched soon. OpenMPI 1.* will be dropped. No new software depending on 1. MW: OpenMPI 2 is not good. Should use 3.
NH: Probably have to test this with OpenMPI 3.0 RY: 3.1.3. Switch everything to that. Good test for new machine. AH: Working now? RY: My fault. Used unmatched openMPI library. Everything looks fine. OpenMPI 2/3/4 with Intel 19. All working. 1 deg & 0.25 deg working. Tenth not working. MW: I was able to run tenth with 3.1.2/3.1.3.
MW: One of the intel compilers broke MOM. A compiler bug with types in types.
AH: Should  start an issue for testing RY: Will email MW directly. RY: Not a MOM bug.
MW: Tried MOM-SIS tenth? Good test. RY: From earlier this year do have this working. This is testing for new machine, so ACCESS-OM2.

OMIP date restart protocol

RF: Talked to Griffies. GFDL take ensemble approach. Run for N years using true dates. At finish reset back to start date with correct calendar. Storing new stuff in different directory. End up with 5 sequences of 55 years. All dates are correct. No issues with leap years going wrong. Think this is the best way to go.
AK: Came to conclusion that this was right way to go, mostly due to leap year issue. Problem is, can we get the model to do that, but Maurice and Ryan had issues. Issue with CICE getting the correct date. CICE has a flag “use_restart_dates”. Suggested set this to false, and set the dates in access_restart.nml, but CICE is not picking up dates. Looks like libaccessom2 is not passing them on to CICE. Some confusion about exactly what they have done. Some instructions on Wiki for restarting, from restarting IAF from RYF at tenth, but doesn’t work for other people. NH: I’ll look at it. AK: Will send issue. NH: Didn’t realise it was happening. CICE date handling is not great.
AH: Downside with ensemble, difficult to get metrics across the whole time series. RF: Need extra meta-data added in. Maybe which cycle you’re in. An extra variable which gives the actual number of days since the start of the run. Down with post-processing. Might be able to concatenate files using extra meta-data. AH: Always have issues with missing leap years if it spans a century. But only daily is an issue. AK: Cookbook do something. MC: Pretend it is no leap? JM: Data looking at as time series? AH: Extra metadata, say offset day is a good idea. RF: Add buffer in netCDF file so don’t need copies. mppnccombine can add padding. usually done with nccreate, make sure the header has some space. hbuf?

Strategy for CICE updates for flexibly adding fields

RF: Way CICE drivers work, variables you want are either hard coded, or muck around with pre-processing to compile them in and out. Wondering if anyone looked at doing it on the fly. Using error codes coming back when setting up variables, so have flexible number of variables passed in and out. Would like this to pass total wind speed, to harmonise code. Also Hakase wants it for some BGC stuff. Phytoplankton through to the ice. So specify the variables, work out if they’re there or not.
NH: Would want the exe to handle configuration with different sets of coupling fields. Sometimes include total wind speed, sometimes not. RF: would know complete set, if not there skip it. Currently have to be hard wired in, or make another driver. NH: Way to do it, start with superset in namcouple, and code would exclude certain variables. RF: Maybe if variable not in namcouple, return an error code, but ignore error. NH: Shouldn’t be too hard to do. NH: OASIS does return error codes that could be used. Either abort or return error code. If aborting could change that. AH: Restart fields? NH: Should do behind the scenes.

Paths for JRA55-do forcing files. Some changes to support v1.4

AH: JRA55-do not part of Input4MIPs, part of CMIP6. Have to use the copy that is CMIP6. Encodes all the metadata in filename, consequently doesn’t currently work with YATM. Circumvented by creating symbolic links that worked with YATM. When I did this couldn’t reproduce. Not sure if this is actually an issue with the fields being different or not.
AH: Tried to use testing framework NH developed for this using jenkins. The historical test that tests against known checksums doesn’t seem to actually compare them. Not sure if that is intentional. Would like to use framework, as NH has done a great job with it.
MH: MOM6 has diag_mediator, supports CMOR name alongside internal model name. Porting to MOM5 is a big task, but idea is good and saved them a lot of work. Could create a thin wrapper to translate to CMOR name if that helps. AK: How integrate with YATM? MW: Don’t know. At FMS level, so only help with 1 model (MOM). AK: YATM access the JRA files. So libaccessom2 change. AH: Looked at YATM code. Generates filename form date. Input4MIPS has current year and next year, so would require code changes. Might just be easier to create a file with date->filename mapping? AH: Possible to do. Would need to add a token for year+1. Possible to do. Probably best to do it that way.
AK: Also need code changes with v1.4. Solid and liquid runoff are separate. What to do with solid runoff? Griffies either use iceberg model, or melt them and add them to runoff. Take account latent heat of fusion? Assuming solid runoff is at zero, which could be a problem. Put in a request to download v1.4. Scripts they have should automatically download it, but not. MW: Think GFDL only has v1.3.
MW: Fields go to end of 2017, is 2018 downloaded? Looking in wrong place? Looking in ua8. AK: Should look in qv56. AK: qv56 up to feb 2018. AH: If not automatically downloading, we should ask. What does the OMIP protocol say about end date? AK: JRA55 can find out about 2018. RF: It is specified, but would like latest for ongoing runs.

Testing FMS merge

AH: Putting FMS in as a sub-repo. Just needs testing. If it reproduces checksums for a month we’re sure it is ok? Is that sufficient?
NH: When Marshall upgraded FMS, went through every MOM test. Including 0.25. Can’t recall how strict we were. AH: Testing framework still there? NH: It is there. Because it never gets used, might be rotted a bit. Can give Jenkins URL of PR and it would do it. We should work together to get that working.

New NCI HPC hardware announcement

RY: System by end of the year. 2 phases, install new machine with Cascade Lake nodes. Short period gabi and raijin run simultaneously. After that skylake and broadwell will be merged with new machine and SandyBridge nodes removed. 100 GPU installed. 16 skylake k-80 nodes. PBS pro again. Storage and network infiniband. 200GB/s transfer speed. OS is CentOS 8. AH: Trying to figure out total core count for new machine. Do you know what core count will be? RY: Not clear on exact number. Can check with system guys if they know the exact number. If 32 cores/node, 150+K processors. AH: Will runtimes be extended for new machine. Find 5 hours too low for high core count jobs. Reduces flexibility. RY: Queue time limits are per project. Quite flexible. Contact NCI help. AH: Have asked for time limit changes in past, but usually time limited. RY: Have been asked by other users, not sure about the policy. Good time to ask and get a better policy for the new machine.