Technical Working Group Meeting, June 2020

Minutes

Date: 10th June, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants
  • Peter Dobrohotoff (CSIRO Aspendale)

Zarr

JM: Already have compression with netCDF4. How do they consume the data? Jupyter/dask? AK: Mostly in COSIMA, BoM and CSIRO have their own workflows. JM: Maybe archival netCDF4? As long as writing/parallel IO/combining, no hurry to move to zarr direct output. JM: Inodes not an issue. Done badly can be bad. lustre file system has a block size, so natural minimum size. At least as many inodes as allocatable units on FS. If a problem wrap whole thing in uncompressed zarr. RY: many filters. blosc is pretty good. can use in netcdf4, but not portable. needs to compiled into library. netCDF4 now supports parallel compression. HDF supported a couple of years ago.
AH: As we’re in science, want stable well supported software. Unlikely to use bleeding edge right now. Probably won’t output directly from model for the time being. Maybe post process.
NH: What about converting to zarr from uncollated ocean output? Why collating when zarr uncollates anyway? Also collate as difficult to use uncollated output. How easy to uncollated to zarr. JM: Should pretty straightforward. Write that block directly to part of directory tree. Why is collating hard? Don’t just copy blocks to appropriate place in file? AH: Outputs are compressed. Need to be uncompressed and then recompressed. Scott Wales has made a fast collation tool (mppnccombine-fast) that just copies already compressed data. There are subtitles. Your io_layout determines block size, as netCDF library chooses automatically. Some of the quarter degree configs had very small tiles which led to very small chunks and terrible IO. AK: regional output is one tile per PE and mppnccombine can’t hand the number of files in a glob. AH: Yes that is disastrous. Not sure was a good idea to compress all IO.
JM: Good idea to compress even for intermediate storage. Regional collation: what do we currently collate? Original collation tool? AK: Yes, but don’t have a solution JM: definitely need to combine to get decent chunk sizes. If interested happy to talk about moving directly to zarr, or parallelising in some way. AK: Would want a uniform approach/format across outputs. JM: not sure why collate runs out of files. AK: Shell can’t pass that many files to it. AH: My recollection is that it is a limit in mpirun, which is why mppnccombine-fast allows quoting globs to get around this issue. Always interested in hearing about any new approaches to improving IO and processing at scale.

ACCESS-OM2-BGC testing and rollout

AH: Congrats on Russ getting this all working. How do we roll this out?
AK: All model components are now WOMBAT/BGC versions, but two MOM exes. One with BGC and one without. All in ak-dev. No standard configs that refer to BGC. Need some files. RF: About 1G, maybe 10 files. Climatologies for forcing and 3 restart files. This will work with standard 1 degree test cases. Slight change to a field table, and a different o2i.nc (OASIS restart file). Not much change. Will work on that with Richard and Hakase. Haven’t tested with current version. Maintaining a 1 degree config should be ok. AK: No interest at high res? NH: Yes, but not worth supporting as yet. One degree shows how it is used.
AK: Would BGC be standard 1 degree with BGC as option, or separate config specialised for BGC? RF: Get together and give you more info. Probably stand separately as an additional config. Currently set it up as a couple of separate input directories. AH: So RF going to work up a 1 degree BGC config. RF: Yes. AH: So work up config, make sure it works, and then tell people about it. RF: The people who are mainly interested know about the progress.

ACCESS-OM2 release version plan

AH: Fixing configs, code, bathymetry. Do we need a plan? Need help?
AK: Considering merging all ak-dev configs into master. Was constant albedo, moving to Large and Yeager as RF advised. Didn’t make a big difference. How much do we want to polish? Initial condition is wrong. Initial condition is potential temp rather than conservative. Small compared to WOA and drift. Not sure if worth fixing. Also bathymetry at 1 and 0.25 degree. Not sure who would fix it. AH: Talk to Andy Hogg if unsure about resources? AK: A number of problems could all be fixed in one go.
AK: Not much left to do on code. PIO with CICE. Still issues?
NH: Good news. In theory compression is supported on PIO with latest netCDF. PIO library also has to enable those features. There was a GitHub issue that indicated they just needed to change some tests and it would be fine. Not true. There is code in their library that will not let you call deflate. Tried commenting out, and one of the devs thought was reasonable. Getting some segfaults at netCDF definition phase. Want to explore until decide it is waste of time, will go back to offline compression if it doesn’t pan out. Done the naive test if there is a simple work-around. Looking increasingly like won’t work easily. Could do valgrind runs etc.
AK: Bleeding edge isn’t best for production runs. NH: Agreed. Will try a few more things before giving up. AK: Offline compression might be safer for production. NH: Agreed, random errors can accumulate. RY: Segfault is with PIO? NH: Using newly installed netCDF4.7.4p, and latest version of PIO wrapper with some commented out some checks. Not complicated just calls deflate_var. RY: When install PIO do you link to new netCDF library? NH: Yes. Did have that problem before. AG pointed it out, and fixed that. Takes a while to become stable. RY: Do you need to specify a new flag with parallel compression when you open the with nf_open? NH: Not doing that. Possible PIO doing that. RY: Maybe PIO library not updated to use new flags to use this correctly. NH: If using netCDF directly what would you expect to see? Part of nf_create? RY: Maybe hard wired for serial. NH: Possible they’ve overlooked something like this. Might be worth a little bit of time to look into that. PIO does allow compression on serial output only. Will do some quick checks. Still shouldn’t segfault. Been dragging on, keep thinking a solution is around the corner, but might be time to give it up. A bit unsatisfying, but need to use my time wisely. AH: It’s not nothing to add a post-processing compression step and then take it out again. Not no work, and no testing, so out of the box compression would be nice. Major code update you’re waiting on? Need to update fortran datetime library? Just a couple of PRs from PL and NH. Paul’s look like a bug fix PL: Mine an edge case? AK: Incorrect unit conversion. AH: Don’t know where in the code and if it affects us. AK: We’re using the version that fixes that. NH: Guy who wrote that library wrote a book called “Modern Fortran”.

Testing update

 AH: MOM travis testing properly testing ACCESS-OM2 and ACCESS-OM2-BGC as have also updated libaccessom2 testing so that it creates releases that can be used in the MOM travis testing. Lots of boring/stupid commits to get the testing working. AK: Latest version gets used regardless of what is in the ACCESS-OM2 repo? AH: Not testing the build of ACCESS-OM2, just the MOM5 bit of that build linking to the libaccessom2 library so that it can produce an executable so we know it worked successfully. We do compile OASIS, and uses just the most recent version. Had intended to do the same thing with OASIS, make it create releases that could be used. Haven’t done it, but it is a relatively fast build. AK: OASIS is now a submodule in the libaccessom2 build. AH: Yes, but still have a dependency on OASIS. Don’t have a clean dependence on libaccessom2. Might be possible to refactor, but probably not worth it. So yes have a dependence on libaccessom2 *and* OASIS.
AH: Previously travis allowed ACCESS-om2 to fail and the only way you knew it worked correctly would have to look at the logs to determine if everything was successful up to the linking step.
AH: Currently fixing up Jenkins automated testing. Starting on libaccessom2 testing. Hopefully won’t be too difficult. NH: Definitely need to have it working.
PL: Writing up testing/scaling tests for ACCESS-OM2.