Technical Working Group Meeting, May 2020

Minutes

Date: 20th May, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
Russ Fiedler (RF) CSIRO Hobart
Rui Yang (RY), Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
James Munroe (JM) Sweetwater Consultants

ACCESS-OM2-01 scaling experiments

See PL’s associated scaling doc, scaling spreadsheet and python notebook

PL: Scaled MOM5 and CICE5 by same amount. Based in 01deg_jra55v13_ryf9091. Run an initial run to get restart output from February 1900. Restart runs for February (28 day). 540s time step. 4480 steps. No diagnostic output. Left ice as is.

PL: CICE scaled ncpus and ntask proportionally. Scaled MOM from 80×75 (4358) to 160×150 (16548). Scaling ok looking just at ocean timer and ice timer. Didn’t have daily iCE output.

PL: Most efficient at 10K cores with total wall time. Ocean timer shows perfect scaling. CICE only timer also shows good scaling.

NH: Keen to try these configs in production.

PL: Not sure how appropriate for production, no IO. NH: Good place to start, turn on output and see how it goes. Looks well balanced. Somewhat surprised.

PL: Now trying to reproduce Marshall’s figures from the report. Scales ocean and ice separately. Yet to get reproduction runs going. Working through namelist differences. Sometimes get a silent hang. Worth scaling ocean and ice at same time.

AH: Why do both models scale so well but overall not so well when combined. RY: CICE is waiting for MOM. Maybe some more optimal setting for CPU numbers? AH: Seems odd MOM scaling better than CICE, but CICE waiting for MOM.

NH: CICE is waiting less for ocean as cpus scaled. oasis_recv is constant, which means MOM not waiting on CICE. Definitely don’t want MOM waiting for CICE. RY: If increase MOM and reduce CICE would we get better performance? PL: Not sure. Might be useful to know how I got those numbers, using log file and figures are total time divided by number of steps.. RF: Output from access-om2.out. Just summary, won’t show load balance with MOM. PL: Any guidance would be useful. RF: Look in access-om2.out.

AH: Look for MOM timers. Might be some information about range of values, could be some very slow PEs masked by average. RF: the check mask is out of date. Has 12-16 processors which are purely land. Changed land mask and didn’t update mask. Some processors only have values in the halo boundaries. Crashes otherwise.

PL: Regenerated new mask files. Numbers should agree with what was done. Any more advice would be welcome. Send email or talk on slack. RF: I’ll look at CICE layouts and balance, and masks. CICE Is also seasonally dependent.

PL: Moving to a more conventional experiment payout. Will move to a shared location. AH: Could put in /g/data/v45.

AK: CICE scaling with serial IO. Nic almost finished PIO. Will stop scaling without PIO. Runs much faster with parallel even with monthly outputs. AH: Seems to be scaling ok. AK: Any output written? AH: Running for a month, so should be some output.

AH: Ran from initial conditions? PL: Yes. Ran for 1 month with timestep of 300s. Then ran from those restarts with timestep of 540s. AH: There is an ice climatology? RF: If run for a month, should have generated ice. AK: Ice generated from surface temperature in initial conditions.

RY and PL left meeting.

AH: Maybe a bit more to look at in PL runs. NH: May have misunderstood where those numbers came from. RF: Looked like it was scaling nice and linear. AH: Yes for each model, but together scaling died going to 20K. RF: Not sure these results are that useful when IO is turned on. Code paths not currently going through without IO. Putting stuff on density levels. And a whole lot of globals/collectives that aren’t being done. AH: Encouraging though. NH: In principle can scale up.

PIO compilation in ACCESS-OM2

NH: Got a reply from NCI. Resistance to having PIO in a module. Best to be self sufficient. If it turns out to be an issue can address later. Will make a submodule. Clean up the build process. Changes to CICE repo. One CICE namelist change, tell it to not explicitly use netCDF for certain things. Bit odd.

NH: Experiment repos will require updates. Maybe AK will report some more realistic performance numbers.

AH: PIO with MOM? NH: Not sure. CICE isn’t doing a great deal in the configuration I am using. Seems to all work inside parallel netCDF as doing output from all processors. Can use IO nodes and use comms, but doesn’t show performance improvement, and looks worse in many cases. We could configure in the same way without using PIO. RF: Don’t have much control where we put processors. CICE at the end. Probably sharing with MOM. Playing with layout might be tricky. NH: At some stage put CICE on all it’s own nodes. RF: Once YATM is on the first node, it ends up messing things up. NH: Why are we doing that? RF: Something to do with OASIS in the old days. Now have YATM and root PE of MOM on same node. Would make sure all root PEs on their own node. No contention. YATM and MOM also on same NUMA partition. NH: We should change that, easy fix. YATM doesn’t do much on 0.1 as rest of the model takes so long. RF: Two IO processors on the same node. MOM root PE uses for diagnostics and YATM process. NH: If each model on their own node. Could make sure each node has a single IO processor. With PIO if want 1 node per 16 processors, don’t know if it is talking across nodes.

JM: In terms of PIO are multiple nodes writing to the same file? NH: For CICE very single process writing to same file at the same time. Works well. Haven’t looked into it deeply, probably the optimum is something in between. Still a big improvement over serial output. AH: Kaizen (改善): small incremental improvements all the time. Compressed netCDF output? NH: No. PIO GitHub talked about supporting compression. AH: Same as what RY and Marshall did? NH: Yes. Have to wait for parallel netCDF implementation which supports it. Confusing because there is also p-netCDF. PIO is a wrapper. AH: Yes, wraps p-netCDF and MPI-NetCDF. p-netCDF is only netCDF3, not based on HDF5. AK: Will need post-processing compression step. NH: Task not done until compression done. AK: Very sparse data, shame not to compress.

AH: xarray is supporting sparse data now. FYI. Can mean a lot less memory use for some data.

Compiling with/without WOMBAT

AH: Any speed/memory use implications to always have it compiled in? RF: Should be separate. Overhead basically nothing. Will only allocate BGC arrays if they’re in the field table. Should be kept separate like all other BGC packages. I put in some lines in the compile scripts. Also f you want to compile without ACCESS.

AK: If want to maintain harmony with CM2 want a non-BGC compilation? RF: Yes. AK: From the point of view of OM2 users would be nice to be able to switch BGC on and off just through namelists. RF: Switched on via field_table. Strange design choices years ago. Also need changes in some of the restart files, o2i.nc and i2o.nc. AK: Not something that can be switched on and off? RF: No.

MOM Pull Requests

AH: Guidance for checking? RF: Two main changes in the code are probably fine. Maybe the ACCESS compilation scripts. Unless want to change that it gets compiled in all the time for ACCESS-OM.AH: Decided not I think. RF: Made changes to install.sh to specify the type of model. AH: Separate model designation with WOMAT? RF: ACCESS-OM-BGC is a new model type. Run tests, all ok. AH: Do we need any tests to check it hasn’t changed non-BGC tests. RF: Shouldn’t be anything that effects a normal run. Code compiled ok on travis. Put in some heat diagnostics, the fluxes from CICE, might be the only thing. AH: Are Jenkins tests working? NH: ACCESS-OM2 tests haven’t worked since moved to the new machine. RF: Run a 1 degree model and see how it goes. AH: I’ll do that.

AK: Managing ACCESS-OM2, should the distinction between BGC and non-BGC be in the control directories. So build script builds both and choose which in the config, or compile once, supporting both. AH: I don’t think BGC is a supported configuration yet. Needs testing. How it is implemented, shared or separate exes is just a choice of how you decide is the best.

AH: Turns out that Geos PR was a mistake. Asked about it, and they closed it.

Bad bathymetry

AH: Any comments? Does it need fixing? RF: Bad bathymetry needs to be fixed, or copy bathymetry from somewhere else. Bad around Australia. Same for CM2. Mentioned it 3-4 years ago, still not updated. Some pits in Gulf of Carpentaria down to 120m in 0.25. 1 degree goes down to 80m. Should be no deeper than 60m. OCCAM created some bad bathymetry in Bass Strait, off coast of China. Russian and Alaskan issues, and White Sea. Remapping indices got mucked up. AH: Wasn’t 0.25 fixed north of Bering Strait? RF: Doesn’t look like it.

JM: Bathymetry files are wrong in certain regions? RF: Came from Southampton OCCAM model. They ran it with a normal mercator and a transverse mercator across the top. Remapping onto spherical grid indices got mucked up and got some strange bathymetry. GFDL inherited it and based a bunch of models on it. Leaked through to the ACCESS models. Was in the US forecast model and they noticed all the stuff around Alaska.

AH: Should be a relatively straightforward as this is only ocean bottom cells, and doesn’t touch coasts? RF: Yes. AK: Base on a coarsened tenth grid? RF: Not a big job, just a few slabs that need smoothing/removing. AH: Does this need to be fixed for the next release of OM2? RF: Yes. AK: No. RF: Get a student to look at it. AK: Also land mask inconsistencies, would be good to have all three models consistent. There are big curvy bits of coastline keeping ocean away from tripoles. AH: The 1 degree is very much a model, that isn’t that realistic. Tenth starts to look much more like real life.

Zarr file format

AH: Wanted to engage JM about zarr. RF: Interested as this is being used in decadal prediction project. JM: Exactly. Talked today about parallelising output from model into netCDF, and then post-analysis requires transforming to zarr. Zarr is a distributed file format that stores files in directories, each chunk is a separate file, parallelisation handled by filesystem. Should we write directly into zarr like file format. There are file formats like it. netCDF5 may have a zarr like back-end. RF: There is some discussion on the netCDF GitHub about zarr, looks like just one person. JM: Unidata is willing to move away from HDF5. Parallelisation of HDF5 has never worked the way it was supposed to. Instead of using parallel IO, just write directly to the format people want to use. AH: Got the impression netCDF people never got the buy-in from HDF5 that they thought would get. HDF5 just do their own thing. JM: Still have people using netCDF3. AH: A strength of netCDF, they could hop back end again and keep the same interface. JM: Same data model. AH: What is the physical format of a zarr blob? JM: It is a binary blob that supports different filters/compression schemes. AH: Does machine independent storage? Bad old days with swapping endianness on binary files. AG: In zarr there are raw data blobs, and associated metadata files that describe the filter/endianess etc.

JM: Inodes not a problem. Still relatively large, on the order of the lustre striping scheme. Can wrap the whole thing inside an uncompressed zip file. Parallelises for reading just fine. Works like a tar, index on where to read, supports multiple reads on same file. AH: Would want to do this when archiving.

NH: Another one is TileDB, which is a file format. JM: There are other backends, n5/z5. Distributed storage for large data sets.

AH: At one stage we did wonder if collation was even necessary with tools like xarray, but never looked into it. NH: Things have changed a lot. xarray is relatively new. 3-5 years ago might segfeault on tenth model data. So much better now, so many more possibilities.