Technical Working Group Meeting, July 2020


Date: 10th June, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants
  • Peter Dobrohotoff (CSIRO Aspendale)
  • Marshall Ward (GFDL)

Optimisation report

PL: Have a full report, need review before release. This is an excerpt.
PL: Aims for perf tuning and options for configuration. Did a comparison with Marshall’s previous report on raijin.
Testing included MOM-SIS at 0.25 and 1 deg to get idea of MOM scalability stand-alone.
The ACCESS-OM2 at 0.1 deg. Resting with land masking, scaling MOM and CICE proportional.
Couldn’t repeat Marshall’s exactly. ACCESS-OM2 results based on different configs. Differences:
  1. continuation run
  2. time step 540s v 400s
  3. MOM and CICE were scaled proportionally
  4. Scaling taken to 20k v 16k
MOM-SIS at 0.25 degrees on gadi 25% faster than ACCESS-OM2 on raijin at low end of CPU scaling. Twice as fast for MOM-SIS at 0.1 degrees. Scalability at high end better.
ACCESS-OM2: With 5K MOM cores, MOM is 50-100% faster than MOM on raijin. Almost twice as fast at 16K, scaled out to 20K. CICE: with 2.5K cores CICE on gadi seems 50% faster than CICE on raijin. Scales to 2.5 times as fast at 16K OM2 CPUs.
Days per cpu day. From 799/4358 CICE/MOM cpus does not scale well.
Tried to look at wait time as fraction of wall time. Waiting constant for high CICE ncpus, decreases with high core counts with low CICE ncpus. So higher core count probably best to reduce ncpus as proportion. In this case half the usual fraction.
JM: How significant are results statistically? PL: Expensive. Only ran 3-4 runs. Spread quite low. Waiting time varied the most. Stat sig probably not due to small sample size.
MW: Timers in libacessom2 were better than OASIS timers, which include bootstrapping times which are impossible to remove. Also noisy IO timers. Not sure how long your run. Longer would be more accurate. PL: Runs are for 1 calendar month (28 days). MW: oasis_put and get are slightly magical, difficult to know what they’re doing. PL: Still have outputs, could reanalyse.
MW: Speedups seem very high. Must be configuration thing. PL: Worried not a straight comparison. MW: If network time is 15-20%, wouldn’t make a difference. Always been RAM bound, which be good if that wasn’t an issue now. PL: Very meticulous documentation of configuration, and is very reproducible. Made a shell script that pulls everything from GitHub. MW: I think your runs are better referenced. While experiments were being released, seemed some parameters etc were changing as I was testing. That could be the difference. Wish I documented it better.
AH: All figures independent of ocean timestep. PL: All timestep at 540s. AK: Production runs are 15-20% faster, but a lot of IO. PL: Switched off IO and only output at end of the month. It really drags it. Make sure it isn’t IO bound. Probably memory bound. Didn’t do any profiling that was worth presenting. MW: Got a FLOP rate? PL: Yes, but not at my fingertips. If it is around a GFLOP, probably RAM bound. PL: Now profiling ACCESS-CM2 with ARM-map. RY is looking at a micro view, looking at OpenMP and compilation flag level. RY: Gadi has 4gb/core, raijin has 2GB/core. Not sure about bandwidth. Also 24 cores/node. Much less remote note comms. Maybe a big reduction in MPI overheads. MW: OpenMPI icx stuff helping. RY: Lots of on-node comms. Not sure how much. MW Believe at high ranks. At modest resolutions comms not a huge fraction of run time. Normal scalable configs only about 20%. PL: The way the scaling was done was different. MW scaled components separately. MW: I was using clocks that separated out wait time.
RY: If config timestep matters, any rule for choosing a good one? AK Longest timestep that is numericaly stable. 540s is stable most of the time.
MW: How have progressed on CICE layout stuff? Changed in the last year? I was using sect-robin. RF: sect-robin or round-robin. AK: You use sect-robin, production did use round-robin, but not sect-robin. Less comms overhead, not sure about load balance.
PL: Is there any value in releasing the report. NH: Would be interested in reading it. Looking to get these bigger configurations. AH: Worth to document the performance at this point. RY: Any other else worth trying? AH: Why 20K limit? PL: Believe that is a PBS queue limit. Some projects can apply for an exception. RY: For each queue there are limits. Can talk with admin if necessary to to increase. AH: Will bring this up at Scheme Managers meeting. They should be updated with gadi being a much bigger machine. Would give more flexibility with configurations. Scalability is very encouraging.
RF: BlueLink runs very short jobs, 3days at a time. Quite a bit of start-up and termination time. How much does that vary with various runs. MW: I did plot init times, it was in proportion to the number of cores. Entirely MPI_Init. It has been a point of research in OpenMPI. Double ranks, double boot time. RF: Also initialisation of OASIS, reading and writing of restart files. PL: Information has been collected, but hasn’t been analysed. RF: Paul Sandery’s runs are 20% of the time. MW: MPI_Init is brutal, and then exchange grid. There are obvious improvements. Still doing alltoall when it only needs to query neighbours. Can be speed up by applying geometry. Don’t need to preconnect. That is bad.
PL: At least one case where MPI collectives were being replaced with a loop of point to points. Was collective unstable? MW: Yes, but also may be there for reproducibility. MW: I re-wrote a lot of those with collectives, but they had no impact. At one time collectives were very unreliable. Probably don’t need to be that way anymore. MW I doubt that they would better. Hinges on my assertion that comms are not a major chunk.
AH: MOM is RAM bound or cache-bound? MW: When doing calculations the model is waiting to load data from memory. AH: Memory bandwidth improves all the time. MW: Yes, but increase the number of cores and it’s a wash. It could have improved. AMD is doing better, but Intel know there is a problem.
AH: To wrap-up. Yes would like full report. This is useful for NH to work up new configurations, as naive scaling is not the way to go. Also intialisation numbers RF would like that PL can provide.

ACCESS-OM2 release plan

AH: Are we going to change bathymeytry? Consulted with Andy who consulted with COSIMA group. What is the upshot? AK: Ryan and Abhi want to do some MOM runs. Problems with bathy. Andy wants run to start, if someone has time to do it, otherwise keep going. Does anyone have some idea how long it would take. RF: 1 deg would be fairly quick. We know where the problems are. Shouldn’t be a big job. Maybe a few days. 1 deg in a day for an experienced person. GFDL has some python tools for adjusting bathymetry on their GitHub. Point and click. Alisdair wrote it. Might be in MOM-SIS examples repo. MW: Don’t know, could ask. RF: Could be something that would make it straightforward.
AK: Will have a look.
MW: topog problems in MOM6 not usually the same as MOM5 due to isopycnal coordinate.
AK: Some specific points that need fixing? RF: I think I put some regions in a GitHub issue. AK: What level to dig into this? RF: Take pits out. Set to min depth in config. Regions which should be min depth and have a hole. Gulf of Carpenteria trivial. Laptev should all be set to min depth. NH: I did write some python stuff called topog_tools, can feed it csv of points and it will fix it. Will fill in holes in automatically. Also smooths humps. May still have to look at the python and fix stuff up. Another possibility.
AK: Another issue is quantisation to the vertical grid. A lot of terracing that has been inherited. RF: Different issue. Generating a new grid. 1 degree not too bad. 0.25 would be weighty still. MC: BGC in 0.25, found a hole off Peru that filled up with nutrients.
AH: Only thing you’re waiting on? AK: Could do an interim release without topog fixes. People want to use. master is so far behind now Also updating wiki which refers to new config. Might merge ak-dev into master, and tag as 1.7, and have wiki instructions up to date with that. AH: After bathy update what would version be? AK: 2.0. AH: Just wondering what constitutes a change in model version. AK: Maybe one criteria if restarts are unusable from one version to the next. Changing bathy would make these incompatible. AH: Good version.
AH: Final word on ice data compression? NH: Decided deflate within model was too difficult due to bugs. Then Angus recognised the traceback for my degfault which was great. Wasted some time not implementing it correctly. Now working correctly. Using different IO subsystem. IO_MP rather than ROMIO_MP. Got more segfaults. Traced and figured out. Need to be able to tell netCDF the file view. The view of a file that a rank is responsible for. MPI expecting that to be set. One way to set it up is to specify the chunks to something that makes sense. Once I did that file view was correct. Then ran into bugs in PIO library. Seem like cut n’ paste mistakes. No test for chunking. Library wrapper is wrong. Fixed that. No working. Learnt a lot and a satisfying outcome. Significantly faster. Partly parallel, also different layer does more optimisation. Noticed with 1 degree things are flying. Nothing definitive, but seems a good outcome. Will get some definitive numbers and a better explanation. Will have something to merge. Will have some PRs to add to and some bug reports to PIO. RY: PIO just released a new version yesterday. NH: Didn’t know that. Tracking issues that are relevant to me. Still sitting there. Will try new version. RY: Happy it is working now. NH: Was getting frustrated with PIO, wondered why not using netCDF directly. For what we do pretty thin wrapper around netCDF. Main advantage is the way it handles the ice mapping. Worth keeping just for that. MW: FMS has most of a PIO wrapper but not the parallel bit.
PL: Any of that fix needs to be pushed upstream? NH: Changes to the CICE code. Will push to upstream CICE. Will be a couple of changes to PIO. AK: Dynamically determines chunking? NH: Need to set that. Dynamically figures out tuneable parameters under the hood about the number of aggregators. Looking at what each rank is doing. Knows what filesystem it is on. Dependent on how it is installed. Assuming it knows it is on lustre. Can generate optimal settings. Can explain more when do a summary.
AK: Want to make sure OP files are consistently chunked. NH: Using the chunking to set the file view. Another way to explicitly set the file view using MPI API. Chunks are the same size as the data that each PE has. In CICE each block is a chunk. MW: These are netCDF chunks? AK: More chunks than cores? NH: Yes. Is that bad or good? This level is perfect for writing. Every rank is chunked on what it knows to do. Not too bad for reading. JM: How large a chunk? NH: In 1 degree every PE has full domain 300 rows x 20 columns. JM: Those are small. Need bigger for reading AK: For tenth 36×30. Something like 9000 blocks/chunk. NH: Might be a problem for reading? RF: Yes for analysis. JM: Fixed cost for every read operation. A lot of network chatter. AH: Is that the ice or the ocean in the tenth? Not sure. Chunk size is 36×30.  A lot of that is ice free, 30% is land. MW: Ideal chunks are based on IO buffers in the filesystem. AH: Best chunking depends a lot on access patterns. JM: One chunk should be many minimal file units big. AH: When 0.25 had one file per PE it was horrendously bad for IO. Crippled Ryan’s analysis scripts. If you’re using sect robin that could make the view complicated? NH: Wasn’t Ryan’s issue that the time dimension was also chunked? AH: He was testing mppnccombine-fast which just copies chunks as-is, which were set by the very small tiles sizes. Similar to your problem? NH: Probably worse. Not doing MOM tiles, doing CICE blocks, which are even smaller. Same grid as MOM. RF: Fewer PEs, so blocks are half the size. MOM5 tile size to CICE block size are comparable apart from 1 degree model.
NH: Will carry on with this. Better than deflating externally, but could run into some problems. The chunks in the file view don’t have to be the same. Will this be really bad for read performance? Gathering that it is. What could be done about it? Limited by what each rank can do. No reason the chunks have to be the same as the file view. Could have multiple processors contribute to a chunk. Can’t do without out collective. MW: MPI-IO does collectives under the hood. Can configure MPI-IO to build your chunk for you? NH: Currently every rank does it own IO as it was simpler and faster. MW: Can’t all be configured at MPI-IO later? RY: PIO can map compute domain to IO domain. Previous work had one IO rank per node. IO rank collect all data from node. Set chunking at this level. NH: For example, our chunk size could be 48 times bigger. RY: Yes. Also best performance is single rank per node. PIO does have this function to map from compute to IO domain, and why we used it. Can also specify how many aggregators / node. First decide how many IO ranks per node, and how many aggregators per node. Those should match. Can also number of stripes and strip size to match chunk size. IO rank per node is the most important, as will set chunk size. MW: Only want same number of writers as OSTs. RY: Many writers per node, will easily saturate the network and be the bottleneck. AH: Have to go, but definitely need to solve this, as scaling to 20K cores will kill this otherwise.

MW: Will also help RF. If you’re desperate should look at the patch RY and I did. Will help a lot once you’ve identified your initialisation slow down. RF: Yes, will do once I’ve worked out where the blockages are. Just seen some timings from Paul Sandery, but haven’t looked into it deeply yet. NH: Even with rubbish config, model is showing performance improvements. Will continue with that, and will consider the chunk size stuff as an optimisation task. MW: Sounds like you’ve gone from serial write to massively parallel, so inverted the problem, from one disk bound, to network bound within lustre. If you can find a sweet spot in between then should see another big speed improvement. NH: Config step pretty easy to do with PIO. Will talk to RY about it. RY: Could have a user settable parameter to specify IO writers per node. PL: Need to look into lustre striping size? RY: Currently set to 1GB, so probably ok, but can always tune this. NH: Just getting a small taste of the huge world of IO optimisation. MW: Just interesting to be IO bound again. NH: heavily impacted by IO performance with dailies with tenth. MW: IO domain offset the problem. Still there  but could be dealt with in parallel with next run so could be sort of ignored.

AK: This is going to speed up the run. Worse case is post-processing to get chunking sorted out. NH: Leaves us back at the point of having to do another step, which I would like to avoid. Maybe different, before was a step to deflate, maybe rechecking was always going to be a problem.
AK: Revisiting Issue #212, so we need to change model ordering. Concerns about YATM and MOM sharing a node and affecting IO bandwidth. Tried this test, there is an assert statement that fails. libaccessom2 demands YATM is first PE. NH: Will look at why I put the assert there. Weirdly proud there is an assert. RF: Remember this, when playing around with OASIS intercommunicators, might have been conceptually this was the easiest way to get it work. MW: I recall insisting on a change to the intercommunicator to get score-p working. AK: Not sure how important this is. NH: There are other things. Maybe something to do with the runoff. The ice model needs to talk to YATM at some point. Maybe a scummy way of knowing where YATM is. For every config maybe then know where YATM is. These would be shortcut reasons. PL: Give it it’s own communicator and use that? NH: Maybe that is what we used to do. Could always go back to what we had before. RF: Just an idea if it would have an impact. Could give YATM it’s own node as a test. MW: Not sure why it is that way. Should be easy to fix. NH: Ok, certain configurations are shared. Like timers and coupling fields. Instead of each model have their own understanding, share this information. So models check timestep compatibility etc. Using it to share configs. Another way to do that. MW: Doesn’t have to rank zero. NH: Sure it is just a hack. MW: libaccessom2 is elegant, can’t say the same for all the components it talks to. RF: There is a hard-wired broadcast from rank zero at the end.


MW: Ever talk about MOM6? AK: Angus is getting a regional circumantarctic MOM6 config together. RF: Running old version of CM4 for decadal project. PL: Maybe a good topic for next meeting?


Technical Working Group Meeting, June 2020


Date: 10th June, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants
  • Peter Dobrohotoff (CSIRO Aspendale)


JM: Already have compression with netCDF4. How do they consume the data? Jupyter/dask? AK: Mostly in COSIMA, BoM and CSIRO have their own workflows. JM: Maybe archival netCDF4? As long as writing/parallel IO/combining, no hurry to move to zarr direct output. JM: Inodes not an issue. Done badly can be bad. lustre file system has a block size, so natural minimum size. At least as many inodes as allocatable units on FS. If a problem wrap whole thing in uncompressed zarr. RY: many filters. blosc is pretty good. can use in netcdf4, but not portable. needs to compiled into library. netCDF4 now supports parallel compression. HDF supported a couple of years ago.
AH: As we’re in science, want stable well supported software. Unlikely to use bleeding edge right now. Probably won’t output directly from model for the time being. Maybe post process.
NH: What about converting to zarr from uncollated ocean output? Why collating when zarr uncollates anyway? Also collate as difficult to use uncollated output. How easy to uncollated to zarr. JM: Should pretty straightforward. Write that block directly to part of directory tree. Why is collating hard? Don’t just copy blocks to appropriate place in file? AH: Outputs are compressed. Need to be uncompressed and then recompressed. Scott Wales has made a fast collation tool (mppnccombine-fast) that just copies already compressed data. There are subtitles. Your io_layout determines block size, as netCDF library chooses automatically. Some of the quarter degree configs had very small tiles which led to very small chunks and terrible IO. AK: regional output is one tile per PE and mppnccombine can’t hand the number of files in a glob. AH: Yes that is disastrous. Not sure was a good idea to compress all IO.
JM: Good idea to compress even for intermediate storage. Regional collation: what do we currently collate? Original collation tool? AK: Yes, but don’t have a solution JM: definitely need to combine to get decent chunk sizes. If interested happy to talk about moving directly to zarr, or parallelising in some way. AK: Would want a uniform approach/format across outputs. JM: not sure why collate runs out of files. AK: Shell can’t pass that many files to it. AH: My recollection is that it is a limit in mpirun, which is why mppnccombine-fast allows quoting globs to get around this issue. Always interested in hearing about any new approaches to improving IO and processing at scale.

ACCESS-OM2-BGC testing and rollout

AH: Congrats on Russ getting this all working. How do we roll this out?
AK: All model components are now WOMBAT/BGC versions, but two MOM exes. One with BGC and one without. All in ak-dev. No standard configs that refer to BGC. Need some files. RF: About 1G, maybe 10 files. Climatologies for forcing and 3 restart files. This will work with standard 1 degree test cases. Slight change to a field table, and a different (OASIS restart file). Not much change. Will work on that with Richard and Hakase. Haven’t tested with current version. Maintaining a 1 degree config should be ok. AK: No interest at high res? NH: Yes, but not worth supporting as yet. One degree shows how it is used.
AK: Would BGC be standard 1 degree with BGC as option, or separate config specialised for BGC? RF: Get together and give you more info. Probably stand separately as an additional config. Currently set it up as a couple of separate input directories. AH: So RF going to work up a 1 degree BGC config. RF: Yes. AH: So work up config, make sure it works, and then tell people about it. RF: The people who are mainly interested know about the progress.

ACCESS-OM2 release version plan

AH: Fixing configs, code, bathymetry. Do we need a plan? Need help?
AK: Considering merging all ak-dev configs into master. Was constant albedo, moving to Large and Yeager as RF advised. Didn’t make a big difference. How much do we want to polish? Initial condition is wrong. Initial condition is potential temp rather than conservative. Small compared to WOA and drift. Not sure if worth fixing. Also bathymetry at 1 and 0.25 degree. Not sure who would fix it. AH: Talk to Andy Hogg if unsure about resources? AK: A number of problems could all be fixed in one go.
AK: Not much left to do on code. PIO with CICE. Still issues?
NH: Good news. In theory compression is supported on PIO with latest netCDF. PIO library also has to enable those features. There was a GitHub issue that indicated they just needed to change some tests and it would be fine. Not true. There is code in their library that will not let you call deflate. Tried commenting out, and one of the devs thought was reasonable. Getting some segfaults at netCDF definition phase. Want to explore until decide it is waste of time, will go back to offline compression if it doesn’t pan out. Done the naive test if there is a simple work-around. Looking increasingly like won’t work easily. Could do valgrind runs etc.
AK: Bleeding edge isn’t best for production runs. NH: Agreed. Will try a few more things before giving up. AK: Offline compression might be safer for production. NH: Agreed, random errors can accumulate. RY: Segfault is with PIO? NH: Using newly installed netCDF4.7.4p, and latest version of PIO wrapper with some commented out some checks. Not complicated just calls deflate_var. RY: When install PIO do you link to new netCDF library? NH: Yes. Did have that problem before. AG pointed it out, and fixed that. Takes a while to become stable. RY: Do you need to specify a new flag with parallel compression when you open the with nf_open? NH: Not doing that. Possible PIO doing that. RY: Maybe PIO library not updated to use new flags to use this correctly. NH: If using netCDF directly what would you expect to see? Part of nf_create? RY: Maybe hard wired for serial. NH: Possible they’ve overlooked something like this. Might be worth a little bit of time to look into that. PIO does allow compression on serial output only. Will do some quick checks. Still shouldn’t segfault. Been dragging on, keep thinking a solution is around the corner, but might be time to give it up. A bit unsatisfying, but need to use my time wisely. AH: It’s not nothing to add a post-processing compression step and then take it out again. Not no work, and no testing, so out of the box compression would be nice. Major code update you’re waiting on? Need to update fortran datetime library? Just a couple of PRs from PL and NH. Paul’s look like a bug fix PL: Mine an edge case? AK: Incorrect unit conversion. AH: Don’t know where in the code and if it affects us. AK: We’re using the version that fixes that. NH: Guy who wrote that library wrote a book called “Modern Fortran”.

Testing update

 AH: MOM travis testing properly testing ACCESS-OM2 and ACCESS-OM2-BGC as have also updated libaccessom2 testing so that it creates releases that can be used in the MOM travis testing. Lots of boring/stupid commits to get the testing working. AK: Latest version gets used regardless of what is in the ACCESS-OM2 repo? AH: Not testing the build of ACCESS-OM2, just the MOM5 bit of that build linking to the libaccessom2 library so that it can produce an executable so we know it worked successfully. We do compile OASIS, and uses just the most recent version. Had intended to do the same thing with OASIS, make it create releases that could be used. Haven’t done it, but it is a relatively fast build. AK: OASIS is now a submodule in the libaccessom2 build. AH: Yes, but still have a dependency on OASIS. Don’t have a clean dependence on libaccessom2. Might be possible to refactor, but probably not worth it. So yes have a dependence on libaccessom2 *and* OASIS.
AH: Previously travis allowed ACCESS-om2 to fail and the only way you knew it worked correctly would have to look at the logs to determine if everything was successful up to the linking step.
AH: Currently fixing up Jenkins automated testing. Starting on libaccessom2 testing. Hopefully won’t be too difficult. NH: Definitely need to have it working.
PL: Writing up testing/scaling tests for ACCESS-OM2.

Updated COSIMA Cookbook default database

The COSIMA Cookbook is the recommended, and supported, method for finding and accessing COSIMA datasets.

Currently COSIMA datasets are located in temporary storage under the hh5 project on the /g/data filesystem at NCI. The default COSIMA Cookbook database (/g/data/hh5/tmp/cosima/database/access-om2.db) indexes data in this location.

The COSIMA datasets are being moved to a new project, ik11: dedicated storage provided by an ARC LIEF grant. As part of this transition the default database will change to:


and will index all data in /g/data/ik11/outputs/. The database is updated daily.

This change will take place from Wednesday the 1st of July. To access the old database pass an argument to create_session:

session = cc.database.create_session(db='/g/data/hh5/tmp/cosima/database/access-om2.db')

or set the COSIMA_COOKBOOK_DB environment variable, e.g. for bash

export set COSIMA_COOKBOOK_DB=/g/data/hh5/tmp/cosima/database/access-om2.db

In the same way the new ik11 database can be accessed by using the path to it (/g/data/ik11/databases/cosima_master.db) in the same manner as above.

Technical Working Group Meeting, May 2020


Date: 20th May, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants

ACCESS-OM2-01 scaling experiments

See PL’s associated scaling doc, scaling spreadsheet and python notebook
PL: Scaled MOM5 and CICE5 by same amount. Based in 01deg_jra55v13_ryf9091. Run an initial run to get restart output from February 1900. Restart runs for February (28 day). 540s time step. 4480 steps. No diagnostic output. Left ice as is.
PL: CICE scaled ncpus and ntask proportionally. Scaled MOM from 80×75 (4358) to 160×150 (16548). Scaling ok looking just at ocean timer and ice timer. Didn’t have daily iCE output.
PL: Most efficient at 10K cores with total wall time. Ocean timer shows perfect scaling. CICE only timer also shows good scaling.
NH: Keen to try these configs in production.
PL: Not sure how appropriate for production, no IO. NH: Good place to start, turn on output and see how it goes. Looks well balanced. Somewhat surprised.
PL: Now trying to reproduce Marshall’s figures from the report. Scales ocean and ice separately. Yet to get reproduction runs going. Working through namelist differences. Sometimes get a silent hang. Worth scaling ocean and ice at same time.
AH: Why do both models scale so well but overall not so well when combined. RY: CICE is waiting for MOM. Maybe some more optimal setting for CPU numbers?  AH: Seems odd MOM scaling better than CICE, but CICE waiting for MOM.
NH: CICE is waiting less for ocean as cpus scaled. oasis_recv is constant, which means MOM not waiting on CICE. Definitely don’t want MOM waiting for CICE. RY: If increase MOM and reduce CICE would we get better performance? PL: Not sure. Might be useful to know how I got those numbers, using log file and figures are total time divided by number of steps.. RF: Output from access-om2.out.  Just summary, won’t show load balance with MOM. PL: Any guidance would be useful. RF: Look in access-om2.out.
AH: Look for MOM timers. Might be some information about range of values, could be some very slow PEs masked by average. RF: the check mask is out of date. Has 12-16 processors which are purely land. Changed land mask and didn’t update mask. Some processors only have values in the halo boundaries. Crashes otherwise.
PL: Regenerated new mask files. Numbers should agree with what was done. Any more advice would be welcome. Send email or talk on slack. RF: I’ll look at CICE layouts and balance, and masks. CICE Is also seasonally dependent.
PL: Moving to a more conventional experiment payout. Will move to a shared location. AH: Could put in /g/data/v45.
AK: CICE scaling with serial IO. Nic almost finished PIO. Will stop scaling without PIO. Runs much faster with parallel even with monthly outputs. AH: Seems to be scaling ok. AK: Any output written? AH: Running for a month, so should be some output.
AH: Ran from initial conditions? PL: Yes. Ran for 1 month with timestep of 300s. Then ran from those restarts with timestep of 540s. AH: There is an ice climatology? RF: If run for a month, should have generated ice. AK: Ice generated from surface temperature in initial conditions.
RY and PL left meeting.
AH: Maybe a bit more to look at in PL runs. NH: May have misunderstood where those numbers came from. RF: Looked like it was scaling nice and linear. AH: Yes for each model, but together scaling died going to 20K. RF: Not sure these results are that useful when IO is turned on. Code paths not currently going through without IO. Putting stuff on density levels. And a whole lot of globals/collectives that aren’t being done. AH: Encouraging though. NH: In principle can scale up.

PIO compilation in ACCESS-OM2

NH: Got a reply from NCI. Resistance to having PIO in a module. Best to be self sufficient. If it turns out to be an issue can address later. Will make a submodule. Clean up the build process. Changes to CICE repo. One CICE namelist change, tell it to not explicitly use netCDF for certain things. Bit odd.
NH: Experiment repos will require updates. Maybe AK will report some more realistic performance numbers.
AH: PIO with MOM? NH: Not sure. CICE isn’t doing a great deal in the configuration I am using. Seems to all work inside parallel netCDF as doing output from all processors. Can use IO nodes and use comms, but doesn’t show performance improvement, and looks worse in many cases. We could configure in the same way without using PIO. RF: Don’t have much control where we put processors. CICE at the end. Probably sharing with MOM. Playing with layout might be tricky. NH: At some stage put CICE on all it’s own nodes. RF: Once YATM is on the first node, it ends up messing things up. NH: Why are we doing that? RF: Something to do with OASIS in the old days. Now have YATM and root PE of MOM on same node. Would make sure all root PEs on their own node. No contention. YATM and MOM also on same NUMA partition. NH: We should change that, easy fix. YATM doesn’t do much on 0.1 as rest of the model takes so long. RF: Two IO processors on the same node. MOM root PE uses for diagnostics and YATM process. NH: If each model on their own node. Could make sure each node has a single IO processor. With PIO if want 1 node per 16 processors, don’t know if it is talking across nodes.
JM: In terms of PIO are multiple nodes writing to the same file? NH: For CICE very single process writing to same file at the same time. Works well. Haven’t looked into it deeply, probably the optimum is something in between. Still a big improvement over serial output. AH: Kaizen (改善): small incremental improvements all the time. Compressed netCDF output? NH: No. PIO GitHub talked about supporting compression. AH: Same as what RY and Marshall did? NH: Yes. Have to wait for parallel netCDF implementation which supports it. Confusing because there is also p-netCDF. PIO is a wrapper. AH: Yes, wraps p-netCDF and MPI-NetCDF. p-netCDF is only netCDF3, not based on HDF5. AK: Will need post-processing compression step. NH: Task not done until compression done. AK: Very sparse data, shame not to compress.
AH: xarray is supporting sparse data now. FYI. Can mean a lot less memory use for some data.

Compiling with/without WOMBAT

AH: Any speed/memory use implications to always have it compiled in? RF: Should be separate. Overhead basically nothing. Will only allocate BGC arrays if they’re in the field table. Should be kept separate like all other BGC packages. I put in some lines in the compile scripts. Also f you want to compile without ACCESS.
AK: If want to maintain harmony with CM2 want a non-BGC compilation? RF: Yes. AK: From the point of view of OM2 users would be nice to be able to switch BGC on and off just through namelists. RF: Switched on via field_table. Strange design choices years ago. Also need changes in some of the restart files, and AK: Not something that can be switched on and off? RF: No.

MOM Pull Requests

AH: Guidance for checking? RF: Two main changes in the code are probably fine. Maybe the ACCESS compilation scripts. Unless want to change that it gets compiled in all the time for ACCESS-OM.AH: Decided not I think. RF: Made changes to to specify the type of model. AH: Separate model designation with WOMAT? RF: ACCESS-OM-BGC is a new model type. Run tests, all ok. AH: Do we need any tests to check it hasn’t changed non-BGC tests. RF: Shouldn’t be anything that effects a normal run. Code compiled ok on travis. Put in some heat diagnostics, the fluxes from CICE, might be the only thing. AH: Are Jenkins tests working? NH: ACCESS-OM2 tests haven’t worked since moved to the new machine. RF: Run a 1 degree model and see how it goes. AH: I’ll do that.
AK: Managing ACCESS-OM2, should the distinction between BGC and non-BGC be in the control directories. So build script builds both and choose which in the config, or compile once, supporting both. AH: I don’t think BGC is a supported configuration yet. Needs testing. How it is implemented, shared or separate exes is just a choice of how you decide is the best.
AH: Turns out that Geos PR was a mistake. Asked about it, and they closed it.

Bad bathymetry

AH: Any comments? Does it need fixing? RF: Bad bathymetry needs to be fixed, or copy bathymetry from somewhere else. Bad around Australia. Same for CM2. Mentioned it 3-4 years ago, still not updated. Some pits in Gulf of Carpentaria down to 120m in 0.25. 1 degree goes down to 80m. Should be no deeper than 60m. OCCAM created some bad bathymetry in Bass Strait, off coast of China. Russian and Alaskan issues, and White Sea. Remapping indices got mucked up. AH: Wasn’t 0.25 fixed north of Bering Strait? RF: Doesn’t look like it.
JM: Bathymetry files are wrong in certain regions? RF: Came from Southampton OCCAM model. They ran it with a normal mercator and a transverse mercator across the top. Remapping onto spherical grid indices got mucked up and got some strange bathymetry. GFDL inherited it and based a bunch of models on it. Leaked through to the ACCESS models. Was in the US forecast model and they noticed all the stuff around Alaska.
AH: Should be a relatively straightforward as this is only ocean bottom cells, and doesn’t touch coasts? RF: Yes. AK: Base on a coarsened tenth grid? RF: Not a big job, just a few slabs that need smoothing/removing. AH: Does this need to be fixed for the next release of OM2? RF: Yes. AK: No. RF: Get a student to look at it. AK: Also land mask inconsistencies, would be good to have all three models consistent. There are big curvy bits of coastline keeping ocean away from tripoles. AH: The 1 degree is very much a model, that isn’t that realistic. Tenth starts to look much more like real life.

Zarr file format

AH: Wanted to engage JM about zarr. RF: Interested as this is being used in decadal prediction project. JM: Exactly. Talked today about parallelising output from model into netCDF, and then post-analysis requires transforming to zarr. Zarr is a distributed file format that stores files in directories, each chunk is a separate file, parallelisation handled by filesystem. Should we write directly into zarr like file format. There are file formats like it. netCDF5 may have a zarr like back-end. RF: There is some discussion on the netCDF GitHub about zarr, looks like just one person. JM: Unidata is willing to move away from HDF5. Parallelisation of HDF5 has never worked the way it was supposed to. Instead of using parallel IO, just write directly to the format people want to use. AH: Got the impression netCDF people never got the buy-in from HDF5 that they thought would get. HDF5 just do their own thing. JM: Still have people using netCDF3. AH: A strength of netCDF, they could hop back end again and keep the same interface. JM: Same data model. AH: What is the physical format of a zarr blob? JM: It is a binary blob that supports different filters/compression schemes. AH: Does machine independent storage? Bad old days with swapping endianness on binary files. AG: In zarr there are raw data blobs, and associated metadata files that describe the filter/endianess etc.
JM: Inodes not a problem. Still relatively large, on the order of the lustre striping scheme. Can wrap the whole thing inside an uncompressed zip file. Parallelises for reading just fine. Works like a tar, index on where to read, supports multiple reads on same file. AH: Would want to do this when archiving.
NH: Another one is TileDB, which is a file format. JM: There are other backends, n5/z5. Distributed storage for large data sets.
AH: At one stage we did wonder if collation was even necessary with tools like xarray, but never looked into it. NH: Things have changed a lot. xarray is relatively new. 3-5 years ago might segfeault on tenth model data. So much better now, so many more possibilities.


Technical Working Group Meeting, April 2020


Date: 29th April, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

Apologies from Peter Dobrohotoff.

JRA55-do v1.4 support

AK: Staged rollout. NH tagged some branches, so existing master tagged 1.3.0, using old JRA55-do v 1.3.1 using NH new exes which also support 1.4
AK: Also working on a new feature branch for 1.4. Same exes configured to use JRA 1.4 version. Seems to run ok. Not looked at output. Will look at that today. Once satisfied that is ok will move into master, tag 1.4.
AK: Also looking at ak-dev branch with a wide variety of changes. Once this is ok will tag with a new ACCESS-OM2 version. Will be new standard for new experiments. Good to make an equivalent point across repos.
AH: COSIMA cookbook hackathon showed value of project boards. Might be a good idea next time something like this attempted. AK: Tried, but it didn’t go anywhere.
NH: Two freshwater fields coming from forcing, liquid and solid. Both go into the ICE model which accepts one new forcing field. Get added together, solid magically becomes liquid without heat changes, passed straight to Ocean. Ocean and Ice models have also been changed to accept liquid part of land/ice melt and heat part of land ice melt. Exist but just pass zeroes. Extra engineering not being used as yet. A harmonisation step which takes us close to CM2 as coupled model uses these fields.
RF: With my WOMBAT updates incorporated this new code, could get rid off ACCESS-CM preprocessor directives.
NH: In the future can put work into calculating those fields correctly in the ice model. Not a huge amount of work. Will then have river runoff, land ice runoff and land ice melt heat.
NH: New executables have another change, support different numbers of coupling fields. Land/ice coupling fields are optional. At runtime figures out what coupling fields used. Dependent on namcouple being consistent. Coded internally as a maximum set of coupling fields. You can take coupling fields out but not add new ones. Possibly useful for others. Not a fully flexible coupling framework.
NH: Working on ak-dev branch. Harmonising namcouple files. Have a lot of configuration fields, but a lot ignored. Could use same namcouple in all configs, but in practice might leave them looking a little different. They include the timestep in them, but ignored. Could set to zero? AH: Or a flag value that is obviously ignored?
NH: Only three variables used in namcouple. Rest ignored, bust must parse properly. Needs cruft to make it parse. Never liked namcouple. Completely inflexible, values must be changed in multiple places.
AK: On version Oasis3-mct2, have they improved it in new version?
NH: Can now bunch fields together, pass a single 3d field instead of many 2d fields. Should improve performance. RF: Not through namcouple at all. Just a function call.
MW: What does OASIS do now? NH: Just doing routing. Which is done by MCT anyway. Remapping done by ESMF. Coupler meant to do 3 things, config, remap and routing. Made libaccessom2 do as much as possible automatically. So OASIS does very little. Still using API, so would require effort to remove.
MW: Know about NUOPC? NCAR is using it. NH: Coupling API. If all use the same API then can go plug and play. MW: MOM6 has a NUOPC driver. NH: In the future would to look at OASIS4, but probably just chuck OASIS, use MCT to do the routing and ESMF to do remapping. MW: NCAR dropped MCT. NH: MCT is a small team. AK: Something that would suit ACCESS-CM. Any critical things that rely on OASIS? MW: At mercy of UM. Probably still use OASIS due to Europe. NH: Not using ESMF, so using OASIS a lot more than we are. Might never change because of that. AK: Even moving to v4 would require coordination with CM2. NH: Nicer and cleaner, but no clear benefit.

Updated ACCESS-OM2 model configs

AK: 3 different tags. 1.3.1, 1.4 in works. ak-dev new tag. 1.4 intended to be minimal other than change in JRA55-do version. ak-dev making more extensive changes. Ussing mppccombine-fast for tenth. Output compressed data and use fast collation. Not worthwhile for 1 deg. With 0.25 output uncompressed and use mppnccombine to do compression. Hopefully output will be a reasonable size.
AK: If outputting uncompressed restarts might get large. Might want to collate restarts. Wanted to verify which run is collated: just finished, or the previous run? AH: Yes it is the restarts which are not used in the next run.
AH: Because quarter degree is not compressed won’t get the inconsistent chunk sizes between different sized tiles. Ryan had the problem when he had a io_layout with very small chunk sizes which made his performance very bad. mppnccombine-fast might be faster, will definitely use less memory. Still got compression overhead but memory use much reduced. AK: Not such a big issue as tenth. AH: Paul Spence had some issues with the time to collate his outputs. Maybe because they were compressed. Would recommend using AK: Fast version will always be faster? AH: Yes, at least no slower, but definitely uses less memory and will be much faster with compressed output.
MW: No appetite for FMS with parallel-IO? AH: Compression? Without it won’t bother probably. RY: Did some tests on parallel IO compression tests. Can’t recall results. Interested to try again. Requires a bit more memory. gadi has optane as storage or as memory. Interesting to test. Probably can use that for parallel compression or even just serial compression. Thinking about, but haven’t started. AH: Please keep us updated.
NH: Anyone have thoughts on CICE? Planning on parallel IO on CICE. Are we going to need a compression step? RF: With daily would like compression. Post processing to do compression on smaller number of PEs would be fine. Improving IO is critical for Paul Sandrey and Pavel. NH: Might need a post processing step similar to MOM. RF: Yes. Getting parallel IO is the most important. Worry about compression later. NH: Did a run yesterday with parallel IO. Completed successfully. Output was garbage. Was expecting to do heaps of work and segfaults. Surprised at that. RF: Misaligned or complete garbage? NH: Default assumption as bad as can be. Just used parallel-IO output driver on CICE. AK and RF realised daily CICE output was a bottle neck on 0.1 performance. As model code existed, decided to get working. RY: Parallel IO need to set up mapping correctly between compute and IO domains. NH: Should be part of the current implementation. Mapping is a tricky part of CICE. AK: Values out of range, so maybe not just a mapping issue? NH: Completely broken, but not segfaulting. Just getting it building was one hurdle. Also had to call the right initialisation stuff within CICE. Had to rewrite some of it that was depending on another library from one of the NCAR models (CESM). CICE is used with  CESM and they had a dependency on another utility library. Changed some code to remove dependence. Relatively positive. Library under active development and well supported. AH: Did they develop just for their use case, and maybe doesn’t support round-robin? NH: Not sure. We do know never been used in any other model than CESM.
MW: Ed Hartnett (PIO) eager to get into FMS. Also lead maintainer of netCDF4.

Status of WOMBAT in ACCESS-OM2

RF: Compiled. Next is testing. Up to current ACCESS-OM2 code changes. Had issues with submodules. AK: Previously libaccessom2 dependencies brought in through CMake, now moved to submodules. If you have an existing repo will have initialise submodules to pull in latest from GitHub.
RF: Made some changes to installation procedures. Can go between BGC version or standard ACCESS-OM. Want it to be different for BGC version. Changes to install scripts and hashexe etc. AH: Good that it is up to date, could have been an messy merge otherwise. RF: Will run tests today or tomorrow.


AH: See this PR? Seemed a bit odd to me. First idea was to ask them to split the PR into science changes and config changes. RF: Looked like a lot of it was config changes. MW: Adding the GEOS5 stuff, which they shouldn’t. Code changes are challenging. Introduced a generic tracer, not sure what they’re doing with it. AH: Strategy? Ask them to wrap science stuff in preprocessor flags? MW: First step is to get config stuff out. Asked GFDL about it. GEOS are switching from MOM5 to MOM6. This must be associated with that effort to validate their runs. Maybe just giving back what it took to get it work. Maybe just makes his build process easier. AH: They have a specific requirement to use the same FMS library. Seems odd, as MOM5 and MOM6 are not likely to share FMS versions in the future. MW: Thorny topic, as it is not clear how FMS compatible MOM6 will be in the future. AH: Using FMS for less and less. MW: The PR needs to be cleaned up. AH: Also put in a CMake build system. MW: They need to explain more.
AK: Has conflicts, so can’t be merged at the moment. AH: Only going to get more conflicted. Which is why I was thinking they could split it up. I have a CMake build system in another branch, but never finished. if we can use theirs cool. I’ll engage with them.


AH: Been experimenting with graceful error recovery with payu. Can specify a script which can decide if the error is something you can just resubmit after. Mostly of interest to the production guys.
PL: Scalability testing with land masks, manifests, and payu setup. Supposed to be simpler but taking some time to get used to it. AH: Manifests are relatively new so some of the use cases have not been as well tested. MW: Are not all using manifests? AH: They are, but can be used in different ways. Tracking always works, but options to reproduce inputs and runs. Suggested PL could use reproduce to start a run. It was confounded by some restarts being missing, so not quite sure if it works as we would like. This is a very desirable feature, as it makes it very simple to fork off new runs from existing ones as well as making sure the files are consistent. PL: Working now. Next step is to change core counts and look for scalability numbers. AH: When I was doing scalability stuff for MOM-SIS I use input directory categories to isolate processor changes. Not quite doing that same thing anymore, but you can do something similar, but you won’t want use the reproduce flag if you are changing any of the input files.
AK: Just MOM scaling or CICE as well? PL: Just looking at MOM to begin with to see dependency and wait times. AK: CICE run time is critically dependent on daily outputs. Revelance to scaling data to production output. MW: Make sure your clock can tell them apart. In principle can distinguish compute from IO. AH: Daily output always part of production? AK: Ice modellers want very high temporal output. Ice is very dynamic. Even daily output not enough to resolve  some features. Maybe wait for PIO for CICE scaling tests? AH: I thought scaling tests always turned off IO? Can’t properly test scaling with daily output, as it dominates runtime.
NH: Would be nice to look at performance with and without PIO. PL: Will also look at CICE. Start with ocean model. AK: Were you (MW) running models coupled for paper scaling numbers? MW: Coupled. Not sure what IO was set to. Subtracted it and don’t recall it was large. Don’t recall a bottle neck, so might have had it turned off. RF: Wouldn’t be running with daily IO. Monthly IO doesn’t show up. MW: sounds likely.
AK: For IAF had a lot of daily CICE output. Not complete set of fields.
MW: Starting to run performance tests at GFDL and want to use payu. Has it changed much? Manifest stuff hasn’t made a big difference? Will have to get slurm working. Filesystem will be a nightmare. You moved PBS stuff into a component? AH: No, you did that. Not huge differences. Will be great to have slurm support.

Technical Working Group Meeting, March 2020


Date: 18th March, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

Scalability of ACCESS-OM2 on gadi

(Paul’s report is attached at the end)

PL: Looking at scaling. Started with ACCESS-OM2, but went to testing MOM5 directly with MOM5-SIS. Using POM25, global 0.25 model with NYF forcing. The model MW developed for testing scaling prior to ACCESS-OM2. Had to add specify min_thickness in ocean_topog_nml.

PL: Tested the scaling of 960/1920/3840/7680/15360, with no masking. Scales well up to some point between 7680 and 15360.

PL: Tested effect of vectorising options (AVX2/AVX512/AVX512-REPRO). Found no difference in runtime with 15360 cores. MW: Probably communication bound at the CPU count. Repro did not change time.

MW: Never seen significant speed up from vectorisation. Typically only a few percent improvement. Code is RAM bound, so cannot provide enough data to make use of vectorisation. Still worth working toward a point where we can take advantage of vectoeisatio.

PL: Had one “slow” run outlier out of 20 runs. Ran 20% slower. Ran on different nodes to other jobs, not sure if that is significant. MW: IO can cause that. AH: Andy Hogg also had some slow jobs due to a bad node. AK: Job was 20x slower. Also RYF runs become consistently slower a few weeks ago. MW: OpenMPI can prepend timestamps in front of output, can help to identify issues.

PL: Getting some segfaults in ompi_request_wait_completion, caused by pmpi_wait and pmpi_bcast. Both called from the coupler. NH: Could be a bad bit of memory in the buffer, and if it tries to copy it can segfault. PL: Thinking to run again using valgrind, but would require compiling own version of valgrind wrapper for OpenMPI 4.0.2. Would be easier to Intel MPI, but no-one else has use this. Saw some cases similar when searching which were associated with UCX, but sufficiently different to not be sure. These issues are with highest core count. MW: Often see a lot of problems at high core counts. NH: Finding bugs can be a never ending bug. Use time wisely to fix bugs that affect people. MW: Quarter degree at 15K cores would have very small tile sizes. Could be the source of the issue. AH: This is not a configuration that we would use, so it is not worth spending time chasing bugs.

PL: Next testing target is 0.1 degree, but not sure which configuration and forcing data to use. Will not use MOM5-SIS, but will use ACCESS-OM2 for direct comparison purposes. AK: Configurations used in the model description paper have not been ported to gadi. Moving on to a new iteration. Andy Hogg is running a configuration that is quite similar, but moving to new configurations with updated software and forcing. Those are not quite ready.

PL: Need a starting configuration for testing. Want to confine to scalability testing and compiler flags. NH: ACCESS-OM2 is setup to be well balanced for particular configurations. Can’t just double CPUs on all models as load imbalance between submodels will dominate any other performance changes. Makes it a problematic config for clean configurations for things like compiler flags. MW: Useful approach was to check scalability of sub-model components independently. Required careful definition of timers to strategically ignore coupling time. MOM was easy, CICE was more difficult, but work with Nic’s timers helped a lot. Try to time the bits of code that are doing computation and separate from code that waits on other parts. Coupled model is a real challenge to test. Figure out what timers we used and trust those. Can reverse engineer from my old scripts.

PL: Should do MOM-SIS scalability work? MW: Easier task, and some lessons can be learned, but runtime will not match between MOM-SIS and ACCESS-OM2. Would be more of a practice run. PL: Maybe getting out of scope. Would need 0.1 MOM-SIS config. RY: Yes we have that one. If PL wanted to run ACCESS-OM2-01 is there a configuration available? AK: Andy Hogg’s currently running configuration would work. PL: Next quarter need to free up time to do other things.

MW: Might be valuable to get some score-p or similar numbers on current production model. Useful to have a record of those timings to share. Scaling test might be too much, but a profile/timing test is more tractable. RY: Any issues with score-p? Overhead? MW: Typical, 10-20%, so skews numbers but get in-depth view. Can do it one sub-model at a time. Had to hack a lot scripts, and get NH to rewrite some code to get it to work. score-p is always done at compile time. Doesn’t affect payu. Try building MOM-SIS with score-p, then try MOM within ACCESS-OM2. Then move on to CICE and maybe libaccessom2. PL: Build script does include some score-p hooks. MW: Even without score-p MOM has very good internal timers. Not getting per-rank times. score-p is great for measuring load imbalance. AH: payu has a repeat option, which repeats the same time, which removes variability due to forcing. Need to think about what time you want to repeat as far as season. AK: CICE has idealised initial ice, evolves rapidly. MW: My earlier profile runs had no ice, which affects performance. MW: Not sure it is huge, maybe 10-20%, but not huge.

MW: Overall surprised at lack of any speed up with vectorisation, and lack of slow-down with repro. PL: Will verify those numbers with 960 core config.

AH: Surprised how well it scaled. Did it scale that well on raijin? MW: The performance scaling elbow did show up lower. AH: 3x more processors per node has an effect? MW: Yes, big part of it. AH: 0.1 scaled well on raijin, so should scale better on gadi. 1/30th should scale well. Only bottleneck will be if the library can handle that many ranks.

NH: If repro flags don’t change performance that is interesting. Seem to regularly have a “what trade off does repro flags have?”, would be good to avoid. MW: Probably best to have an automated pipeline calculating these numbers. NH: People have an issue with fp0 flag. MW: Shouldn’t affect performance. NH: Make sure fp0 is in there. MW: Agree 100%.

ACCESS-OM2 update

AH: Do we have a gadi compatible master branch on gadi? AK: No, not currently. NH: At a previous TWG meeting I self-assigned getting master gadi compatible. Merged all gadi-transition branches and tested, seemed to be working ok. Subsequent meeting AK said there were other changes required, so stopped at that point. gadi-transition branches still exist, but much has already been merged and tested on a couple of configurations. Have since moved to working on other things.

NH: Close if AK has all the things he wants into gadi-transition branch. Previous merge didn’t include all the things AK wanted in there. Happy to spend more time on that after finishing JRA55 v1.4 stuff.

JRA55-do v1.4 update

NH: Made code changes in all the models, but have not checked existing experiments are unchanged with modified code.

NH: v1.4 has a new coupling field, ice calving. Passing this through to CICE as a separate field. In CICE split into two fields, liquid water flux and a heat flux. MOM in ACCESS-CM2 already handles both these fields. Just had to change preprocessor flags to make it work for ACCESS-OM2 as well.

NH: Lots of options. Possible to join liquid and solid ice at atmosphere and becomes the same as we have now. Can join in CICE and have a water flux but not a heat flux.

Strange MOM6 error

AH: A quick update with Navid’s error. Made a little mpi4python script to run before payu to check status of nodes, and all but root node had a stale version of the work directory. Like it hadn’t been archived. Link to executable was gone, but everything else was there. Reported to NCI, Ben Menadue does not know why this is happening. Also tried a delay option between runs and this helped somewhat, but also had some strange comms errors trying to connect to exec nodes. Will next try turning off all input/output can find in case it is a file lock error. Have been told Lustre cannot be in this state.

MW: In old driver do a lot of moving directories from work to archive, and then relabelling. Is it still moving directories around to archive them? Maybe replace with hard copy of directory to archive. MOM6 driver is the MOM5 driver, so maybe all old drivers are doing this. Definitely worth understanding, but a quick fix to copy rather than move.

NH: Filesystem and symbolic links might be an issue MW: Maybe symbolic links are an issue on these mounted filesystems. AH: There was a suggestion it might be because it was running on home which is NFS mounted, but that wasn’t the problem. MW: Often with raijin you just got the same nodes back when you resubmit, so maybe some sort of smart caching.


Scalability of ACCESS-OM2 on Gadi – Paul Leopardi 18 March 2020



Technical Working Group Meeting, February 2020


Date: 27th February, 2020
  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

New installed payu version

Version 1.0.7 is now installed in conda/analysis3-20.01 (analysis3-unstable

AH: payu is now 100% gadi compatible. Default cpus/node is now 48 and memory 192GB/node. Python interpreter, short path and manifests are scanned to automatically determined from model config and manifests. Using qsub_flags to manually specify storage flags no longer works, as automatically determined storage flag option is appended and the manually specified one no longer works.

RF: Paul Sandery having issues getting 0.1 deg model working. [AH: turns out it was a typo in config.yam]

AH: No need for the number of cpus in a payu job to be divisible by the number of CPUS in a node. Request however many the job uses, and payu will pad the request to make sure the PBS submission is requesting an integer number of nodes if ncpus is greater than the number in a single node. PL: Rounds up for each model? AH: No, just the total. MW: Will spread models across ranks, so a rank can have different models on it.

AH: Andy Hogg ran out 80 odd submits with the tenth model. Occasional hang, resubmit ok. Might be more stable than raijin.

AH: Navid has MOM6 model that cannot run more than a couple of submits without it crashing with an error that it cannot find the executable. Weird error, let me know if you see anything similar.

NH: Caution with disks and where to put things. Reading input files can be very slow sometimes, or not, and then files not there and turn up later. If executable is missing, running off a disk that is not good? MW: Filesystems are very complicated on gadi? NH: Less certainty of performance with such a different system with data file systems being mounted separately. I’d look at this.
PD: Good place to look if disk has got caught up doing too many tasks. gdata just hangs, saving text file takes a while. Due to being on login node? Get similar delays with interactive job on execute node.
AH: People reporting issues with login delays. Probably a disk issue? Navid’s job is not being run from gdata, but from scratch. Inclined to blame new system of mounting. Could we use jobfs. MW: Like in the old days when we ran on the node? Good luck! AH: Could just do some tests. NH: Concerning if scratch is slow.
AH: Not sure if filesystems are mounted with NFS. MW: That is what we do on gaia, and have tons of problems with mount on demand. Biggest frustration with using GFDL machine. It’s a nightmare. At least NCI have lustre know-how. AH: Used to have a lot of problems with NFS cache errors in the past, files disappearing and reappearing. Does sound similar to Navid’s problem.
MW: Raijin’s filesystem was quite good. Why the change? AH: Security. Commercial in confidence stuff. I think it is overblown. Can’t seen anyone else’s jobs on the queue. Can’t even check it other people are running on the project. Are moving to 2-factor auth also.

What is required to get gadi transition into master for ACCESS-OM2

AH: Andrew Kiss is on personal leave but sent around an email:
re. gadi-transition, we could proceed like so:
– we’ve also been transitioning libaccessom2 to use submodules for its dependencies instead of cmake which would require this commit (not currently in gadi-transition branch)
– get the libaccessom2 tests working
– there’s a gadi-transition branch libaccessom2, cice and mom that could be merged into master. They use openMPI4.0.2
– there’s also a gadi-transition branch for all the primary (ie JRA, non-minimal) configurations but the exe paths would need to be updated before merging to master
– the access-om2 gadi-transition branch would then need to be updated to use the correct submodules for model components and configurations. We also want to remove the core and minimal config submodules
also fyi the current gadi build instructions are here
AH: Feels urgent that people can use on gadi. Any comments on Andrew’s email?
PL: Transition to submodules finished? AH: That is on a separate branch. NH: I did that work. Put it in a dev branch. Not intending to be part of gadi transition to have least number additions. AH: Agree if that is the easiest. Master is broken for gadi, so anything that works is an improvement. If there is no feedback can do this offline. Could make a project to be explicit about what is required. NH: Given that gadi-transition does work. Andrew and Andy use it. Wouldn’t hurt to put it in now. Work that PL has done to make sure it does reproduce ticks that box. So ready to go. Able to reproduce if we need to. I’ll merge it and do some interactive testing. Then people can use it and I can do automatic testing.
PL: What branch will it be merged into? A lot of branches in a lot of repos.
NH: Isolate gadi-transition branches and merge into master straight away. Not bother with other development branches at this stage. Want to get something in master that people can use. In future bring everything into dev as discussed, with master staying stable, just bug fixes, until decide to update from dev. I’ll go through the branches and just bring in the gadi transition stuff. PL: So dev will have submodule changes and master will not? NH: For the time being. With previous discussion we’ll be slower moving on master, to make sure it is working. Having dev will allow us to move that more rapidly. People can run off dev at their own risk. AH: Submodules will remain a named feature branch and pulled into dev at some future time. Should discourage having personal development branches on the main repo. If you want to experiment do it on your own fork. Branches on the main repo should be master, dev or named feature to keep it clean and everyone can understand what they mean.

Stack array errors and heap-array option

AH: Apologies minutes from last TWG meeting are not on the COSIMA website. There is an IT issue with the server. We wanted to follow up with stack array errors.
AH: Did ever test on raijin with same compiler? Is there any way we can do comparative test? Use raijin image? Any more from Dale about this stack stuff? PL: Haven’t heard anything. AH: Last meeting some mention of there being a limit on UM stacksize. RY: Already fixed Ilia’s issue. Fixed by making stacksize unlimited. RF: Always run with unlimited stack size. When had problem only fixed by setting heap arrays small or zero. When I went into code and made array allocation from automatic to allocatable the error went away.
MW: If I have an automatic array I get three different heap allocations for three different compilers. RF: This option forces all arrays on to the heap.
AH: This was fixed a while ago Rui? RY: Not clear this is the same problem. Ilia’s issue was the end of 2019 when gadi first on line. Not sure it is the same issue.

BGC Update

AH: Russ forwarded an update to Andy Hogg.
RF: Work was completed on raijin in 2019. BGC code in to MOM and CICE. Required changes in CICE: moving arrays around to different modules due to scope issues which allow optional fields to be sent. Main one is to send 10m winds to ocean, not just the wind stress. Holding off to issue PR until gadi transition done so could go in clearly.
NH: Will be useful for JRA1.4 work.
RF: Hakase will be using it for BGC. Passing algae between ice and ocean components. To add new field, need to add field to code, but don’t have to be passed. Just picked up from namcouple using the flags in OASIS to see if it’s registered.
AH: Can this be the next cab off the rank after gadi-transition, before AKs science tweaks. Not relying on any changes in Andrews branches? RF: Would like to get gadi transition out of the way and then test these changes. Not tested on gadi yet.
How to proceed? Testing?
I’ve held off issuing a pull request until the dust settles wrt the gadi transition. There’s a bit of code rearrangement in order to allow optional fields (10m wind speed but this can be extended) to be passed from CICE.
The flags ACCESS-OM-BGC (tested) and ACCESS-ESM (untested) enable compilation of the BGC code. The 10m winds need to be added to the namcouple files and the MOM coupling fields namelist.
Work done on raijin last year. Changes in CICE to move arrays around in modules due to scope issues. Main one is to send 10m winds to ocean. No just wind stress. Holding off until gadi-transition done.
NH: Useful for stuff I’m doing with JRAv1.4.
RF: Hakase will use for BGC, passing algae between ice and ocean components. Have to change code to add fields. Don’t need to hard code as much. Once field in there optional to pass. Using the OASIS flags to see if registered.

JRA55-do counter-rotating cyclones

RF: Fortunately Paul Sandrey’s started in 1988. Last reverse cyclone in 1987. Cafe 60 use whole month window, so washed out on the average.
One of the RYF runs has reverse cyclone (83-84). Tell Kial.


PL: Thanks to Marshall for getting me up to speed on scaling tests and sharing scripts. Can reproduce diagrams so can compare between raijin and gadi.
 AH: Any more performances numbers? PL: Now in a position to answer questions, just need to know what questions to ask.
AH: ACCESS-OM2-01 currently running around 5K cores, would love to be able to scale to 10K, 20K even better. MW: MOM scaled to 50K. AH: CICE doesn’t scale as well. MW: Any work on CICE distributions? RF: Nope. Would need to be done again at higher core counts. MW: Current one working really well. AH: On NH’s to-do list was to experiment with layouts and load balancing. MW: Alistair is very interesting in load balancing sea ice models. Particularly icebergs. Has some quasi lagrangian code in SIS2 to load balance icebergs. Maybe some ideas will translate or vice versa.
PL: For the moment will just look at MOM and see how it scales at 0.1? AH: Maybe just try doubling everything and see if it scales ok? MW: Used to make those processor heat maps to get the load imbalance of CICE. Would be good to keep an eye on that while working with scaling. Tony Craig (CICE developer) is very interested.

 Atmosphere/coupled models

 PD: Still using code frozen for CMIP runs. Extending number of runs in ensemble.
AH: People in CLEX are keen to run CM2. PD: Not aware, maybe through someone else, maybe Simon or Martin? CM2 and ESM-1.5 runs have been published under s38 project.
AH: Scott Wales doing an ultra high resolution atmosphere run over Australia, under  the STRESS2020 project. PD: Atmosphere only, do you know what resolution? I’ve also done some high res atmosphere only runs. On a project to improve turbulent kinetic energy spectrum in UM. Working on code to put stochastic back scatter into low res N96 (CMIP6) atmosphere. Got some good results injecting turbulent kinetic energy into small scales to improve artificial dissipation associated with semi-lagrangian timestep in UM. To test this is to see how improved N96 results compare to N512 runs using STRESS2020 resources. Working with Jorgen Fredrikson. Should talk to Scott.
AH: At the moment Scott is targeting 400m over Australia. PL: Convection resolving? AH: Planning a 2 day run to simulate Cyclone Debbie. Nested 400m run for Australia, inside BARRA at 2.2km. 10500×13000. PD: We’re going global. MW: How many levels? Same as global? PD: 85. AH: Major problem is running out of memory. MW: More cores should mean less memory. Maybe their Helmholtz server imposes some memory limit on the ranks. AH: Currently waiting for large memory nods to come online.


MW: New FMS version coming. Targeting auto tools and getting rid of mkmf. If you’re on MOM5 you can use your frozen version. Completely rewritten IO in FMS. Now a thin wrapper to netCDF. No more magic functions like save_restart, write_restart. They have been replaced by lower level ops to allow model developers to have more control. Not sure MOM5 significance. AH: API compatible? MW: Keep compatible with old API as long as they can. Could dump it in and slowly integrate. Only raising in case you want to do more innovative stuff with IO. PL: Affects MOM6 mainly? MW: MOM6 is one of the main targets. PL: Parallel IO support? MW: Part of the reason. They want parallel IO in atmosphere model which NCAR now uses it. Now an important model. This implements the hooks for that work. RY: MPI-IO still there or be replaced by PIO? MW: It is. RY: Simpler to do one? MW: They’ve sent a patch to get MOM6 working with that now. Doesn’t work currently. Not sure about the progress, but know you were interested in PIO. RF: We’re interested from the ICE point of view. New version of BRAN will need daily inputs in CICE. Performance is terrible as IO is collected on to one processor.  MW: FMS will not help CICE, but a test case if PIO is a valid solution.

Technical Working Group Meeting, November 2019


Date: 27th November, 2019
  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) ANU,  Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

ACCESS-OM2 on gadi

PL: Submodules not updated (#176). Reported bug from CICE5 but not being built. AK: not sure how to release this. Sometimes model components updated but not tested. AH: gadi transition branch? AK: Yes. PL: Science bug.
PL: To test had to copy files around. Needed to update config.yaml and atmosphere.json. Made fork of 1deg_JRA55_RYF for testing. Had to move to non-public places as don’t have access to public places. Will send details in an email.
PL: conda/analysis3-unstable needs to be updated, payu not working on gadi. AH: Did update, still not working. Update only tested on interactive job. PBS job strips out environment. Wanted to consult with Marshall about why payu works as it does currently. Difficult to debug as payu-run as it does not have the same environment as “payu run”. PL: Work-around to add -V option to qsub_flags in config.yaml. AH: This is what I am considering to change payu to by default. Not sure. Currently looking into this.
PL: nccmp module not on gadi. Been using for reproducibility testing. In backlog. RY: Can install personally, don’t have to wait for system install.
PL: Running on gadi. Got 1 deg RYF55 finished. Did not have mppnccombine compiled. Will have to do this to get this working correctly. Got something for baseline for comparison. Report by the end of the week.
RY: gadi 48 cores. Default based on broadwell (28 cores). Do you have an up to date config? Paul currently changes core count in his config, but is it done in official config?
AH: I was in the process of making an official configuration for gadi. Copied all inputs that were in /short/public to the ik11 project. Once directory structure finalised will make a config that runs, update on GitHub, and look at making the same changes for other configs. Make an exemplar config with those changes. RY: Should work on same configs.
RY: Anyone else running on gadi? AH: No.
AH: What are the impediments to others updating ACCESS-OM2 on GitHub? People not sure if they can? How they should go about tit? AK: Put my hand up to do this. Other model components also need updating. AH: Maybe dev branch that everyone pulls from. Easier to make changes without worrying about breaking. So everyone working from the same version and don’t have to re-fix known bugs.
AH: Environment stuff? MW: Something about python exec command. Nuance? Wholesale copy everything? Wanted to create idealised processes, rather  than depend on what users haves stop. payu run submits job to PBS with whole new environment. Explicitly give environment variables.
AH: Drawback payu-run does not use same environment as payu run. MW: Not launching a process. payu run submits to PBS and starts posix process with defined environment. Exception when explicitly give it environment variables. AH: One work-around is to make list of environment variables want to keep. Losing MODULEPATH variables. PL: module env being used by payu required modules 3. Modules 4 works differently. Python code from modules 4 may work better.
MW: Fixed? AH: Thought I had, but was fooled because using payu-run. MW: If you set MODULEPATH locally, it won’t be exported to payu run process.
PL: What is the fix? MW: On raijin there was a bootstrap script in init dir, which sets everything. I duplicated those commands and put them in the payu module that did equivalent bootstrap. If moving to gadi and it is different none of that bootstrap script works. PL: Bootstrap script there, but completely different. MW: Was old version, and never actually used the bootstrap script. Maybe exec the bootstrap script they provide? AH: Or pass through environment variables that are set already. MW: Do whatever you think is best. Did try and make it so ‘payu run’ job was clean and always looked the same regardless of who submits. If we take entire ENV and submit to run, every run will be different. One variable is a controlled solution. Solution should be possible to have job on submitted node can set it up on it’s own. Should get it going and not be held up by my purist notions. AH: Try/except blocks can be used to support multiple approaches. MW: Definitely need to bootstrap the modules. PL: Sent through email with details.

OpenMPI/4.0.1 on gadi

AH: Angus reported openmpi/4.0.1 seems broken. Has this been fixed?
AG: Any wrapped commands (mpicc, mpifort) will print whitespace before output. In most cases ok, but can break configure scripts. Ben M knows about it, but not why.
PL: Divide by zero error in MPI_Init. MW: Remember that one UCX back-end, FP exception. Evaluates a log function when evaluating binary tree when working out communication. Ben M told them about it, but got nothing back. We use FP exception checking, but can’t ignore for just MPI. PL: Work-around like turn off UCX? MW: Could turn off FP exceptions. A race condition, so not every job sees it. RY: Can turn off UCX. Can use ob1 instead of UCX. Also try that. PL: Wasn’t sure it would work on gadi.
AH: Maybe 4.0.1 not a good candidate for testing? Get intermittent crashes.

Russ update on model performance on gadi

RF: Been testing OFAM bluelink, compiled as MOM-SIS without doing ice. Performance was fantastic. 2x faster than Sandy Bridge. Don’t get hammered with extra cost on new CPUs. Initialisation was very fast. A lot of files, so might be a low load issue. Dropped from 100s to 8s. Doing data assimilation runs, run 3 days at a time. 25% of the run time was init. Now pretty much zero. MOM5 performance was really good.
RF: Did notice some variation on start up of CM4. Still a lot faster. Reads in a lot more files and a lot more data. Still considerably faster than on raijin. MW: MOM has IO timers, do you have those on? FMS timers. Rui used them a lot. RF: No, didn’t turn them on.
RF: Running CM4 was about 15% faster than Broadwell. Improved but will cost a lot more for decadal prediction. RY: 15% is normal. Martin report UM is 30% quicker. RF: SIS2 load balance is bad. Probably a bunch of things being covered up. Needs more testing.
MW: Bob has never talked about SIS2 load imbalance. Presumably oblivious to them. RF: Would have to be. Regular layout would lead to many redundant processors. MW: Alistair has done some iceberg code load balance improvements. RF: Doesn’t take much time. Had to turn off iceberg stuff on raijin. netcdf stuff broke it. Might turn back on. Time spent in iceberg code minimal.

Stack array errors and heap array option

RF: When compiling need to set heap-arrays option in compiler, otherwise get segfaults with stack, even when stack set to unlimited. Wasn’t an issue on raijin. Happened for both MOM5 and CM4. PL: Dale mentioned about stack size limited to 8MB. RF: I unlimited stack size, so shouldn’t have been an issue. Got all sorts of issues with unmapped addresses. First one saw it was automatic so tried moving to allocatable, moved error. Then tried different heap-arrays size options, which moved error again. MOM5 dropped to heap-arrays 5KB. Same for CM4 but set to zero for SIS2 and it got through. Different models, seems ubiquitous. MW: Intel fortran?
MW: When compile and run on CRAY machines stack vars use malloc, so heap variables not stack. Same model, same compiler on laptop (gcc), same variables are stack variables. Is it possible moving from raijin to gadi something different about malloc. RY: CentOS 7 v 8 makes some difference. MW: Is kernel making some decisions on malloc? RY: Had similar issues with UM. Stacksize unlimited seemed to fix for UM. But Dale talked about this in ACCESS meeting, kernel changed something that caused this problem.
NH: Intel compiler has heap always arrays option. Useful in some cases. Models can have array bounds overruns, and easier to track when trash heap compared to stack. RY: Slower? NH: Depends. Doesn’t do it for everything, just the larger arrays. RF: If you just set heap-arrays, all on heap. Can control it. MW: In MOM6 explicit places we declare variables we know we won’t use, contingent on assumption they are stack vars. Can’t make those assumptions any longer.
NH: Surprised to hear linux kernel. Would think it was Fortran runtime or compiler. MW: runtime or libc. Couldn’t figure out why different results with same compiler on different platforms. NH: Calculating variables addresses, compiler computes stack offsets. Looking at the executable there are static offsets. Needs to be done at compile time. MW: Shouldn’t be running models that need to use heap. Should be resilient to either choice. No? NH: Comes down to algorithms used to manage memory. Heap has algorithm to minimise fragmentation. Don’t have an answer, will need to think about it.
MW: Can you send a bug report for SIS2? RF: Could be everywhere that has run out of stack space. Just the first one I tried to fix this.
AH: What OS are you running on your laptop? MW: Archlinux. Comparing them to the travis VMs. AH: At some point the compiler has to query the system to see what resources are available? MW: The fact that you’re typing stacksize unlimited shows you accessing the kernel. AH: Seems strange, system has plenty of memory. MW: I’m interested in this problem. AH: Problem should be reported to relevant NCI people (Dale/Ben?). Potentially affecting a lot of codes. Not tenable that everyone who has this issue have to debug it themselves. MW: Bad memory explicit in stack, buried in the heap? NH: Can make a huge difference. Layout of memory is different. More likely something on HEAP won’t affect other variables. More fragmented on stack. Heap memory more tightly packed. MW: Fixed a couple of dozen memory access bugs in MOM6 and they take it seriously. RF: Old versions I’m using with CM4 release. Happens with MOM5. Only FMS common. MW: Wondering if this is a bug that is hidden moving from stack to heap.
MW: Using GCC9.0 to find these. Few flags to find stuff. Initialise with NaNs. malloc-perturb is an environment variables you can turn on and that helps. Turns on signal NaNs. Any FP op generates an error now. Finds a lot of zeroes in bad memory accesses that didn’t trigger errors. Trying to not use valgrind, but that would work also.
RF: Switch in GCC that does something similar to valgrind. Puts in guards around arrays. MW: Don’t know the explicit option, using -Wall, turns it on for me. GCC9.0 is very aggressive at finding issues in a way that 5/6/7 were not.
AH: Same compiler on raijin and gadi, see if gadi only issue. RF: Not sure if it was the same version of 2019 I was using. AG: One overlapping compiler 2019.3. RF: Recently recompiled MOM-SIS build. Will look and see if it is the same. AH: Useful data point if same issue is gadi specific.

Update on BGC

AH: Andy Hogg has asked for an update. People at Melbourne would like to us eit. RF: On my desk with Hakase. Been promising. Will prioritise. Almost there for a while. Been distracted with gadi. On to-do list.
MC: Do we know who in Melbourne wants to use it? AH: A student, not sure who.

New projects to support COSIMA and ACCESS-OM2 on gadi

AH: /g/data/ik11 is where inputs that were on /short/public will now live. Not sure exactly how this will be organised. Will mostly likely have input and output directories. Might be some pre-published COSIMA datasets there. Part of a publishing pipeline. AK: Moving data from scratch to this as a holding area? AH: People were using datasets from hh5 that had no status, not sure how to reference them.
AK: Control directories are separate, and not well connected to the data on hh5. Nice to have ways to link things more firmly. AH: To-do for payu is have experiment tracking IDs. Generate UUIDs as unique identifiers for experiments. Will go in metadata file. Not linked to git hash. If they don’t exist, make new ones. AK: Have data on hh5 and the control directories have been moved or deleted. Lose the git history of the runs that were used to generate the output. AH: Nothing to stop that all being in the same directory. Nic has advocated this for some time. Could change the way we do things. AK: Not sure on solution, but flagging as an issue.
AH: Published dataset from the COSIMA paper is almost ready. New location for COSIMA published data will be cj50. To do this publishing have created a python/xarray tool to create published dataset from raw model data. Splits data into separate files for each variable, a year per file in most cases. Needs a specific naming convention for THREDDS publishing. Using xarray  it doesn’t matter what the temporal range of each model output file. Uses pandas style resampling to generate outputs. In theory simple, in practice there are many many exceptions and specific tweaks to be standards compliant. Same tool can handle MOM and CICE outputs, which are different models, and radically different file metadata and layout. If you have something that you might find it useful for it is called splitvar. Also made a tool called addmeta for adding metadata. Do the metadata modification as a separate step as it is always fiddly. Uses yaml formatted files to define metadata. The metadata for the COSIMA data publishing is available.
PL: Published data is netCDF format with all the correct metadata? AH: MOM doesn’t put much metadata in the files. To make this better connection between runs and outputs is to insert the experiment tracking id mentioned above into the files. Would be nice to put that into a namelist so that MOM could put it in the file. Best option, and if anyone knows how would like to know. Another option is a post-processing step, on all the tiled outputs. MOM isn’t the only model we run. Not all output netCDF. Would be nice if there was a consistent way for payu to do this. COSIMA published data should be up before the end of the year.
PL: Will ik11 replace hh5 and v45. AH: hh5 is storage space that is part of a ARC LIEF grant from the Australian climate community. The COE CMS team was tasked with managing this, and people could ask for temporary storage allocations. In practice it is harder to get people to remove their data. COSIMA was one of the first to ask for an allocation, but it somewhat outgrown the original intent of hh5, as it has been there for a long time and grown quite large. hh5 might still be used for some models outputs. Not sure. ik11 started because we needed somewhere to put common model inputs/exes because /short/public went away and /scratch/public is ephemeral. /scratch space is difficult to utilise because of the ephemeral nature. NH: Have some experienced /scratch space on Pawsey. Once you lose data you make sure you have a better system to make sure your data is backed up. Possibly a good thing. AH: Doesn’t suit the workflow people currently use, where they come back and run some more of a model after a break. Suits workflows that create large amounts of data and then do a massive reduction and only save the reduced dataset. Maybe suits ensemble guys. Our models everything we create we want to keep. NH: Doesn’t all the model output go to scratch. AH: Yes, but model output doesn’t get reduced, so end up having to mirror the data.

Technical Working Group Meeting, September 2019


Date: 11th September, 2019

  • Aidan Heerdegen (AH) CLEX ANU,  Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Nic Hannah (NH), Double Precision


AK: JRA55 v1.4 splits runoff into liquid and solid. Most elegant way to support? Have a flag in accessom2 namelist to enable combining these runoffs. NH: Is it a problem in terms of physics? Have to melt it? AK: Had previously ignored this anyway, so ok to continue. NH: Backward compatibility!
AK: Some interest in multiplicative scaling and additive perturbations to allow for model perturbation runs. NH: Look at existing code. Might not be too hard. AK: Test framework for libaccessom2? NH: When did scaling did longer to write test than make the code change. All there, could use as an example. Worth to run tests, don’t want to get it wrong. AK: Not familiar with pytest. NH: In this case just copying scaling test, modify, and get pytest to run just that test. Once got just  that test running and passing you’re done.
AK: New JRA55 now in Input4MIPS. Used JRA v1.3 from that directory and didn’t reproduce. AH: Correct. Didn’t work out why it wasn’t reproducing. AK: Ingesting the wrong files? Should be identical. AH: Never figured out what was wrong. Didn’t match checksums from historical runs. Next step was to regenerate those checksums to make sure the historical ones were correct. Could have been ok, but didn’t get that far.
AH: JRA55-do is now on the automatic download list, should be kept up to date by NCI. If it isn’t let us know.
NH: Liquid and frozen runoff backwards compat, but what about future? AK: Some desire to perturb solid and liquid separately, and/or distribute solid runoff. NH: Can we just put it somewhere and allow model to deal with it. AK: In terms of distributing it, not sure. Some people are waiting on this for CMIP6 OMIP run. Leave open for the future. NH: MOM5 doesn’t have icebergs? AK: No. Depoorter et al. has written a paper for meltwater distribution. Maybe use a map to distribute. RF: What they use for ACCESS-CM2. Read in from a file.
AK: Naming convention for JRA55 v1.4 has year+1 fields. Put in a PR some time ago. AH: Problem with operator in token? NH: Should be fine as long as within quotes. AK: Just a string search shouldn’t make a difference.
AK: Can’t get libaccessom2 to compile and link to correct netcdf library. Ben Menadue tried and worked ok for him. Problem with findnetCDF plugin for CMake. Not properly supported on NCI. Edited the CMake file to remove this, could find netCDF, but used different versions for include than linking. Should move to a newer version of netCDF. v4.7.1 has just been released. Have requested this be installed on NCI NH: Does supported include CMake infrastructure around library? If getting findnetCDF working was NCI responsibility that would be great. Difficult getting system library stuff working properly with CMake. CMake isn’t well supported in HPC environments. AK: Ben suggested adding logic to check and not use on NCI. NH: Definitely upgrade, to 4,7 if they install it.
AH: Didn’t Ben Menadue login as AK and it ran ok? AK: No, he didn’t do that as far as I know. AH: Definitely check there is nothing in .bashrc. Also worth checking if there is a csh login file that is sourced by the the csh build scripts.

OpenMPI testing

RY: OpenMPI 2,3,4 and Intel 2019. Consistent results between for all OpenMPI versions. 1, 0.25 and 0.1. Some differences between Intel 2017, not from MPI library. Not sure if difference is acceptable or not? Would like some help to check differences.
Just looking at access-om2.out differences. Maybe need to look at output file like RF: Need to compile with strict floating point precision to get repro results. MOM is pretty good. Don’t know about CICE. Can’t use standard compilation options. fp-precise at a minimum.
RY: If this difference is not acceptable need to use flags to check difference between 2017 and 2019? RF: Once get a bit change, chaos and get divergence. RY: Intel 2017 still on new system. AH: So not only newest versions of modules on gadi? RY: 2017 will be there, but no system software built with it. AH: Done a lot of testing. Should be possible to just use 1 degree as a test to get 2017 and 2019 to agree. There are repro build targets in some of those build files. Could try and find them. RY: Yes please.
AK: Any difference in performance? RY: No big difference. NH: New machine? RY: No, old machine, with broadwell.
RY: NCI recently sent out gadi update and blog and webpage. 48 cores/node. NH: Did we think it was 64 cores/node? AH: Still 150K cores in gadi, with 30K of broadwell+skylake. Maybe have to change some decompositions. RY: Not the same as any existing processors.
AH: Two week overlap with gadi, then short will be read only on gadi. RF: There was panic in ACCESS due to an email that said short would disappear in mid October. AH: Easy to misread those dates.

accessom2 release strategy

AK: Harmonising accessom2 configurations. Somewhat haphazard release strategy, but not tested. Maybe master branch that is known good, and have a dev branch people can try if they want? Any thoughts?
NH: Good way is really time consuming and labor intensive. Would mean testing every new configuration. Not sure if we can do that. Tried to keep master of parent repo only references master of all the control experiments. Not sure if necessary or desirable? Maybe makes more sense to develop freely on own experiment and keep everything in control stable? Not sure. If all control experiments are stable and working, can be a bit slow to update. Just update your experiment.
AK: Some people are cloning directly from experiment repos, some cloning all of access-om2. Would reduce confusion if control directories under accessom2 are kept up to date with latest known good version. NH: Does make sense I guess. Shame for people to clone something that is broken which has already been fixed. There is some python code in utils directory which can update everything. Builds everything at all resolutions, copies to public space, updates all exes in config.yaml and does something with input directories. AK: I ended up writing up something like that myself.
AH: Should split out control dirs from access-om2 repo. Is a support burden to keep them synched. Not all users need entire repository, as using precompiled binaries. Tends to confuse people. NH: Did need a way for config to reference source code and vice versa. AH: Required to “publish” code? Maybe worth looking into. NH: Ideally from the experiment directories need to know what code you’re using. Probably got that covered. In config.yaml do reference the code and it’s in the executable as well. When run executable it prints out the hash from the source code. Enough to link them?
AH: I recall NH wanted to flip it around and have the source code part of the experiment. NH: Probably too confusing for users. AH: True, but a useful idea to help refine a goal and best way to achieve it.
AH: A dev branch is a good idea. Then you have the idea that this is the version that will replace the current master. Can then possibly entrain others into the testing. Users who want updates can test stuff, you can make a PR and detail testing that has been done.
NH: Good idea. Some documentation that says experiments have stable and dev. When people are aware and have a problem, wonder if they can go to dev, see if it fixes. AK: Bug fixes should go into master ASAP. Feature development is not so urgent. A bit gray, as sometimes people need a feature but they can work off dev. AH: Now have some process for this: hot fixes that go straight in. Other branches are dev/feature branches. Maybe always accumulate changes into dev. Any organisation helps.
NH: Re: Removing experiment repositories: namelists depend on source code. AK: Covered by executables defined in config.yaml. NH: Yes ok.


RF: Did it work? It’s got a lot of merges. RF: Just two lines. Did a merge and pushed it to my branches on GitHub. AH: I’ll merge it in. Just wanted to check. AH: Can always make a new master branch that tracks the origin, check that out and pull in code from other branches. RF: Have a lot of other branches. AH: Can get very confusing.

payu restart issue

AH: Issue has resurfaced. I commented on #193, but didn’t look into the source of the problem. Should look into it rather than talk about it here.

FMS subrepo

AH: Still not done the testing on this. Been sick. Will try and get back to it.

Tenth update

AK: Andy done 50 years with RYF 90/91. Running stably. AH: What tilmestep? RF: Think he was using 600s. AK: 3 months / submit. Should ask for longer wall time limit. RF: Depends on how queues will be on new machine, what limits and what performance. AH: Talking about high temporal res output. AK: Putting out 3D daily prognostic fields. Want it for particle tracking. Including vertical velocity. Slowed it down a little bit. RF: More slowdown through ice. AK: No daily outputs from CICE.


NH: Still in progress. AK: Also requires newer version of netCDF? NH: Requires specific version of netCDF. Needs parallel version. Not a parallel build for every version. AK: Has parallel for 4.6.1. RF: Bug in HDF5 library which it is linked to. Documented in PIO. Probably a bug we’re not going to trip. Doing a collective write, and some of the processors not taking part/writing no data. Fixed next version of HDF5 1.10.4? AH: Not a netCDF version so much as the HDF library it links to. RF: Yes. AH: So should make sure we ask for a version of netCDF that doesn’t have this bug? AK: Add to request.
RY: If want parallel version, use OpenMPI 3 or 4? AH: Good question! RY: All dependencies will be available and very easy to use. AH: This using spack? RY: Above spack and other stuff. Automatic builds with all possible combinations. AH: Using it for your builds? RY: We are requested to test and are now using. Difficult to create new versions currently. In transition difficult, but in new system should be fixed quite easily. AH: Should fix the various versions of OpenMPI with different compilers. RY: Yes. AH: Will have a compiler/OpenMPI toolchain? RY: Will automatically use correct MPI and compiler. AH: Any documentation? RY: Some preliminary, but not released. When gadi is up all this should be available.
AK: Should I ask for a specific version of MPI? RY: If don’t specify, will be built with 3 or 4. Do you gave a preference? AK: No, just want the version with performance and stability we need. Do we need to use the same MPI version across all components. RY: Not necessarily. Good time to try OpenMPI3. No performance benefit as system hardware is still old hardware.

Technical Working Group Meeting, August 2019


Date: 14th August, 2019

  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) RSES ANU, Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Marshall Ward (MW) GFDL
  • Nich Hannah (NH), Double Precision
  • James Munroe (JM), COSIMA

PIO work with CICE

NH: PIO code in CICE not as complete or thorough as netCDF code. Nothing to suggest it won’t work. Relies on NCAR PIO library, and a CESM utility library. Dependencies which are not part of CICE. Built PIO dependency on raijin, ran into CESM dependency. Can either remove dependency or remove code.
NH: Initially thought to use the MOM approach. Tile and collate. Russ’ comments encouraged to try PIO. Will be supported in future and will be supported in CICE6. Nothing working, but will soon test with 1 degree.
RF: Real bottleneck with high freq output. Worth a go. Attempt to put this into FMS by Hartnett. AH: Different to parallel netCDF? NH: PIO is wrapper around parallel netcdf. Written by NCAR to simplify parallel netcdf. Another layer. On GitHub, continuing to be maintained. RY: Wrapper that does work to match computing to IO domain. Not so useful for MOM5 as it has io_layout already.
MW: Harntnett motivated by FE3 (forecast model) rather than ocean. Not sure what project even involved in.
NH: Big test is handling interesting CICE layout, difference between cartesian grid and PE layout. MW: PIO will support explicit decomposition and other approaches.
NH: Parallel netCDF version on raijin only links with OpenMPI3.0. RY: New machine launched soon. OpenMPI 1.* will be dropped. No new software depending on 1. MW: OpenMPI 2 is not good. Should use 3.
NH: Probably have to test this with OpenMPI 3.0 RY: 3.1.3. Switch everything to that. Good test for new machine. AH: Working now? RY: My fault. Used unmatched openMPI library. Everything looks fine. OpenMPI 2/3/4 with Intel 19. All working. 1 deg & 0.25 deg working. Tenth not working. MW: I was able to run tenth with 3.1.2/3.1.3.
MW: One of the intel compilers broke MOM. A compiler bug with types in types.
AH: Should  start an issue for testing RY: Will email MW directly. RY: Not a MOM bug.
MW: Tried MOM-SIS tenth? Good test. RY: From earlier this year do have this working. This is testing for new machine, so ACCESS-OM2.

OMIP date restart protocol

RF: Talked to Griffies. GFDL take ensemble approach. Run for N years using true dates. At finish reset back to start date with correct calendar. Storing new stuff in different directory. End up with 5 sequences of 55 years. All dates are correct. No issues with leap years going wrong. Think this is the best way to go.
AK: Came to conclusion that this was right way to go, mostly due to leap year issue. Problem is, can we get the model to do that, but Maurice and Ryan had issues. Issue with CICE getting the correct date. CICE has a flag “use_restart_dates”. Suggested set this to false, and set the dates in access_restart.nml, but CICE is not picking up dates. Looks like libaccessom2 is not passing them on to CICE. Some confusion about exactly what they have done. Some instructions on Wiki for restarting, from restarting IAF from RYF at tenth, but doesn’t work for other people. NH: I’ll look at it. AK: Will send issue. NH: Didn’t realise it was happening. CICE date handling is not great.
AH: Downside with ensemble, difficult to get metrics across the whole time series. RF: Need extra meta-data added in. Maybe which cycle you’re in. An extra variable which gives the actual number of days since the start of the run. Down with post-processing. Might be able to concatenate files using extra meta-data. AH: Always have issues with missing leap years if it spans a century. But only daily is an issue. AK: Cookbook do something. MC: Pretend it is no leap? JM: Data looking at as time series? AH: Extra metadata, say offset day is a good idea. RF: Add buffer in netCDF file so don’t need copies. mppnccombine can add padding. usually done with nccreate, make sure the header has some space. hbuf?

Strategy for CICE updates for flexibly adding fields

RF: Way CICE drivers work, variables you want are either hard coded, or muck around with pre-processing to compile them in and out. Wondering if anyone looked at doing it on the fly. Using error codes coming back when setting up variables, so have flexible number of variables passed in and out. Would like this to pass total wind speed, to harmonise code. Also Hakase wants it for some BGC stuff. Phytoplankton through to the ice. So specify the variables, work out if they’re there or not.
NH: Would want the exe to handle configuration with different sets of coupling fields. Sometimes include total wind speed, sometimes not. RF: would know complete set, if not there skip it. Currently have to be hard wired in, or make another driver. NH: Way to do it, start with superset in namcouple, and code would exclude certain variables. RF: Maybe if variable not in namcouple, return an error code, but ignore error. NH: Shouldn’t be too hard to do. NH: OASIS does return error codes that could be used. Either abort or return error code. If aborting could change that. AH: Restart fields? NH: Should do behind the scenes.

Paths for JRA55-do forcing files. Some changes to support v1.4

AH: JRA55-do not part of Input4MIPs, part of CMIP6. Have to use the copy that is CMIP6. Encodes all the metadata in filename, consequently doesn’t currently work with YATM. Circumvented by creating symbolic links that worked with YATM. When I did this couldn’t reproduce. Not sure if this is actually an issue with the fields being different or not.
AH: Tried to use testing framework NH developed for this using jenkins. The historical test that tests against known checksums doesn’t seem to actually compare them. Not sure if that is intentional. Would like to use framework, as NH has done a great job with it.
MH: MOM6 has diag_mediator, supports CMOR name alongside internal model name. Porting to MOM5 is a big task, but idea is good and saved them a lot of work. Could create a thin wrapper to translate to CMOR name if that helps. AK: How integrate with YATM? MW: Don’t know. At FMS level, so only help with 1 model (MOM). AK: YATM access the JRA files. So libaccessom2 change. AH: Looked at YATM code. Generates filename form date. Input4MIPS has current year and next year, so would require code changes. Might just be easier to create a file with date->filename mapping? AH: Possible to do. Would need to add a token for year+1. Possible to do. Probably best to do it that way.
AK: Also need code changes with v1.4. Solid and liquid runoff are separate. What to do with solid runoff? Griffies either use iceberg model, or melt them and add them to runoff. Take account latent heat of fusion? Assuming solid runoff is at zero, which could be a problem. Put in a request to download v1.4. Scripts they have should automatically download it, but not. MW: Think GFDL only has v1.3.
MW: Fields go to end of 2017, is 2018 downloaded? Looking in wrong place? Looking in ua8. AK: Should look in qv56. AK: qv56 up to feb 2018. AH: If not automatically downloading, we should ask. What does the OMIP protocol say about end date? AK: JRA55 can find out about 2018. RF: It is specified, but would like latest for ongoing runs.

Testing FMS merge

AH: Putting FMS in as a sub-repo. Just needs testing. If it reproduces checksums for a month we’re sure it is ok? Is that sufficient?
NH: When Marshall upgraded FMS, went through every MOM test. Including 0.25. Can’t recall how strict we were. AH: Testing framework still there? NH: It is there. Because it never gets used, might be rotted a bit. Can give Jenkins URL of PR and it would do it. We should work together to get that working.

New NCI HPC hardware announcement

RY: System by end of the year. 2 phases, install new machine with Cascade Lake nodes. Short period gabi and raijin run simultaneously. After that skylake and broadwell will be merged with new machine and SandyBridge nodes removed. 100 GPU installed. 16 skylake k-80 nodes. PBS pro again. Storage and network infiniband. 200GB/s transfer speed. OS is CentOS 8. AH: Trying to figure out total core count for new machine. Do you know what core count will be? RY: Not clear on exact number. Can check with system guys if they know the exact number. If 32 cores/node, 150+K processors. AH: Will runtimes be extended for new machine. Find 5 hours too low for high core count jobs. Reduces flexibility. RY: Queue time limits are per project. Quite flexible. Contact NCI help. AH: Have asked for time limit changes in past, but usually time limited. RY: Have been asked by other users, not sure about the policy. Good time to ask and get a better policy for the new machine.