Technical Working Group Meeting, February 2021


Date: 17th February, 2021
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU
  • Angus Gibson (AG) RSES ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Paul Leopardi (PL), Rui Yang (RY) NCI
  • Nic Hannah (NH) Double Precision


AH: Been working with Mark Cheeseman from DownUnder GeoSolutions (DUG)
MW: GFDL collaborating with Seattle group Vulcan. Mark Cheeseman was on staff. Now with DUG. NH: Worked at NCI and then Pawsey, saw him there when gave a talk. Went to US project Vulcan. Then contacted me, now contacted working from DUG. Interested in REASP startup, now on ACCESS-OM2 project.
NH: Curious about Vulcan, what is it? MW: Vulcan is a project to wrap fortran core with python. Similar to f2py. Controlled by python and a domain specific language to put arrays on GPUs. Not my style. GFDL atmosphere group like it. Not sure if they are using it in production. Lucas Harris (head of atmosphere) spruiks it. NH: Vulcan were advertising a job in their team (short term). MW: Vulcan was a Paul Allen start-up. Jeremy McGibbon is a good guy to talk to.
MW: MITgcm was rewritten in Julia. PL: Domain specific languages becoming more popular. NH: Fortran is having a revival as well. MW: Talked to Andre Certik who made lfortran. Wants to connect with flint project. is doing good stuff. Vulcan is similar to cyclone (LFRIC). Similar spirit. PL: Cyclone is one of the enabling technologies. LFRIC is the application of it to weather modelling. MW: Will be UM15 before LFRIC.
PL: UM model has a huge amount that isn’t just the dynamical core that all has to be integrated. Has to be more scientifically valid and performant than current. MW: Everybody is trying to work around compiler/hardware. Seems backwards. Need compilers to do this. Need better libraries. malloc should be gpu aware. Maybe co-arrays should be gpu aware. Seems silly to write programs to write programs.
AH: Agree that compiler should do the work. Was dismayed when talked to Intel guy at training session when he said compiler writers didn’t know or care overly about fortran. Don’t expect them to optimise language specific features well. If you want performance write it in the code, don’t expect compiler to do it for you.
MW: If can write MPI, surely can write a compiler to work with GPUs. Can figure out language/library level.
AH: DUG contacted us and wanted to move into weather and climate. Met them at the NCI Leadership conference. Stu Midgley told me they had a top 25 compute capability, larger than raijin but now gadi is bigger. Biggest centre is in Texas. Immersed all their compute in oil tanks for cooling. Reckons their compute cycles are half the cost of NCI.
NH: Selling HPC compute cycles: HPC in the cloud. Also recently listed on stock exchange. AH: Floated at $1.50, now down to $1.00 a share. NH: Don’t understand how they can make money. What is so special about HPC cloud?Competing with AWS and NCI and governments. MW: Trying to pull researchers? NH: Getting research codes working. MW: NCI is free, how do you beat that? PL: Free depending on who you are.
AH: Liased with Mark. Pointed him at repos, got him access to v45 to see how it runs on gadi. He has duplicated the setup on their machines, not using payu. Only running 1 degree so far.
AH: Had a meeting with them and Andy Hogg. Showed us their analysis software. Talked about their business model. Made the point we use a lot of cycles, but produce a lot of data that needs to be accessed and analysed over a long time period. Seemed a difficult sell to us for this reason. They view their analysis software as a “free” add-on to get people to buy cycles. Developed their own blocked-binary storage format similar to zarr. Can utilised multiple compression filters. Will need to make this sort of thing public for serious uptake IMO. Researchers will not want to be locked into a proprietary format. Andy pointed out there were some researchers who can’t get the NCI cycles they need. DUG know of Aus researchers who currently go overseas for compute cycles. Also targeting those who don’t have enough in-house expertise: they will run models for clients. NH: Will run models, not just compute? AH: They’ll do whatever you want. Quite customer focussed. MW: Running a model for you is a valuable service. AH: Would military guys who run BlueLink model be interested in those services RF? RF: Want to keep secure and in house. AH: Big scale and looking for new markets.

PIO update

NH: Task was to improve IO performance of CICE and MOM. Wanted async IO using PIO in CICE. Numbers showed could improve speed 1-2% if not waiting on CICE IO. RF sent out a presentation on PIO, active on GitHub, moving quickly. Software all there to do this. 6 months ago no support for fortran async IO. First problem OASIS uses MPICOMMWORLD world to do everything. With IO tasks can’t use MPICOMMWORLD. New version of OASIS-mct v4 makes it easier not to use MPICOMMWORLD. Upgraded OASIS. Has new checks in coupling. Historically doing some weird stuff in start-up. New OASIS version didn’t like it. Ocean and Ice both write out coupling restarts at the end of the run.  Payu copies both over but read in by just Ice, then sent back to Ocean. Rather than each model writing and reading their own restarts get this weird swap thing. Get unbalanced comms. OASIS has check for this. Could have disabled check, but decided to change coupling as you would expect. Don’t know why it was done this way. RF: Did it this due to timing. Needs to store an average of the previous step, to do with multiple time steps. Also might have changed the order in the way things were done with the coupling, and how time steps were staggered. From a long time ago. NH: Might also be with dynamic atmosphere, need different coupling strategy. Made change, checked answers are identical. Now we’re using OASIS-mct 4.0. Now integrating the IO server.
NH: Open to advice/ideas. Start off with all MPI processes (world), gets split to compute group and IO groups. New compute group/world fed to OASIS and split into models. payu will see a new “model” which is the IO server. Maybe 10 lines of fortran, calls into PIO library and sits and waits.
NH: Other possibility, could split world into 3 models, then split CICE and IO into separate models. Doesn’t work, have to tell everyone else what the new world is. Async IO has to be the first thing that happens to create separate group. OASIS upgrade was smooth and changed no results. Should have async IO server soon. Could also use it for MOM. Wonder if it will be worthwhile. MW: Be nice. PL: Testing on gadi? MW: Using NCAR PIO? NH: Yes. AH: NCAR PIO guys very responsive and good to work with.
MW: They have been good custodians of netCDF.
PL: Also work with UM atmosphere with CM2? NH: No reason not to. Seems fairly easy to use starting from a netCDF style interface. Library calls look identical. With MOM might be more difficult. netCDF calls buried in FMS. If UM has obvious IO interface using netCDF, should be obvious how to do it.
PL: Wondering about upgrading from OASIS3-mct, might baulk at coupling. NH: Should be ok. Might have problem I saw, not sure if it happens with climate model. Worst case scenario you comment out the check. Just trying to catch ugly user behaviour, not an error condition. AH: Surprised you can’t turn off with a flag. NH: New check they had forgotten to include previously.
AH: UM already has IO server? MW: In the UM, so have to launch a whole UM instance which is an IO server. Much prefer the way NH has done it.
NH: PIO does an INIT and never returns. Must be IO server. MW: Prefer that model with IO outside rest of model code. NH: Designed with idea multiple submodels would use same IO server. MW: When Dale did IO server in UM, launch say 16 ranks, 15 ran model, 16th waiting for data. Always seemed like a waste. Does your IO server have it’s own MPI PE? Perfect if it was just a threaded process that doesn’t have a dedicated PE. RF: Locked in. Tell how many you want and where they are. Maybe every 48th processor on whole machine or CICE nodes. Doesn’t have to be one on every node. Can send it an array designating IO servers, split etc handled automatically. Also a flag to flag if a PE will do IO. NH: I have set it up as a separate executable launched by payu. All IO bunched together. Other option is to do what RF said, part of the CICE executable, starts but never computes CICE code. More flexible? RF: Think so. Better way to do it I think. NH: If were to go to MOM, would want 1 IO server per x number of PEs rather than bunching at end.
AH: Keep local, reducing communications. Sounds difficult to optimise. PL: PE placement script? NH: Could do it all at the library level. Messed around with aggregators and strides and made no difference. Tried an aggregator every node, and then a few over whole job which was the fastest. AH: Didn’t alter lustre striping? RY: Striping depends on how many nodes you use. AH: Something that was many dimensional optimisation space as Rui showed, throwing in this makes if very difficult. Would need to know the breakdown of IO wait time. NH: IO server the best, non-blocking IO sends and you’re done. Doesn’t matter where the IO server PEs are. Unless waiting on IO server. PL: Waiting on send buffers? NH: Why. Work, send IO, go back to work and should be empty when you come back to send more MW: IO only one directional fire and forget. Should be able to schedule IO jobs and get back to work. As long as IO work doesn’t interfere with computing. PL: Situation where output held up, and send buffers not clean and wind up with gap in output. MW: Possible. Hope to drain buffers. NH: If ever wait on IO server something broken in your system. MW: Always thought optimal would be IO process is 49th process on 48 core node, OS squeezing in IO when there is free time. RY: If bind IO to core always doing IO. MW: Imagine not-bound, but maybe not performant. RY: Normally IO server bound to single node but performance depends on number of nodes when writing to a single file. Can easily saturate single node IO bandwidth. Better to distribute different IO processes around different nodes to utilise multiple nodes bandwidth.  NH: I think the library makes that ok to do. RY: User has to provide mapping file. NH: Have to tell it which ranks do what. RY: Issues with moving between heterogenous nodes on gadi. Core number per node will change. We can provide a mapping for acceptable performance. Need lots of benchmarking.
AH: If IO server works well, it will simplify timings to understand what the model is doing? How much time spent waiting? NH: Looking like MOM spending a lot of time on IO. AK: Daily 3D fields became too slow. So turned them off. AH: So IO is a bottleneck for that.
NH: Any thoughts on PIO in MOM? RY: When I did parallel IO in MOM there already exist a compute and IO domain in MOM. Don’t need to use PIO. Native library supports directly. IO server in OASIS? Depends how MOM couples to OASIS. Can set up PIO as separate layer and forget IO domain and create mapping to PIO and use IO PE to do these things. Easier in MOM if pick up IO domain. MW tried to improve FMS domain settings to pick up IO domain as separate MPI world. NH: Still not going to work with OASIS? RY: Similar to UM. Doesn’t talk to external ranks. MW: Fixed to grid. PL: OASIS does an MPI ISEND wrapped so it waits for buffer before does ISEND to make sure buffer is always ready. Sort of blocking non-blocking. Not sure if IO would have to work the same way? NH: Assume so, would have to check request is completed, otherwise overwrite buffers.


MW: FMS2 rewrite nearing completion. Pretty severe. Can’t run older models with FMS2. Bob has done a massive rewrite of MOM6. Made an infra layer. Now an abstraction of the interface to FMS from MOM6, can run either FMS or FMS2. MOM6 will continue to have a life with legacy FMS. MOM6 will support any framework you choose, FMS, FMS2, ESMF, our own fork of FMS with PIO. That is now a safe boundary configurable at the MOM6 level. Could be achieved in MOM5, or if migrate MOM6 have this flexibility available. Not sure if it helps, but useful to know about.
MW: MOM6 is a rolling release. Militantly against versioning. Now in CESM, not sure if it is production currently. It is their active model for development.

ACCESS-OM2 tenth

AK: Not currently running. Finished 3rd IAF cycle a month ago. All done with new 12K configuration. Only real instability was a bad node. Hardware fix.
AK: Talk of doing another cycle with BGC at 0.1 deg. Not discussed how to configure. AH: Would need all the BGC fields? AK: Assume so.
AH: A while ago there were parameter and code updates. Are those all included? AK: Yes, using the latest versions. AH: Weres the run restarted? AK: Yes. AH: Metrics ok? AMOC collapsing? AK: Not been looked at detail. Only looked at sea-ice and that has been fine, but mostly result of atmospheric forcing. AH: Almost have an OMIP submission? AK: Haven’t saved the correct outputs. AH: Not sure if need to save for whole run, or just final one.


 AH: Continuing with this. Going slowly, not huge resources available at Aspendale. Currently trying harmonise CICE code bases by getting their version of CICE into git and do a PR against our version of CICE so we can see what needs updating since the forking of that CICE version.
PL: I am doing performance testing on CM2, how do I keep up to date? AH: I will invite you to the sporadic meetings.

Technical Working Group Meeting, November 2020


Date: 11th September, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU
  • Angus Gibson (AG) RSES ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Paul Leopardi (PL), Rui Yang (RY) NCI
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD) CSIRO Aspendale
  • James Munroe (JM) Sweetwater Consultants
  • Marshall Ward (MW) GFDL


Updated ACCESS-OM-01 configurations

NH: Testing for practical/usable 0.1 configs. Can expand configs as there is more available resource on gadi and CICE not such a bottleneck. When IO is on, scalability has been a problem, particularly with CICE. IO no longer bottleneck with CICE, but CICE still scalability problem. All I/O was routed through root processor with a lot of gathers etc. So now have a chance to have larger practically useful configs.
NH: Doesn’t look like we can scale up to tens of thousands of cores.
NH: Coupling diagram. YATM doesn’t receive any fields from ICE or OCEAN model. Just sync message. Can just send as much as it wants. Seemed a little dangerous, maybe stressing MPI buffers, or not intended usage. YATM always tries to keep ahead of the rest of the model, and send whenever it can. Does hang on a sync message. Will prepare coupling fields, once done will do a non-blocking sync. Should always send waiting for CICE to receive. When ice and ocean model coupling on every time step, both send and then both receive. Eliminate time between sending and waiting to receive. CICE should never wait on YATM. CICE and MOM need to run at roughly same pace. MOM should wait less as it consumes more resources. Internally CICE and MOM processors should be balanced. Previously CICE I/O has been big problem. Usually caused other CICE CPUs to wait, and MOM CPUs.
NH: Reworked coupling strategy originally from CSIRO to do as best as we can.
NH: Four new configs: small, medium, large and x-large. New configs look better than existing because old config did daily ice output, which it was never configured to do. Comparing new configs to worst existing config. All include daily ice and ocean output. Small basically the same as previous config. All tests are warm from April from restart from AK run. Lots of ice. Three changes from existing in all configs: PIO, small coupling change where both models simultaneously, small change to initialisation in CICE. Medium config (10K) is very good. Efficiency very similar to existing/small. Large not scaling too well. Hitting limit on CICE scaling. Probably limit of efficiency. X-Large hasn’t been able to be tested. Wanted to move projects, run larger job. AH: X-large likely to be less efficient than large NH: Unlikely to use it. Probably won’t use Large either.
AH: MOM waiting time double from med -> large. NH: MOM scaling well. CICE doesn’t scale well. Tried a few different processor counts, doesn’t help much. Problem getting right number of CICE cpus. Doesn’t speed up if you add more. AK: Played around with block numbers? NH: Keep blocks around 6 or 7. As high as 10 no down side. Beyond that, not much to do.
PL: Couldn’t see much change with number of blocks. Keeping same seemed better than varying. NH: Did you try small number like 3-4. PL: No. RF: Will change by season. Need enough blocks scattered over globe so you never blow out. PL: I was only testing one time of the year. NH: Difficult to test properly without long tests.
NH: PIO is great. Medium is usable. Large and X-Large probably too wasteful. Want to compare to PL and MW results. Might be some disagreement with medium config. From PL plots shouldn’t have scaled this well. PL: Maybe re-run my tests with new executables/configs. NH: Good idea. Will send those configs. Will run X-Large anyway.
NH: Had to fix ocean and ice restarts after changing layout. Didn’t expect it would be required but got a lot of crashes. Interpolate right on to land and then mask land out again. In ice case, making sure no ice on land. PL: I had to create restart for every single grid configuration.
NH: Not clear why this is necessary. Thought we’d fixed that, at least with ocean. Maybe with PIO introduced something weird. Would be good to understand. Put 2 scripts in topography_tools repo
RF: Will you push changes for initialisation? Would like to test them with our 3 day runs. NH: Will do.
RF: Working on i2o restarts. Code writes out a variable, defines the next variable, chance of doing a copy each time. About 10-12 copies of the files in and out. NH: OASIS? RF: No coupled_netcdf thing. Rewritten and will test today. Simple change to define variables in advance. NH: Still doing a serial write? RF: Yes. Could probably be done with PIO as well. Will just build on what is there at the moment.
AH: Large not as efficient as Medium or Small, but more efficient than current. NH: Due to daily output and no PIO. AH: Yes, but still and improvement than current. AH: How does this translate to model years per submit? AK: Small is 3 months in just over 2.5 hours. Should be able to do 1 year/submit in large. AH: Might be a reason to use large. Useful for example if someone wanted to do 3-daily outputs for particle tracking for RYF runs. Medium is 6months/submit? Or maybe a year? AK: Possibly. Not so important when queue wait was longer. NH: I calculate medium would take 5.2 hours/year. Could make medium an thousand processors on the ocean. Might be smarter. AK: Target wall time.
AK: About to start 3rd IAF cycle. Andy keen to try medium. Good enough for production? NH: No real changes. AK: Do you have a branch I could look at. NH: Maybe test a small bump on medium to get it under 5 hours. AK: Hold off until config that gets under 5 hours? NH: Spend the rest of the day and get it under 5 hours. AK: Next few days, updates to diagnostics.
PL: Any testing to see if it changes climate output? NH: None. Don’t know how to do it. MOM should be reproducible. Assume CICE isn’t. AH: Run, check gross metrics like ice volume. Not trying to reproduce anything. Don’t know if there is an issue if it is due to this or not. Cross that bridge when we have to.
AH: Any code changes as well to add to core/cpu bumps? RF: Just a few minutes at the beginning and end.
MW: Shouldn’t change bit repro? NH: Never verified CICE gives same answers with different layouts. Assuming not. MW: Sorry, thought you meant moving from serial to PIO. AH: AK does done testing with that. AK: With PIO only issue is restart files within landmarked processors. MW: Does CICE do a lot of collectives? Only real source of issues with repro. PL: Fix restart code might perturb things? NH: Possibly? AH: Can’t statistically test enough on tenth. PL: Martin Dix sent me an article about how to test climate using ensembles.
AH: When trying to identify what doesn’t scale? Already know? EVP solver NH: Just looked at the simplest numbers. Might be other numbers we could pull out to look at it. Maybe other work. Already done by PL and MW. If goal is to improve scalability of CICE can go down that rabbit hole. AH: Medium is efficient and a better user of effort to concentrate on finessing that. Good result too. NH: Year in a single submit with good efficiency is a good outcome. As long as it can get through the queue, and had no trouble with testing. AH: Throughput might be an issue, but shouldn’t be with much larger machine now.
AK: Less stable at 10K? Run into more MPI problems with larger runs. Current (small) config is pretty reliable. Resub scripts works well, falls over 5-10% of the time. Falls over at the start, gets resubmitted. Dies at the beginning so no zombie jobs that run to wall time limit. NH: Should figure out if our problem or someone else’s? In the libraries? AK: Don’t know. Runs almost every time when restarted. Tends to be MPI stuff, sometimes lower down interconnect. Do tend to come in batches. MW: There were times when libraries would lose connections to other ranks. Those use to happen with lower counts on other machines. Depends on the quality of the library. MPI is a thin layer above garbage that may not work as you expect at very high core counts.
PL: What is happens when job restarts? Different hardware with different properties? AK: Not sure if it is hardware. MW: mxm used to be unstable. JM: As we get to high core count jobs, may nott trust every node won’t be ok for the whole run? Should get used to this? In cloud this is expected. In future will have to cope with this? That one day cannot guarantee an MPI rank may not ever finish?
MW: Did see a presentation with heap OpenMPI devs, acknowledge that MPI is not well suited to that problem. Not designed for pinging and rebooting. Actions in implementation and standard to address that scenario. Not just EC2. Not great. Once an MPI starts to fray falls to pieces. Will take time for library devs to address problem, also if MPI is the appropriate back end for IPC scenario. PL: 6-7 yrs ago ANU had a project with Fujitsu and exascale and what happens when hardware goes away. Not MPI as such. Mostly mathematics of sparse grids. Not sure what happened to it. NH: With cloud expect to happen, and why. We might expect it, but don’t know why. If know why, if it is dropping nodes we can cope. If just transient error, due to corrupted memory due to bad code in our model, want to know and fix. MW: MPI doesn’t provide API to investigate it. MPI trace on different machines (e.g. CRAY v gadi) look completely different. Answering that question is implementation specific. Not sure what extra memory is being used, which ports, and no way to query. MPI may not be capable of doing what we’re asking for, unless they expand the API and implementors commit to it.
AH: Jobs are currently constituted don’t fit cloud computing model, as jobs are not independent, reply on results from neighbouring cells. Not sure what the answer is?
MW: Comes up a lot for us. NOAA director is private sector. Loves cloud compute. Asks why don’t we run on cloud? Any information for a formal response could be useful longer term. Same discussions at your places? JM: Problem is in infrastructure. Need an MPI with built in redundancy to model. Scheduler can keep track and resubmit. Same thing happens in TCP networking. Can imagine a calculation layer which is MPI like. MW: We need infiniband. Can’t rely on TCP, useful stock answer. JM: Went to DMAC meeting, advocating regional models in the cloud. Work on 96 cores. Spin up a long tail of regional models. Decent interconnect on small core counts. Maybe single racks within datasets. MW: Looking at small scale MOM6 jobs. Would help all of us if we keep this in mind, what we do and do not need for big machines. Fault tolerance not something I’ve thought about. JM: NH point about difference between faulty code and machine being bad is important.
AH: Regarding idea of having an MPI scheduler that can cope with nodes disappearing. Massive problem to save model state and transfer around. A python pickle for massive amounts of data. NH: That is what  restart is. MW: Currently do it the dumbest way possible with files. JM: Is there a point where individual cores fail, MPI mesh breaks down. Destroys whole calculation. Anticipating the future. NH: Scale comes down to library level. A lot of value in being able to run these models on the cloud. Able to run on machines that are interruptible. Generally cheaper than defined duration. Simplest way is to restart more often, maybe every 10 days. Never lose much when you get interrupted. Lots we could do with that we have. AH: Interesting, even if they aren’t routinely used for that. Makes the models more flexible, usable, shareable.
MW: Looking at paper PL sent, are AWS and cori comparable for moderate size jobs? PL: Looking at a specific MPI operation. JM: Assuming machine stays up for the whole job. Not really testing redundancy.

Testing PIO enabled FMS in MOM5

AH: Last meeting talked to MW about his previous update of FMS to ulm. MW put code directly into shared dir in MOM5 source tree. Made and merged PR where shared dir is now a subtree. Found exact commit MW had used from FMS repo, and recreated the changes he subsequently made so old configs still work: adding back some namelist variables that don’t do anything, but mean old configs still work. FMS code is identical, but now can swap out FMS more easily with subtree. There is a fork of FMS in the mom-ocean GitHub organisation. The master branch is the same as the code that is in the MOM5 subtree. Have written a README to explain the workflow for updating this. If  we want to update FMS, create a feature branch, create a PR, create a matching branch on MOM5, use subtree to update FMS and make a PR on the MOM5 repo. This will then do all the compilation tests necessary for MOM5. Both PRs are done concurrently. Bit clunky, but a lot easier than otherwise, and changes we make are tied to the FMS codebase. Wasn’t the case when they were bare code changes directly included in MOM5.
AH: Can now look at testing a PIO enabled FMS branch based on work of RY and MW. Is there an appetite for that? MW and RY changes are on a branch on FMS branch. Based on a much more recent version of FMS? MW: Pretty old, whatever it is based on. 1400 commits behind. AH: Current is 2015, PIO branch is based on 2018. MW: Must be compatible with MOM5 because it was tested. Not a big issue if it is ulm or not.
AH: So should we do that? I could make a branch with this updated PIO enabled FMS. NH: Great idea. Uses what we’ve already done. AH: RY and MW have done it all, just have to include it. Is compression of outputs handled by library? RY: Testing parallel compression. Code does not need changing. Just need to change deflate level flag in FMS. Parallel IO can import that. No need for code changes. Not compatible with current HDF version 1.10. If specify chunk layout will crash. Test HDF 1.12. Works much better than 1.10. Performance looks pretty good. Compression ratio 60%. Sequential IO 2.06TB, using compression 840GB. Performance is better than sequential IO. Currently available HDF library not compatible. Will present results at next meeting. Not rush too much, need stable results. When netCDF support parallel compression, always wanted to test that, and see if there are code changes. Current HDF library layout only compatible with certain chunk layouts. AH: Certainly need stability for production runs.
NH: Had similar issues. Got p-netcdf and PIO working. PIO is faster, no idea why. Also had trouble with crashes, mostly in MPI libraries and not netCDF libraries. Used MPI options to avoid problematic code. PIO and compression turned on, very senstitive to chunk size. Could only get it working with chunks the same as block size. Wasn’t good, blocks too small. Now got it working with MPI options which avoid bad parts of code. RY: ompi is much more stable. NH: Also had to manually set the number of aggregators. RY: Yes, HDF v 1.12 is much more stable. Should try that. Parallel IO works fine, so not MPI issue, so definitely comes from HDF5 library. So much better moving to HDF1.12. MW: Is PIO doing some of the tuning RY has done manually. NH: Needs more investigation, but possibly being more intelligent about gathers before writes. MW: RY did a lot of tuning about how much to gather before writes. NH: Lots of config options. Didn’t change much. Initially expected parallel netCDF and PIO to be the same, and surprised PIO was so much better. Asked on GitHub why that might be, but got no conclusive answer.
AH: So RY, hold off and wait for changes? RY: Yes doing testing, but same code works. AH: Even though didn’t support deflate level before? RY: Existed with serial IO. PIO can pick up the deflate level. Before would ignore or crash.


MW: At prior MOM6 meeting was obvious preference to move MOM6 source code to mom-ocean organisation if that is ok with current occupants. Hadn’t had a chance to mention this when NH was present. If they did that were worried about you guys getting upset. AH: We are very happy for them to do this. NH: Don’t know time frame. Consensus was unanimous? AH: Definitely from my point of view makes the organisation a vibrant place. NH: I don’t own or run, but think it would cool to have all the codebase in one organisation. Saw the original MOM1 discs in Matt England’s office, which spurred putting all the versions on GitHub. So would be awesome. MW: COSIMA doesn’t have a stake or own this domain? AH: Not at all. I have just invited MW to be a owner, and you can take care of it.  MW: Great, all settled.
PL: Action items for NH? NH: Sent PL configs, commit all code changes tested on current config and give to RF and fix up Medium config. AH: four, shared config with RF and AK, and please send me the slides you presented. RY: PIO changes all there for PL to test? NH: Yes. PL: Definitely need those to test the efficiency of the code. AK: PIO in master branch of CICE on the repo.
PL: Have an updated report that is currently being checked over by Ben. Will release when that is give the ok.
AH: Working on finally finished the cmake stuff for MOM5 for all the other MOM5 test configs. Will mean MOM5 can compile in 5 minutes as parallel compilation works better due to dependency tree being correctly determined.

Technical Working Group Meeting, October 2020


Date: 13th October, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD) CSIRO Aspendale

CICE Initialisation

RF: NH found CICE initialisation broadcast timing issue. Using PIO to read in those files? NH: Just a read, regular netCDF calls. Still confused. Thought slowdown was OASIS setting up routing tables. RF pointed out docs that talked about OASIS performance and start up times. No reason ours should be > 20s. Weren’t using settings that would lead to that. Found broadcasts in CICE initialisation that took 2 minutes. Now take 20s. That is the routing table stuff. Confused why it is so bad. Big fields, surprised it was so bad. PL: Not to do with the way mpp works? At one stage a broadcast was split out to a loop with point to points. NH: Within MCT and OASIS has stuff like that. We turned it off. MW had the same issue with xgrid and FMS. Couldn’t get MOM-SIS working with that. Removed those.
MW: PL talking about FMS. RF: These are standard MPI broadcasts of 2D fields. MW: Once collectives were unreliable, so FMS didn’t use them. Now collectives exceed point to point but code never got updated. PL: Still might be slowing down initialisation?
NH: Now less than a second. Big change in start up time. Next would be to use newer version of OASIS. From docs and papers could save another 10-15s if we wanted to. Not sure it is worth the effort. Maybe lower hanging fruit in termination. RF: Yes, another 2-3 minutes for Pavel’s runs. Will need to track exactly what he is doing, how much IO, which bits stalling. Just restarts, or finalising diagnostics. AH: What did you use to track it down. NH: Print statements. Binary search. Strong hunch it was CICE. RF: Time wasn’t being tracked in CICE timers. NH: First suspected my code, CICE was the next candidate. AH: Close timer budgets? RF: A lot depend on MPI, others need to call system clock.
NH: Will push those changes to CICE so RF can grab them.
NH: Made another mention in CICE. Order in coupling: both models do work ocn send ice, ice recv ocn, ice send ocn, ocn recv from ice. So send and recv are paired after the work. Not best way. Both should work, both send and then both recv. Minute difference, but does mean ocean is not waiting for ice. AH: Might affect pathological cases.
AH: Re finalisation, no broadcasts at end too? NH: Even error messages, couple of megabytes of UCX errors. Maybe improve termination by 10% by cleaning that up. CICE using PIO for restarts. Auscom driver is not using PIO and has other restarts. ACCESS specific restarts are not using PIO. Could look at. From logs, YATM finished 3-4 minutes before everything else. AH: Just tenth? NH: Just at 0.1. Not sure a problem, just could be better.

Progress in PIO testing in CICE with ACCESS-OM2

NH: AK done most of the testing. Large parts of the globe where no PE does a write. Each PE writing it’s own area. Over land no write, so filled with netCDF _FillValue. When change ice layout different CPUs have different pattern of missing tiles can read unitialised values, same as MOM. Way to get around that is to just fill with zeroes. Could use any ice restart with any new run with different PE layout. AH: Why does computation care over land? NH: CICE doesn’t apply land mask to anything. Not like MOM which excludes calculations over land, just excluding calc over land where there is no cpu. Code isn’t careful. If change PE layout, parts which weren’t calculated are now calculated. RF: Often uses comparison to zero to mask out land. When there is a _FillValue doesn’t properly detect that isn’t land. MW: A lot of if water statements in code. RF: Pack into 1D array with offsets. NH: Not how I thought it works. RF: Assumes certain things have been done. Doesn’t expect ever to change, running the same way all the time. Because dumped whole file from one processor, never ran into the problem. NH: Maybe ok to do what they did before: changed _FillValue to zero. RF: Nasty though, is a valid number. NH: Alternative is to give up on using restarts RF: Process restarts to fill where necessary, and put in zeroes where necessary. NH: Same with MOM? How do we fix it with MOM? RF: Changed a while ago. Tested on _FillValue or _MissingValue, especially in thickness module. PL: Does this imply being able to restart from restart file means can’t change processor layout? Just in CICE or also MOM? RF: Will be sensitive to tile size. Distribution of tiles (round-robin etc) still have the same number of tiles, so not sensitive. AH: MOM always have to collate restarts if changing IO layout. AK: Why is having _FillValue of zero worse than just putting in zero? RF: Often codes check for _FillValue, like MissingValue. So might affect other codes.
NH: Ok settle on post-processing to reuse CICE restarts in different layouts. AK: Sounds good. Put a note in the issue and close it. AH: Make another issue with  script to post process restarts. NH: Use default default netCDF _FillValue.

Scaling performance of new Configurations for 0.1

NH: Working on small (6K), medium (10K+1.2K), large (18K) and x-large (33K+) ACCESS-OM2-01 configs. Profiling and tuning. PIO improves a lot. Small is better than before. CICE has scalability issues but before PIO, but little IO? PL: Removed as much IO as possible. Hard-wired to remove all restart writing. Some restart writing didn’t honour write_restart flag. Not incorporated in main code, still on a branch.
NH: Medium scales ok. Not as good as MOM, but efficiency not a lot worse. Large and x-large might not be that efficient. x-large just takes a long time to even start. 5-10 minutes. Will have enough data for a presentation soon. Still tweaking balance between models. Can slow down when decrease the number of ICE cpus *and* increase them. Can speed up current config a little by decreasing the number of CICE cpus. It is a balance between the average time that each CPU takes to complete ice step, and the worst case time. As increase CPUs the worse case maximum increases. Mean minus worst case decreases RF: Changing tile size? NH: Haven’t needed to. Trying to keep number of blocks/cpu around 5-10. RF: Fewer tiles the larger chance they are to be dist evenly across ice/ocean. Only 3-4 tiles per processor, some may have 1, others with 2. So 50% difference in load. About 6 tiles/processor is a sweet spot. NH: Haven’t created a config less than 7. From PL work, not bad to err on the side of having more. Not noticeable difference between 8 and 10, err on higher side. Marshall, did you find CICE had serious limits to scalability.
MW: Can’t recall much about CICE scaling work. Think some of the bottlenecks have been fixed since then. NH: Always wanted to compete with MOM-SIS. Seems hard to do with CICE. MW: Recall CICE does have trouble scaling up computations. EVP solver is much more complex. SIS is a much simpler model.
AH: When changing the number of CPUs and testing the sensitivity run time, are you also changing the CPUs/block? NH: Algorithm does distribution. Tell it how many blocks overall then give it any number of CPUS and will distribute it over the CPUs. Only two things. Build system conflates them a little. Uses the number of cpus to calculate the number of blocks. Not necessary. Can just say how many blocks you want and then change how many cpus as much as you want. RF: I wrote a tool to calculate the optimal number of CPUS. Could run that quickly. NH: What is it called? RF: Under v45. Maybe masking under my account. NH: So don’t change the number of blocks just change number of cpus in config. So can use CICE to soak up extra CPUS to use full ranks. That is why we were using 799. Should change that number with different node sizes. AH: Small changes of the number CPUs just means a small change to average number of blocks per CPU? Trying to understand sensitivity of run time considering such a small change. NH: Some collective ops, so more CPUs you have the slower. MW: Have timers for all the ranks. Down at subroutine level? NH: No, just a couple of key things. Ice step time.
MW: Been using perf. Linux line profiler. Getting used to using it in parallel. Very powerful. Tell you exactly where you are wasting your time. AH: Learning another profiler? MW: Yes, but and no documentation. score-p is good, but overhead is too erratic. Are the Allinea tools available? PL: Yes, bit restricted. MW: Small number of licenses? Why I gave up on them.

1 degree CICE ice config

AK: Re CICE core count, 1 degree config has 241 cores. Wasting 20% of cpu time on 1 degree. Currently has 24 cores for CICE. Should reduce to 23? NH: Different for 1 degree, not using round-robin. Was playing with 1 degree. Assuming asn’t as bad with gadi, were wasting 11. Maybe just add or subtract a stripe to improve efficiency. Will take a look. Improve the SU efficiency a lot.
RF: fortran programs in /scratch/v45/raf599/masking which will work out the processors for MOM and CICE. Also a couple of FERRET scripts given a sample ICE distribution will tell you how much work is expected to be done for a round-robin distribution. Ok, so not quite valid. Will look at changing the script to support sect-robin. More of a dynamic thing for how a typical run might go. Performance changes seasonally. NH: Another thing we could do is get a more accurate calculation of work per block. Work per block is based on latitude. Higher latitude that block going to do more work. Can also give it a file specifying the work done. Is ice evenly distributed across latitudes? RF: Index space, and is variable. Maybe some sort of seasonal climatology, or an annual thing. AH: Seasonal would make sense. AH: Can give it a file with a time dimensions? RF: Have to tell it the file, use a script. Put it in a namelist, or get it to figure out. NH: Hard to know if it is worth the effort. AH: Start with a climatology and see if it makes a difference. Ice tends to stick to close to coastlines. Antarctic peninsula will have ice at higher latitudes than in Weddell Sea for example. Also in the Arctic the ice drains out at specific points. RF: Most variable southerly part is north of Japan, coming a fair way south near Sakhalin. Would have a few tiles there with low weight. NH: Hard to test, need to run a whole year, not just a 10 day run. Doing all scaling test with warm run in June. MW: CICE run time is proportional to amount of ice in ocean. There is a 10% seasonal variation. RF: sect-robin of small tiles tries to make sure each processor has work in northern and southern hemispheres. Others divide into northern and southern tiles. SlenderX2? NH: That is what we’re using for the 1 degree.

FMS updates

 AH: Was getting the MOM jenkins tests running specifically to test the FMS pull request which uses subtree so you can switch in and out FMS versions easily. Very similar to what MW had already done. Just have to move a few files out of the FMS shared directory. When I did the tests it took me a week to find MW had already find all these bugs. When MW put in ulm reverted some changes to multithreaded reads and writes so the tests didn’t break. MW: Quasi-ulm. AH: Those changes were hard coded inside a MOM repo. MW: In the FMS code in MOM. AH: Not a separate FMS branch where they exist? MW: No. Maybe some clever git cherry-picking would work. MW: Has changed a lot since then. Unlikely MOM5 will every change FMS every again. AH: Intention was to be able to make changes in a separate fork. MW: Changes to your own FMS? Ok. Hopefully would have done it in a single commit. AH: Allowed those options but did nothing with them? MW: Don’t remember, sorry. AH: Yours and Rui’s changes are in an FMS branch? MW: Yes. Changes I did there would not be in shared dir of MOM5. Search for commits by me maybe. If you’re defining subtree wouldn’t you want the current version to be what is in shared dir right now? AH: Naively thought that was what ulm was. MW: I fixed some things, maybe bad logic, race conditions. There were problems. Not sure if I did them on FMS or MOM side. AH: Just these things not supported in namelist, not important. Will look and ask specific questions.


AH: Given Martin Dix latest 0.25 bathymetry for the ACCESS-CM2-025 configuration. Also need to generate regridding weights, will rely on latest scripts AK used for ACCESS-OM2-025. AK: Was issue with ESMF_RegridWeightGen relying on a fix from RF. Were you going to do a PR to the upstream RF? RF: You can put in the PR if you want. Just say it is coming from COSIMA.
MW: Regarding hosting MOM6 code in mom-ocean GitHub repo. Who is the caretaker of that group? AH: Probably me, as I’m paid to do it. Not sure. MW: Natural place to move the source code. They’re only hesitant because not sure if it is Australian property. Also wants complete freedom to add anything there, including other dependencies like CVMix and TEOS-10. AH: I think the more the better. Makes a more vital place. MW: So if Alistair comes back to me I can say it is ok? AH: Steve is probably the one to ask. MW: Steve is the one who is advocating for it. MW: They had a model of no central point, now they have five central points. NOAA-GFDL is now the central point. Helps to distance the code one organisation. Would be an advantage to be in a neutral space. COSIMA doesn’t have a say? AH: COSIMA has it’s own organisation, so no. AH: Just ask for Admin privileges. MW: You’re a reluctant caretaker? AH: NH just didn’t have time, so I stepped in and helped Steve with the website and stuff. Sounds great.

Technical Working Group Meeting, September 2020


Date: 16th August, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD) CSIRO Aspendale

ACCESS-CM2 + 0.25deg Ocean

PD: Dave Bi thinking about 0.25 ocean. Still fairly unfamiliar with MOM5. Trying to keep was harmonised as possible. Learn from 1 degree case harmonisation. PD: Doing performance plan for current financial year hence talk about that now. Asking supervisors what they want and popped out of that conversation with Dave Bi. AH: Andy Hogg has been pushing CLEX to do the same work. PD: Maybe some of this already happening, need some extra help from CSIRO? AH: We think it is much more that CSIRO has a lot of experience with model tuning and validation. Not something researchers want to do, but want to use a validated model and produce research. So a win-win. PD: Validation scripts are something we run all the time, so yes would be a good collaboration. Won’t be any CMIP6 submission with this model. AH: Andy Hogg keen to have a meeting. PD: Agree that sounds like a good idea.
PL: How do get baseline parameterisation and ocean mask and all that. Grab from OM2? PD: Yes grab from OM2. AH: Yes, but still some tuning in coupled versus forced model.

ACCESS-OM2 and MOM-SIS benchmarking on Broadwell on gadi

PL: Started this week. Not much to report. Running restart based on existing data fell over. Just recreated a new baseline. Done a couple of MOM-SIS runs. Waiting on more results. Anecdotally expecting 20% speed degradation.

Update on PIO in CICE and MOM

AK: Test run with NH exe. Tried to reproduce 3 month production run at 0.1deg. Issue with CPU masking in grid fields. Have an updated exe set running. Some performance figures under CICE issue on GitHub. Speeds things up quite considerably. AH: Not waiting on ice? AK: 75% less waiting on ice.
AH: Nick getting queue limits changed to run up to 32K.
PD: Flagship project such as this should be encouraged. Have heard 70% NCI utilisation? May be able to get more time. RY: No idea about utilisation. Walltime limitation can be adjusted. Not sure about CPU limit. AH: I believe it can. They just wanted some details. Have brought up this issue at Scheme Managers meeting. Would like to get number of cpu limits increased across the board. There is positive reaction to increasing limits, but no motivation to do so. Need to kick it up to policy/board level to get those changes. Will try and do that.
NH: Some hesitation. Consume 70-80KSU/hr. Need to be careful. PL: What is research motivation? NH: Building on previous work of PL and MW. With PIO in CICE can make practical configs with daily ice output with lots more cores. Turning Paul’s scaling work in production config. Possible due to PL, MW work, moving to gadi and having PIO in CICE.
NH: Got 3 new configs. Existing small (5K), medium (8K), large (16K) and x-large (32K). MOM doubling. CICE doesn’t have to double. Running short runs of configs to test. PL: 16K is where I stopped. NH: Andy Hogg said good to have a document to show scalability for NCMAS. PL: All up on GitHub. NH: Will take another look. NH: Getting easier and easier to make new configs. CICE load balancing used to be tricky. Current setup seems to work well to increase cpus.
PD: What is situation with reproducibility? In 1 degree MOM run 12×8. Would be same with 8×12? More processors? NH: Possible to make MOM bit repro with different layout and processor counts, but not on by default. Can’t do with CICE, so no big advantage. PL: What if CICE doesn’t change? NH: Should be ok if keep CICE same, can change MOM layout with correct options and get repro. RF: Generally have to knock down optimisation and floating point config. Once you’re in operational mode do you care? Good to show as a test, but operationally want to get through fast as possible as long as results statistically the same. PL: Climatologically the same? RF: Yeah. PL: All the other components, becomes another ensemble member. RF: Exactly. NH: Repro is a good test. Sometimes there are bugs that show up when don’t reproduce. That is why those repro options exist in MOM. If you do a change that shouldn’t change answers, then repro test can check that change. Without repro don’t have that option.
RF: Working with Pavel Sakov, struggling with some of the configs, updating yam to latest version. Moving on to 0.1 degree model. Hoping to run 96 member ensembles, 3 days or so, with daily output. A lot of start/stop overhead. PIO will help a lot. Maybe look at start up and tidy up time. A lot different to 6 month runs. AH: Use ncar_rescale? RF: Just standard config. Not sure if it reads in or does dynamic. AH: Worth precalculating if not doing so. Sensitivity to io_layout with start-up? RF: Data assimilation step, restarts may come back joined rather than separate. Thousands of processors trying to read the same file. AH: mppncscatter? RF: Thinking about that. Haven’t looked at timings yet, but will be some issues. AH: How long does DA step take? RF: Not sure. Right at the very start of this process. Pavel as had success with 1 degree. Impressed with quality of results from the model. Especially ice.
AH: Maybe Pavel present to a  COSIMA meeting? RF: Presented to BlueLink last week. AH: Always good to get a different view of the model.


AH: Trying to get testing framework NH setup on Jenkins running again. Wanted to use this to check FMS update hasn’t broken anything. Can then update FMS fork with Marshall and Rui’s PIO changes.
NH: A couple of months ago got most of the ACCESS-OM2 tests running. MOM5 runs were not well maintained. MOM6 were better maintained and run consistently until gadi. Can get those working again if it is a priority. Was being looked at by GFDL. AH: Might be priority as we want to transition to MOM6.
NH: Don’t have scaling results yet. Will probably be pretty similar to Pauls numbers. Will should you next time. PL: Will update GitHub scaling info. NH: Planning to do some simple plots and tables using AK’s scripts that pull out information from runs.


AK: Got list of edits from Simon Marsland for original topography. Wanted to get feedback about what should be carried across. Pushed a lot into COSIMA topography_tools repo. Use as a submodule in other repos which create the 1 degree and 0.25 degree topographies. Document the topography with the githash of the repo which created it. Pretty much finished 0.25. Just a little hand editing required. Hoping to get test runs with old and new bathymetry.
AH: KDS50 vertical grid? QA analyses? AK: Partial cells used optimally from KDS50 grid. Source data is also improved (GEBCO) so not potholes and shelves. AH: Sounds like a nice well documented process which can picked up and improved in the future.
AK: Way it is done could be used for other input files, have all in git repo and embedding metadata into file its exact link to git hash. Good practice. Could also use manifests to hash inputs? NH: Great, talked about reproducible inputs for a long time. AH: Hashing output can track back with manifests. Ideally would put hashes in every output. There is an existing feature for unique experiment IDs in payu, but has not gone further, still think it is a good idea.
AK: Process can be applied to other inputs. AH: The more you do it, the more sense it makes to create supporting tools to make this all easier.

Jupyterhub on gadi

AH: What is the advantage too using jupyterhub?
JM: Configurable http proxy and jupyter hub forward ports through a single ssh tunnel. If program crashes and re-run, might choose different port but script doesn’t know. This is a barrier. Also does persistence, basically like tmux for jupyter processes. AH: Can’t do ramping up and down using bash script? JM: Could do, that is handled through dask-jobqueue. Bash script could use that too. JM: Long term goal would be a Difficult to deploy services at scale. AH: Pawsey and the NZ HPC mob were doing it.

Technical Working Group Meeting, August 2020


Date: 12th August, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (GFDL)

PIO with MOM (and NCAR)

MW: NCAR running global 1/8 CESM with MOM6. Struggling with IO requirements. Worked out they need to use io_layout. Interested in parallel. Our patch is not in FMS, and never will be. They understand but don’t know what to do. Can’t guarantee my patch will work with other models and in the future. Said COSIMA guys are not using it. Using mppnccombine-fast so don’t need it.  Is that working? AK: Yes.
RY: Issue is no compression. Previous PIO not support compression and size is huge. Now netCDF supports parallel compression, so maybe look at it again. Haven’t got time to look at it. Should be a better solution for COSIMA group.
MW: Ideally Ed Harnett or someone else from NCAR would add PIO to FMS. They have been working on latest FMS rewrite for more than 2 years. Haven’t finished API update. FMS API is very high level. They have decided too high level to do PIO. FMS completely rewriting API. Ed stopped until FMS update. Added PIO and used direct netCDF IO calls. Bit hard-wired but suitable for MOM-like domains. Options 1 sit and wait, Option 2 is do their own, Option 3 is do it now and use fork of FMS. Maybe Option 4 is mppnccombine-fast. What do you think?
AK: Outputting compressed tiles with io_layout and using fast combine. Potential issue is if io_layout makes small tiles. MW: Chunk size has to match tile size? Do tiles have to be same size? AH: Yes. Still works but slower but has to do deflate/reflate step. It is fast when it can just copy compressed chunks from one file to another. Limit is only filesystem speed. Still uses much less memory if it has to do deflate/reflate. Chooses first tile size and imposes that on all tiles. If first tile is not typical tile size for most files could do end up reflating/deflating a lot of the data. Also have to choose processor layout and io_layout carefully. For example 0.25 1920×1080 doesn’t have consistent tile sizes. MW: Trying to figure out if it is worth telling them to reach out to you guys. Sent them a link to the repo. AH: Might be a decent way to keep going until they get a better solution.
MW: Bob had strategy to force modelling services to include PIO support by getting NCAR to use PIO.
NH: Can they use PIO patch with their current version of FMS? MW: They want to get rid of the old functions.
Bad idea to ditch old API, creates a lot of problems. The parallel-IO work is on a branch.
AH: Regional output would be much better. Output one file per rank. Can aggregate with PIO? NH: One output file. Can set chunking. AH: Not doing regional outputs any more because so slow. Would give more flexibility. AK: Slow because of huge number of files. Chunks are tiny and unusable. Need to use non-fast. AH: I thought it was hust the output is slow. RF: Many processors on same node will pump output. MW: Many outputs will throttle lustre. Only have a couple of hundred disks. Will get throttled. AH: Another good reason to use for MOM. MW: Change with io_layout? RF: No, always output for themselves. MW: Wonder how patch would behave. AH: NCAR constrained to stay consistent with FMS. MOM5 is not so constrained, should just use it. NH: Should try it if code already. parallel netcdf is a game changer. AH: I have a long-standing branch to add FMS as a sub-tree. Should do it that way. Have our own FMS fork with the code changes. MW: Only took 3 years!
AK: Put in a deflate level as namelist parameter as it defaulted to 5. Used 1 as much faster but compression was the same.


NH: Solved all known issues. Using PIO output driver. Works well. Can set chunking, do compression, a lot faster. Ready to merge and will do soon. I don’t understand why it is so much better than what we had. I don’t understand the configuration of it very well. Documentation is not great. When they suggested changes they didn’t perform well. Don’t understand why it is working so well as it is, and would like to make it even better.
NH: Converted normal netcdf CICE output driver to use latest parallel netCDF library with compression. So 3 ways, serial, same netCDF with pnetcdf compressed output, or PIO library. netcdf way is redundant as not as fast as PIO. Don’t know why. Should be doing this with MOM as well. Couldn’t recall details of MW and RY previous work. Should think about reviving that. Makes sense for us to do that, and have code already.
MW: Performance difference is concerning. NH: Has another player of gathers compared to MPI-IO layer.  PIO adds another layer of gathering and buffering. With messy CICE layout PIO is bringing all those bits it needs and handing it lower layer. Maybe possible reason for performance difference. RY: PIO does some mapping from compute to IO domain. Similar to io_layout in MOM. Doesn’t use all ranks to do IO. Sends more data to a single rank to IO, saves contention issues. NH: MPI-IO has aggregators? RY: In the library you can select number of aggregators. Default is 1 aggregator per node. If you use PIO to use single rank per node this matches MPI-IO. Did this in the paper where we tested this. If consistent io_layout, aggregator number and lustre striping should get good performance.
RY: Tried different compression levels? NH: Just using level 1. Did some testing in serial case not much point going higher. Current tests doing all possible outputs. RF: A lot of compression will be due to empty fields. RY: compression performance is related to chunk size. NH: performance difference with chunk size. Too big and too small is slower. Default chunk size is fastest for writing. 360×300 for 2D field. Might not be ok for reading. RY: Should consider both read and write. Write once and many read patterns. MW: Parallel reads were slower than POSIX reads. AH: What is dependence of time on chunk size. NH: Depends how many fields we output. Cutting down should be fast for larger chunk size. Is a namelist config currently. Tell it chunk dimension. RY: Did similar with MOM. AH: CICE mostly 2D, how many have layers. AH: What chunking in layers? NH: No chunking, chunk size is 1, for layers. AH: Have noticed access patterns have changed with extremes. Looking more at time series, and sensitive to time chunking. Time chunking has to be set to 1? NH: With unlimited not sure RF: Can chunk in time with unlimited, but would be difficult as need to buffer until writing possible. With ICE data layers/categories are read at once. Usually aggregated, not used as individual layers. Make more sense to make category chunk the max size. Still a small cache for each chunk. netCDF4 4.7.4 increased default cache size from 4M to 16M.
AH: I thought deflate level 4 or 5 was still worth it. NH: Can give it a try. Don’t really care about deflate level, just getting rid of zeroes.

Masking ICE from MOM – ICE bypass

NH: Chatted with RF on slack. Mask to bypass ICE. Don’t talk to ice in certain areas. Like the idea. Don’t know how much performance improvement. RF: Not sure it would make much difference. Just communication between CICE and MOM. NH: Also get rid of all the CICE ranks that do nothing. RF: Those are hidden away because of round-robin and evenly spread. Layout with no work would make a difference. NH: What motivated me was the IO would be nicer without crazy layouts. If didn’t bother with middle, would do one block per cpu, one in north and south. Would improve comms and IO. If it were easy would try. Maybe lots of work. AK: Using halo masking so no comms with ice-free blocks.
AH: What about ice to ice comms with far flung PEs? RF: Smart enough to know if it needs to MPI or not. Physically co-located rank will not use MPI. AH: Thought it would be easy? NH: Not sure it is justified in terms of performance improvement. With IO tiny blocks were killing performance, so this was a solution to IO problem. MW: Two issues are funny comms patterns and calculations are expensive, but ice distribution is unpredictable. Don’t know which PEs will be busy. Load imbalances will be dynamic in time. Seasonal variation is order 20%. Might improve comms, but that wasn’t the problem. Stress tensor calcs are expensive, so ice regions will do a lot more work. NH: Reason to use small blocks which improves ability to balance load. MW: Alisdair struggling with icebergs. Needs dynamic load balancing. Difficult problem. RF: Small blocks are good. Min max problem. Every rank has same amount of work, not too much or too little. CICE ignores tiles without ice. CICE6 a lot of this can be done dynamically. There is dynamic allocation of memory. AH: Dynamic load balancing? RF: Who knows. Now using OpenMP. AH: Doesn’t seem to make much difference with UM. MW: Uses it a lot with IO as IO is so expensive.
AH: A major reason to pursue masking is it might make it easier when scaling up. If round-robin magically scales well that is ok, but last time there was a lot of analysis with heat maps and discussion about optimal block sizes. Conceptually it might be easier to understand how to best optimise for new config. NH: Does seem to make sense, could simplify some aspects of config. Not sure if it is justified. MW: Easy to look at comms inefficiency. Did this a lot for MOM5, and mostly it wasn’t comms. Sometimes the hardware, or a library not handling the messages well, rather than comms message composition. Bob does this a lot. Sees comms issues, fixes it and doesn’t make a big difference. Definitely worth running the numbers. NH: Andy made the point. This is an architecture thing. Can’t make changes like this unilaterally. Coupled model as well. Fundamentally different architecture could be an issue. MW: Feel like CPUs are the main issue not comms. Could afford to do comms badly. NH: comms and disks seems pretty good on gadi. Not sure we’re pushing the limits of the machine yet. Might have double or triple size of model. AH: Models are starting to diverge in terms of architecture. Coupled model will never have 20K ocean cpus any time soon. NH: Don’t care about ice or ocean performance.
AH: ESM1.5 halved runtime by doubling ocean CPUS. RF: BGC model takes more time. Was probably very tightly tuned for coupled model. 5-6 extra tracers. MW: 3 on by default, triple expensive part of model. UM is way more resources. AH: Did an entire CMIP deck run half as fast as they could have done. My point is that at some point we might not be able to keep infrastructure the same. Also if there is code that needs to move in case we need to do this in the future. NH: Code is more of an ocean calculation anyway? RF: Kind of. Presume there is a separate ice calc. Coupling code taken from gfdl/MOM and put into CICE to bypass FMS and coupler code. From GFDL coupler code. Rather than ocean_solo.f90 goes through coupler.f90. NH: If 10 or 20K cores might revisit some of these ideas. Goal to get to those core counts working, not sure about production.
MW: Still thinking about super high res, like 1/30. OceanMaps people wanted it? More concrete. RF: Some controversy with OceanMaps and BoM. Wanting to go to UM and Nemo. There is a meeting, and a presentation from CLEX. Wondering about opportunity to go to very high core counts (20K/40K). AH: Didn’t GFDL run 60K cores for their ESM model? NH: Never heard about it. Atmosphere more aggressive. RF: Config I got for CM4 was about 10K. 3000 line diag_table. AH: Performance must be IO limited? MW: Not sure. Separated from that group.

New bathymetry

AK: Russ made a script to use GEBCO from scratch. Worked on that to polish it up. Everything so far has been automatic. RF: Always some things that need intervention for MOM that aren’t so much physically realistic but are required for the model. AK: Identified some key straits. Retaining previous land masks so as not to need to redo mapping files. 0.25 need to remove 3 ocean points and add 2 points. Make remap weights scripts are not working on gadi, due to ESMF install. Just installed latest esmf locally, 8.0.1, currently running. AH: ESMF install for WRF doesn’t work? AK: Can’t find opal/MCA MPI error. RF: That is an MPI error.
AH: Sounds like the sort of error that was a random error, but if happening deterministically not sure. AK: Might be a library version issue. AH: They have wrappers to guess MPI library, major version the same should be the same.
AH: All this is scriptable and be re-run right? Bathymetries are intimately tied to vertical grid, so needs to be re-run if that is changed. AK: Vision is certainly for it to be largely automated. Not quite there yet.
NH: I’ll have a quick look too. Noticed there is no module load esmf? AK: Using esmf/nuwrf. I’ll have a look at what esmf built with. AH: I want esmf installed centrally. We should get more people to ask. NH: I think it is very important. AK: Definitely need it for remapping weights. AH: Other people need it as well.

Technical Working Group Meeting, July 2020


Date: 10th June, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants
  • Peter Dobrohotoff (CSIRO Aspendale)
  • Marshall Ward (GFDL)

Optimisation report

PL: Have a full report, need review before release. This is an excerpt.
PL: Aims for perf tuning and options for configuration. Did a comparison with Marshall’s previous report on raijin.
Testing included MOM-SIS at 0.25 and 1 deg to get idea of MOM scalability stand-alone.
The ACCESS-OM2 at 0.1 deg. Resting with land masking, scaling MOM and CICE proportional.
Couldn’t repeat Marshall’s exactly. ACCESS-OM2 results based on different configs. Differences:
  1. continuation run
  2. time step 540s v 400s
  3. MOM and CICE were scaled proportionally
  4. Scaling taken to 20k v 16k
MOM-SIS at 0.25 degrees on gadi 25% faster than ACCESS-OM2 on raijin at low end of CPU scaling. Twice as fast for MOM-SIS at 0.1 degrees. Scalability at high end better.
ACCESS-OM2: With 5K MOM cores, MOM is 50-100% faster than MOM on raijin. Almost twice as fast at 16K, scaled out to 20K. CICE: with 2.5K cores CICE on gadi seems 50% faster than CICE on raijin. Scales to 2.5 times as fast at 16K OM2 CPUs.
Days per cpu day. From 799/4358 CICE/MOM cpus does not scale well.
Tried to look at wait time as fraction of wall time. Waiting constant for high CICE ncpus, decreases with high core counts with low CICE ncpus. So higher core count probably best to reduce ncpus as proportion. In this case half the usual fraction.
JM: How significant are results statistically? PL: Expensive. Only ran 3-4 runs. Spread quite low. Waiting time varied the most. Stat sig probably not due to small sample size.
MW: Timers in libacessom2 were better than OASIS timers, which include bootstrapping times which are impossible to remove. Also noisy IO timers. Not sure how long your run. Longer would be more accurate. PL: Runs are for 1 calendar month (28 days). MW: oasis_put and get are slightly magical, difficult to know what they’re doing. PL: Still have outputs, could reanalyse.
MW: Speedups seem very high. Must be configuration thing. PL: Worried not a straight comparison. MW: If network time is 15-20%, wouldn’t make a difference. Always been RAM bound, which be good if that wasn’t an issue now. PL: Very meticulous documentation of configuration, and is very reproducible. Made a shell script that pulls everything from GitHub. MW: I think your runs are better referenced. While experiments were being released, seemed some parameters etc were changing as I was testing. That could be the difference. Wish I documented it better.
AH: All figures independent of ocean timestep. PL: All timestep at 540s. AK: Production runs are 15-20% faster, but a lot of IO. PL: Switched off IO and only output at end of the month. It really drags it. Make sure it isn’t IO bound. Probably memory bound. Didn’t do any profiling that was worth presenting. MW: Got a FLOP rate? PL: Yes, but not at my fingertips. If it is around a GFLOP, probably RAM bound. PL: Now profiling ACCESS-CM2 with ARM-map. RY is looking at a micro view, looking at OpenMP and compilation flag level. RY: Gadi has 4gb/core, raijin has 2GB/core. Not sure about bandwidth. Also 24 cores/node. Much less remote note comms. Maybe a big reduction in MPI overheads. MW: OpenMPI icx stuff helping. RY: Lots of on-node comms. Not sure how much. MW Believe at high ranks. At modest resolutions comms not a huge fraction of run time. Normal scalable configs only about 20%. PL: The way the scaling was done was different. MW scaled components separately. MW: I was using clocks that separated out wait time.
RY: If config timestep matters, any rule for choosing a good one? AK Longest timestep that is numericaly stable. 540s is stable most of the time.
MW: How have progressed on CICE layout stuff? Changed in the last year? I was using sect-robin. RF: sect-robin or round-robin. AK: You use sect-robin, production did use round-robin, but not sect-robin. Less comms overhead, not sure about load balance.
PL: Is there any value in releasing the report. NH: Would be interested in reading it. Looking to get these bigger configurations. AH: Worth to document the performance at this point. RY: Any other else worth trying? AH: Why 20K limit? PL: Believe that is a PBS queue limit. Some projects can apply for an exception. RY: For each queue there are limits. Can talk with admin if necessary to to increase. AH: Will bring this up at Scheme Managers meeting. They should be updated with gadi being a much bigger machine. Would give more flexibility with configurations. Scalability is very encouraging.
RF: BlueLink runs very short jobs, 3days at a time. Quite a bit of start-up and termination time. How much does that vary with various runs. MW: I did plot init times, it was in proportion to the number of cores. Entirely MPI_Init. It has been a point of research in OpenMPI. Double ranks, double boot time. RF: Also initialisation of OASIS, reading and writing of restart files. PL: Information has been collected, but hasn’t been analysed. RF: Paul Sandery’s runs are 20% of the time. MW: MPI_Init is brutal, and then exchange grid. There are obvious improvements. Still doing alltoall when it only needs to query neighbours. Can be speed up by applying geometry. Don’t need to preconnect. That is bad.
PL: At least one case where MPI collectives were being replaced with a loop of point to points. Was collective unstable? MW: Yes, but also may be there for reproducibility. MW: I re-wrote a lot of those with collectives, but they had no impact. At one time collectives were very unreliable. Probably don’t need to be that way anymore. MW I doubt that they would better. Hinges on my assertion that comms are not a major chunk.
AH: MOM is RAM bound or cache-bound? MW: When doing calculations the model is waiting to load data from memory. AH: Memory bandwidth improves all the time. MW: Yes, but increase the number of cores and it’s a wash. It could have improved. AMD is doing better, but Intel know there is a problem.
AH: To wrap-up. Yes would like full report. This is useful for NH to work up new configurations, as naive scaling is not the way to go. Also intialisation numbers RF would like that PL can provide.

ACCESS-OM2 release plan

AH: Are we going to change bathymeytry? Consulted with Andy who consulted with COSIMA group. What is the upshot? AK: Ryan and Abhi want to do some MOM runs. Problems with bathy. Andy wants run to start, if someone has time to do it, otherwise keep going. Does anyone have some idea how long it would take. RF: 1 deg would be fairly quick. We know where the problems are. Shouldn’t be a big job. Maybe a few days. 1 deg in a day for an experienced person. GFDL has some python tools for adjusting bathymetry on their GitHub. Point and click. Alisdair wrote it. Might be in MOM-SIS examples repo. MW: Don’t know, could ask. RF: Could be something that would make it straightforward.
AK: Will have a look.
MW: topog problems in MOM6 not usually the same as MOM5 due to isopycnal coordinate.
AK: Some specific points that need fixing? RF: I think I put some regions in a GitHub issue. AK: What level to dig into this? RF: Take pits out. Set to min depth in config. Regions which should be min depth and have a hole. Gulf of Carpenteria trivial. Laptev should all be set to min depth. NH: I did write some python stuff called topog_tools, can feed it csv of points and it will fix it. Will fill in holes in automatically. Also smooths humps. May still have to look at the python and fix stuff up. Another possibility.
AK: Another issue is quantisation to the vertical grid. A lot of terracing that has been inherited. RF: Different issue. Generating a new grid. 1 degree not too bad. 0.25 would be weighty still. MC: BGC in 0.25, found a hole off Peru that filled up with nutrients.
AH: Only thing you’re waiting on? AK: Could do an interim release without topog fixes. People want to use. master is so far behind now Also updating wiki which refers to new config. Might merge ak-dev into master, and tag as 1.7, and have wiki instructions up to date with that. AH: After bathy update what would version be? AK: 2.0. AH: Just wondering what constitutes a change in model version. AK: Maybe one criteria if restarts are unusable from one version to the next. Changing bathy would make these incompatible. AH: Good version.
AH: Final word on ice data compression? NH: Decided deflate within model was too difficult due to bugs. Then Angus recognised the traceback for my degfault which was great. Wasted some time not implementing it correctly. Now working correctly. Using different IO subsystem. IO_MP rather than ROMIO_MP. Got more segfaults. Traced and figured out. Need to be able to tell netCDF the file view. The view of a file that a rank is responsible for. MPI expecting that to be set. One way to set it up is to specify the chunks to something that makes sense. Once I did that file view was correct. Then ran into bugs in PIO library. Seem like cut n’ paste mistakes. No test for chunking. Library wrapper is wrong. Fixed that. No working. Learnt a lot and a satisfying outcome. Significantly faster. Partly parallel, also different layer does more optimisation. Noticed with 1 degree things are flying. Nothing definitive, but seems a good outcome. Will get some definitive numbers and a better explanation. Will have something to merge. Will have some PRs to add to and some bug reports to PIO. RY: PIO just released a new version yesterday. NH: Didn’t know that. Tracking issues that are relevant to me. Still sitting there. Will try new version. RY: Happy it is working now. NH: Was getting frustrated with PIO, wondered why not using netCDF directly. For what we do pretty thin wrapper around netCDF. Main advantage is the way it handles the ice mapping. Worth keeping just for that. MW: FMS has most of a PIO wrapper but not the parallel bit.
PL: Any of that fix needs to be pushed upstream? NH: Changes to the CICE code. Will push to upstream CICE. Will be a couple of changes to PIO. AK: Dynamically determines chunking? NH: Need to set that. Dynamically figures out tuneable parameters under the hood about the number of aggregators. Looking at what each rank is doing. Knows what filesystem it is on. Dependent on how it is installed. Assuming it knows it is on lustre. Can generate optimal settings. Can explain more when do a summary.
AK: Want to make sure OP files are consistently chunked. NH: Using the chunking to set the file view. Another way to explicitly set the file view using MPI API. Chunks are the same size as the data that each PE has. In CICE each block is a chunk. MW: These are netCDF chunks? AK: More chunks than cores? NH: Yes. Is that bad or good? This level is perfect for writing. Every rank is chunked on what it knows to do. Not too bad for reading. JM: How large a chunk? NH: In 1 degree every PE has full domain 300 rows x 20 columns. JM: Those are small. Need bigger for reading AK: For tenth 36×30. Something like 9000 blocks/chunk. NH: Might be a problem for reading? RF: Yes for analysis. JM: Fixed cost for every read operation. A lot of network chatter. AH: Is that the ice or the ocean in the tenth? Not sure. Chunk size is 36×30.  A lot of that is ice free, 30% is land. MW: Ideal chunks are based on IO buffers in the filesystem. AH: Best chunking depends a lot on access patterns. JM: One chunk should be many minimal file units big. AH: When 0.25 had one file per PE it was horrendously bad for IO. Crippled Ryan’s analysis scripts. If you’re using sect robin that could make the view complicated? NH: Wasn’t Ryan’s issue that the time dimension was also chunked? AH: He was testing mppnccombine-fast which just copies chunks as-is, which were set by the very small tiles sizes. Similar to your problem? NH: Probably worse. Not doing MOM tiles, doing CICE blocks, which are even smaller. Same grid as MOM. RF: Fewer PEs, so blocks are half the size. MOM5 tile size to CICE block size are comparable apart from 1 degree model.
NH: Will carry on with this. Better than deflating externally, but could run into some problems. The chunks in the file view don’t have to be the same. Will this be really bad for read performance? Gathering that it is. What could be done about it? Limited by what each rank can do. No reason the chunks have to be the same as the file view. Could have multiple processors contribute to a chunk. Can’t do without out collective. MW: MPI-IO does collectives under the hood. Can configure MPI-IO to build your chunk for you? NH: Currently every rank does it own IO as it was simpler and faster. MW: Can’t all be configured at MPI-IO later? RY: PIO can map compute domain to IO domain. Previous work had one IO rank per node. IO rank collect all data from node. Set chunking at this level. NH: For example, our chunk size could be 48 times bigger. RY: Yes. Also best performance is single rank per node. PIO does have this function to map from compute to IO domain, and why we used it. Can also specify how many aggregators / node. First decide how many IO ranks per node, and how many aggregators per node. Those should match. Can also number of stripes and strip size to match chunk size. IO rank per node is the most important, as will set chunk size. MW: Only want same number of writers as OSTs. RY: Many writers per node, will easily saturate the network and be the bottleneck. AH: Have to go, but definitely need to solve this, as scaling to 20K cores will kill this otherwise.

MW: Will also help RF. If you’re desperate should look at the patch RY and I did. Will help a lot once you’ve identified your initialisation slow down. RF: Yes, will do once I’ve worked out where the blockages are. Just seen some timings from Paul Sandery, but haven’t looked into it deeply yet. NH: Even with rubbish config, model is showing performance improvements. Will continue with that, and will consider the chunk size stuff as an optimisation task. MW: Sounds like you’ve gone from serial write to massively parallel, so inverted the problem, from one disk bound, to network bound within lustre. If you can find a sweet spot in between then should see another big speed improvement. NH: Config step pretty easy to do with PIO. Will talk to RY about it. RY: Could have a user settable parameter to specify IO writers per node. PL: Need to look into lustre striping size? RY: Currently set to 1GB, so probably ok, but can always tune this. NH: Just getting a small taste of the huge world of IO optimisation. MW: Just interesting to be IO bound again. NH: heavily impacted by IO performance with dailies with tenth. MW: IO domain offset the problem. Still there  but could be dealt with in parallel with next run so could be sort of ignored.

AK: This is going to speed up the run. Worse case is post-processing to get chunking sorted out. NH: Leaves us back at the point of having to do another step, which I would like to avoid. Maybe different, before was a step to deflate, maybe rechecking was always going to be a problem.
AK: Revisiting Issue #212, so we need to change model ordering. Concerns about YATM and MOM sharing a node and affecting IO bandwidth. Tried this test, there is an assert statement that fails. libaccessom2 demands YATM is first PE. NH: Will look at why I put the assert there. Weirdly proud there is an assert. RF: Remember this, when playing around with OASIS intercommunicators, might have been conceptually this was the easiest way to get it work. MW: I recall insisting on a change to the intercommunicator to get score-p working. AK: Not sure how important this is. NH: There are other things. Maybe something to do with the runoff. The ice model needs to talk to YATM at some point. Maybe a scummy way of knowing where YATM is. For every config maybe then know where YATM is. These would be shortcut reasons. PL: Give it it’s own communicator and use that? NH: Maybe that is what we used to do. Could always go back to what we had before. RF: Just an idea if it would have an impact. Could give YATM it’s own node as a test. MW: Not sure why it is that way. Should be easy to fix. NH: Ok, certain configurations are shared. Like timers and coupling fields. Instead of each model have their own understanding, share this information. So models check timestep compatibility etc. Using it to share configs. Another way to do that. MW: Doesn’t have to rank zero. NH: Sure it is just a hack. MW: libaccessom2 is elegant, can’t say the same for all the components it talks to. RF: There is a hard-wired broadcast from rank zero at the end.


MW: Ever talk about MOM6? AK: Angus is getting a regional circumantarctic MOM6 config together. RF: Running old version of CM4 for decadal project. PL: Maybe a good topic for next meeting?


Technical Working Group Meeting, June 2020


Date: 10th June, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants
  • Peter Dobrohotoff (CSIRO Aspendale)


JM: Already have compression with netCDF4. How do they consume the data? Jupyter/dask? AK: Mostly in COSIMA, BoM and CSIRO have their own workflows. JM: Maybe archival netCDF4? As long as writing/parallel IO/combining, no hurry to move to zarr direct output. JM: Inodes not an issue. Done badly can be bad. lustre file system has a block size, so natural minimum size. At least as many inodes as allocatable units on FS. If a problem wrap whole thing in uncompressed zarr. RY: many filters. blosc is pretty good. can use in netcdf4, but not portable. needs to compiled into library. netCDF4 now supports parallel compression. HDF supported a couple of years ago.
AH: As we’re in science, want stable well supported software. Unlikely to use bleeding edge right now. Probably won’t output directly from model for the time being. Maybe post process.
NH: What about converting to zarr from uncollated ocean output? Why collating when zarr uncollates anyway? Also collate as difficult to use uncollated output. How easy to uncollated to zarr. JM: Should pretty straightforward. Write that block directly to part of directory tree. Why is collating hard? Don’t just copy blocks to appropriate place in file? AH: Outputs are compressed. Need to be uncompressed and then recompressed. Scott Wales has made a fast collation tool (mppnccombine-fast) that just copies already compressed data. There are subtitles. Your io_layout determines block size, as netCDF library chooses automatically. Some of the quarter degree configs had very small tiles which led to very small chunks and terrible IO. AK: regional output is one tile per PE and mppnccombine can’t hand the number of files in a glob. AH: Yes that is disastrous. Not sure was a good idea to compress all IO.
JM: Good idea to compress even for intermediate storage. Regional collation: what do we currently collate? Original collation tool? AK: Yes, but don’t have a solution JM: definitely need to combine to get decent chunk sizes. If interested happy to talk about moving directly to zarr, or parallelising in some way. AK: Would want a uniform approach/format across outputs. JM: not sure why collate runs out of files. AK: Shell can’t pass that many files to it. AH: My recollection is that it is a limit in mpirun, which is why mppnccombine-fast allows quoting globs to get around this issue. Always interested in hearing about any new approaches to improving IO and processing at scale.

ACCESS-OM2-BGC testing and rollout

AH: Congrats on Russ getting this all working. How do we roll this out?
AK: All model components are now WOMBAT/BGC versions, but two MOM exes. One with BGC and one without. All in ak-dev. No standard configs that refer to BGC. Need some files. RF: About 1G, maybe 10 files. Climatologies for forcing and 3 restart files. This will work with standard 1 degree test cases. Slight change to a field table, and a different (OASIS restart file). Not much change. Will work on that with Richard and Hakase. Haven’t tested with current version. Maintaining a 1 degree config should be ok. AK: No interest at high res? NH: Yes, but not worth supporting as yet. One degree shows how it is used.
AK: Would BGC be standard 1 degree with BGC as option, or separate config specialised for BGC? RF: Get together and give you more info. Probably stand separately as an additional config. Currently set it up as a couple of separate input directories. AH: So RF going to work up a 1 degree BGC config. RF: Yes. AH: So work up config, make sure it works, and then tell people about it. RF: The people who are mainly interested know about the progress.

ACCESS-OM2 release version plan

AH: Fixing configs, code, bathymetry. Do we need a plan? Need help?
AK: Considering merging all ak-dev configs into master. Was constant albedo, moving to Large and Yeager as RF advised. Didn’t make a big difference. How much do we want to polish? Initial condition is wrong. Initial condition is potential temp rather than conservative. Small compared to WOA and drift. Not sure if worth fixing. Also bathymetry at 1 and 0.25 degree. Not sure who would fix it. AH: Talk to Andy Hogg if unsure about resources? AK: A number of problems could all be fixed in one go.
AK: Not much left to do on code. PIO with CICE. Still issues?
NH: Good news. In theory compression is supported on PIO with latest netCDF. PIO library also has to enable those features. There was a GitHub issue that indicated they just needed to change some tests and it would be fine. Not true. There is code in their library that will not let you call deflate. Tried commenting out, and one of the devs thought was reasonable. Getting some segfaults at netCDF definition phase. Want to explore until decide it is waste of time, will go back to offline compression if it doesn’t pan out. Done the naive test if there is a simple work-around. Looking increasingly like won’t work easily. Could do valgrind runs etc.
AK: Bleeding edge isn’t best for production runs. NH: Agreed. Will try a few more things before giving up. AK: Offline compression might be safer for production. NH: Agreed, random errors can accumulate. RY: Segfault is with PIO? NH: Using newly installed netCDF4.7.4p, and latest version of PIO wrapper with some commented out some checks. Not complicated just calls deflate_var. RY: When install PIO do you link to new netCDF library? NH: Yes. Did have that problem before. AG pointed it out, and fixed that. Takes a while to become stable. RY: Do you need to specify a new flag with parallel compression when you open the with nf_open? NH: Not doing that. Possible PIO doing that. RY: Maybe PIO library not updated to use new flags to use this correctly. NH: If using netCDF directly what would you expect to see? Part of nf_create? RY: Maybe hard wired for serial. NH: Possible they’ve overlooked something like this. Might be worth a little bit of time to look into that. PIO does allow compression on serial output only. Will do some quick checks. Still shouldn’t segfault. Been dragging on, keep thinking a solution is around the corner, but might be time to give it up. A bit unsatisfying, but need to use my time wisely. AH: It’s not nothing to add a post-processing compression step and then take it out again. Not no work, and no testing, so out of the box compression would be nice. Major code update you’re waiting on? Need to update fortran datetime library? Just a couple of PRs from PL and NH. Paul’s look like a bug fix PL: Mine an edge case? AK: Incorrect unit conversion. AH: Don’t know where in the code and if it affects us. AK: We’re using the version that fixes that. NH: Guy who wrote that library wrote a book called “Modern Fortran”.

Testing update

 AH: MOM travis testing properly testing ACCESS-OM2 and ACCESS-OM2-BGC as have also updated libaccessom2 testing so that it creates releases that can be used in the MOM travis testing. Lots of boring/stupid commits to get the testing working. AK: Latest version gets used regardless of what is in the ACCESS-OM2 repo? AH: Not testing the build of ACCESS-OM2, just the MOM5 bit of that build linking to the libaccessom2 library so that it can produce an executable so we know it worked successfully. We do compile OASIS, and uses just the most recent version. Had intended to do the same thing with OASIS, make it create releases that could be used. Haven’t done it, but it is a relatively fast build. AK: OASIS is now a submodule in the libaccessom2 build. AH: Yes, but still have a dependency on OASIS. Don’t have a clean dependence on libaccessom2. Might be possible to refactor, but probably not worth it. So yes have a dependence on libaccessom2 *and* OASIS.
AH: Previously travis allowed ACCESS-om2 to fail and the only way you knew it worked correctly would have to look at the logs to determine if everything was successful up to the linking step.
AH: Currently fixing up Jenkins automated testing. Starting on libaccessom2 testing. Hopefully won’t be too difficult. NH: Definitely need to have it working.
PL: Writing up testing/scaling tests for ACCESS-OM2.

Technical Working Group Meeting, May 2020


Date: 20th May, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants

ACCESS-OM2-01 scaling experiments

See PL’s associated scaling doc, scaling spreadsheet and python notebook
PL: Scaled MOM5 and CICE5 by same amount. Based in 01deg_jra55v13_ryf9091. Run an initial run to get restart output from February 1900. Restart runs for February (28 day). 540s time step. 4480 steps. No diagnostic output. Left ice as is.
PL: CICE scaled ncpus and ntask proportionally. Scaled MOM from 80×75 (4358) to 160×150 (16548). Scaling ok looking just at ocean timer and ice timer. Didn’t have daily iCE output.
PL: Most efficient at 10K cores with total wall time. Ocean timer shows perfect scaling. CICE only timer also shows good scaling.
NH: Keen to try these configs in production.
PL: Not sure how appropriate for production, no IO. NH: Good place to start, turn on output and see how it goes. Looks well balanced. Somewhat surprised.
PL: Now trying to reproduce Marshall’s figures from the report. Scales ocean and ice separately. Yet to get reproduction runs going. Working through namelist differences. Sometimes get a silent hang. Worth scaling ocean and ice at same time.
AH: Why do both models scale so well but overall not so well when combined. RY: CICE is waiting for MOM. Maybe some more optimal setting for CPU numbers?  AH: Seems odd MOM scaling better than CICE, but CICE waiting for MOM.
NH: CICE is waiting less for ocean as cpus scaled. oasis_recv is constant, which means MOM not waiting on CICE. Definitely don’t want MOM waiting for CICE. RY: If increase MOM and reduce CICE would we get better performance? PL: Not sure. Might be useful to know how I got those numbers, using log file and figures are total time divided by number of steps.. RF: Output from access-om2.out.  Just summary, won’t show load balance with MOM. PL: Any guidance would be useful. RF: Look in access-om2.out.
AH: Look for MOM timers. Might be some information about range of values, could be some very slow PEs masked by average. RF: the check mask is out of date. Has 12-16 processors which are purely land. Changed land mask and didn’t update mask. Some processors only have values in the halo boundaries. Crashes otherwise.
PL: Regenerated new mask files. Numbers should agree with what was done. Any more advice would be welcome. Send email or talk on slack. RF: I’ll look at CICE layouts and balance, and masks. CICE Is also seasonally dependent.
PL: Moving to a more conventional experiment payout. Will move to a shared location. AH: Could put in /g/data/v45.
AK: CICE scaling with serial IO. Nic almost finished PIO. Will stop scaling without PIO. Runs much faster with parallel even with monthly outputs. AH: Seems to be scaling ok. AK: Any output written? AH: Running for a month, so should be some output.
AH: Ran from initial conditions? PL: Yes. Ran for 1 month with timestep of 300s. Then ran from those restarts with timestep of 540s. AH: There is an ice climatology? RF: If run for a month, should have generated ice. AK: Ice generated from surface temperature in initial conditions.
RY and PL left meeting.
AH: Maybe a bit more to look at in PL runs. NH: May have misunderstood where those numbers came from. RF: Looked like it was scaling nice and linear. AH: Yes for each model, but together scaling died going to 20K. RF: Not sure these results are that useful when IO is turned on. Code paths not currently going through without IO. Putting stuff on density levels. And a whole lot of globals/collectives that aren’t being done. AH: Encouraging though. NH: In principle can scale up.

PIO compilation in ACCESS-OM2

NH: Got a reply from NCI. Resistance to having PIO in a module. Best to be self sufficient. If it turns out to be an issue can address later. Will make a submodule. Clean up the build process. Changes to CICE repo. One CICE namelist change, tell it to not explicitly use netCDF for certain things. Bit odd.
NH: Experiment repos will require updates. Maybe AK will report some more realistic performance numbers.
AH: PIO with MOM? NH: Not sure. CICE isn’t doing a great deal in the configuration I am using. Seems to all work inside parallel netCDF as doing output from all processors. Can use IO nodes and use comms, but doesn’t show performance improvement, and looks worse in many cases. We could configure in the same way without using PIO. RF: Don’t have much control where we put processors. CICE at the end. Probably sharing with MOM. Playing with layout might be tricky. NH: At some stage put CICE on all it’s own nodes. RF: Once YATM is on the first node, it ends up messing things up. NH: Why are we doing that? RF: Something to do with OASIS in the old days. Now have YATM and root PE of MOM on same node. Would make sure all root PEs on their own node. No contention. YATM and MOM also on same NUMA partition. NH: We should change that, easy fix. YATM doesn’t do much on 0.1 as rest of the model takes so long. RF: Two IO processors on the same node. MOM root PE uses for diagnostics and YATM process. NH: If each model on their own node. Could make sure each node has a single IO processor. With PIO if want 1 node per 16 processors, don’t know if it is talking across nodes.
JM: In terms of PIO are multiple nodes writing to the same file? NH: For CICE very single process writing to same file at the same time. Works well. Haven’t looked into it deeply, probably the optimum is something in between. Still a big improvement over serial output. AH: Kaizen (改善): small incremental improvements all the time. Compressed netCDF output? NH: No. PIO GitHub talked about supporting compression. AH: Same as what RY and Marshall did? NH: Yes. Have to wait for parallel netCDF implementation which supports it. Confusing because there is also p-netCDF. PIO is a wrapper. AH: Yes, wraps p-netCDF and MPI-NetCDF. p-netCDF is only netCDF3, not based on HDF5. AK: Will need post-processing compression step. NH: Task not done until compression done. AK: Very sparse data, shame not to compress.
AH: xarray is supporting sparse data now. FYI. Can mean a lot less memory use for some data.

Compiling with/without WOMBAT

AH: Any speed/memory use implications to always have it compiled in? RF: Should be separate. Overhead basically nothing. Will only allocate BGC arrays if they’re in the field table. Should be kept separate like all other BGC packages. I put in some lines in the compile scripts. Also f you want to compile without ACCESS.
AK: If want to maintain harmony with CM2 want a non-BGC compilation? RF: Yes. AK: From the point of view of OM2 users would be nice to be able to switch BGC on and off just through namelists. RF: Switched on via field_table. Strange design choices years ago. Also need changes in some of the restart files, and AK: Not something that can be switched on and off? RF: No.

MOM Pull Requests

AH: Guidance for checking? RF: Two main changes in the code are probably fine. Maybe the ACCESS compilation scripts. Unless want to change that it gets compiled in all the time for ACCESS-OM.AH: Decided not I think. RF: Made changes to to specify the type of model. AH: Separate model designation with WOMAT? RF: ACCESS-OM-BGC is a new model type. Run tests, all ok. AH: Do we need any tests to check it hasn’t changed non-BGC tests. RF: Shouldn’t be anything that effects a normal run. Code compiled ok on travis. Put in some heat diagnostics, the fluxes from CICE, might be the only thing. AH: Are Jenkins tests working? NH: ACCESS-OM2 tests haven’t worked since moved to the new machine. RF: Run a 1 degree model and see how it goes. AH: I’ll do that.
AK: Managing ACCESS-OM2, should the distinction between BGC and non-BGC be in the control directories. So build script builds both and choose which in the config, or compile once, supporting both. AH: I don’t think BGC is a supported configuration yet. Needs testing. How it is implemented, shared or separate exes is just a choice of how you decide is the best.
AH: Turns out that Geos PR was a mistake. Asked about it, and they closed it.

Bad bathymetry

AH: Any comments? Does it need fixing? RF: Bad bathymetry needs to be fixed, or copy bathymetry from somewhere else. Bad around Australia. Same for CM2. Mentioned it 3-4 years ago, still not updated. Some pits in Gulf of Carpentaria down to 120m in 0.25. 1 degree goes down to 80m. Should be no deeper than 60m. OCCAM created some bad bathymetry in Bass Strait, off coast of China. Russian and Alaskan issues, and White Sea. Remapping indices got mucked up. AH: Wasn’t 0.25 fixed north of Bering Strait? RF: Doesn’t look like it.
JM: Bathymetry files are wrong in certain regions? RF: Came from Southampton OCCAM model. They ran it with a normal mercator and a transverse mercator across the top. Remapping onto spherical grid indices got mucked up and got some strange bathymetry. GFDL inherited it and based a bunch of models on it. Leaked through to the ACCESS models. Was in the US forecast model and they noticed all the stuff around Alaska.
AH: Should be a relatively straightforward as this is only ocean bottom cells, and doesn’t touch coasts? RF: Yes. AK: Base on a coarsened tenth grid? RF: Not a big job, just a few slabs that need smoothing/removing. AH: Does this need to be fixed for the next release of OM2? RF: Yes. AK: No. RF: Get a student to look at it. AK: Also land mask inconsistencies, would be good to have all three models consistent. There are big curvy bits of coastline keeping ocean away from tripoles. AH: The 1 degree is very much a model, that isn’t that realistic. Tenth starts to look much more like real life.

Zarr file format

AH: Wanted to engage JM about zarr. RF: Interested as this is being used in decadal prediction project. JM: Exactly. Talked today about parallelising output from model into netCDF, and then post-analysis requires transforming to zarr. Zarr is a distributed file format that stores files in directories, each chunk is a separate file, parallelisation handled by filesystem. Should we write directly into zarr like file format. There are file formats like it. netCDF5 may have a zarr like back-end. RF: There is some discussion on the netCDF GitHub about zarr, looks like just one person. JM: Unidata is willing to move away from HDF5. Parallelisation of HDF5 has never worked the way it was supposed to. Instead of using parallel IO, just write directly to the format people want to use. AH: Got the impression netCDF people never got the buy-in from HDF5 that they thought would get. HDF5 just do their own thing. JM: Still have people using netCDF3. AH: A strength of netCDF, they could hop back end again and keep the same interface. JM: Same data model. AH: What is the physical format of a zarr blob? JM: It is a binary blob that supports different filters/compression schemes. AH: Does machine independent storage? Bad old days with swapping endianness on binary files. AG: In zarr there are raw data blobs, and associated metadata files that describe the filter/endianess etc.
JM: Inodes not a problem. Still relatively large, on the order of the lustre striping scheme. Can wrap the whole thing inside an uncompressed zip file. Parallelises for reading just fine. Works like a tar, index on where to read, supports multiple reads on same file. AH: Would want to do this when archiving.
NH: Another one is TileDB, which is a file format. JM: There are other backends, n5/z5. Distributed storage for large data sets.
AH: At one stage we did wonder if collation was even necessary with tools like xarray, but never looked into it. NH: Things have changed a lot. xarray is relatively new. 3-5 years ago might segfeault on tenth model data. So much better now, so many more possibilities.


Technical Working Group Meeting, April 2020


Date: 29th April, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

Apologies from Peter Dobrohotoff.

JRA55-do v1.4 support

AK: Staged rollout. NH tagged some branches, so existing master tagged 1.3.0, using old JRA55-do v 1.3.1 using NH new exes which also support 1.4
AK: Also working on a new feature branch for 1.4. Same exes configured to use JRA 1.4 version. Seems to run ok. Not looked at output. Will look at that today. Once satisfied that is ok will move into master, tag 1.4.
AK: Also looking at ak-dev branch with a wide variety of changes. Once this is ok will tag with a new ACCESS-OM2 version. Will be new standard for new experiments. Good to make an equivalent point across repos.
AH: COSIMA cookbook hackathon showed value of project boards. Might be a good idea next time something like this attempted. AK: Tried, but it didn’t go anywhere.
NH: Two freshwater fields coming from forcing, liquid and solid. Both go into the ICE model which accepts one new forcing field. Get added together, solid magically becomes liquid without heat changes, passed straight to Ocean. Ocean and Ice models have also been changed to accept liquid part of land/ice melt and heat part of land ice melt. Exist but just pass zeroes. Extra engineering not being used as yet. A harmonisation step which takes us close to CM2 as coupled model uses these fields.
RF: With my WOMBAT updates incorporated this new code, could get rid off ACCESS-CM preprocessor directives.
NH: In the future can put work into calculating those fields correctly in the ice model. Not a huge amount of work. Will then have river runoff, land ice runoff and land ice melt heat.
NH: New executables have another change, support different numbers of coupling fields. Land/ice coupling fields are optional. At runtime figures out what coupling fields used. Dependent on namcouple being consistent. Coded internally as a maximum set of coupling fields. You can take coupling fields out but not add new ones. Possibly useful for others. Not a fully flexible coupling framework.
NH: Working on ak-dev branch. Harmonising namcouple files. Have a lot of configuration fields, but a lot ignored. Could use same namcouple in all configs, but in practice might leave them looking a little different. They include the timestep in them, but ignored. Could set to zero? AH: Or a flag value that is obviously ignored?
NH: Only three variables used in namcouple. Rest ignored, bust must parse properly. Needs cruft to make it parse. Never liked namcouple. Completely inflexible, values must be changed in multiple places.
AK: On version Oasis3-mct2, have they improved it in new version?
NH: Can now bunch fields together, pass a single 3d field instead of many 2d fields. Should improve performance. RF: Not through namcouple at all. Just a function call.
MW: What does OASIS do now? NH: Just doing routing. Which is done by MCT anyway. Remapping done by ESMF. Coupler meant to do 3 things, config, remap and routing. Made libaccessom2 do as much as possible automatically. So OASIS does very little. Still using API, so would require effort to remove.
MW: Know about NUOPC? NCAR is using it. NH: Coupling API. If all use the same API then can go plug and play. MW: MOM6 has a NUOPC driver. NH: In the future would to look at OASIS4, but probably just chuck OASIS, use MCT to do the routing and ESMF to do remapping. MW: NCAR dropped MCT. NH: MCT is a small team. AK: Something that would suit ACCESS-CM. Any critical things that rely on OASIS? MW: At mercy of UM. Probably still use OASIS due to Europe. NH: Not using ESMF, so using OASIS a lot more than we are. Might never change because of that. AK: Even moving to v4 would require coordination with CM2. NH: Nicer and cleaner, but no clear benefit.

Updated ACCESS-OM2 model configs

AK: 3 different tags. 1.3.1, 1.4 in works. ak-dev new tag. 1.4 intended to be minimal other than change in JRA55-do version. ak-dev making more extensive changes. Ussing mppccombine-fast for tenth. Output compressed data and use fast collation. Not worthwhile for 1 deg. With 0.25 output uncompressed and use mppnccombine to do compression. Hopefully output will be a reasonable size.
AK: If outputting uncompressed restarts might get large. Might want to collate restarts. Wanted to verify which run is collated: just finished, or the previous run? AH: Yes it is the restarts which are not used in the next run.
AH: Because quarter degree is not compressed won’t get the inconsistent chunk sizes between different sized tiles. Ryan had the problem when he had a io_layout with very small chunk sizes which made his performance very bad. mppnccombine-fast might be faster, will definitely use less memory. Still got compression overhead but memory use much reduced. AK: Not such a big issue as tenth. AH: Paul Spence had some issues with the time to collate his outputs. Maybe because they were compressed. Would recommend using AK: Fast version will always be faster? AH: Yes, at least no slower, but definitely uses less memory and will be much faster with compressed output.
MW: No appetite for FMS with parallel-IO? AH: Compression? Without it won’t bother probably. RY: Did some tests on parallel IO compression tests. Can’t recall results. Interested to try again. Requires a bit more memory. gadi has optane as storage or as memory. Interesting to test. Probably can use that for parallel compression or even just serial compression. Thinking about, but haven’t started. AH: Please keep us updated.
NH: Anyone have thoughts on CICE? Planning on parallel IO on CICE. Are we going to need a compression step? RF: With daily would like compression. Post processing to do compression on smaller number of PEs would be fine. Improving IO is critical for Paul Sandrey and Pavel. NH: Might need a post processing step similar to MOM. RF: Yes. Getting parallel IO is the most important. Worry about compression later. NH: Did a run yesterday with parallel IO. Completed successfully. Output was garbage. Was expecting to do heaps of work and segfaults. Surprised at that. RF: Misaligned or complete garbage? NH: Default assumption as bad as can be. Just used parallel-IO output driver on CICE. AK and RF realised daily CICE output was a bottle neck on 0.1 performance. As model code existed, decided to get working. RY: Parallel IO need to set up mapping correctly between compute and IO domains. NH: Should be part of the current implementation. Mapping is a tricky part of CICE. AK: Values out of range, so maybe not just a mapping issue? NH: Completely broken, but not segfaulting. Just getting it building was one hurdle. Also had to call the right initialisation stuff within CICE. Had to rewrite some of it that was depending on another library from one of the NCAR models (CESM). CICE is used with  CESM and they had a dependency on another utility library. Changed some code to remove dependence. Relatively positive. Library under active development and well supported. AH: Did they develop just for their use case, and maybe doesn’t support round-robin? NH: Not sure. We do know never been used in any other model than CESM.
MW: Ed Hartnett (PIO) eager to get into FMS. Also lead maintainer of netCDF4.

Status of WOMBAT in ACCESS-OM2

RF: Compiled. Next is testing. Up to current ACCESS-OM2 code changes. Had issues with submodules. AK: Previously libaccessom2 dependencies brought in through CMake, now moved to submodules. If you have an existing repo will have initialise submodules to pull in latest from GitHub.
RF: Made some changes to installation procedures. Can go between BGC version or standard ACCESS-OM. Want it to be different for BGC version. Changes to install scripts and hashexe etc. AH: Good that it is up to date, could have been an messy merge otherwise. RF: Will run tests today or tomorrow.


AH: See this PR? Seemed a bit odd to me. First idea was to ask them to split the PR into science changes and config changes. RF: Looked like a lot of it was config changes. MW: Adding the GEOS5 stuff, which they shouldn’t. Code changes are challenging. Introduced a generic tracer, not sure what they’re doing with it. AH: Strategy? Ask them to wrap science stuff in preprocessor flags? MW: First step is to get config stuff out. Asked GFDL about it. GEOS are switching from MOM5 to MOM6. This must be associated with that effort to validate their runs. Maybe just giving back what it took to get it work. Maybe just makes his build process easier. AH: They have a specific requirement to use the same FMS library. Seems odd, as MOM5 and MOM6 are not likely to share FMS versions in the future. MW: Thorny topic, as it is not clear how FMS compatible MOM6 will be in the future. AH: Using FMS for less and less. MW: The PR needs to be cleaned up. AH: Also put in a CMake build system. MW: They need to explain more.
AK: Has conflicts, so can’t be merged at the moment. AH: Only going to get more conflicted. Which is why I was thinking they could split it up. I have a CMake build system in another branch, but never finished. if we can use theirs cool. I’ll engage with them.


AH: Been experimenting with graceful error recovery with payu. Can specify a script which can decide if the error is something you can just resubmit after. Mostly of interest to the production guys.
PL: Scalability testing with land masks, manifests, and payu setup. Supposed to be simpler but taking some time to get used to it. AH: Manifests are relatively new so some of the use cases have not been as well tested. MW: Are not all using manifests? AH: They are, but can be used in different ways. Tracking always works, but options to reproduce inputs and runs. Suggested PL could use reproduce to start a run. It was confounded by some restarts being missing, so not quite sure if it works as we would like. This is a very desirable feature, as it makes it very simple to fork off new runs from existing ones as well as making sure the files are consistent. PL: Working now. Next step is to change core counts and look for scalability numbers. AH: When I was doing scalability stuff for MOM-SIS I use input directory categories to isolate processor changes. Not quite doing that same thing anymore, but you can do something similar, but you won’t want use the reproduce flag if you are changing any of the input files.
AK: Just MOM scaling or CICE as well? PL: Just looking at MOM to begin with to see dependency and wait times. AK: CICE run time is critically dependent on daily outputs. Revelance to scaling data to production output. MW: Make sure your clock can tell them apart. In principle can distinguish compute from IO. AH: Daily output always part of production? AK: Ice modellers want very high temporal output. Ice is very dynamic. Even daily output not enough to resolve  some features. Maybe wait for PIO for CICE scaling tests? AH: I thought scaling tests always turned off IO? Can’t properly test scaling with daily output, as it dominates runtime.
NH: Would be nice to look at performance with and without PIO. PL: Will also look at CICE. Start with ocean model. AK: Were you (MW) running models coupled for paper scaling numbers? MW: Coupled. Not sure what IO was set to. Subtracted it and don’t recall it was large. Don’t recall a bottle neck, so might have had it turned off. RF: Wouldn’t be running with daily IO. Monthly IO doesn’t show up. MW: sounds likely.
AK: For IAF had a lot of daily CICE output. Not complete set of fields.
MW: Starting to run performance tests at GFDL and want to use payu. Has it changed much? Manifest stuff hasn’t made a big difference? Will have to get slurm working. Filesystem will be a nightmare. You moved PBS stuff into a component? AH: No, you did that. Not huge differences. Will be great to have slurm support.

Technical Working Group Meeting, March 2020


Date: 18th March, 2020
  • Aidan Heerdegen (AH) CLEX ANU
  • Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

Scalability of ACCESS-OM2 on gadi

(Paul’s report is attached at the end)

PL: Looking at scaling. Started with ACCESS-OM2, but went to testing MOM5 directly with MOM5-SIS. Using POM25, global 0.25 model with NYF forcing. The model MW developed for testing scaling prior to ACCESS-OM2. Had to add specify min_thickness in ocean_topog_nml.

PL: Tested the scaling of 960/1920/3840/7680/15360, with no masking. Scales well up to some point between 7680 and 15360.

PL: Tested effect of vectorising options (AVX2/AVX512/AVX512-REPRO). Found no difference in runtime with 15360 cores. MW: Probably communication bound at the CPU count. Repro did not change time.

MW: Never seen significant speed up from vectorisation. Typically only a few percent improvement. Code is RAM bound, so cannot provide enough data to make use of vectorisation. Still worth working toward a point where we can take advantage of vectoeisatio.

PL: Had one “slow” run outlier out of 20 runs. Ran 20% slower. Ran on different nodes to other jobs, not sure if that is significant. MW: IO can cause that. AH: Andy Hogg also had some slow jobs due to a bad node. AK: Job was 20x slower. Also RYF runs become consistently slower a few weeks ago. MW: OpenMPI can prepend timestamps in front of output, can help to identify issues.

PL: Getting some segfaults in ompi_request_wait_completion, caused by pmpi_wait and pmpi_bcast. Both called from the coupler. NH: Could be a bad bit of memory in the buffer, and if it tries to copy it can segfault. PL: Thinking to run again using valgrind, but would require compiling own version of valgrind wrapper for OpenMPI 4.0.2. Would be easier to Intel MPI, but no-one else has use this. Saw some cases similar when searching which were associated with UCX, but sufficiently different to not be sure. These issues are with highest core count. MW: Often see a lot of problems at high core counts. NH: Finding bugs can be a never ending bug. Use time wisely to fix bugs that affect people. MW: Quarter degree at 15K cores would have very small tile sizes. Could be the source of the issue. AH: This is not a configuration that we would use, so it is not worth spending time chasing bugs.

PL: Next testing target is 0.1 degree, but not sure which configuration and forcing data to use. Will not use MOM5-SIS, but will use ACCESS-OM2 for direct comparison purposes. AK: Configurations used in the model description paper have not been ported to gadi. Moving on to a new iteration. Andy Hogg is running a configuration that is quite similar, but moving to new configurations with updated software and forcing. Those are not quite ready.

PL: Need a starting configuration for testing. Want to confine to scalability testing and compiler flags. NH: ACCESS-OM2 is setup to be well balanced for particular configurations. Can’t just double CPUs on all models as load imbalance between submodels will dominate any other performance changes. Makes it a problematic config for clean configurations for things like compiler flags. MW: Useful approach was to check scalability of sub-model components independently. Required careful definition of timers to strategically ignore coupling time. MOM was easy, CICE was more difficult, but work with Nic’s timers helped a lot. Try to time the bits of code that are doing computation and separate from code that waits on other parts. Coupled model is a real challenge to test. Figure out what timers we used and trust those. Can reverse engineer from my old scripts.

PL: Should do MOM-SIS scalability work? MW: Easier task, and some lessons can be learned, but runtime will not match between MOM-SIS and ACCESS-OM2. Would be more of a practice run. PL: Maybe getting out of scope. Would need 0.1 MOM-SIS config. RY: Yes we have that one. If PL wanted to run ACCESS-OM2-01 is there a configuration available? AK: Andy Hogg’s currently running configuration would work. PL: Next quarter need to free up time to do other things.

MW: Might be valuable to get some score-p or similar numbers on current production model. Useful to have a record of those timings to share. Scaling test might be too much, but a profile/timing test is more tractable. RY: Any issues with score-p? Overhead? MW: Typical, 10-20%, so skews numbers but get in-depth view. Can do it one sub-model at a time. Had to hack a lot scripts, and get NH to rewrite some code to get it to work. score-p is always done at compile time. Doesn’t affect payu. Try building MOM-SIS with score-p, then try MOM within ACCESS-OM2. Then move on to CICE and maybe libaccessom2. PL: Build script does include some score-p hooks. MW: Even without score-p MOM has very good internal timers. Not getting per-rank times. score-p is great for measuring load imbalance. AH: payu has a repeat option, which repeats the same time, which removes variability due to forcing. Need to think about what time you want to repeat as far as season. AK: CICE has idealised initial ice, evolves rapidly. MW: My earlier profile runs had no ice, which affects performance. MW: Not sure it is huge, maybe 10-20%, but not huge.

MW: Overall surprised at lack of any speed up with vectorisation, and lack of slow-down with repro. PL: Will verify those numbers with 960 core config.

AH: Surprised how well it scaled. Did it scale that well on raijin? MW: The performance scaling elbow did show up lower. AH: 3x more processors per node has an effect? MW: Yes, big part of it. AH: 0.1 scaled well on raijin, so should scale better on gadi. 1/30th should scale well. Only bottleneck will be if the library can handle that many ranks.

NH: If repro flags don’t change performance that is interesting. Seem to regularly have a “what trade off does repro flags have?”, would be good to avoid. MW: Probably best to have an automated pipeline calculating these numbers. NH: People have an issue with fp0 flag. MW: Shouldn’t affect performance. NH: Make sure fp0 is in there. MW: Agree 100%.

ACCESS-OM2 update

AH: Do we have a gadi compatible master branch on gadi? AK: No, not currently. NH: At a previous TWG meeting I self-assigned getting master gadi compatible. Merged all gadi-transition branches and tested, seemed to be working ok. Subsequent meeting AK said there were other changes required, so stopped at that point. gadi-transition branches still exist, but much has already been merged and tested on a couple of configurations. Have since moved to working on other things.

NH: Close if AK has all the things he wants into gadi-transition branch. Previous merge didn’t include all the things AK wanted in there. Happy to spend more time on that after finishing JRA55 v1.4 stuff.

JRA55-do v1.4 update

NH: Made code changes in all the models, but have not checked existing experiments are unchanged with modified code.

NH: v1.4 has a new coupling field, ice calving. Passing this through to CICE as a separate field. In CICE split into two fields, liquid water flux and a heat flux. MOM in ACCESS-CM2 already handles both these fields. Just had to change preprocessor flags to make it work for ACCESS-OM2 as well.

NH: Lots of options. Possible to join liquid and solid ice at atmosphere and becomes the same as we have now. Can join in CICE and have a water flux but not a heat flux.

Strange MOM6 error

AH: A quick update with Navid’s error. Made a little mpi4python script to run before payu to check status of nodes, and all but root node had a stale version of the work directory. Like it hadn’t been archived. Link to executable was gone, but everything else was there. Reported to NCI, Ben Menadue does not know why this is happening. Also tried a delay option between runs and this helped somewhat, but also had some strange comms errors trying to connect to exec nodes. Will next try turning off all input/output can find in case it is a file lock error. Have been told Lustre cannot be in this state.

MW: In old driver do a lot of moving directories from work to archive, and then relabelling. Is it still moving directories around to archive them? Maybe replace with hard copy of directory to archive. MOM6 driver is the MOM5 driver, so maybe all old drivers are doing this. Definitely worth understanding, but a quick fix to copy rather than move.

NH: Filesystem and symbolic links might be an issue MW: Maybe symbolic links are an issue on these mounted filesystems. AH: There was a suggestion it might be because it was running on home which is NFS mounted, but that wasn’t the problem. MW: Often with raijin you just got the same nodes back when you resubmit, so maybe some sort of smart caching.


Scalability of ACCESS-OM2 on Gadi – Paul Leopardi 18 March 2020