Technical Working Group Meeting, November 2020

Minutes

Date: 11th September, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU
Angus Gibson (AG) RSES ANU
Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
Paul Leopardi (PL), Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision
Peter Dobrohotoff (PD) CSIRO Aspendale
James Munroe (JM) Sweetwater Consultants
Marshall Ward (MW) GFDL

Updated ACCESS-OM-01 configurations

NH: Testing for practical/usable 0.1 configs. Can expand configs as there is more available resource on gadi and CICE not such a bottleneck. When IO is on, scalability has been a problem, particularly with CICE. IO no longer bottleneck with CICE, but CICE still scalability problem. All I/O was routed through root processor with a lot of gathers etc. So now have a chance to have larger practically useful configs.

NH: Doesn’t look like we can scale up to tens of thousands of cores.

NH: Coupling diagram. YATM doesn’t receive any fields from ICE or OCEAN model. Just sync message. Can just send as much as it wants. Seemed a little dangerous, maybe stressing MPI buffers, or not intended usage. YATM always tries to keep ahead of the rest of the model, and send whenever it can. Does hang on a sync message. Will prepare coupling fields, once done will do a non-blocking sync. Should always send waiting for CICE to receive. When ice and ocean model coupling on every time step, both send and then both receive. Eliminate time between sending and waiting to receive. CICE should never wait on YATM. CICE and MOM need to run at roughly same pace. MOM should wait less as it consumes more resources. Internally CICE and MOM processors should be balanced. Previously CICE I/O has been big problem. Usually caused other CICE CPUs to wait, and MOM CPUs.

NH: Reworked coupling strategy originally from CSIRO to do as best as we can.

NH: Four new configs: small, medium, large and x-large. New configs look better than existing because old config did daily ice output, which it was never configured to do. Comparing new configs to worst existing config. All include daily ice and ocean output. Small basically the same as previous config. All tests are warm from April from restart from AK run. Lots of ice. Three changes from existing in all configs: PIO, small coupling change where both models simultaneously, small change to initialisation in CICE. Medium config (10K) is very good. Efficiency very similar to existing/small. Large not scaling too well. Hitting limit on CICE scaling. Probably limit of efficiency. X-Large hasn’t been able to be tested. Wanted to move projects, run larger job. AH: X-large likely to be less efficient than large NH: Unlikely to use it. Probably won’t use Large either.

AH: MOM waiting time double from med -> large. NH: MOM scaling well. CICE doesn’t scale well. Tried a few different processor counts, doesn’t help much. Problem getting right number of CICE cpus. Doesn’t speed up if you add more. AK: Played around with block numbers? NH: Keep blocks around 6 or 7. As high as 10 no down side. Beyond that, not much to do.

PL: Couldn’t see much change with number of blocks. Keeping same seemed better than varying. NH: Did you try small number like 3-4. PL: No. RF: Will change by season. Need enough blocks scattered over globe so you never blow out. PL: I was only testing one time of the year. NH: Difficult to test properly without long tests.

NH: PIO is great. Medium is usable. Large and X-Large probably too wasteful. Want to compare to PL and MW results. Might be some disagreement with medium config. From PL plots shouldn’t have scaled this well. PL: Maybe re-run my tests with new executables/configs. NH: Good idea. Will send those configs. Will run X-Large anyway.

NH: Had to fix ocean and ice restarts after changing layout. Didn’t expect it would be required but got a lot of crashes. Interpolate right on to land and then mask land out again. In ice case, making sure no ice on land. PL: I had to create restart for every single grid configuration.

NH: Not clear why this is necessary. Thought we’d fixed that, at least with ocean. Maybe with PIO introduced something weird. Would be good to understand. Put 2 scripts in topography_tools repo fix_ice_restarts.py fix_ocean_restarts.py.

RF: Will you push changes for initialisation? Would like to test them with our 3 day runs. NH: Will do.

RF: Working on i2o restarts. Code writes out a variable, defines the next variable, chance of doing a copy each time. About 10-12 copies of the files in and out. NH: OASIS? RF: No coupled_netcdf thing. Rewritten and will test today. Simple change to define variables in advance. NH: Still doing a serial write? RF: Yes. Could probably be done with PIO as well. Will just build on what is there at the moment.

AH: Large not as efficient as Medium or Small, but more efficient than current. NH: Due to daily output and no PIO. AH: Yes, but still and improvement than current. AH: How does this translate to model years per submit? AK: Small is 3 months in just over 2.5 hours. Should be able to do 1 year/submit in large. AH: Might be a reason to use large. Useful for example if someone wanted to do 3-daily outputs for particle tracking for RYF runs. Medium is 6months/submit? Or maybe a year? AK: Possibly. Not so important when queue wait was longer. NH: I calculate medium would take 5.2 hours/year. Could make medium an thousand processors on the ocean. Might be smarter. AK: Target wall time.

AK: About to start 3rd IAF cycle. Andy keen to try medium. Good enough for production? NH: No real changes. AK: Do you have a branch I could look at. NH: Maybe test a small bump on medium to get it under 5 hours. AK: Hold off until config that gets under 5 hours? NH: Spend the rest of the day and get it under 5 hours. AK: Next few days, updates to diagnostics.

PL: Any testing to see if it changes climate output? NH: None. Don’t know how to do it. MOM should be reproducible. Assume CICE isn’t. AH: Run, check gross metrics like ice volume. Not trying to reproduce anything. Don’t know if there is an issue if it is due to this or not. Cross that bridge when we have to.

AH: Any code changes as well to add to core/cpu bumps? RF: Just a few minutes at the beginning and end.

MW: Shouldn’t change bit repro? NH: Never verified CICE gives same answers with different layouts. Assuming not. MW: Sorry, thought you meant moving from serial to PIO. AH: AK does done testing with that. AK: With PIO only issue is restart files within landmarked processors. MW: Does CICE do a lot of collectives? Only real source of issues with repro. PL: Fix restart code might perturb things? NH: Possibly? AH: Can’t statistically test enough on tenth. PL: Martin Dix sent me an article about how to test climate using ensembles.

AH: When trying to identify what doesn’t scale? Already know? EVP solver NH: Just looked at the simplest numbers. Might be other numbers we could pull out to look at it. Maybe other work. Already done by PL and MW. If goal is to improve scalability of CICE can go down that rabbit hole. AH: Medium is efficient and a better user of effort to concentrate on finessing that. Good result too. NH: Year in a single submit with good efficiency is a good outcome. As long as it can get through the queue, and had no trouble with testing. AH: Throughput might be an issue, but shouldn’t be with much larger machine now.

AK: Less stable at 10K? Run into more MPI problems with larger runs. Current (small) config is pretty reliable. Resub scripts works well, falls over 5-10% of the time. Falls over at the start, gets resubmitted. Dies at the beginning so no zombie jobs that run to wall time limit. NH: Should figure out if our problem or someone else’s? In the libraries? AK: Don’t know. Runs almost every time when restarted. Tends to be MPI stuff, sometimes lower down interconnect. Do tend to come in batches. MW: There were times when libraries would lose connections to other ranks. Those use to happen with lower counts on other machines. Depends on the quality of the library. MPI is a thin layer above garbage that may not work as you expect at very high core counts.

PL: What is happens when job restarts? Different hardware with different properties? AK: Not sure if it is hardware. MW: mxm used to be unstable. JM: As we get to high core count jobs, may nott trust every node won’t be ok for the whole run? Should get used to this? In cloud this is expected. In future will have to cope with this? That one day cannot guarantee an MPI rank may not ever finish?

MW: Did see a presentation with heap OpenMPI devs, acknowledge that MPI is not well suited to that problem. Not designed for pinging and rebooting. Actions in implementation and standard to address that scenario. Not just EC2. Not great. Once an MPI starts to fray falls to pieces. Will take time for library devs to address problem, also if MPI is the appropriate back end for IPC scenario. PL: 6-7 yrs ago ANU had a project with Fujitsu and exascale and what happens when hardware goes away. Not MPI as such. Mostly mathematics of sparse grids. Not sure what happened to it. NH: With cloud expect to happen, and why. We might expect it, but don’t know why. If know why, if it is dropping nodes we can cope. If just transient error, due to corrupted memory due to bad code in our model, want to know and fix. MW: MPI doesn’t provide API to investigate it. MPI trace on different machines (e.g. CRAY v gadi) look completely different. Answering that question is implementation specific. Not sure what extra memory is being used, which ports, and no way to query. MPI may not be capable of doing what we’re asking for, unless they expand the API and implementors commit to it.

AH: Jobs are currently constituted don’t fit cloud computing model, as jobs are not independent, reply on results from neighbouring cells. Not sure what the answer is?

MW: Comes up a lot for us. NOAA director is private sector. Loves cloud compute. Asks why don’t we run on cloud? Any information for a formal response could be useful longer term. Same discussions at your places? JM: Problem is in infrastructure. Need an MPI with built in redundancy to model. Scheduler can keep track and resubmit. Same thing happens in TCP networking. Can imagine a calculation layer which is MPI like. MW: We need infiniband. Can’t rely on TCP, useful stock answer. JM: Went to DMAC meeting, advocating regional models in the cloud. Work on 96 cores. Spin up a long tail of regional models. Decent interconnect on small core counts. Maybe single racks within datasets. MW: Looking at small scale MOM6 jobs. Would help all of us if we keep this in mind, what we do and do not need for big machines. Fault tolerance not something I’ve thought about. JM: NH point about difference between faulty code and machine being bad is important.

AH: Regarding idea of having an MPI scheduler that can cope with nodes disappearing. Massive problem to save model state and transfer around. A python pickle for massive amounts of data. NH: That is what restart is. MW: Currently do it the dumbest way possible with files. JM: Is there a point where individual cores fail, MPI mesh breaks down. Destroys whole calculation. Anticipating the future. NH: Scale comes down to library level. A lot of value in being able to run these models on the cloud. Able to run on machines that are interruptible. Generally cheaper than defined duration. Simplest way is to restart more often, maybe every 10 days. Never lose much when you get interrupted. Lots we could do with that we have. AH: Interesting, even if they aren’t routinely used for that. Makes the models more flexible, usable, shareable.

MW: Looking at paper PL sent, are AWS and cori comparable for moderate size jobs? PL: Looking at a specific MPI operation. JM: Assuming machine stays up for the whole job. Not really testing redundancy.

Testing PIO enabled FMS in MOM5

AH: Last meeting talked to MW about his previous update of FMS to ulm. MW put code directly into shared dir in MOM5 source tree. Made and merged PR where shared dir is now a subtree. Found exact commit MW had used from FMS repo, and recreated the changes he subsequently made so old configs still work: adding back some namelist variables that don’t do anything, but mean old configs still work. FMS code is identical, but now can swap out FMS more easily with subtree. There is a fork of FMS in the mom-ocean GitHub organisation. The master branch is the same as the code that is in the MOM5 subtree. Have written a README to explain the workflow for updating this. If we want to update FMS, create a feature branch, create a PR, create a matching branch on MOM5, use subtree to update FMS and make a PR on the MOM5 repo. This will then do all the compilation tests necessary for MOM5. Both PRs are done concurrently. Bit clunky, but a lot easier than otherwise, and changes we make are tied to the FMS codebase. Wasn’t the case when they were bare code changes directly included in MOM5.

AH: Can now look at testing a PIO enabled FMS branch based on work of RY and MW. Is there an appetite for that? MW and RY changes are on a branch on FMS branch. Based on a much more recent version of FMS? MW: Pretty old, whatever it is based on. 1400 commits behind. AH: Current is 2015, PIO branch is based on 2018. MW: Must be compatible with MOM5 because it was tested. Not a big issue if it is ulm or not.

AH: So should we do that? I could make a branch with this updated PIO enabled FMS. NH: Great idea. Uses what we’ve already done. AH: RY and MW have done it all, just have to include it. Is compression of outputs handled by library? RY: Testing parallel compression. Code does not need changing. Just need to change deflate level flag in FMS. Parallel IO can import that. No need for code changes. Not compatible with current HDF version 1.10. If specify chunk layout will crash. Test HDF 1.12. Works much better than 1.10. Performance looks pretty good. Compression ratio 60%. Sequential IO 2.06TB, using compression 840GB. Performance is better than sequential IO. Currently available HDF library not compatible. Will present results at next meeting. Not rush too much, need stable results. When netCDF support parallel compression, always wanted to test that, and see if there are code changes. Current HDF library layout only compatible with certain chunk layouts. AH: Certainly need stability for production runs.

NH: Had similar issues. Got p-netcdf and PIO working. PIO is faster, no idea why. Also had trouble with crashes, mostly in MPI libraries and not netCDF libraries. Used MPI options to avoid problematic code. PIO and compression turned on, very senstitive to chunk size. Could only get it working with chunks the same as block size. Wasn’t good, blocks too small. Now got it working with MPI options which avoid bad parts of code. RY: ompi is much more stable. NH: Also had to manually set the number of aggregators. RY: Yes, HDF v 1.12 is much more stable. Should try that. Parallel IO works fine, so not MPI issue, so definitely comes from HDF5 library. So much better moving to HDF1.12. MW: Is PIO doing some of the tuning RY has done manually. NH: Needs more investigation, but possibly being more intelligent about gathers before writes. MW: RY did a lot of tuning about how much to gather before writes. NH: Lots of config options. Didn’t change much. Initially expected parallel netCDF and PIO to be the same, and surprised PIO was so much better. Asked on GitHub why that might be, but got no conclusive answer.

AH: So RY, hold off and wait for changes? RY: Yes doing testing, but same code works. AH: Even though didn’t support deflate level before? RY: Existed with serial IO. PIO can pick up the deflate level. Before would ignore or crash.

Miscellaneous

MW: At prior MOM6 meeting was obvious preference to move MOM6 source code to mom-ocean organisation if that is ok with current occupants. Hadn’t had a chance to mention this when NH was present. If they did that were worried about you guys getting upset. AH: We are very happy for them to do this. NH: Don’t know time frame. Consensus was unanimous? AH: Definitely from my point of view makes the organisation a vibrant place. NH: I don’t own or run, but think it would cool to have all the codebase in one organisation. Saw the original MOM1 discs in Matt England’s office, which spurred putting all the versions on GitHub. So would be awesome. MW: COSIMA doesn’t have a stake or own this domain? AH: Not at all. I have just invited MW to be a owner, and you can take care of it. MW: Great, all settled.

PL: Action items for NH? NH: Sent PL configs, commit all code changes tested on current config and give to RF and fix up Medium config. AH: four, shared config with RF and AK, and please send me the slides you presented. RY: PIO changes all there for PL to test? NH: Yes. PL: Definitely need those to test the efficiency of the code. AK: PIO in master branch of CICE on the repo.

PL: Have an updated report that is currently being checked over by Ben. Will release when that is give the ok.

AH: Working on finally finished the cmake stuff for MOM5 for all the other MOM5 test configs. Will mean MOM5 can compile in 5 minutes as parallel compilation works better due to dependency tree being correctly determined.