Technical Working Group Meeting, July 2020

Minutes

Date: 10th June, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
Rui Yang (RY), Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
James Munroe (JM) Sweetwater Consultants
Peter Dobrohotoff (CSIRO Aspendale)
Marshall Ward (GFDL)

Optimisation report

PL: Have a full report, need review before release. This is an excerpt.

PL: Aims for perf tuning and options for configuration. Did a comparison with Marshall’s previous report on raijin.

Testing included MOM-SIS at 0.25 and 1 deg to get idea of MOM scalability stand-alone.

The ACCESS-OM2 at 0.1 deg. Resting with land masking, scaling MOM and CICE proportional.

Couldn’t repeat Marshall’s exactly. ACCESS-OM2 results based on different configs. Differences:

continuation run
time step 540s v 400s
MOM and CICE were scaled proportionally
Scaling taken to 20k v 16k

MOM-SIS at 0.25 degrees on gadi 25% faster than ACCESS-OM2 on raijin at low end of CPU scaling. Twice as fast for MOM-SIS at 0.1 degrees. Scalability at high end better.

ACCESS-OM2: With 5K MOM cores, MOM is 50-100% faster than MOM on raijin. Almost twice as fast at 16K, scaled out to 20K. CICE: with 2.5K cores CICE on gadi seems 50% faster than CICE on raijin. Scales to 2.5 times as fast at 16K OM2 CPUs.

Days per cpu day. From 799/4358 CICE/MOM cpus does not scale well.

Tried to look at wait time as fraction of wall time. Waiting constant for high CICE ncpus, decreases with high core counts with low CICE ncpus. So higher core count probably best to reduce ncpus as proportion. In this case half the usual fraction.

JM: How significant are results statistically? PL: Expensive. Only ran 3-4 runs. Spread quite low. Waiting time varied the most. Stat sig probably not due to small sample size.

MW: Timers in libacessom2 were better than OASIS timers, which include bootstrapping times which are impossible to remove. Also noisy IO timers. Not sure how long your run. Longer would be more accurate. PL: Runs are for 1 calendar month (28 days). MW: oasis_put and get are slightly magical, difficult to know what they’re doing. PL: Still have outputs, could reanalyse.

MW: Speedups seem very high. Must be configuration thing. PL: Worried not a straight comparison. MW: If network time is 15-20%, wouldn’t make a difference. Always been RAM bound, which be good if that wasn’t an issue now. PL: Very meticulous documentation of configuration, and is very reproducible. Made a shell script that pulls everything from GitHub. MW: I think your runs are better referenced. While experiments were being released, seemed some parameters etc were changing as I was testing. That could be the difference. Wish I documented it better.

AH: All figures independent of ocean timestep. PL: All timestep at 540s. AK: Production runs are 15-20% faster, but a lot of IO. PL: Switched off IO and only output at end of the month. It really drags it. Make sure it isn’t IO bound. Probably memory bound. Didn’t do any profiling that was worth presenting. MW: Got a FLOP rate? PL: Yes, but not at my fingertips. If it is around a GFLOP, probably RAM bound. PL: Now profiling ACCESS-CM2 with ARM-map. RY is looking at a micro view, looking at OpenMP and compilation flag level. RY: Gadi has 4gb/core, raijin has 2GB/core. Not sure about bandwidth. Also 24 cores/node. Much less remote note comms. Maybe a big reduction in MPI overheads. MW: OpenMPI icx stuff helping. RY: Lots of on-node comms. Not sure how much. MW Believe at high ranks. At modest resolutions comms not a huge fraction of run time. Normal scalable configs only about 20%. PL: The way the scaling was done was different. MW scaled components separately. MW: I was using clocks that separated out wait time.

RY: If config timestep matters, any rule for choosing a good one? AK Longest timestep that is numericaly stable. 540s is stable most of the time.

MW: How have progressed on CICE layout stuff? Changed in the last year? I was using sect-robin. RF: sect-robin or round-robin. AK: You use sect-robin, production did use round-robin, but not sect-robin. Less comms overhead, not sure about load balance.

PL: Is there any value in releasing the report. NH: Would be interested in reading it. Looking to get these bigger configurations. AH: Worth to document the performance at this point. RY: Any other else worth trying? AH: Why 20K limit? PL: Believe that is a PBS queue limit. Some projects can apply for an exception. RY: For each queue there are limits. Can talk with admin if necessary to to increase. AH: Will bring this up at Scheme Managers meeting. They should be updated with gadi being a much bigger machine. Would give more flexibility with configurations. Scalability is very encouraging.

RF: BlueLink runs very short jobs, 3days at a time. Quite a bit of start-up and termination time. How much does that vary with various runs. MW: I did plot init times, it was in proportion to the number of cores. Entirely MPI_Init. It has been a point of research in OpenMPI. Double ranks, double boot time. RF: Also initialisation of OASIS, reading and writing of restart files. PL: Information has been collected, but hasn’t been analysed. RF: Paul Sandery’s runs are 20% of the time. MW: MPI_Init is brutal, and then exchange grid. There are obvious improvements. Still doing alltoall when it only needs to query neighbours. Can be speed up by applying geometry. Don’t need to preconnect. That is bad.

PL: At least one case where MPI collectives were being replaced with a loop of point to points. Was collective unstable? MW: Yes, but also may be there for reproducibility. MW: I re-wrote a lot of those with collectives, but they had no impact. At one time collectives were very unreliable. Probably don’t need to be that way anymore. MW I doubt that they would better. Hinges on my assertion that comms are not a major chunk.

AH: MOM is RAM bound or cache-bound? MW: When doing calculations the model is waiting to load data from memory. AH: Memory bandwidth improves all the time. MW: Yes, but increase the number of cores and it’s a wash. It could have improved. AMD is doing better, but Intel know there is a problem.

AH: To wrap-up. Yes would like full report. This is useful for NH to work up new configurations, as naive scaling is not the way to go. Also intialisation numbers RF would like that PL can provide.

ACCESS-OM2 release plan

AH: Are we going to change bathymeytry? Consulted with Andy who consulted with COSIMA group. What is the upshot? AK: Ryan and Abhi want to do some MOM runs. Problems with bathy. Andy wants run to start, if someone has time to do it, otherwise keep going. Does anyone have some idea how long it would take. RF: 1 deg would be fairly quick. We know where the problems are. Shouldn’t be a big job. Maybe a few days. 1 deg in a day for an experienced person. GFDL has some python tools for adjusting bathymetry on their GitHub. Point and click. Alisdair wrote it. Might be in MOM-SIS examples repo. MW: Don’t know, could ask. RF: Could be something that would make it straightforward.

AK: Will have a look.

MW: topog problems in MOM6 not usually the same as MOM5 due to isopycnal coordinate.

AK: Some specific points that need fixing? RF: I think I put some regions in a GitHub issue. AK: What level to dig into this? RF: Take pits out. Set to min depth in config. Regions which should be min depth and have a hole. Gulf of Carpenteria trivial. Laptev should all be set to min depth. NH: I did write some python stuff called topog_tools, can feed it csv of points and it will fix it. Will fill in holes in automatically. Also smooths humps. May still have to look at the python and fix stuff up. Another possibility.

AK: Another issue is quantisation to the vertical grid. A lot of terracing that has been inherited. RF: Different issue. Generating a new grid. 1 degree not too bad. 0.25 would be weighty still. MC: BGC in 0.25, found a hole off Peru that filled up with nutrients.

AH: Only thing you’re waiting on? AK: Could do an interim release without topog fixes. People want to use. master is so far behind now Also updating wiki which refers to new config. Might merge ak-dev into master, and tag as 1.7, and have wiki instructions up to date with that. AH: After bathy update what would version be? AK: 2.0. AH: Just wondering what constitutes a change in model version. AK: Maybe one criteria if restarts are unusable from one version to the next. Changing bathy would make these incompatible. AH: Good version.

AH: Final word on ice data compression? NH: Decided deflate within model was too difficult due to bugs. Then Angus recognised the traceback for my degfault which was great. Wasted some time not implementing it correctly. Now working correctly. Using different IO subsystem. IO_MP rather than ROMIO_MP. Got more segfaults. Traced and figured out. Need to be able to tell netCDF the file view. The view of a file that a rank is responsible for. MPI expecting that to be set. One way to set it up is to specify the chunks to something that makes sense. Once I did that file view was correct. Then ran into bugs in PIO library. Seem like cut n’ paste mistakes. No test for chunking. Library wrapper is wrong. Fixed that. No working. Learnt a lot and a satisfying outcome. Significantly faster. Partly parallel, also different layer does more optimisation. Noticed with 1 degree things are flying. Nothing definitive, but seems a good outcome. Will get some definitive numbers and a better explanation. Will have something to merge. Will have some PRs to add to and some bug reports to PIO. RY: PIO just released a new version yesterday. NH: Didn’t know that. Tracking issues that are relevant to me. Still sitting there. Will try new version. RY: Happy it is working now. NH: Was getting frustrated with PIO, wondered why not using netCDF directly. For what we do pretty thin wrapper around netCDF. Main advantage is the way it handles the ice mapping. Worth keeping just for that. MW: FMS has most of a PIO wrapper but not the parallel bit.

PL: Any of that fix needs to be pushed upstream? NH: Changes to the CICE code. Will push to upstream CICE. Will be a couple of changes to PIO. AK: Dynamically determines chunking? NH: Need to set that. Dynamically figures out tuneable parameters under the hood about the number of aggregators. Looking at what each rank is doing. Knows what filesystem it is on. Dependent on how it is installed. Assuming it knows it is on lustre. Can generate optimal settings. Can explain more when do a summary.

AK: Want to make sure OP files are consistently chunked. NH: Using the chunking to set the file view. Another way to explicitly set the file view using MPI API. Chunks are the same size as the data that each PE has. In CICE each block is a chunk. MW: These are netCDF chunks? AK: More chunks than cores? NH: Yes. Is that bad or good? This level is perfect for writing. Every rank is chunked on what it knows to do. Not too bad for reading. JM: How large a chunk? NH: In 1 degree every PE has full domain 300 rows x 20 columns. JM: Those are small. Need bigger for reading AK: For tenth 36×30. Something like 9000 blocks/chunk. NH: Might be a problem for reading? RF: Yes for analysis. JM: Fixed cost for every read operation. A lot of network chatter. AH: Is that the ice or the ocean in the tenth? Not sure. Chunk size is 36×30. A lot of that is ice free, 30% is land. MW: Ideal chunks are based on IO buffers in the filesystem. AH: Best chunking depends a lot on access patterns. JM: One chunk should be many minimal file units big. AH: When 0.25 had one file per PE it was horrendously bad for IO. Crippled Ryan’s analysis scripts. If you’re using sect robin that could make the view complicated? NH: Wasn’t Ryan’s issue that the time dimension was also chunked? AH: He was testing mppnccombine-fast which just copies chunks as-is, which were set by the very small tiles sizes. Similar to your problem? NH: Probably worse. Not doing MOM tiles, doing CICE blocks, which are even smaller. Same grid as MOM. RF: Fewer PEs, so blocks are half the size. MOM5 tile size to CICE block size are comparable apart from 1 degree model.

NH: Will carry on with this. Better than deflating externally, but could run into some problems. The chunks in the file view don’t have to be the same. Will this be really bad for read performance? Gathering that it is. What could be done about it? Limited by what each rank can do. No reason the chunks have to be the same as the file view. Could have multiple processors contribute to a chunk. Can’t do without out collective. MW: MPI-IO does collectives under the hood. Can configure MPI-IO to build your chunk for you? NH: Currently every rank does it own IO as it was simpler and faster. MW: Can’t all be configured at MPI-IO later? RY: PIO can map compute domain to IO domain. Previous work had one IO rank per node. IO rank collect all data from node. Set chunking at this level. NH: For example, our chunk size could be 48 times bigger. RY: Yes. Also best performance is single rank per node. PIO does have this function to map from compute to IO domain, and why we used it. Can also specify how many aggregators / node. First decide how many IO ranks per node, and how many aggregators per node. Those should match. Can also number of stripes and strip size to match chunk size. IO rank per node is the most important, as will set chunk size. MW: Only want same number of writers as OSTs. RY: Many writers per node, will easily saturate the network and be the bottleneck. AH: Have to go, but definitely need to solve this, as scaling to 20K cores will kill this otherwise.

MW: Will also help RF. If you’re desperate should look at the patch RY and I did. Will help a lot once you’ve identified your initialisation slow down. RF: Yes, will do once I’ve worked out where the blockages are. Just seen some timings from Paul Sandery, but haven’t looked into it deeply yet. NH: Even with rubbish config, model is showing performance improvements. Will continue with that, and will consider the chunk size stuff as an optimisation task. MW: Sounds like you’ve gone from serial write to massively parallel, so inverted the problem, from one disk bound, to network bound within lustre. If you can find a sweet spot in between then should see another big speed improvement. NH: Config step pretty easy to do with PIO. Will talk to RY about it. RY: Could have a user settable parameter to specify IO writers per node. PL: Need to look into lustre striping size? RY: Currently set to 1GB, so probably ok, but can always tune this. NH: Just getting a small taste of the huge world of IO optimisation. MW: Just interesting to be IO bound again. NH: heavily impacted by IO performance with dailies with tenth. MW: IO domain offset the problem. Still there but could be dealt with in parallel with next run so could be sort of ignored.

AK: This is going to speed up the run. Worse case is post-processing to get chunking sorted out. NH: Leaves us back at the point of having to do another step, which I would like to avoid. Maybe different, before was a step to deflate, maybe rechecking was always going to be a problem.

AK: Revisiting Issue #212, so we need to change model ordering. Concerns about YATM and MOM sharing a node and affecting IO bandwidth. Tried this test, there is an assert statement that fails. libaccessom2 demands YATM is first PE. NH: Will look at why I put the assert there. Weirdly proud there is an assert. RF: Remember this, when playing around with OASIS intercommunicators, might have been conceptually this was the easiest way to get it work. MW: I recall insisting on a change to the intercommunicator to get score-p working. AK: Not sure how important this is. NH: There are other things. Maybe something to do with the runoff. The ice model needs to talk to YATM at some point. Maybe a scummy way of knowing where YATM is. For every config maybe then know where YATM is. These would be shortcut reasons. PL: Give it it’s own communicator and use that? NH: Maybe that is what we used to do. Could always go back to what we had before. RF: Just an idea if it would have an impact. Could give YATM it’s own node as a test. MW: Not sure why it is that way. Should be easy to fix. NH: Ok, certain configurations are shared. Like timers and coupling fields. Instead of each model have their own understanding, share this information. So models check timestep compatibility etc. Using it to share configs. Another way to do that. MW: Doesn’t have to rank zero. NH: Sure it is just a hack. MW: libaccessom2 is elegant, can’t say the same for all the components it talks to. RF: There is a hard-wired broadcast from rank zero at the end.

MOM6

MW: Ever talk about MOM6? AK: Angus is getting a regional circumantarctic MOM6 config together. RF: Running old version of CM4 for decadal project. PL: Maybe a good topic for next meeting?

Attachments

https://cosima.org.au/wp-content/uploads/2020/08/ACCESS-OM2-scalability-on-Gadi.pdf

https://cosima.org.au/wp-content/uploads/2020/08/ACCESS-OM2-Performance-Report.pdf