Aidan Heerdegen (AH) CLEX ANU
- Andrew Kiss (AK) COSIMA ANU
- Angus Gibson (AG) RSES ANU
Russ Fiedler (RF) CSIRO Hobart
Paul Leopardi (PL), Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision
MW: Will also help RF. If you’re desperate should look at the patch RY and I did. Will help a lot once you’ve identified your initialisation slow down. RF: Yes, will do once I’ve worked out where the blockages are. Just seen some timings from Paul Sandery, but haven’t looked into it deeply yet. NH: Even with rubbish config, model is showing performance improvements. Will continue with that, and will consider the chunk size stuff as an optimisation task. MW: Sounds like you’ve gone from serial write to massively parallel, so inverted the problem, from one disk bound, to network bound within lustre. If you can find a sweet spot in between then should see another big speed improvement. NH: Config step pretty easy to do with PIO. Will talk to RY about it. RY: Could have a user settable parameter to specify IO writers per node. PL: Need to look into lustre striping size? RY: Currently set to 1GB, so probably ok, but can always tune this. NH: Just getting a small taste of the huge world of IO optimisation. MW: Just interesting to be IO bound again. NH: heavily impacted by IO performance with dailies with tenth. MW: IO domain offset the problem. Still there but could be dealt with in parallel with next run so could be sort of ignored.
Apologies from Peter Dobrohotoff.
(Paul’s report is attached at the end)
PL: Looking at scaling. Started with ACCESS-OM2, but went to testing MOM5 directly with MOM5-SIS. Using POM25, global 0.25 model with NYF forcing. The model MW developed for testing scaling prior to ACCESS-OM2. Had to add specify min_thickness in ocean_topog_nml.
PL: Tested the scaling of 960/1920/3840/7680/15360, with no masking. Scales well up to some point between 7680 and 15360.
PL: Tested effect of vectorising options (AVX2/AVX512/AVX512-REPRO). Found no difference in runtime with 15360 cores. MW: Probably communication bound at the CPU count. Repro did not change time.
MW: Never seen significant speed up from vectorisation. Typically only a few percent improvement. Code is RAM bound, so cannot provide enough data to make use of vectorisation. Still worth working toward a point where we can take advantage of vectoeisatio.
PL: Had one “slow” run outlier out of 20 runs. Ran 20% slower. Ran on different nodes to other jobs, not sure if that is significant. MW: IO can cause that. AH: Andy Hogg also had some slow jobs due to a bad node. AK: Job was 20x slower. Also RYF runs become consistently slower a few weeks ago. MW: OpenMPI can prepend timestamps in front of output, can help to identify issues.
PL: Getting some segfaults in ompi_request_wait_completion, caused by pmpi_wait and pmpi_bcast. Both called from the coupler. NH: Could be a bad bit of memory in the buffer, and if it tries to copy it can segfault. PL: Thinking to run again using valgrind, but would require compiling own version of valgrind wrapper for OpenMPI 4.0.2. Would be easier to Intel MPI, but no-one else has use this. Saw some cases similar when searching which were associated with UCX, but sufficiently different to not be sure. These issues are with highest core count. MW: Often see a lot of problems at high core counts. NH: Finding bugs can be a never ending bug. Use time wisely to fix bugs that affect people. MW: Quarter degree at 15K cores would have very small tile sizes. Could be the source of the issue. AH: This is not a configuration that we would use, so it is not worth spending time chasing bugs.
PL: Next testing target is 0.1 degree, but not sure which configuration and forcing data to use. Will not use MOM5-SIS, but will use ACCESS-OM2 for direct comparison purposes. AK: Configurations used in the model description paper have not been ported to gadi. Moving on to a new iteration. Andy Hogg is running a configuration that is quite similar, but moving to new configurations with updated software and forcing. Those are not quite ready.
PL: Need a starting configuration for testing. Want to confine to scalability testing and compiler flags. NH: ACCESS-OM2 is setup to be well balanced for particular configurations. Can’t just double CPUs on all models as load imbalance between submodels will dominate any other performance changes. Makes it a problematic config for clean configurations for things like compiler flags. MW: Useful approach was to check scalability of sub-model components independently. Required careful definition of timers to strategically ignore coupling time. MOM was easy, CICE was more difficult, but work with Nic’s timers helped a lot. Try to time the bits of code that are doing computation and separate from code that waits on other parts. Coupled model is a real challenge to test. Figure out what timers we used and trust those. Can reverse engineer from my old scripts.
PL: Should do MOM-SIS scalability work? MW: Easier task, and some lessons can be learned, but runtime will not match between MOM-SIS and ACCESS-OM2. Would be more of a practice run. PL: Maybe getting out of scope. Would need 0.1 MOM-SIS config. RY: Yes we have that one. If PL wanted to run ACCESS-OM2-01 is there a configuration available? AK: Andy Hogg’s currently running configuration would work. PL: Next quarter need to free up time to do other things.
MW: Might be valuable to get some score-p or similar numbers on current production model. Useful to have a record of those timings to share. Scaling test might be too much, but a profile/timing test is more tractable. RY: Any issues with score-p? Overhead? MW: Typical, 10-20%, so skews numbers but get in-depth view. Can do it one sub-model at a time. Had to hack a lot scripts, and get NH to rewrite some code to get it to work. score-p is always done at compile time. Doesn’t affect payu. Try building MOM-SIS with score-p, then try MOM within ACCESS-OM2. Then move on to CICE and maybe libaccessom2. PL: Build script does include some score-p hooks. MW: Even without score-p MOM has very good internal timers. Not getting per-rank times. score-p is great for measuring load imbalance. AH: payu has a repeat option, which repeats the same time, which removes variability due to forcing. Need to think about what time you want to repeat as far as season. AK: CICE has idealised initial ice, evolves rapidly. MW: My earlier profile runs had no ice, which affects performance. MW: Not sure it is huge, maybe 10-20%, but not huge.
MW: Overall surprised at lack of any speed up with vectorisation, and lack of slow-down with repro. PL: Will verify those numbers with 960 core config.
AH: Surprised how well it scaled. Did it scale that well on raijin? MW: The performance scaling elbow did show up lower. AH: 3x more processors per node has an effect? MW: Yes, big part of it. AH: 0.1 scaled well on raijin, so should scale better on gadi. 1/30th should scale well. Only bottleneck will be if the library can handle that many ranks.
NH: If repro flags don’t change performance that is interesting. Seem to regularly have a “what trade off does repro flags have?”, would be good to avoid. MW: Probably best to have an automated pipeline calculating these numbers. NH: People have an issue with fp0 flag. MW: Shouldn’t affect performance. NH: Make sure fp0 is in there. MW: Agree 100%.
AH: Do we have a gadi compatible master branch on gadi? AK: No, not currently. NH: At a previous TWG meeting I self-assigned getting master gadi compatible. Merged all gadi-transition branches and tested, seemed to be working ok. Subsequent meeting AK said there were other changes required, so stopped at that point. gadi-transition branches still exist, but much has already been merged and tested on a couple of configurations. Have since moved to working on other things.
NH: Close if AK has all the things he wants into gadi-transition branch. Previous merge didn’t include all the things AK wanted in there. Happy to spend more time on that after finishing JRA55 v1.4 stuff.
NH: Made code changes in all the models, but have not checked existing experiments are unchanged with modified code.
NH: v1.4 has a new coupling field, ice calving. Passing this through to CICE as a separate field. In CICE split into two fields, liquid water flux and a heat flux. MOM in ACCESS-CM2 already handles both these fields. Just had to change preprocessor flags to make it work for ACCESS-OM2 as well.
NH: Lots of options. Possible to join liquid and solid ice at atmosphere and becomes the same as we have now. Can join in CICE and have a water flux but not a heat flux.
AH: A quick update with Navid’s error. Made a little mpi4python script to run before payu to check status of nodes, and all but root node had a stale version of the work directory. Like it hadn’t been archived. Link to executable was gone, but everything else was there. Reported to NCI, Ben Menadue does not know why this is happening. Also tried a delay option between runs and this helped somewhat, but also had some strange comms errors trying to connect to exec nodes. Will next try turning off all input/output can find in case it is a file lock error. Have been told Lustre cannot be in this state.
MW: In old driver do a lot of moving directories from work to archive, and then relabelling. Is it still moving directories around to archive them? Maybe replace with hard copy of directory to archive. MOM6 driver is the MOM5 driver, so maybe all old drivers are doing this. Definitely worth understanding, but a quick fix to copy rather than move.
NH: Filesystem and symbolic links might be an issue MW: Maybe symbolic links are an issue on these mounted filesystems. AH: There was a suggestion it might be because it was running on home which is NFS mounted, but that wasn’t the problem. MW: Often with raijin you just got the same nodes back when you resubmit, so maybe some sort of smart caching.