Date: 13th October, 2020
Aidan Heerdegen (AH) CLEX ANU
- Andrew Kiss (AK) COSIMA ANU
Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
- Peter Dobrohotoff (PD) CSIRO Aspendale
RF: NH found CICE initialisation broadcast timing issue. Using PIO to read in those files? NH: Just a read, regular netCDF calls. Still confused. Thought slowdown was OASIS setting up routing tables. RF pointed out docs that talked about OASIS performance and start up times. No reason ours should be > 20s. Weren’t using settings that would lead to that. Found broadcasts in CICE initialisation that took 2 minutes. Now take 20s. That is the routing table stuff. Confused why it is so bad. Big fields, surprised it was so bad. PL: Not to do with the way mpp works? At one stage a broadcast was split out to a loop with point to points. NH: Within MCT and OASIS has stuff like that. We turned it off. MW had the same issue with xgrid and FMS. Couldn’t get MOM-SIS working with that. Removed those.
MW: PL talking about FMS. RF: These are standard MPI broadcasts of 2D fields. MW: Once collectives were unreliable, so FMS didn’t use them. Now collectives exceed point to point but code never got updated. PL: Still might be slowing down initialisation?
NH: Now less than a second. Big change in start up time. Next would be to use newer version of OASIS. From docs and papers could save another 10-15s if we wanted to. Not sure it is worth the effort. Maybe lower hanging fruit in termination. RF: Yes, another 2-3 minutes for Pavel’s runs. Will need to track exactly what he is doing, how much IO, which bits stalling. Just restarts, or finalising diagnostics. AH: What did you use to track it down. NH: Print statements. Binary search. Strong hunch it was CICE. RF: Time wasn’t being tracked in CICE timers. NH: First suspected my code, CICE was the next candidate. AH: Close timer budgets? RF: A lot depend on MPI, others need to call system clock.
NH: Will push those changes to CICE so RF can grab them.
NH: Made another mention in CICE. Order in coupling: both models do work ocn send ice, ice recv ocn, ice send ocn, ocn recv from ice. So send and recv are paired after the work. Not best way. Both should work, both send and then both recv. Minute difference, but does mean ocean is not waiting for ice. AH: Might affect pathological cases.
AH: Re finalisation, no broadcasts at end too? NH: Even error messages, couple of megabytes of UCX errors. Maybe improve termination by 10% by cleaning that up. CICE using PIO for restarts. Auscom driver is not using PIO and has other restarts. ACCESS specific restarts are not using PIO. Could look at. From logs, YATM finished 3-4 minutes before everything else. AH: Just tenth? NH: Just at 0.1. Not sure a problem, just could be better.
Progress in PIO testing in CICE with ACCESS-OM2
NH: AK done most of the testing. Large parts of the globe where no PE does a write. Each PE writing it’s own area. Over land no write, so filled with netCDF _FillValue. When change ice layout different CPUs have different pattern of missing tiles can read unitialised values, same as MOM. Way to get around that is to just fill with zeroes. Could use any ice restart with any new run with different PE layout. AH: Why does computation care over land? NH: CICE doesn’t apply land mask to anything. Not like MOM which excludes calculations over land, just excluding calc over land where there is no cpu. Code isn’t careful. If change PE layout, parts which weren’t calculated are now calculated. RF: Often uses comparison to zero to mask out land. When there is a _FillValue doesn’t properly detect that isn’t land. MW: A lot of if water statements in code. RF: Pack into 1D array with offsets. NH: Not how I thought it works. RF: Assumes certain things have been done. Doesn’t expect ever to change, running the same way all the time. Because dumped whole file from one processor, never ran into the problem. NH: Maybe ok to do what they did before: changed _FillValue to zero. RF: Nasty though, is a valid number. NH: Alternative is to give up on using restarts RF: Process restarts to fill where necessary, and put in zeroes where necessary. NH: Same with MOM? How do we fix it with MOM? RF: Changed a while ago. Tested on _FillValue or _MissingValue, especially in thickness module. PL: Does this imply being able to restart from restart file means can’t change processor layout? Just in CICE or also MOM? RF: Will be sensitive to tile size. Distribution of tiles (round-robin etc) still have the same number of tiles, so not sensitive. AH: MOM always have to collate restarts if changing IO layout. AK: Why is having _FillValue of zero worse than just putting in zero? RF: Often codes check for _FillValue, like MissingValue. So might affect other codes.
NH: Ok settle on post-processing to reuse CICE restarts in different layouts. AK: Sounds good. Put a note in the issue and close it. AH: Make another issue with script to post process restarts. NH: Use default default netCDF _FillValue.
Scaling performance of new Configurations for 0.1
NH: Working on small (6K), medium (10K+1.2K), large (18K) and x-large (33K+) ACCESS-OM2-01 configs. Profiling and tuning. PIO improves a lot. Small is better than before. CICE has scalability issues but before PIO, but little IO? PL: Removed as much IO as possible. Hard-wired to remove all restart writing. Some restart writing didn’t honour write_restart flag. Not incorporated in main code, still on a branch.
NH: Medium scales ok. Not as good as MOM, but efficiency not a lot worse. Large and x-large might not be that efficient. x-large just takes a long time to even start. 5-10 minutes. Will have enough data for a presentation soon. Still tweaking balance between models. Can slow down when decrease the number of ICE cpus *and* increase them. Can speed up current config a little by decreasing the number of CICE cpus. It is a balance between the average time that each CPU takes to complete ice step, and the worst case time. As increase CPUs the worse case maximum increases. Mean minus worst case decreases RF: Changing tile size? NH: Haven’t needed to. Trying to keep number of blocks/cpu around 5-10. RF: Fewer tiles the larger chance they are to be dist evenly across ice/ocean. Only 3-4 tiles per processor, some may have 1, others with 2. So 50% difference in load. About 6 tiles/processor is a sweet spot. NH: Haven’t created a config less than 7. From PL work, not bad to err on the side of having more. Not noticeable difference between 8 and 10, err on higher side. Marshall, did you find CICE had serious limits to scalability.
MW: Can’t recall much about CICE scaling work. Think some of the bottlenecks have been fixed since then. NH: Always wanted to compete with MOM-SIS. Seems hard to do with CICE. MW: Recall CICE does have trouble scaling up computations. EVP solver is much more complex. SIS is a much simpler model.
AH: When changing the number of CPUs and testing the sensitivity run time, are you also changing the CPUs/block? NH: Algorithm does distribution. Tell it how many blocks overall then give it any number of CPUS and will distribute it over the CPUs. Only two things. Build system conflates them a little. Uses the number of cpus to calculate the number of blocks. Not necessary. Can just say how many blocks you want and then change how many cpus as much as you want. RF: I wrote a tool to calculate the optimal number of CPUS. Could run that quickly. NH: What is it called? RF: Under v45. Maybe masking under my account. NH: So don’t change the number of blocks just change number of cpus in config. So can use CICE to soak up extra CPUS to use full ranks. That is why we were using 799. Should change that number with different node sizes. AH: Small changes of the number CPUs just means a small change to average number of blocks per CPU? Trying to understand sensitivity of run time considering such a small change. NH: Some collective ops, so more CPUs you have the slower. MW: Have timers for all the ranks. Down at subroutine level? NH: No, just a couple of key things. Ice step time.
MW: Been using perf. Linux line profiler. Getting used to using it in parallel. Very powerful. Tell you exactly where you are wasting your time. AH: Learning another profiler? MW: Yes, but and no documentation. score-p is good, but overhead is too erratic. Are the Allinea tools available? PL: Yes, bit restricted. MW: Small number of licenses? Why I gave up on them.
1 degree CICE ice config
AK: Re CICE core count, 1 degree config has 241 cores. Wasting 20% of cpu time on 1 degree. Currently has 24 cores for CICE. Should reduce to 23? NH: Different for 1 degree, not using round-robin. Was playing with 1 degree. Assuming asn’t as bad with gadi, were wasting 11. Maybe just add or subtract a stripe to improve efficiency. Will take a look. Improve the SU efficiency a lot.
RF: fortran programs in /scratch/v45/raf599/masking which will work out the processors for MOM and CICE. Also a couple of FERRET scripts given a sample ICE distribution will tell you how much work is expected to be done for a round-robin distribution. Ok, so not quite valid. Will look at changing the script to support sect-robin. More of a dynamic thing for how a typical run might go. Performance changes seasonally. NH: Another thing we could do is get a more accurate calculation of work per block. Work per block is based on latitude. Higher latitude that block going to do more work. Can also give it a file specifying the work done. Is ice evenly distributed across latitudes? RF: Index space, and is variable. Maybe some sort of seasonal climatology, or an annual thing. AH: Seasonal would make sense. AH: Can give it a file with a time dimensions? RF: Have to tell it the file, use a script. Put it in a namelist, or get it to figure out. NH: Hard to know if it is worth the effort. AH: Start with a climatology and see if it makes a difference. Ice tends to stick to close to coastlines. Antarctic peninsula will have ice at higher latitudes than in Weddell Sea for example. Also in the Arctic the ice drains out at specific points. RF: Most variable southerly part is north of Japan, coming a fair way south near Sakhalin. Would have a few tiles there with low weight. NH: Hard to test, need to run a whole year, not just a 10 day run. Doing all scaling test with warm run in June. MW: CICE run time is proportional to amount of ice in ocean. There is a 10% seasonal variation. RF: sect-robin of small tiles tries to make sure each processor has work in northern and southern hemispheres. Others divide into northern and southern tiles. SlenderX2? NH: That is what we’re using for the 1 degree.
AH: Was getting the MOM jenkins tests running specifically to test the FMS pull request which uses subtree so you can switch in and out FMS versions easily. Very similar to what MW had already done. Just have to move a few files out of the FMS shared directory. When I did the tests it took me a week to find MW had already find all these bugs. When MW put in ulm reverted some changes to multithreaded reads and writes so the tests didn’t break. MW: Quasi-ulm. AH: Those changes were hard coded inside a MOM repo. MW: In the FMS code in MOM. AH: Not a separate FMS branch where they exist? MW: No. Maybe some clever git cherry-picking would work. MW: Has changed a lot since then. Unlikely MOM5 will every change FMS every again. AH: Intention was to be able to make changes in a separate fork. MW: Changes to your own FMS? Ok. Hopefully would have done it in a single commit. AH: Allowed those options but did nothing with them? MW: Don’t remember, sorry. AH: Yours and Rui’s changes are in an FMS branch? MW: Yes. Changes I did there would not be in shared dir of MOM5. Search for commits by me maybe. If you’re defining subtree wouldn’t you want the current version to be what is in shared dir right now? AH: Naively thought that was what ulm was. MW: I fixed some things, maybe bad logic, race conditions. There were problems. Not sure if I did them on FMS or MOM side. AH: Just these things not supported in namelist, not important. Will look and ask specific questions.
AH: Given Martin Dix latest 0.25 bathymetry for the ACCESS-CM2-025 configuration. Also need to generate regridding weights, will rely on latest scripts AK used for ACCESS-OM2-025. AK: Was issue with ESMF_RegridWeightGen relying on a fix from RF. Were you going to do a PR to the upstream RF? RF: You can put in the PR if you want. Just say it is coming from COSIMA.
MW: Regarding hosting MOM6 code in mom-ocean GitHub repo. Who is the caretaker of that group? AH: Probably me, as I’m paid to do it. Not sure. MW: Natural place to move the source code. They’re only hesitant because not sure if it is Australian property. Also wants complete freedom to add anything there, including other dependencies like CVMix and TEOS-10. AH: I think the more the better. Makes a more vital place. MW: So if Alistair comes back to me I can say it is ok? AH: Steve is probably the one to ask. MW: Steve is the one who is advocating for it. MW: They had a model of no central point, now they have five central points. NOAA-GFDL is now the central point. Helps to distance the code one organisation. Would be an advantage to be in a neutral space. COSIMA doesn’t have a say? AH: COSIMA has it’s own organisation, so no. AH: Just ask for Admin privileges. MW: You’re a reluctant caretaker? AH: NH just didn’t have time, so I stepped in and helped Steve with the website and stuff. Sounds great.