Technical Working Group Meeting, November 2020

Minutes

Date: 11th September, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU
  • Angus Gibson (AG) RSES ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Paul Leopardi (PL), Rui Yang (RY) NCI
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD) CSIRO Aspendale
  • James Munroe (JM) Sweetwater Consultants
  • Marshall Ward (MW) GFDL

 

Updated ACCESS-OM-01 configurations

NH: Testing for practical/usable 0.1 configs. Can expand configs as there is more available resource on gadi and CICE not such a bottleneck. When IO is on, scalability has been a problem, particularly with CICE. IO no longer bottleneck with CICE, but CICE still scalability problem. All I/O was routed through root processor with a lot of gathers etc. So now have a chance to have larger practically useful configs.
NH: Doesn’t look like we can scale up to tens of thousands of cores.
NH: Coupling diagram. YATM doesn’t receive any fields from ICE or OCEAN model. Just sync message. Can just send as much as it wants. Seemed a little dangerous, maybe stressing MPI buffers, or not intended usage. YATM always tries to keep ahead of the rest of the model, and send whenever it can. Does hang on a sync message. Will prepare coupling fields, once done will do a non-blocking sync. Should always send waiting for CICE to receive. When ice and ocean model coupling on every time step, both send and then both receive. Eliminate time between sending and waiting to receive. CICE should never wait on YATM. CICE and MOM need to run at roughly same pace. MOM should wait less as it consumes more resources. Internally CICE and MOM processors should be balanced. Previously CICE I/O has been big problem. Usually caused other CICE CPUs to wait, and MOM CPUs.
NH: Reworked coupling strategy originally from CSIRO to do as best as we can.
NH: Four new configs: small, medium, large and x-large. New configs look better than existing because old config did daily ice output, which it was never configured to do. Comparing new configs to worst existing config. All include daily ice and ocean output. Small basically the same as previous config. All tests are warm from April from restart from AK run. Lots of ice. Three changes from existing in all configs: PIO, small coupling change where both models simultaneously, small change to initialisation in CICE. Medium config (10K) is very good. Efficiency very similar to existing/small. Large not scaling too well. Hitting limit on CICE scaling. Probably limit of efficiency. X-Large hasn’t been able to be tested. Wanted to move projects, run larger job. AH: X-large likely to be less efficient than large NH: Unlikely to use it. Probably won’t use Large either.
AH: MOM waiting time double from med -> large. NH: MOM scaling well. CICE doesn’t scale well. Tried a few different processor counts, doesn’t help much. Problem getting right number of CICE cpus. Doesn’t speed up if you add more. AK: Played around with block numbers? NH: Keep blocks around 6 or 7. As high as 10 no down side. Beyond that, not much to do.
PL: Couldn’t see much change with number of blocks. Keeping same seemed better than varying. NH: Did you try small number like 3-4. PL: No. RF: Will change by season. Need enough blocks scattered over globe so you never blow out. PL: I was only testing one time of the year. NH: Difficult to test properly without long tests.
NH: PIO is great. Medium is usable. Large and X-Large probably too wasteful. Want to compare to PL and MW results. Might be some disagreement with medium config. From PL plots shouldn’t have scaled this well. PL: Maybe re-run my tests with new executables/configs. NH: Good idea. Will send those configs. Will run X-Large anyway.
NH: Had to fix ocean and ice restarts after changing layout. Didn’t expect it would be required but got a lot of crashes. Interpolate right on to land and then mask land out again. In ice case, making sure no ice on land. PL: I had to create restart for every single grid configuration.
NH: Not clear why this is necessary. Thought we’d fixed that, at least with ocean. Maybe with PIO introduced something weird. Would be good to understand. Put 2 scripts in topography_tools repo fix_ice_restarts.pyfix_ocean_restarts.py.
RF: Will you push changes for initialisation? Would like to test them with our 3 day runs. NH: Will do.
RF: Working on i2o restarts. Code writes out a variable, defines the next variable, chance of doing a copy each time. About 10-12 copies of the files in and out. NH: OASIS? RF: No coupled_netcdf thing. Rewritten and will test today. Simple change to define variables in advance. NH: Still doing a serial write? RF: Yes. Could probably be done with PIO as well. Will just build on what is there at the moment.
AH: Large not as efficient as Medium or Small, but more efficient than current. NH: Due to daily output and no PIO. AH: Yes, but still and improvement than current. AH: How does this translate to model years per submit? AK: Small is 3 months in just over 2.5 hours. Should be able to do 1 year/submit in large. AH: Might be a reason to use large. Useful for example if someone wanted to do 3-daily outputs for particle tracking for RYF runs. Medium is 6months/submit? Or maybe a year? AK: Possibly. Not so important when queue wait was longer. NH: I calculate medium would take 5.2 hours/year. Could make medium an thousand processors on the ocean. Might be smarter. AK: Target wall time.
AK: About to start 3rd IAF cycle. Andy keen to try medium. Good enough for production? NH: No real changes. AK: Do you have a branch I could look at. NH: Maybe test a small bump on medium to get it under 5 hours. AK: Hold off until config that gets under 5 hours? NH: Spend the rest of the day and get it under 5 hours. AK: Next few days, updates to diagnostics.
PL: Any testing to see if it changes climate output? NH: None. Don’t know how to do it. MOM should be reproducible. Assume CICE isn’t. AH: Run, check gross metrics like ice volume. Not trying to reproduce anything. Don’t know if there is an issue if it is due to this or not. Cross that bridge when we have to.
AH: Any code changes as well to add to core/cpu bumps? RF: Just a few minutes at the beginning and end.
MW: Shouldn’t change bit repro? NH: Never verified CICE gives same answers with different layouts. Assuming not. MW: Sorry, thought you meant moving from serial to PIO. AH: AK does done testing with that. AK: With PIO only issue is restart files within landmarked processors. MW: Does CICE do a lot of collectives? Only real source of issues with repro. PL: Fix restart code might perturb things? NH: Possibly? AH: Can’t statistically test enough on tenth. PL: Martin Dix sent me an article about how to test climate using ensembles.
AH: When trying to identify what doesn’t scale? Already know? EVP solver NH: Just looked at the simplest numbers. Might be other numbers we could pull out to look at it. Maybe other work. Already done by PL and MW. If goal is to improve scalability of CICE can go down that rabbit hole. AH: Medium is efficient and a better user of effort to concentrate on finessing that. Good result too. NH: Year in a single submit with good efficiency is a good outcome. As long as it can get through the queue, and had no trouble with testing. AH: Throughput might be an issue, but shouldn’t be with much larger machine now.
AK: Less stable at 10K? Run into more MPI problems with larger runs. Current (small) config is pretty reliable. Resub scripts works well, falls over 5-10% of the time. Falls over at the start, gets resubmitted. Dies at the beginning so no zombie jobs that run to wall time limit. NH: Should figure out if our problem or someone else’s? In the libraries? AK: Don’t know. Runs almost every time when restarted. Tends to be MPI stuff, sometimes lower down interconnect. Do tend to come in batches. MW: There were times when libraries would lose connections to other ranks. Those use to happen with lower counts on other machines. Depends on the quality of the library. MPI is a thin layer above garbage that may not work as you expect at very high core counts.
PL: What is happens when job restarts? Different hardware with different properties? AK: Not sure if it is hardware. MW: mxm used to be unstable. JM: As we get to high core count jobs, may nott trust every node won’t be ok for the whole run? Should get used to this? In cloud this is expected. In future will have to cope with this? That one day cannot guarantee an MPI rank may not ever finish?
MW: Did see a presentation with heap OpenMPI devs, acknowledge that MPI is not well suited to that problem. Not designed for pinging and rebooting. Actions in implementation and standard to address that scenario. Not just EC2. Not great. Once an MPI starts to fray falls to pieces. Will take time for library devs to address problem, also if MPI is the appropriate back end for IPC scenario. PL: 6-7 yrs ago ANU had a project with Fujitsu and exascale and what happens when hardware goes away. Not MPI as such. Mostly mathematics of sparse grids. Not sure what happened to it. NH: With cloud expect to happen, and why. We might expect it, but don’t know why. If know why, if it is dropping nodes we can cope. If just transient error, due to corrupted memory due to bad code in our model, want to know and fix. MW: MPI doesn’t provide API to investigate it. MPI trace on different machines (e.g. CRAY v gadi) look completely different. Answering that question is implementation specific. Not sure what extra memory is being used, which ports, and no way to query. MPI may not be capable of doing what we’re asking for, unless they expand the API and implementors commit to it.
AH: Jobs are currently constituted don’t fit cloud computing model, as jobs are not independent, reply on results from neighbouring cells. Not sure what the answer is?
MW: Comes up a lot for us. NOAA director is private sector. Loves cloud compute. Asks why don’t we run on cloud? Any information for a formal response could be useful longer term. Same discussions at your places? JM: Problem is in infrastructure. Need an MPI with built in redundancy to model. Scheduler can keep track and resubmit. Same thing happens in TCP networking. Can imagine a calculation layer which is MPI like. MW: We need infiniband. Can’t rely on TCP, useful stock answer. JM: Went to DMAC meeting, advocating regional models in the cloud. Work on 96 cores. Spin up a long tail of regional models. Decent interconnect on small core counts. Maybe single racks within datasets. MW: Looking at small scale MOM6 jobs. Would help all of us if we keep this in mind, what we do and do not need for big machines. Fault tolerance not something I’ve thought about. JM: NH point about difference between faulty code and machine being bad is important.
AH: Regarding idea of having an MPI scheduler that can cope with nodes disappearing. Massive problem to save model state and transfer around. A python pickle for massive amounts of data. NH: That is what  restart is. MW: Currently do it the dumbest way possible with files. JM: Is there a point where individual cores fail, MPI mesh breaks down. Destroys whole calculation. Anticipating the future. NH: Scale comes down to library level. A lot of value in being able to run these models on the cloud. Able to run on machines that are interruptible. Generally cheaper than defined duration. Simplest way is to restart more often, maybe every 10 days. Never lose much when you get interrupted. Lots we could do with that we have. AH: Interesting, even if they aren’t routinely used for that. Makes the models more flexible, usable, shareable.
MW: Looking at paper PL sent, are AWS and cori comparable for moderate size jobs? PL: Looking at a specific MPI operation. JM: Assuming machine stays up for the whole job. Not really testing redundancy.

Testing PIO enabled FMS in MOM5

AH: Last meeting talked to MW about his previous update of FMS to ulm. MW put code directly into shared dir in MOM5 source tree. Made and merged PR where shared dir is now a subtree. Found exact commit MW had used from FMS repo, and recreated the changes he subsequently made so old configs still work: adding back some namelist variables that don’t do anything, but mean old configs still work. FMS code is identical, but now can swap out FMS more easily with subtree. There is a fork of FMS in the mom-ocean GitHub organisation. The master branch is the same as the code that is in the MOM5 subtree. Have written a README to explain the workflow for updating this. If  we want to update FMS, create a feature branch, create a PR, create a matching branch on MOM5, use subtree to update FMS and make a PR on the MOM5 repo. This will then do all the compilation tests necessary for MOM5. Both PRs are done concurrently. Bit clunky, but a lot easier than otherwise, and changes we make are tied to the FMS codebase. Wasn’t the case when they were bare code changes directly included in MOM5.
AH: Can now look at testing a PIO enabled FMS branch based on work of RY and MW. Is there an appetite for that? MW and RY changes are on a branch on FMS branch. Based on a much more recent version of FMS? MW: Pretty old, whatever it is based on. 1400 commits behind. AH: Current is 2015, PIO branch is based on 2018. MW: Must be compatible with MOM5 because it was tested. Not a big issue if it is ulm or not.
AH: So should we do that? I could make a branch with this updated PIO enabled FMS. NH: Great idea. Uses what we’ve already done. AH: RY and MW have done it all, just have to include it. Is compression of outputs handled by library? RY: Testing parallel compression. Code does not need changing. Just need to change deflate level flag in FMS. Parallel IO can import that. No need for code changes. Not compatible with current HDF version 1.10. If specify chunk layout will crash. Test HDF 1.12. Works much better than 1.10. Performance looks pretty good. Compression ratio 60%. Sequential IO 2.06TB, using compression 840GB. Performance is better than sequential IO. Currently available HDF library not compatible. Will present results at next meeting. Not rush too much, need stable results. When netCDF support parallel compression, always wanted to test that, and see if there are code changes. Current HDF library layout only compatible with certain chunk layouts. AH: Certainly need stability for production runs.
NH: Had similar issues. Got p-netcdf and PIO working. PIO is faster, no idea why. Also had trouble with crashes, mostly in MPI libraries and not netCDF libraries. Used MPI options to avoid problematic code. PIO and compression turned on, very senstitive to chunk size. Could only get it working with chunks the same as block size. Wasn’t good, blocks too small. Now got it working with MPI options which avoid bad parts of code. RY: ompi is much more stable. NH: Also had to manually set the number of aggregators. RY: Yes, HDF v 1.12 is much more stable. Should try that. Parallel IO works fine, so not MPI issue, so definitely comes from HDF5 library. So much better moving to HDF1.12. MW: Is PIO doing some of the tuning RY has done manually. NH: Needs more investigation, but possibly being more intelligent about gathers before writes. MW: RY did a lot of tuning about how much to gather before writes. NH: Lots of config options. Didn’t change much. Initially expected parallel netCDF and PIO to be the same, and surprised PIO was so much better. Asked on GitHub why that might be, but got no conclusive answer.
AH: So RY, hold off and wait for changes? RY: Yes doing testing, but same code works. AH: Even though didn’t support deflate level before? RY: Existed with serial IO. PIO can pick up the deflate level. Before would ignore or crash.

Miscellaneous

MW: At prior MOM6 meeting was obvious preference to move MOM6 source code to mom-ocean organisation if that is ok with current occupants. Hadn’t had a chance to mention this when NH was present. If they did that were worried about you guys getting upset. AH: We are very happy for them to do this. NH: Don’t know time frame. Consensus was unanimous? AH: Definitely from my point of view makes the organisation a vibrant place. NH: I don’t own or run, but think it would cool to have all the codebase in one organisation. Saw the original MOM1 discs in Matt England’s office, which spurred putting all the versions on GitHub. So would be awesome. MW: COSIMA doesn’t have a stake or own this domain? AH: Not at all. I have just invited MW to be a owner, and you can take care of it.  MW: Great, all settled.
PL: Action items for NH? NH: Sent PL configs, commit all code changes tested on current config and give to RF and fix up Medium config. AH: four, shared config with RF and AK, and please send me the slides you presented. RY: PIO changes all there for PL to test? NH: Yes. PL: Definitely need those to test the efficiency of the code. AK: PIO in master branch of CICE on the repo.
PL: Have an updated report that is currently being checked over by Ben. Will release when that is give the ok.
AH: Working on finally finished the cmake stuff for MOM5 for all the other MOM5 test configs. Will mean MOM5 can compile in 5 minutes as parallel compilation works better due to dependency tree being correctly determined.

Technical Working Group Meeting, October 2020

Minutes

Date: 13th October, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD) CSIRO Aspendale

CICE Initialisation

RF: NH found CICE initialisation broadcast timing issue. Using PIO to read in those files? NH: Just a read, regular netCDF calls. Still confused. Thought slowdown was OASIS setting up routing tables. RF pointed out docs that talked about OASIS performance and start up times. No reason ours should be > 20s. Weren’t using settings that would lead to that. Found broadcasts in CICE initialisation that took 2 minutes. Now take 20s. That is the routing table stuff. Confused why it is so bad. Big fields, surprised it was so bad. PL: Not to do with the way mpp works? At one stage a broadcast was split out to a loop with point to points. NH: Within MCT and OASIS has stuff like that. We turned it off. MW had the same issue with xgrid and FMS. Couldn’t get MOM-SIS working with that. Removed those.
MW: PL talking about FMS. RF: These are standard MPI broadcasts of 2D fields. MW: Once collectives were unreliable, so FMS didn’t use them. Now collectives exceed point to point but code never got updated. PL: Still might be slowing down initialisation?
NH: Now less than a second. Big change in start up time. Next would be to use newer version of OASIS. From docs and papers could save another 10-15s if we wanted to. Not sure it is worth the effort. Maybe lower hanging fruit in termination. RF: Yes, another 2-3 minutes for Pavel’s runs. Will need to track exactly what he is doing, how much IO, which bits stalling. Just restarts, or finalising diagnostics. AH: What did you use to track it down. NH: Print statements. Binary search. Strong hunch it was CICE. RF: Time wasn’t being tracked in CICE timers. NH: First suspected my code, CICE was the next candidate. AH: Close timer budgets? RF: A lot depend on MPI, others need to call system clock.
NH: Will push those changes to CICE so RF can grab them.
NH: Made another mention in CICE. Order in coupling: both models do work ocn send ice, ice recv ocn, ice send ocn, ocn recv from ice. So send and recv are paired after the work. Not best way. Both should work, both send and then both recv. Minute difference, but does mean ocean is not waiting for ice. AH: Might affect pathological cases.
AH: Re finalisation, no broadcasts at end too? NH: Even error messages, couple of megabytes of UCX errors. Maybe improve termination by 10% by cleaning that up. CICE using PIO for restarts. Auscom driver is not using PIO and has other restarts. ACCESS specific restarts are not using PIO. Could look at. From logs, YATM finished 3-4 minutes before everything else. AH: Just tenth? NH: Just at 0.1. Not sure a problem, just could be better.

Progress in PIO testing in CICE with ACCESS-OM2

NH: AK done most of the testing. Large parts of the globe where no PE does a write. Each PE writing it’s own area. Over land no write, so filled with netCDF _FillValue. When change ice layout different CPUs have different pattern of missing tiles can read unitialised values, same as MOM. Way to get around that is to just fill with zeroes. Could use any ice restart with any new run with different PE layout. AH: Why does computation care over land? NH: CICE doesn’t apply land mask to anything. Not like MOM which excludes calculations over land, just excluding calc over land where there is no cpu. Code isn’t careful. If change PE layout, parts which weren’t calculated are now calculated. RF: Often uses comparison to zero to mask out land. When there is a _FillValue doesn’t properly detect that isn’t land. MW: A lot of if water statements in code. RF: Pack into 1D array with offsets. NH: Not how I thought it works. RF: Assumes certain things have been done. Doesn’t expect ever to change, running the same way all the time. Because dumped whole file from one processor, never ran into the problem. NH: Maybe ok to do what they did before: changed _FillValue to zero. RF: Nasty though, is a valid number. NH: Alternative is to give up on using restarts RF: Process restarts to fill where necessary, and put in zeroes where necessary. NH: Same with MOM? How do we fix it with MOM? RF: Changed a while ago. Tested on _FillValue or _MissingValue, especially in thickness module. PL: Does this imply being able to restart from restart file means can’t change processor layout? Just in CICE or also MOM? RF: Will be sensitive to tile size. Distribution of tiles (round-robin etc) still have the same number of tiles, so not sensitive. AH: MOM always have to collate restarts if changing IO layout. AK: Why is having _FillValue of zero worse than just putting in zero? RF: Often codes check for _FillValue, like MissingValue. So might affect other codes.
NH: Ok settle on post-processing to reuse CICE restarts in different layouts. AK: Sounds good. Put a note in the issue and close it. AH: Make another issue with  script to post process restarts. NH: Use default default netCDF _FillValue.

Scaling performance of new Configurations for 0.1

NH: Working on small (6K), medium (10K+1.2K), large (18K) and x-large (33K+) ACCESS-OM2-01 configs. Profiling and tuning. PIO improves a lot. Small is better than before. CICE has scalability issues but before PIO, but little IO? PL: Removed as much IO as possible. Hard-wired to remove all restart writing. Some restart writing didn’t honour write_restart flag. Not incorporated in main code, still on a branch.
NH: Medium scales ok. Not as good as MOM, but efficiency not a lot worse. Large and x-large might not be that efficient. x-large just takes a long time to even start. 5-10 minutes. Will have enough data for a presentation soon. Still tweaking balance between models. Can slow down when decrease the number of ICE cpus *and* increase them. Can speed up current config a little by decreasing the number of CICE cpus. It is a balance between the average time that each CPU takes to complete ice step, and the worst case time. As increase CPUs the worse case maximum increases. Mean minus worst case decreases RF: Changing tile size? NH: Haven’t needed to. Trying to keep number of blocks/cpu around 5-10. RF: Fewer tiles the larger chance they are to be dist evenly across ice/ocean. Only 3-4 tiles per processor, some may have 1, others with 2. So 50% difference in load. About 6 tiles/processor is a sweet spot. NH: Haven’t created a config less than 7. From PL work, not bad to err on the side of having more. Not noticeable difference between 8 and 10, err on higher side. Marshall, did you find CICE had serious limits to scalability.
MW: Can’t recall much about CICE scaling work. Think some of the bottlenecks have been fixed since then. NH: Always wanted to compete with MOM-SIS. Seems hard to do with CICE. MW: Recall CICE does have trouble scaling up computations. EVP solver is much more complex. SIS is a much simpler model.
AH: When changing the number of CPUs and testing the sensitivity run time, are you also changing the CPUs/block? NH: Algorithm does distribution. Tell it how many blocks overall then give it any number of CPUS and will distribute it over the CPUs. Only two things. Build system conflates them a little. Uses the number of cpus to calculate the number of blocks. Not necessary. Can just say how many blocks you want and then change how many cpus as much as you want. RF: I wrote a tool to calculate the optimal number of CPUS. Could run that quickly. NH: What is it called? RF: Under v45. Maybe masking under my account. NH: So don’t change the number of blocks just change number of cpus in config. So can use CICE to soak up extra CPUS to use full ranks. That is why we were using 799. Should change that number with different node sizes. AH: Small changes of the number CPUs just means a small change to average number of blocks per CPU? Trying to understand sensitivity of run time considering such a small change. NH: Some collective ops, so more CPUs you have the slower. MW: Have timers for all the ranks. Down at subroutine level? NH: No, just a couple of key things. Ice step time.
MW: Been using perf. Linux line profiler. Getting used to using it in parallel. Very powerful. Tell you exactly where you are wasting your time. AH: Learning another profiler? MW: Yes, but and no documentation. score-p is good, but overhead is too erratic. Are the Allinea tools available? PL: Yes, bit restricted. MW: Small number of licenses? Why I gave up on them.

1 degree CICE ice config

AK: Re CICE core count, 1 degree config has 241 cores. Wasting 20% of cpu time on 1 degree. Currently has 24 cores for CICE. Should reduce to 23? NH: Different for 1 degree, not using round-robin. Was playing with 1 degree. Assuming asn’t as bad with gadi, were wasting 11. Maybe just add or subtract a stripe to improve efficiency. Will take a look. Improve the SU efficiency a lot.
RF: fortran programs in /scratch/v45/raf599/masking which will work out the processors for MOM and CICE. Also a couple of FERRET scripts given a sample ICE distribution will tell you how much work is expected to be done for a round-robin distribution. Ok, so not quite valid. Will look at changing the script to support sect-robin. More of a dynamic thing for how a typical run might go. Performance changes seasonally. NH: Another thing we could do is get a more accurate calculation of work per block. Work per block is based on latitude. Higher latitude that block going to do more work. Can also give it a file specifying the work done. Is ice evenly distributed across latitudes? RF: Index space, and is variable. Maybe some sort of seasonal climatology, or an annual thing. AH: Seasonal would make sense. AH: Can give it a file with a time dimensions? RF: Have to tell it the file, use a script. Put it in a namelist, or get it to figure out. NH: Hard to know if it is worth the effort. AH: Start with a climatology and see if it makes a difference. Ice tends to stick to close to coastlines. Antarctic peninsula will have ice at higher latitudes than in Weddell Sea for example. Also in the Arctic the ice drains out at specific points. RF: Most variable southerly part is north of Japan, coming a fair way south near Sakhalin. Would have a few tiles there with low weight. NH: Hard to test, need to run a whole year, not just a 10 day run. Doing all scaling test with warm run in June. MW: CICE run time is proportional to amount of ice in ocean. There is a 10% seasonal variation. RF: sect-robin of small tiles tries to make sure each processor has work in northern and southern hemispheres. Others divide into northern and southern tiles. SlenderX2? NH: That is what we’re using for the 1 degree.

FMS updates

 AH: Was getting the MOM jenkins tests running specifically to test the FMS pull request which uses subtree so you can switch in and out FMS versions easily. Very similar to what MW had already done. Just have to move a few files out of the FMS shared directory. When I did the tests it took me a week to find MW had already find all these bugs. When MW put in ulm reverted some changes to multithreaded reads and writes so the tests didn’t break. MW: Quasi-ulm. AH: Those changes were hard coded inside a MOM repo. MW: In the FMS code in MOM. AH: Not a separate FMS branch where they exist? MW: No. Maybe some clever git cherry-picking would work. MW: Has changed a lot since then. Unlikely MOM5 will every change FMS every again. AH: Intention was to be able to make changes in a separate fork. MW: Changes to your own FMS? Ok. Hopefully would have done it in a single commit. AH: Allowed those options but did nothing with them? MW: Don’t remember, sorry. AH: Yours and Rui’s changes are in an FMS branch? MW: Yes. Changes I did there would not be in shared dir of MOM5. Search for commits by me maybe. If you’re defining subtree wouldn’t you want the current version to be what is in shared dir right now? AH: Naively thought that was what ulm was. MW: I fixed some things, maybe bad logic, race conditions. There were problems. Not sure if I did them on FMS or MOM side. AH: Just these things not supported in namelist, not important. Will look and ask specific questions.

Miscellaneous

AH: Given Martin Dix latest 0.25 bathymetry for the ACCESS-CM2-025 configuration. Also need to generate regridding weights, will rely on latest scripts AK used for ACCESS-OM2-025. AK: Was issue with ESMF_RegridWeightGen relying on a fix from RF. Were you going to do a PR to the upstream RF? RF: You can put in the PR if you want. Just say it is coming from COSIMA.
MW: Regarding hosting MOM6 code in mom-ocean GitHub repo. Who is the caretaker of that group? AH: Probably me, as I’m paid to do it. Not sure. MW: Natural place to move the source code. They’re only hesitant because not sure if it is Australian property. Also wants complete freedom to add anything there, including other dependencies like CVMix and TEOS-10. AH: I think the more the better. Makes a more vital place. MW: So if Alistair comes back to me I can say it is ok? AH: Steve is probably the one to ask. MW: Steve is the one who is advocating for it. MW: They had a model of no central point, now they have five central points. NOAA-GFDL is now the central point. Helps to distance the code one organisation. Would be an advantage to be in a neutral space. COSIMA doesn’t have a say? AH: COSIMA has it’s own organisation, so no. AH: Just ask for Admin privileges. MW: You’re a reluctant caretaker? AH: NH just didn’t have time, so I stepped in and helped Steve with the website and stuff. Sounds great.

Technical Working Group Meeting, September 2020

Minutes

Date: 16th August, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD) CSIRO Aspendale

ACCESS-CM2 + 0.25deg Ocean

PD: Dave Bi thinking about 0.25 ocean. Still fairly unfamiliar with MOM5. Trying to keep was harmonised as possible. Learn from 1 degree case harmonisation. PD: Doing performance plan for current financial year hence talk about that now. Asking supervisors what they want and popped out of that conversation with Dave Bi. AH: Andy Hogg has been pushing CLEX to do the same work. PD: Maybe some of this already happening, need some extra help from CSIRO? AH: We think it is much more that CSIRO has a lot of experience with model tuning and validation. Not something researchers want to do, but want to use a validated model and produce research. So a win-win. PD: Validation scripts are something we run all the time, so yes would be a good collaboration. Won’t be any CMIP6 submission with this model. AH: Andy Hogg keen to have a meeting. PD: Agree that sounds like a good idea.
PL: How do get baseline parameterisation and ocean mask and all that. Grab from OM2? PD: Yes grab from OM2. AH: Yes, but still some tuning in coupled versus forced model.

ACCESS-OM2 and MOM-SIS benchmarking on Broadwell on gadi

PL: Started this week. Not much to report. Running restart based on existing data fell over. Just recreated a new baseline. Done a couple of MOM-SIS runs. Waiting on more results. Anecdotally expecting 20% speed degradation.

Update on PIO in CICE and MOM

AK: Test run with NH exe. Tried to reproduce 3 month production run at 0.1deg. Issue with CPU masking in grid fields. Have an updated exe set running. Some performance figures under CICE issue on GitHub. Speeds things up quite considerably. AH: Not waiting on ice? AK: 75% less waiting on ice.
AH: Nick getting queue limits changed to run up to 32K.
PD: Flagship project such as this should be encouraged. Have heard 70% NCI utilisation? May be able to get more time. RY: No idea about utilisation. Walltime limitation can be adjusted. Not sure about CPU limit. AH: I believe it can. They just wanted some details. Have brought up this issue at Scheme Managers meeting. Would like to get number of cpu limits increased across the board. There is positive reaction to increasing limits, but no motivation to do so. Need to kick it up to policy/board level to get those changes. Will try and do that.
NH: Some hesitation. Consume 70-80KSU/hr. Need to be careful. PL: What is research motivation? NH: Building on previous work of PL and MW. With PIO in CICE can make practical configs with daily ice output with lots more cores. Turning Paul’s scaling work in production config. Possible due to PL, MW work, moving to gadi and having PIO in CICE.
NH: Got 3 new configs. Existing small (5K), medium (8K), large (16K) and x-large (32K). MOM doubling. CICE doesn’t have to double. Running short runs of configs to test. PL: 16K is where I stopped. NH: Andy Hogg said good to have a document to show scalability for NCMAS. PL: All up on GitHub. NH: Will take another look. NH: Getting easier and easier to make new configs. CICE load balancing used to be tricky. Current setup seems to work well to increase cpus.
PD: What is situation with reproducibility? In 1 degree MOM run 12×8. Would be same with 8×12? More processors? NH: Possible to make MOM bit repro with different layout and processor counts, but not on by default. Can’t do with CICE, so no big advantage. PL: What if CICE doesn’t change? NH: Should be ok if keep CICE same, can change MOM layout with correct options and get repro. RF: Generally have to knock down optimisation and floating point config. Once you’re in operational mode do you care? Good to show as a test, but operationally want to get through fast as possible as long as results statistically the same. PL: Climatologically the same? RF: Yeah. PL: All the other components, becomes another ensemble member. RF: Exactly. NH: Repro is a good test. Sometimes there are bugs that show up when don’t reproduce. That is why those repro options exist in MOM. If you do a change that shouldn’t change answers, then repro test can check that change. Without repro don’t have that option.
RF: Working with Pavel Sakov, struggling with some of the configs, updating yam to latest version. Moving on to 0.1 degree model. Hoping to run 96 member ensembles, 3 days or so, with daily output. A lot of start/stop overhead. PIO will help a lot. Maybe look at start up and tidy up time. A lot different to 6 month runs. AH: Use ncar_rescale? RF: Just standard config. Not sure if it reads in or does dynamic. AH: Worth precalculating if not doing so. Sensitivity to io_layout with start-up? RF: Data assimilation step, restarts may come back joined rather than separate. Thousands of processors trying to read the same file. AH: mppncscatter? RF: Thinking about that. Haven’t looked at timings yet, but will be some issues. AH: How long does DA step take? RF: Not sure. Right at the very start of this process. Pavel as had success with 1 degree. Impressed with quality of results from the model. Especially ice.
AH: Maybe Pavel present to a  COSIMA meeting? RF: Presented to BlueLink last week. AH: Always good to get a different view of the model.

Testing

AH: Trying to get testing framework NH setup on Jenkins running again. Wanted to use this to check FMS update hasn’t broken anything. Can then update FMS fork with Marshall and Rui’s PIO changes.
NH: A couple of months ago got most of the ACCESS-OM2 tests running. MOM5 runs were not well maintained. MOM6 were better maintained and run consistently until gadi. Can get those working again if it is a priority. Was being looked at by GFDL. AH: Might be priority as we want to transition to MOM6.
NH: Don’t have scaling results yet. Will probably be pretty similar to Pauls numbers. Will should you next time. PL: Will update GitHub scaling info. NH: Planning to do some simple plots and tables using AK’s scripts that pull out information from runs.

Bathymetry

AK: Got list of edits from Simon Marsland for original topography. Wanted to get feedback about what should be carried across. Pushed a lot into COSIMA topography_tools repo. Use as a submodule in other repos which create the 1 degree and 0.25 degree topographies. Document the topography with the githash of the repo which created it. Pretty much finished 0.25. Just a little hand editing required. Hoping to get test runs with old and new bathymetry.
AH: KDS50 vertical grid? QA analyses? AK: Partial cells used optimally from KDS50 grid. Source data is also improved (GEBCO) so not potholes and shelves. AH: Sounds like a nice well documented process which can picked up and improved in the future.
AK: Way it is done could be used for other input files, have all in git repo and embedding metadata into file its exact link to git hash. Good practice. Could also use manifests to hash inputs? NH: Great, talked about reproducible inputs for a long time. AH: Hashing output can track back with manifests. Ideally would put hashes in every output. There is an existing feature for unique experiment IDs in payu, but has not gone further, still think it is a good idea.
AK: Process can be applied to other inputs. AH: The more you do it, the more sense it makes to create supporting tools to make this all easier.

Jupyterhub on gadi

AH: What is the advantage too using jupyterhub?
JM: Configurable http proxy and jupyter hub forward ports through a single ssh tunnel. If program crashes and re-run, might choose different port but script doesn’t know. This is a barrier. Also does persistence, basically like tmux for jupyter processes. AH: Can’t do ramping up and down using bash script? JM: Could do, that is handled through dask-jobqueue. Bash script could use that too. JM: Long term goal would be a jupyterhub.nci.org.au. Difficult to deploy services at scale. AH: Pawsey and the NZ HPC mob were doing it.

Technical Working Group Meeting, August 2020

Minutes

Date: 12th August, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (GFDL)

PIO with MOM (and NCAR)

MW: NCAR running global 1/8 CESM with MOM6. Struggling with IO requirements. Worked out they need to use io_layout. Interested in parallel. Our patch is not in FMS, and never will be. They understand but don’t know what to do. Can’t guarantee my patch will work with other models and in the future. Said COSIMA guys are not using it. Using mppnccombine-fast so don’t need it.  Is that working? AK: Yes.
RY: Issue is no compression. Previous PIO not support compression and size is huge. Now netCDF supports parallel compression, so maybe look at it again. Haven’t got time to look at it. Should be a better solution for COSIMA group.
MW: Ideally Ed Harnett or someone else from NCAR would add PIO to FMS. They have been working on latest FMS rewrite for more than 2 years. Haven’t finished API update. FMS API is very high level. They have decided too high level to do PIO. FMS completely rewriting API. Ed stopped until FMS update. Added PIO and used direct netCDF IO calls. Bit hard-wired but suitable for MOM-like domains. Options 1 sit and wait, Option 2 is do their own, Option 3 is do it now and use fork of FMS. Maybe Option 4 is mppnccombine-fast. What do you think?
AK: Outputting compressed tiles with io_layout and using fast combine. Potential issue is if io_layout makes small tiles. MW: Chunk size has to match tile size? Do tiles have to be same size? AH: Yes. Still works but slower but has to do deflate/reflate step. It is fast when it can just copy compressed chunks from one file to another. Limit is only filesystem speed. Still uses much less memory if it has to do deflate/reflate. Chooses first tile size and imposes that on all tiles. If first tile is not typical tile size for most files could do end up reflating/deflating a lot of the data. Also have to choose processor layout and io_layout carefully. For example 0.25 1920×1080 doesn’t have consistent tile sizes. MW: Trying to figure out if it is worth telling them to reach out to you guys. Sent them a link to the repo. AH: Might be a decent way to keep going until they get a better solution.
MW: Bob had strategy to force modelling services to include PIO support by getting NCAR to use PIO.
NH: Can they use PIO patch with their current version of FMS? MW: They want to get rid of the old functions.
Bad idea to ditch old API, creates a lot of problems. The parallel-IO work is on a branch.
AH: Regional output would be much better. Output one file per rank. Can aggregate with PIO? NH: One output file. Can set chunking. AH: Not doing regional outputs any more because so slow. Would give more flexibility. AK: Slow because of huge number of files. Chunks are tiny and unusable. Need to use non-fast. AH: I thought it was hust the output is slow. RF: Many processors on same node will pump output. MW: Many outputs will throttle lustre. Only have a couple of hundred disks. Will get throttled. AH: Another good reason to use for MOM. MW: Change with io_layout? RF: No, always output for themselves. MW: Wonder how patch would behave. AH: NCAR constrained to stay consistent with FMS. MOM5 is not so constrained, should just use it. NH: Should try it if code already. parallel netcdf is a game changer. AH: I have a long-standing branch to add FMS as a sub-tree. Should do it that way. Have our own FMS fork with the code changes. MW: Only took 3 years!
AK: Put in a deflate level as namelist parameter as it defaulted to 5. Used 1 as much faster but compression was the same.

CICE PIO

NH: Solved all known issues. Using PIO output driver. Works well. Can set chunking, do compression, a lot faster. Ready to merge and will do soon. I don’t understand why it is so much better than what we had. I don’t understand the configuration of it very well. Documentation is not great. When they suggested changes they didn’t perform well. Don’t understand why it is working so well as it is, and would like to make it even better.
NH: Converted normal netcdf CICE output driver to use latest parallel netCDF library with compression. So 3 ways, serial, same netCDF with pnetcdf compressed output, or PIO library. netcdf way is redundant as not as fast as PIO. Don’t know why. Should be doing this with MOM as well. Couldn’t recall details of MW and RY previous work. Should think about reviving that. Makes sense for us to do that, and have code already.
MW: Performance difference is concerning. NH: Has another player of gathers compared to MPI-IO layer.  PIO adds another layer of gathering and buffering. With messy CICE layout PIO is bringing all those bits it needs and handing it lower layer. Maybe possible reason for performance difference. RY: PIO does some mapping from compute to IO domain. Similar to io_layout in MOM. Doesn’t use all ranks to do IO. Sends more data to a single rank to IO, saves contention issues. NH: MPI-IO has aggregators? RY: In the library you can select number of aggregators. Default is 1 aggregator per node. If you use PIO to use single rank per node this matches MPI-IO. Did this in the paper where we tested this. If consistent io_layout, aggregator number and lustre striping should get good performance.
RY: Tried different compression levels? NH: Just using level 1. Did some testing in serial case not much point going higher. Current tests doing all possible outputs. RF: A lot of compression will be due to empty fields. RY: compression performance is related to chunk size. NH: performance difference with chunk size. Too big and too small is slower. Default chunk size is fastest for writing. 360×300 for 2D field. Might not be ok for reading. RY: Should consider both read and write. Write once and many read patterns. MW: Parallel reads were slower than POSIX reads. AH: What is dependence of time on chunk size. NH: Depends how many fields we output. Cutting down should be fast for larger chunk size. Is a namelist config currently. Tell it chunk dimension. RY: Did similar with MOM. AH: CICE mostly 2D, how many have layers. AH: What chunking in layers? NH: No chunking, chunk size is 1, for layers. AH: Have noticed access patterns have changed with extremes. Looking more at time series, and sensitive to time chunking. Time chunking has to be set to 1? NH: With unlimited not sure RF: Can chunk in time with unlimited, but would be difficult as need to buffer until writing possible. With ICE data layers/categories are read at once. Usually aggregated, not used as individual layers. Make more sense to make category chunk the max size. Still a small cache for each chunk. netCDF4 4.7.4 increased default cache size from 4M to 16M.
AH: I thought deflate level 4 or 5 was still worth it. NH: Can give it a try. Don’t really care about deflate level, just getting rid of zeroes.

Masking ICE from MOM – ICE bypass

NH: Chatted with RF on slack. Mask to bypass ICE. Don’t talk to ice in certain areas. Like the idea. Don’t know how much performance improvement. RF: Not sure it would make much difference. Just communication between CICE and MOM. NH: Also get rid of all the CICE ranks that do nothing. RF: Those are hidden away because of round-robin and evenly spread. Layout with no work would make a difference. NH: What motivated me was the IO would be nicer without crazy layouts. If didn’t bother with middle, would do one block per cpu, one in north and south. Would improve comms and IO. If it were easy would try. Maybe lots of work. AK: Using halo masking so no comms with ice-free blocks.
AH: What about ice to ice comms with far flung PEs? RF: Smart enough to know if it needs to MPI or not. Physically co-located rank will not use MPI. AH: Thought it would be easy? NH: Not sure it is justified in terms of performance improvement. With IO tiny blocks were killing performance, so this was a solution to IO problem. MW: Two issues are funny comms patterns and calculations are expensive, but ice distribution is unpredictable. Don’t know which PEs will be busy. Load imbalances will be dynamic in time. Seasonal variation is order 20%. Might improve comms, but that wasn’t the problem. Stress tensor calcs are expensive, so ice regions will do a lot more work. NH: Reason to use small blocks which improves ability to balance load. MW: Alisdair struggling with icebergs. Needs dynamic load balancing. Difficult problem. RF: Small blocks are good. Min max problem. Every rank has same amount of work, not too much or too little. CICE ignores tiles without ice. CICE6 a lot of this can be done dynamically. There is dynamic allocation of memory. AH: Dynamic load balancing? RF: Who knows. Now using OpenMP. AH: Doesn’t seem to make much difference with UM. MW: Uses it a lot with IO as IO is so expensive.
AH: A major reason to pursue masking is it might make it easier when scaling up. If round-robin magically scales well that is ok, but last time there was a lot of analysis with heat maps and discussion about optimal block sizes. Conceptually it might be easier to understand how to best optimise for new config. NH: Does seem to make sense, could simplify some aspects of config. Not sure if it is justified. MW: Easy to look at comms inefficiency. Did this a lot for MOM5, and mostly it wasn’t comms. Sometimes the hardware, or a library not handling the messages well, rather than comms message composition. Bob does this a lot. Sees comms issues, fixes it and doesn’t make a big difference. Definitely worth running the numbers. NH: Andy made the point. This is an architecture thing. Can’t make changes like this unilaterally. Coupled model as well. Fundamentally different architecture could be an issue. MW: Feel like CPUs are the main issue not comms. Could afford to do comms badly. NH: comms and disks seems pretty good on gadi. Not sure we’re pushing the limits of the machine yet. Might have double or triple size of model. AH: Models are starting to diverge in terms of architecture. Coupled model will never have 20K ocean cpus any time soon. NH: Don’t care about ice or ocean performance.
AH: ESM1.5 halved runtime by doubling ocean CPUS. RF: BGC model takes more time. Was probably very tightly tuned for coupled model. 5-6 extra tracers. MW: 3 on by default, triple expensive part of model. UM is way more resources. AH: Did an entire CMIP deck run half as fast as they could have done. My point is that at some point we might not be able to keep infrastructure the same. Also if there is code that needs to move in case we need to do this in the future. NH: Code is more of an ocean calculation anyway? RF: Kind of. Presume there is a separate ice calc. Coupling code taken from gfdl/MOM and put into CICE to bypass FMS and coupler code. From GFDL coupler code. Rather than ocean_solo.f90 goes through coupler.f90. NH: If 10 or 20K cores might revisit some of these ideas. Goal to get to those core counts working, not sure about production.
MW: Still thinking about super high res, like 1/30. OceanMaps people wanted it? More concrete. RF: Some controversy with OceanMaps and BoM. Wanting to go to UM and Nemo. There is a meeting, and a presentation from CLEX. Wondering about opportunity to go to very high core counts (20K/40K). AH: Didn’t GFDL run 60K cores for their ESM model? NH: Never heard about it. Atmosphere more aggressive. RF: Config I got for CM4 was about 10K. 3000 line diag_table. AH: Performance must be IO limited? MW: Not sure. Separated from that group.

New bathymetry

AK: Russ made a script to use GEBCO from scratch. Worked on that to polish it up. Everything so far has been automatic. RF: Always some things that need intervention for MOM that aren’t so much physically realistic but are required for the model. AK: Identified some key straits. Retaining previous land masks so as not to need to redo mapping files. 0.25 need to remove 3 ocean points and add 2 points. Make remap weights scripts are not working on gadi, due to ESMF install. Just installed latest esmf locally, 8.0.1, currently running. AH: ESMF install for WRF doesn’t work? AK: Can’t find opal/MCA MPI error. RF: That is an MPI error.
AH: Sounds like the sort of error that was a random error, but if happening deterministically not sure. AK: Might be a library version issue. AH: They have wrappers to guess MPI library, major version the same should be the same.
AH: All this is scriptable and be re-run right? Bathymetries are intimately tied to vertical grid, so needs to be re-run if that is changed. AK: Vision is certainly for it to be largely automated. Not quite there yet.
NH: I’ll have a quick look too. Noticed there is no module load esmf? AK: Using esmf/nuwrf. I’ll have a look at what esmf built with. AH: I want esmf installed centrally. We should get more people to ask. NH: I think it is very important. AK: Definitely need it for remapping weights. AH: Other people need it as well.

Technical Working Group Meeting, July 2020

Minutes

Date: 10th June, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants
  • Peter Dobrohotoff (CSIRO Aspendale)
  • Marshall Ward (GFDL)

Optimisation report

PL: Have a full report, need review before release. This is an excerpt.
PL: Aims for perf tuning and options for configuration. Did a comparison with Marshall’s previous report on raijin.
Testing included MOM-SIS at 0.25 and 1 deg to get idea of MOM scalability stand-alone.
The ACCESS-OM2 at 0.1 deg. Resting with land masking, scaling MOM and CICE proportional.
Couldn’t repeat Marshall’s exactly. ACCESS-OM2 results based on different configs. Differences:
  1. continuation run
  2. time step 540s v 400s
  3. MOM and CICE were scaled proportionally
  4. Scaling taken to 20k v 16k
MOM-SIS at 0.25 degrees on gadi 25% faster than ACCESS-OM2 on raijin at low end of CPU scaling. Twice as fast for MOM-SIS at 0.1 degrees. Scalability at high end better.
ACCESS-OM2: With 5K MOM cores, MOM is 50-100% faster than MOM on raijin. Almost twice as fast at 16K, scaled out to 20K. CICE: with 2.5K cores CICE on gadi seems 50% faster than CICE on raijin. Scales to 2.5 times as fast at 16K OM2 CPUs.
Days per cpu day. From 799/4358 CICE/MOM cpus does not scale well.
Tried to look at wait time as fraction of wall time. Waiting constant for high CICE ncpus, decreases with high core counts with low CICE ncpus. So higher core count probably best to reduce ncpus as proportion. In this case half the usual fraction.
JM: How significant are results statistically? PL: Expensive. Only ran 3-4 runs. Spread quite low. Waiting time varied the most. Stat sig probably not due to small sample size.
MW: Timers in libacessom2 were better than OASIS timers, which include bootstrapping times which are impossible to remove. Also noisy IO timers. Not sure how long your run. Longer would be more accurate. PL: Runs are for 1 calendar month (28 days). MW: oasis_put and get are slightly magical, difficult to know what they’re doing. PL: Still have outputs, could reanalyse.
MW: Speedups seem very high. Must be configuration thing. PL: Worried not a straight comparison. MW: If network time is 15-20%, wouldn’t make a difference. Always been RAM bound, which be good if that wasn’t an issue now. PL: Very meticulous documentation of configuration, and is very reproducible. Made a shell script that pulls everything from GitHub. MW: I think your runs are better referenced. While experiments were being released, seemed some parameters etc were changing as I was testing. That could be the difference. Wish I documented it better.
AH: All figures independent of ocean timestep. PL: All timestep at 540s. AK: Production runs are 15-20% faster, but a lot of IO. PL: Switched off IO and only output at end of the month. It really drags it. Make sure it isn’t IO bound. Probably memory bound. Didn’t do any profiling that was worth presenting. MW: Got a FLOP rate? PL: Yes, but not at my fingertips. If it is around a GFLOP, probably RAM bound. PL: Now profiling ACCESS-CM2 with ARM-map. RY is looking at a micro view, looking at OpenMP and compilation flag level. RY: Gadi has 4gb/core, raijin has 2GB/core. Not sure about bandwidth. Also 24 cores/node. Much less remote note comms. Maybe a big reduction in MPI overheads. MW: OpenMPI icx stuff helping. RY: Lots of on-node comms. Not sure how much. MW Believe at high ranks. At modest resolutions comms not a huge fraction of run time. Normal scalable configs only about 20%. PL: The way the scaling was done was different. MW scaled components separately. MW: I was using clocks that separated out wait time.
RY: If config timestep matters, any rule for choosing a good one? AK Longest timestep that is numericaly stable. 540s is stable most of the time.
MW: How have progressed on CICE layout stuff? Changed in the last year? I was using sect-robin. RF: sect-robin or round-robin. AK: You use sect-robin, production did use round-robin, but not sect-robin. Less comms overhead, not sure about load balance.
PL: Is there any value in releasing the report. NH: Would be interested in reading it. Looking to get these bigger configurations. AH: Worth to document the performance at this point. RY: Any other else worth trying? AH: Why 20K limit? PL: Believe that is a PBS queue limit. Some projects can apply for an exception. RY: For each queue there are limits. Can talk with admin if necessary to to increase. AH: Will bring this up at Scheme Managers meeting. They should be updated with gadi being a much bigger machine. Would give more flexibility with configurations. Scalability is very encouraging.
RF: BlueLink runs very short jobs, 3days at a time. Quite a bit of start-up and termination time. How much does that vary with various runs. MW: I did plot init times, it was in proportion to the number of cores. Entirely MPI_Init. It has been a point of research in OpenMPI. Double ranks, double boot time. RF: Also initialisation of OASIS, reading and writing of restart files. PL: Information has been collected, but hasn’t been analysed. RF: Paul Sandery’s runs are 20% of the time. MW: MPI_Init is brutal, and then exchange grid. There are obvious improvements. Still doing alltoall when it only needs to query neighbours. Can be speed up by applying geometry. Don’t need to preconnect. That is bad.
PL: At least one case where MPI collectives were being replaced with a loop of point to points. Was collective unstable? MW: Yes, but also may be there for reproducibility. MW: I re-wrote a lot of those with collectives, but they had no impact. At one time collectives were very unreliable. Probably don’t need to be that way anymore. MW I doubt that they would better. Hinges on my assertion that comms are not a major chunk.
AH: MOM is RAM bound or cache-bound? MW: When doing calculations the model is waiting to load data from memory. AH: Memory bandwidth improves all the time. MW: Yes, but increase the number of cores and it’s a wash. It could have improved. AMD is doing better, but Intel know there is a problem.
AH: To wrap-up. Yes would like full report. This is useful for NH to work up new configurations, as naive scaling is not the way to go. Also intialisation numbers RF would like that PL can provide.

ACCESS-OM2 release plan

AH: Are we going to change bathymeytry? Consulted with Andy who consulted with COSIMA group. What is the upshot? AK: Ryan and Abhi want to do some MOM runs. Problems with bathy. Andy wants run to start, if someone has time to do it, otherwise keep going. Does anyone have some idea how long it would take. RF: 1 deg would be fairly quick. We know where the problems are. Shouldn’t be a big job. Maybe a few days. 1 deg in a day for an experienced person. GFDL has some python tools for adjusting bathymetry on their GitHub. Point and click. Alisdair wrote it. Might be in MOM-SIS examples repo. MW: Don’t know, could ask. RF: Could be something that would make it straightforward.
AK: Will have a look.
MW: topog problems in MOM6 not usually the same as MOM5 due to isopycnal coordinate.
AK: Some specific points that need fixing? RF: I think I put some regions in a GitHub issue. AK: What level to dig into this? RF: Take pits out. Set to min depth in config. Regions which should be min depth and have a hole. Gulf of Carpenteria trivial. Laptev should all be set to min depth. NH: I did write some python stuff called topog_tools, can feed it csv of points and it will fix it. Will fill in holes in automatically. Also smooths humps. May still have to look at the python and fix stuff up. Another possibility.
AK: Another issue is quantisation to the vertical grid. A lot of terracing that has been inherited. RF: Different issue. Generating a new grid. 1 degree not too bad. 0.25 would be weighty still. MC: BGC in 0.25, found a hole off Peru that filled up with nutrients.
AH: Only thing you’re waiting on? AK: Could do an interim release without topog fixes. People want to use. master is so far behind now Also updating wiki which refers to new config. Might merge ak-dev into master, and tag as 1.7, and have wiki instructions up to date with that. AH: After bathy update what would version be? AK: 2.0. AH: Just wondering what constitutes a change in model version. AK: Maybe one criteria if restarts are unusable from one version to the next. Changing bathy would make these incompatible. AH: Good version.
AH: Final word on ice data compression? NH: Decided deflate within model was too difficult due to bugs. Then Angus recognised the traceback for my degfault which was great. Wasted some time not implementing it correctly. Now working correctly. Using different IO subsystem. IO_MP rather than ROMIO_MP. Got more segfaults. Traced and figured out. Need to be able to tell netCDF the file view. The view of a file that a rank is responsible for. MPI expecting that to be set. One way to set it up is to specify the chunks to something that makes sense. Once I did that file view was correct. Then ran into bugs in PIO library. Seem like cut n’ paste mistakes. No test for chunking. Library wrapper is wrong. Fixed that. No working. Learnt a lot and a satisfying outcome. Significantly faster. Partly parallel, also different layer does more optimisation. Noticed with 1 degree things are flying. Nothing definitive, but seems a good outcome. Will get some definitive numbers and a better explanation. Will have something to merge. Will have some PRs to add to and some bug reports to PIO. RY: PIO just released a new version yesterday. NH: Didn’t know that. Tracking issues that are relevant to me. Still sitting there. Will try new version. RY: Happy it is working now. NH: Was getting frustrated with PIO, wondered why not using netCDF directly. For what we do pretty thin wrapper around netCDF. Main advantage is the way it handles the ice mapping. Worth keeping just for that. MW: FMS has most of a PIO wrapper but not the parallel bit.
PL: Any of that fix needs to be pushed upstream? NH: Changes to the CICE code. Will push to upstream CICE. Will be a couple of changes to PIO. AK: Dynamically determines chunking? NH: Need to set that. Dynamically figures out tuneable parameters under the hood about the number of aggregators. Looking at what each rank is doing. Knows what filesystem it is on. Dependent on how it is installed. Assuming it knows it is on lustre. Can generate optimal settings. Can explain more when do a summary.
AK: Want to make sure OP files are consistently chunked. NH: Using the chunking to set the file view. Another way to explicitly set the file view using MPI API. Chunks are the same size as the data that each PE has. In CICE each block is a chunk. MW: These are netCDF chunks? AK: More chunks than cores? NH: Yes. Is that bad or good? This level is perfect for writing. Every rank is chunked on what it knows to do. Not too bad for reading. JM: How large a chunk? NH: In 1 degree every PE has full domain 300 rows x 20 columns. JM: Those are small. Need bigger for reading AK: For tenth 36×30. Something like 9000 blocks/chunk. NH: Might be a problem for reading? RF: Yes for analysis. JM: Fixed cost for every read operation. A lot of network chatter. AH: Is that the ice or the ocean in the tenth? Not sure. Chunk size is 36×30.  A lot of that is ice free, 30% is land. MW: Ideal chunks are based on IO buffers in the filesystem. AH: Best chunking depends a lot on access patterns. JM: One chunk should be many minimal file units big. AH: When 0.25 had one file per PE it was horrendously bad for IO. Crippled Ryan’s analysis scripts. If you’re using sect robin that could make the view complicated? NH: Wasn’t Ryan’s issue that the time dimension was also chunked? AH: He was testing mppnccombine-fast which just copies chunks as-is, which were set by the very small tiles sizes. Similar to your problem? NH: Probably worse. Not doing MOM tiles, doing CICE blocks, which are even smaller. Same grid as MOM. RF: Fewer PEs, so blocks are half the size. MOM5 tile size to CICE block size are comparable apart from 1 degree model.
NH: Will carry on with this. Better than deflating externally, but could run into some problems. The chunks in the file view don’t have to be the same. Will this be really bad for read performance? Gathering that it is. What could be done about it? Limited by what each rank can do. No reason the chunks have to be the same as the file view. Could have multiple processors contribute to a chunk. Can’t do without out collective. MW: MPI-IO does collectives under the hood. Can configure MPI-IO to build your chunk for you? NH: Currently every rank does it own IO as it was simpler and faster. MW: Can’t all be configured at MPI-IO later? RY: PIO can map compute domain to IO domain. Previous work had one IO rank per node. IO rank collect all data from node. Set chunking at this level. NH: For example, our chunk size could be 48 times bigger. RY: Yes. Also best performance is single rank per node. PIO does have this function to map from compute to IO domain, and why we used it. Can also specify how many aggregators / node. First decide how many IO ranks per node, and how many aggregators per node. Those should match. Can also number of stripes and strip size to match chunk size. IO rank per node is the most important, as will set chunk size. MW: Only want same number of writers as OSTs. RY: Many writers per node, will easily saturate the network and be the bottleneck. AH: Have to go, but definitely need to solve this, as scaling to 20K cores will kill this otherwise.

MW: Will also help RF. If you’re desperate should look at the patch RY and I did. Will help a lot once you’ve identified your initialisation slow down. RF: Yes, will do once I’ve worked out where the blockages are. Just seen some timings from Paul Sandery, but haven’t looked into it deeply yet. NH: Even with rubbish config, model is showing performance improvements. Will continue with that, and will consider the chunk size stuff as an optimisation task. MW: Sounds like you’ve gone from serial write to massively parallel, so inverted the problem, from one disk bound, to network bound within lustre. If you can find a sweet spot in between then should see another big speed improvement. NH: Config step pretty easy to do with PIO. Will talk to RY about it. RY: Could have a user settable parameter to specify IO writers per node. PL: Need to look into lustre striping size? RY: Currently set to 1GB, so probably ok, but can always tune this. NH: Just getting a small taste of the huge world of IO optimisation. MW: Just interesting to be IO bound again. NH: heavily impacted by IO performance with dailies with tenth. MW: IO domain offset the problem. Still there  but could be dealt with in parallel with next run so could be sort of ignored.

AK: This is going to speed up the run. Worse case is post-processing to get chunking sorted out. NH: Leaves us back at the point of having to do another step, which I would like to avoid. Maybe different, before was a step to deflate, maybe rechecking was always going to be a problem.
AK: Revisiting Issue #212, so we need to change model ordering. Concerns about YATM and MOM sharing a node and affecting IO bandwidth. Tried this test, there is an assert statement that fails. libaccessom2 demands YATM is first PE. NH: Will look at why I put the assert there. Weirdly proud there is an assert. RF: Remember this, when playing around with OASIS intercommunicators, might have been conceptually this was the easiest way to get it work. MW: I recall insisting on a change to the intercommunicator to get score-p working. AK: Not sure how important this is. NH: There are other things. Maybe something to do with the runoff. The ice model needs to talk to YATM at some point. Maybe a scummy way of knowing where YATM is. For every config maybe then know where YATM is. These would be shortcut reasons. PL: Give it it’s own communicator and use that? NH: Maybe that is what we used to do. Could always go back to what we had before. RF: Just an idea if it would have an impact. Could give YATM it’s own node as a test. MW: Not sure why it is that way. Should be easy to fix. NH: Ok, certain configurations are shared. Like timers and coupling fields. Instead of each model have their own understanding, share this information. So models check timestep compatibility etc. Using it to share configs. Another way to do that. MW: Doesn’t have to rank zero. NH: Sure it is just a hack. MW: libaccessom2 is elegant, can’t say the same for all the components it talks to. RF: There is a hard-wired broadcast from rank zero at the end.

MOM6

MW: Ever talk about MOM6? AK: Angus is getting a regional circumantarctic MOM6 config together. RF: Running old version of CM4 for decadal project. PL: Maybe a good topic for next meeting?

Attachments

Technical Working Group Meeting, June 2020

Minutes

Date: 10th June, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants
  • Peter Dobrohotoff (CSIRO Aspendale)

Zarr

JM: Already have compression with netCDF4. How do they consume the data? Jupyter/dask? AK: Mostly in COSIMA, BoM and CSIRO have their own workflows. JM: Maybe archival netCDF4? As long as writing/parallel IO/combining, no hurry to move to zarr direct output. JM: Inodes not an issue. Done badly can be bad. lustre file system has a block size, so natural minimum size. At least as many inodes as allocatable units on FS. If a problem wrap whole thing in uncompressed zarr. RY: many filters. blosc is pretty good. can use in netcdf4, but not portable. needs to compiled into library. netCDF4 now supports parallel compression. HDF supported a couple of years ago.
AH: As we’re in science, want stable well supported software. Unlikely to use bleeding edge right now. Probably won’t output directly from model for the time being. Maybe post process.
NH: What about converting to zarr from uncollated ocean output? Why collating when zarr uncollates anyway? Also collate as difficult to use uncollated output. How easy to uncollated to zarr. JM: Should pretty straightforward. Write that block directly to part of directory tree. Why is collating hard? Don’t just copy blocks to appropriate place in file? AH: Outputs are compressed. Need to be uncompressed and then recompressed. Scott Wales has made a fast collation tool (mppnccombine-fast) that just copies already compressed data. There are subtitles. Your io_layout determines block size, as netCDF library chooses automatically. Some of the quarter degree configs had very small tiles which led to very small chunks and terrible IO. AK: regional output is one tile per PE and mppnccombine can’t hand the number of files in a glob. AH: Yes that is disastrous. Not sure was a good idea to compress all IO.
JM: Good idea to compress even for intermediate storage. Regional collation: what do we currently collate? Original collation tool? AK: Yes, but don’t have a solution JM: definitely need to combine to get decent chunk sizes. If interested happy to talk about moving directly to zarr, or parallelising in some way. AK: Would want a uniform approach/format across outputs. JM: not sure why collate runs out of files. AK: Shell can’t pass that many files to it. AH: My recollection is that it is a limit in mpirun, which is why mppnccombine-fast allows quoting globs to get around this issue. Always interested in hearing about any new approaches to improving IO and processing at scale.

ACCESS-OM2-BGC testing and rollout

AH: Congrats on Russ getting this all working. How do we roll this out?
AK: All model components are now WOMBAT/BGC versions, but two MOM exes. One with BGC and one without. All in ak-dev. No standard configs that refer to BGC. Need some files. RF: About 1G, maybe 10 files. Climatologies for forcing and 3 restart files. This will work with standard 1 degree test cases. Slight change to a field table, and a different o2i.nc (OASIS restart file). Not much change. Will work on that with Richard and Hakase. Haven’t tested with current version. Maintaining a 1 degree config should be ok. AK: No interest at high res? NH: Yes, but not worth supporting as yet. One degree shows how it is used.
AK: Would BGC be standard 1 degree with BGC as option, or separate config specialised for BGC? RF: Get together and give you more info. Probably stand separately as an additional config. Currently set it up as a couple of separate input directories. AH: So RF going to work up a 1 degree BGC config. RF: Yes. AH: So work up config, make sure it works, and then tell people about it. RF: The people who are mainly interested know about the progress.

ACCESS-OM2 release version plan

AH: Fixing configs, code, bathymetry. Do we need a plan? Need help?
AK: Considering merging all ak-dev configs into master. Was constant albedo, moving to Large and Yeager as RF advised. Didn’t make a big difference. How much do we want to polish? Initial condition is wrong. Initial condition is potential temp rather than conservative. Small compared to WOA and drift. Not sure if worth fixing. Also bathymetry at 1 and 0.25 degree. Not sure who would fix it. AH: Talk to Andy Hogg if unsure about resources? AK: A number of problems could all be fixed in one go.
AK: Not much left to do on code. PIO with CICE. Still issues?
NH: Good news. In theory compression is supported on PIO with latest netCDF. PIO library also has to enable those features. There was a GitHub issue that indicated they just needed to change some tests and it would be fine. Not true. There is code in their library that will not let you call deflate. Tried commenting out, and one of the devs thought was reasonable. Getting some segfaults at netCDF definition phase. Want to explore until decide it is waste of time, will go back to offline compression if it doesn’t pan out. Done the naive test if there is a simple work-around. Looking increasingly like won’t work easily. Could do valgrind runs etc.
AK: Bleeding edge isn’t best for production runs. NH: Agreed. Will try a few more things before giving up. AK: Offline compression might be safer for production. NH: Agreed, random errors can accumulate. RY: Segfault is with PIO? NH: Using newly installed netCDF4.7.4p, and latest version of PIO wrapper with some commented out some checks. Not complicated just calls deflate_var. RY: When install PIO do you link to new netCDF library? NH: Yes. Did have that problem before. AG pointed it out, and fixed that. Takes a while to become stable. RY: Do you need to specify a new flag with parallel compression when you open the with nf_open? NH: Not doing that. Possible PIO doing that. RY: Maybe PIO library not updated to use new flags to use this correctly. NH: If using netCDF directly what would you expect to see? Part of nf_create? RY: Maybe hard wired for serial. NH: Possible they’ve overlooked something like this. Might be worth a little bit of time to look into that. PIO does allow compression on serial output only. Will do some quick checks. Still shouldn’t segfault. Been dragging on, keep thinking a solution is around the corner, but might be time to give it up. A bit unsatisfying, but need to use my time wisely. AH: It’s not nothing to add a post-processing compression step and then take it out again. Not no work, and no testing, so out of the box compression would be nice. Major code update you’re waiting on? Need to update fortran datetime library? Just a couple of PRs from PL and NH. Paul’s look like a bug fix PL: Mine an edge case? AK: Incorrect unit conversion. AH: Don’t know where in the code and if it affects us. AK: We’re using the version that fixes that. NH: Guy who wrote that library wrote a book called “Modern Fortran”.

Testing update

 AH: MOM travis testing properly testing ACCESS-OM2 and ACCESS-OM2-BGC as have also updated libaccessom2 testing so that it creates releases that can be used in the MOM travis testing. Lots of boring/stupid commits to get the testing working. AK: Latest version gets used regardless of what is in the ACCESS-OM2 repo? AH: Not testing the build of ACCESS-OM2, just the MOM5 bit of that build linking to the libaccessom2 library so that it can produce an executable so we know it worked successfully. We do compile OASIS, and uses just the most recent version. Had intended to do the same thing with OASIS, make it create releases that could be used. Haven’t done it, but it is a relatively fast build. AK: OASIS is now a submodule in the libaccessom2 build. AH: Yes, but still have a dependency on OASIS. Don’t have a clean dependence on libaccessom2. Might be possible to refactor, but probably not worth it. So yes have a dependence on libaccessom2 *and* OASIS.
AH: Previously travis allowed ACCESS-om2 to fail and the only way you knew it worked correctly would have to look at the logs to determine if everything was successful up to the linking step.
AH: Currently fixing up Jenkins automated testing. Starting on libaccessom2 testing. Hopefully won’t be too difficult. NH: Definitely need to have it working.
PL: Writing up testing/scaling tests for ACCESS-OM2.

Updated COSIMA Cookbook default database

The COSIMA Cookbook is the recommended, and supported, method for finding and accessing COSIMA datasets.

Currently COSIMA datasets are located in temporary storage under the hh5 project on the /g/data filesystem at NCI. The default COSIMA Cookbook database (/g/data/hh5/tmp/cosima/database/access-om2.db) indexes data in this location.

The COSIMA datasets are being moved to a new project, ik11: dedicated storage provided by an ARC LIEF grant. As part of this transition the default database will change to:

/g/data/ik11/databases/cosima_master.db

and will index all data in /g/data/ik11/outputs/. The database is updated daily.

This change will take place from Wednesday the 1st of July. To access the old database pass an argument to create_session:

session = cc.database.create_session(db='/g/data/hh5/tmp/cosima/database/access-om2.db')

or set the COSIMA_COOKBOOK_DB environment variable, e.g. for bash

export set COSIMA_COOKBOOK_DB=/g/data/hh5/tmp/cosima/database/access-om2.db

In the same way the new ik11 database can be accessed by using the path to it (/g/data/ik11/databases/cosima_master.db) in the same manner as above.

Technical Working Group Meeting, May 2020

Minutes

Date: 20th May, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants

ACCESS-OM2-01 scaling experiments

See PL’s associated scaling doc, scaling spreadsheet and python notebook
PL: Scaled MOM5 and CICE5 by same amount. Based in 01deg_jra55v13_ryf9091. Run an initial run to get restart output from February 1900. Restart runs for February (28 day). 540s time step. 4480 steps. No diagnostic output. Left ice as is.
PL: CICE scaled ncpus and ntask proportionally. Scaled MOM from 80×75 (4358) to 160×150 (16548). Scaling ok looking just at ocean timer and ice timer. Didn’t have daily iCE output.
PL: Most efficient at 10K cores with total wall time. Ocean timer shows perfect scaling. CICE only timer also shows good scaling.
NH: Keen to try these configs in production.
PL: Not sure how appropriate for production, no IO. NH: Good place to start, turn on output and see how it goes. Looks well balanced. Somewhat surprised.
PL: Now trying to reproduce Marshall’s figures from the report. Scales ocean and ice separately. Yet to get reproduction runs going. Working through namelist differences. Sometimes get a silent hang. Worth scaling ocean and ice at same time.
AH: Why do both models scale so well but overall not so well when combined. RY: CICE is waiting for MOM. Maybe some more optimal setting for CPU numbers?  AH: Seems odd MOM scaling better than CICE, but CICE waiting for MOM.
NH: CICE is waiting less for ocean as cpus scaled. oasis_recv is constant, which means MOM not waiting on CICE. Definitely don’t want MOM waiting for CICE. RY: If increase MOM and reduce CICE would we get better performance? PL: Not sure. Might be useful to know how I got those numbers, using log file and figures are total time divided by number of steps.. RF: Output from access-om2.out.  Just summary, won’t show load balance with MOM. PL: Any guidance would be useful. RF: Look in access-om2.out.
AH: Look for MOM timers. Might be some information about range of values, could be some very slow PEs masked by average. RF: the check mask is out of date. Has 12-16 processors which are purely land. Changed land mask and didn’t update mask. Some processors only have values in the halo boundaries. Crashes otherwise.
PL: Regenerated new mask files. Numbers should agree with what was done. Any more advice would be welcome. Send email or talk on slack. RF: I’ll look at CICE layouts and balance, and masks. CICE Is also seasonally dependent.
PL: Moving to a more conventional experiment payout. Will move to a shared location. AH: Could put in /g/data/v45.
AK: CICE scaling with serial IO. Nic almost finished PIO. Will stop scaling without PIO. Runs much faster with parallel even with monthly outputs. AH: Seems to be scaling ok. AK: Any output written? AH: Running for a month, so should be some output.
AH: Ran from initial conditions? PL: Yes. Ran for 1 month with timestep of 300s. Then ran from those restarts with timestep of 540s. AH: There is an ice climatology? RF: If run for a month, should have generated ice. AK: Ice generated from surface temperature in initial conditions.
RY and PL left meeting.
AH: Maybe a bit more to look at in PL runs. NH: May have misunderstood where those numbers came from. RF: Looked like it was scaling nice and linear. AH: Yes for each model, but together scaling died going to 20K. RF: Not sure these results are that useful when IO is turned on. Code paths not currently going through without IO. Putting stuff on density levels. And a whole lot of globals/collectives that aren’t being done. AH: Encouraging though. NH: In principle can scale up.

PIO compilation in ACCESS-OM2

NH: Got a reply from NCI. Resistance to having PIO in a module. Best to be self sufficient. If it turns out to be an issue can address later. Will make a submodule. Clean up the build process. Changes to CICE repo. One CICE namelist change, tell it to not explicitly use netCDF for certain things. Bit odd.
NH: Experiment repos will require updates. Maybe AK will report some more realistic performance numbers.
AH: PIO with MOM? NH: Not sure. CICE isn’t doing a great deal in the configuration I am using. Seems to all work inside parallel netCDF as doing output from all processors. Can use IO nodes and use comms, but doesn’t show performance improvement, and looks worse in many cases. We could configure in the same way without using PIO. RF: Don’t have much control where we put processors. CICE at the end. Probably sharing with MOM. Playing with layout might be tricky. NH: At some stage put CICE on all it’s own nodes. RF: Once YATM is on the first node, it ends up messing things up. NH: Why are we doing that? RF: Something to do with OASIS in the old days. Now have YATM and root PE of MOM on same node. Would make sure all root PEs on their own node. No contention. YATM and MOM also on same NUMA partition. NH: We should change that, easy fix. YATM doesn’t do much on 0.1 as rest of the model takes so long. RF: Two IO processors on the same node. MOM root PE uses for diagnostics and YATM process. NH: If each model on their own node. Could make sure each node has a single IO processor. With PIO if want 1 node per 16 processors, don’t know if it is talking across nodes.
JM: In terms of PIO are multiple nodes writing to the same file? NH: For CICE very single process writing to same file at the same time. Works well. Haven’t looked into it deeply, probably the optimum is something in between. Still a big improvement over serial output. AH: Kaizen (改善): small incremental improvements all the time. Compressed netCDF output? NH: No. PIO GitHub talked about supporting compression. AH: Same as what RY and Marshall did? NH: Yes. Have to wait for parallel netCDF implementation which supports it. Confusing because there is also p-netCDF. PIO is a wrapper. AH: Yes, wraps p-netCDF and MPI-NetCDF. p-netCDF is only netCDF3, not based on HDF5. AK: Will need post-processing compression step. NH: Task not done until compression done. AK: Very sparse data, shame not to compress.
AH: xarray is supporting sparse data now. FYI. Can mean a lot less memory use for some data.

Compiling with/without WOMBAT

AH: Any speed/memory use implications to always have it compiled in? RF: Should be separate. Overhead basically nothing. Will only allocate BGC arrays if they’re in the field table. Should be kept separate like all other BGC packages. I put in some lines in the compile scripts. Also f you want to compile without ACCESS.
AK: If want to maintain harmony with CM2 want a non-BGC compilation? RF: Yes. AK: From the point of view of OM2 users would be nice to be able to switch BGC on and off just through namelists. RF: Switched on via field_table. Strange design choices years ago. Also need changes in some of the restart files, o2i.nc and i2o.nc. AK: Not something that can be switched on and off? RF: No.

MOM Pull Requests

AH: Guidance for checking? RF: Two main changes in the code are probably fine. Maybe the ACCESS compilation scripts. Unless want to change that it gets compiled in all the time for ACCESS-OM.AH: Decided not I think. RF: Made changes to install.sh to specify the type of model. AH: Separate model designation with WOMAT? RF: ACCESS-OM-BGC is a new model type. Run tests, all ok. AH: Do we need any tests to check it hasn’t changed non-BGC tests. RF: Shouldn’t be anything that effects a normal run. Code compiled ok on travis. Put in some heat diagnostics, the fluxes from CICE, might be the only thing. AH: Are Jenkins tests working? NH: ACCESS-OM2 tests haven’t worked since moved to the new machine. RF: Run a 1 degree model and see how it goes. AH: I’ll do that.
AK: Managing ACCESS-OM2, should the distinction between BGC and non-BGC be in the control directories. So build script builds both and choose which in the config, or compile once, supporting both. AH: I don’t think BGC is a supported configuration yet. Needs testing. How it is implemented, shared or separate exes is just a choice of how you decide is the best.
AH: Turns out that Geos PR was a mistake. Asked about it, and they closed it.

Bad bathymetry

AH: Any comments? Does it need fixing? RF: Bad bathymetry needs to be fixed, or copy bathymetry from somewhere else. Bad around Australia. Same for CM2. Mentioned it 3-4 years ago, still not updated. Some pits in Gulf of Carpentaria down to 120m in 0.25. 1 degree goes down to 80m. Should be no deeper than 60m. OCCAM created some bad bathymetry in Bass Strait, off coast of China. Russian and Alaskan issues, and White Sea. Remapping indices got mucked up. AH: Wasn’t 0.25 fixed north of Bering Strait? RF: Doesn’t look like it.
JM: Bathymetry files are wrong in certain regions? RF: Came from Southampton OCCAM model. They ran it with a normal mercator and a transverse mercator across the top. Remapping onto spherical grid indices got mucked up and got some strange bathymetry. GFDL inherited it and based a bunch of models on it. Leaked through to the ACCESS models. Was in the US forecast model and they noticed all the stuff around Alaska.
AH: Should be a relatively straightforward as this is only ocean bottom cells, and doesn’t touch coasts? RF: Yes. AK: Base on a coarsened tenth grid? RF: Not a big job, just a few slabs that need smoothing/removing. AH: Does this need to be fixed for the next release of OM2? RF: Yes. AK: No. RF: Get a student to look at it. AK: Also land mask inconsistencies, would be good to have all three models consistent. There are big curvy bits of coastline keeping ocean away from tripoles. AH: The 1 degree is very much a model, that isn’t that realistic. Tenth starts to look much more like real life.

Zarr file format

AH: Wanted to engage JM about zarr. RF: Interested as this is being used in decadal prediction project. JM: Exactly. Talked today about parallelising output from model into netCDF, and then post-analysis requires transforming to zarr. Zarr is a distributed file format that stores files in directories, each chunk is a separate file, parallelisation handled by filesystem. Should we write directly into zarr like file format. There are file formats like it. netCDF5 may have a zarr like back-end. RF: There is some discussion on the netCDF GitHub about zarr, looks like just one person. JM: Unidata is willing to move away from HDF5. Parallelisation of HDF5 has never worked the way it was supposed to. Instead of using parallel IO, just write directly to the format people want to use. AH: Got the impression netCDF people never got the buy-in from HDF5 that they thought would get. HDF5 just do their own thing. JM: Still have people using netCDF3. AH: A strength of netCDF, they could hop back end again and keep the same interface. JM: Same data model. AH: What is the physical format of a zarr blob? JM: It is a binary blob that supports different filters/compression schemes. AH: Does machine independent storage? Bad old days with swapping endianness on binary files. AG: In zarr there are raw data blobs, and associated metadata files that describe the filter/endianess etc.
JM: Inodes not a problem. Still relatively large, on the order of the lustre striping scheme. Can wrap the whole thing inside an uncompressed zip file. Parallelises for reading just fine. Works like a tar, index on where to read, supports multiple reads on same file. AH: Would want to do this when archiving.
NH: Another one is TileDB, which is a file format. JM: There are other backends, n5/z5. Distributed storage for large data sets.
AH: At one stage we did wonder if collation was even necessary with tools like xarray, but never looked into it. NH: Things have changed a lot. xarray is relatively new. 3-5 years ago might segfeault on tenth model data. So much better now, so many more possibilities.

 

Technical Working Group Meeting, April 2020

Minutes

Date: 29th April, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

Apologies from Peter Dobrohotoff.

JRA55-do v1.4 support

AK: Staged rollout. NH tagged some branches, so existing master tagged 1.3.0, using old JRA55-do v 1.3.1 using NH new exes which also support 1.4
AK: Also working on a new feature branch for 1.4. Same exes configured to use JRA 1.4 version. Seems to run ok. Not looked at output. Will look at that today. Once satisfied that is ok will move into master, tag 1.4.
AK: Also looking at ak-dev branch with a wide variety of changes. Once this is ok will tag with a new ACCESS-OM2 version. Will be new standard for new experiments. Good to make an equivalent point across repos.
AH: COSIMA cookbook hackathon showed value of project boards. Might be a good idea next time something like this attempted. AK: Tried, but it didn’t go anywhere.
NH: Two freshwater fields coming from forcing, liquid and solid. Both go into the ICE model which accepts one new forcing field. Get added together, solid magically becomes liquid without heat changes, passed straight to Ocean. Ocean and Ice models have also been changed to accept liquid part of land/ice melt and heat part of land ice melt. Exist but just pass zeroes. Extra engineering not being used as yet. A harmonisation step which takes us close to CM2 as coupled model uses these fields.
RF: With my WOMBAT updates incorporated this new code, could get rid off ACCESS-CM preprocessor directives.
NH: In the future can put work into calculating those fields correctly in the ice model. Not a huge amount of work. Will then have river runoff, land ice runoff and land ice melt heat.
NH: New executables have another change, support different numbers of coupling fields. Land/ice coupling fields are optional. At runtime figures out what coupling fields used. Dependent on namcouple being consistent. Coded internally as a maximum set of coupling fields. You can take coupling fields out but not add new ones. Possibly useful for others. Not a fully flexible coupling framework.
NH: Working on ak-dev branch. Harmonising namcouple files. Have a lot of configuration fields, but a lot ignored. Could use same namcouple in all configs, but in practice might leave them looking a little different. They include the timestep in them, but ignored. Could set to zero? AH: Or a flag value that is obviously ignored?
NH: Only three variables used in namcouple. Rest ignored, bust must parse properly. Needs cruft to make it parse. Never liked namcouple. Completely inflexible, values must be changed in multiple places.
AK: On version Oasis3-mct2, have they improved it in new version?
NH: Can now bunch fields together, pass a single 3d field instead of many 2d fields. Should improve performance. RF: Not through namcouple at all. Just a function call.
MW: What does OASIS do now? NH: Just doing routing. Which is done by MCT anyway. Remapping done by ESMF. Coupler meant to do 3 things, config, remap and routing. Made libaccessom2 do as much as possible automatically. So OASIS does very little. Still using API, so would require effort to remove.
MW: Know about NUOPC? NCAR is using it. NH: Coupling API. If all use the same API then can go plug and play. MW: MOM6 has a NUOPC driver. NH: In the future would to look at OASIS4, but probably just chuck OASIS, use MCT to do the routing and ESMF to do remapping. MW: NCAR dropped MCT. NH: MCT is a small team. AK: Something that would suit ACCESS-CM. Any critical things that rely on OASIS? MW: At mercy of UM. Probably still use OASIS due to Europe. NH: Not using ESMF, so using OASIS a lot more than we are. Might never change because of that. AK: Even moving to v4 would require coordination with CM2. NH: Nicer and cleaner, but no clear benefit.

Updated ACCESS-OM2 model configs

AK: 3 different tags. 1.3.1, 1.4 in works. ak-dev new tag. 1.4 intended to be minimal other than change in JRA55-do version. ak-dev making more extensive changes. Ussing mppccombine-fast for tenth. Output compressed data and use fast collation. Not worthwhile for 1 deg. With 0.25 output uncompressed and use mppnccombine to do compression. Hopefully output will be a reasonable size.
AK: If outputting uncompressed restarts might get large. Might want to collate restarts. Wanted to verify which run is collated: just finished, or the previous run? AH: Yes it is the restarts which are not used in the next run.
AH: Because quarter degree is not compressed won’t get the inconsistent chunk sizes between different sized tiles. Ryan had the problem when he had a io_layout with very small chunk sizes which made his performance very bad. mppnccombine-fast might be faster, will definitely use less memory. Still got compression overhead but memory use much reduced. AK: Not such a big issue as tenth. AH: Paul Spence had some issues with the time to collate his outputs. Maybe because they were compressed. Would recommend using AK: Fast version will always be faster? AH: Yes, at least no slower, but definitely uses less memory and will be much faster with compressed output.
MW: No appetite for FMS with parallel-IO? AH: Compression? Without it won’t bother probably. RY: Did some tests on parallel IO compression tests. Can’t recall results. Interested to try again. Requires a bit more memory. gadi has optane as storage or as memory. Interesting to test. Probably can use that for parallel compression or even just serial compression. Thinking about, but haven’t started. AH: Please keep us updated.
NH: Anyone have thoughts on CICE? Planning on parallel IO on CICE. Are we going to need a compression step? RF: With daily would like compression. Post processing to do compression on smaller number of PEs would be fine. Improving IO is critical for Paul Sandrey and Pavel. NH: Might need a post processing step similar to MOM. RF: Yes. Getting parallel IO is the most important. Worry about compression later. NH: Did a run yesterday with parallel IO. Completed successfully. Output was garbage. Was expecting to do heaps of work and segfaults. Surprised at that. RF: Misaligned or complete garbage? NH: Default assumption as bad as can be. Just used parallel-IO output driver on CICE. AK and RF realised daily CICE output was a bottle neck on 0.1 performance. As model code existed, decided to get working. RY: Parallel IO need to set up mapping correctly between compute and IO domains. NH: Should be part of the current implementation. Mapping is a tricky part of CICE. AK: Values out of range, so maybe not just a mapping issue? NH: Completely broken, but not segfaulting. Just getting it building was one hurdle. Also had to call the right initialisation stuff within CICE. Had to rewrite some of it that was depending on another library from one of the NCAR models (CESM). CICE is used with  CESM and they had a dependency on another utility library. Changed some code to remove dependence. Relatively positive. Library under active development and well supported. AH: Did they develop just for their use case, and maybe doesn’t support round-robin? NH: Not sure. We do know never been used in any other model than CESM.
MW: Ed Hartnett (PIO) eager to get into FMS. Also lead maintainer of netCDF4.

Status of WOMBAT in ACCESS-OM2

RF: Compiled. Next is testing. Up to current ACCESS-OM2 code changes. Had issues with submodules. AK: Previously libaccessom2 dependencies brought in through CMake, now moved to submodules. If you have an existing repo will have initialise submodules to pull in latest from GitHub.
RF: Made some changes to installation procedures. Can go between BGC version or standard ACCESS-OM. Want it to be different for BGC version. Changes to install scripts and hashexe etc. AH: Good that it is up to date, could have been an messy merge otherwise. RF: Will run tests today or tomorrow.

MOM5 PR from GEOS-ESM

AH: See this PR? Seemed a bit odd to me. First idea was to ask them to split the PR into science changes and config changes. RF: Looked like a lot of it was config changes. MW: Adding the GEOS5 stuff, which they shouldn’t. Code changes are challenging. Introduced a generic tracer, not sure what they’re doing with it. AH: Strategy? Ask them to wrap science stuff in preprocessor flags? MW: First step is to get config stuff out. Asked GFDL about it. GEOS are switching from MOM5 to MOM6. This must be associated with that effort to validate their runs. Maybe just giving back what it took to get it work. Maybe just makes his build process easier. AH: They have a specific requirement to use the same FMS library. Seems odd, as MOM5 and MOM6 are not likely to share FMS versions in the future. MW: Thorny topic, as it is not clear how FMS compatible MOM6 will be in the future. AH: Using FMS for less and less. MW: The PR needs to be cleaned up. AH: Also put in a CMake build system. MW: They need to explain more.
AK: Has conflicts, so can’t be merged at the moment. AH: Only going to get more conflicted. Which is why I was thinking they could split it up. I have a CMake build system in another branch, but never finished. if we can use theirs cool. I’ll engage with them.

Miscellaneous

AH: Been experimenting with graceful error recovery with payu. Can specify a script which can decide if the error is something you can just resubmit after. Mostly of interest to the production guys.
PL: Scalability testing with land masks, manifests, and payu setup. Supposed to be simpler but taking some time to get used to it. AH: Manifests are relatively new so some of the use cases have not been as well tested. MW: Are not all using manifests? AH: They are, but can be used in different ways. Tracking always works, but options to reproduce inputs and runs. Suggested PL could use reproduce to start a run. It was confounded by some restarts being missing, so not quite sure if it works as we would like. This is a very desirable feature, as it makes it very simple to fork off new runs from existing ones as well as making sure the files are consistent. PL: Working now. Next step is to change core counts and look for scalability numbers. AH: When I was doing scalability stuff for MOM-SIS I use input directory categories to isolate processor changes. Not quite doing that same thing anymore, but you can do something similar, but you won’t want use the reproduce flag if you are changing any of the input files.
AK: Just MOM scaling or CICE as well? PL: Just looking at MOM to begin with to see dependency and wait times. AK: CICE run time is critically dependent on daily outputs. Revelance to scaling data to production output. MW: Make sure your clock can tell them apart. In principle can distinguish compute from IO. AH: Daily output always part of production? AK: Ice modellers want very high temporal output. Ice is very dynamic. Even daily output not enough to resolve  some features. Maybe wait for PIO for CICE scaling tests? AH: I thought scaling tests always turned off IO? Can’t properly test scaling with daily output, as it dominates runtime.
NH: Would be nice to look at performance with and without PIO. PL: Will also look at CICE. Start with ocean model. AK: Were you (MW) running models coupled for paper scaling numbers? MW: Coupled. Not sure what IO was set to. Subtracted it and don’t recall it was large. Don’t recall a bottle neck, so might have had it turned off. RF: Wouldn’t be running with daily IO. Monthly IO doesn’t show up. MW: sounds likely.
AK: For IAF had a lot of daily CICE output. Not complete set of fields.
MW: Starting to run performance tests at GFDL and want to use payu. Has it changed much? Manifest stuff hasn’t made a big difference? Will have to get slurm working. Filesystem will be a nightmare. You moved PBS stuff into a component? AH: No, you did that. Not huge differences. Will be great to have slurm support.

Technical Working Group Meeting, March 2020

Minutes

Date: 18th March, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

Scalability of ACCESS-OM2 on gadi

(Paul’s report is attached at the end)

PL: Looking at scaling. Started with ACCESS-OM2, but went to testing MOM5 directly with MOM5-SIS. Using POM25, global 0.25 model with NYF forcing. The model MW developed for testing scaling prior to ACCESS-OM2. Had to add specify min_thickness in ocean_topog_nml.

PL: Tested the scaling of 960/1920/3840/7680/15360, with no masking. Scales well up to some point between 7680 and 15360.

PL: Tested effect of vectorising options (AVX2/AVX512/AVX512-REPRO). Found no difference in runtime with 15360 cores. MW: Probably communication bound at the CPU count. Repro did not change time.

MW: Never seen significant speed up from vectorisation. Typically only a few percent improvement. Code is RAM bound, so cannot provide enough data to make use of vectorisation. Still worth working toward a point where we can take advantage of vectoeisatio.

PL: Had one “slow” run outlier out of 20 runs. Ran 20% slower. Ran on different nodes to other jobs, not sure if that is significant. MW: IO can cause that. AH: Andy Hogg also had some slow jobs due to a bad node. AK: Job was 20x slower. Also RYF runs become consistently slower a few weeks ago. MW: OpenMPI can prepend timestamps in front of output, can help to identify issues.

PL: Getting some segfaults in ompi_request_wait_completion, caused by pmpi_wait and pmpi_bcast. Both called from the coupler. NH: Could be a bad bit of memory in the buffer, and if it tries to copy it can segfault. PL: Thinking to run again using valgrind, but would require compiling own version of valgrind wrapper for OpenMPI 4.0.2. Would be easier to Intel MPI, but no-one else has use this. Saw some cases similar when searching which were associated with UCX, but sufficiently different to not be sure. These issues are with highest core count. MW: Often see a lot of problems at high core counts. NH: Finding bugs can be a never ending bug. Use time wisely to fix bugs that affect people. MW: Quarter degree at 15K cores would have very small tile sizes. Could be the source of the issue. AH: This is not a configuration that we would use, so it is not worth spending time chasing bugs.

PL: Next testing target is 0.1 degree, but not sure which configuration and forcing data to use. Will not use MOM5-SIS, but will use ACCESS-OM2 for direct comparison purposes. AK: Configurations used in the model description paper have not been ported to gadi. Moving on to a new iteration. Andy Hogg is running a configuration that is quite similar, but moving to new configurations with updated software and forcing. Those are not quite ready.

PL: Need a starting configuration for testing. Want to confine to scalability testing and compiler flags. NH: ACCESS-OM2 is setup to be well balanced for particular configurations. Can’t just double CPUs on all models as load imbalance between submodels will dominate any other performance changes. Makes it a problematic config for clean configurations for things like compiler flags. MW: Useful approach was to check scalability of sub-model components independently. Required careful definition of timers to strategically ignore coupling time. MOM was easy, CICE was more difficult, but work with Nic’s timers helped a lot. Try to time the bits of code that are doing computation and separate from code that waits on other parts. Coupled model is a real challenge to test. Figure out what timers we used and trust those. Can reverse engineer from my old scripts.

PL: Should do MOM-SIS scalability work? MW: Easier task, and some lessons can be learned, but runtime will not match between MOM-SIS and ACCESS-OM2. Would be more of a practice run. PL: Maybe getting out of scope. Would need 0.1 MOM-SIS config. RY: Yes we have that one. If PL wanted to run ACCESS-OM2-01 is there a configuration available? AK: Andy Hogg’s currently running configuration would work. PL: Next quarter need to free up time to do other things.

MW: Might be valuable to get some score-p or similar numbers on current production model. Useful to have a record of those timings to share. Scaling test might be too much, but a profile/timing test is more tractable. RY: Any issues with score-p? Overhead? MW: Typical, 10-20%, so skews numbers but get in-depth view. Can do it one sub-model at a time. Had to hack a lot scripts, and get NH to rewrite some code to get it to work. score-p is always done at compile time. Doesn’t affect payu. Try building MOM-SIS with score-p, then try MOM within ACCESS-OM2. Then move on to CICE and maybe libaccessom2. PL: Build script does include some score-p hooks. MW: Even without score-p MOM has very good internal timers. Not getting per-rank times. score-p is great for measuring load imbalance. AH: payu has a repeat option, which repeats the same time, which removes variability due to forcing. Need to think about what time you want to repeat as far as season. AK: CICE has idealised initial ice, evolves rapidly. MW: My earlier profile runs had no ice, which affects performance. MW: Not sure it is huge, maybe 10-20%, but not huge.

MW: Overall surprised at lack of any speed up with vectorisation, and lack of slow-down with repro. PL: Will verify those numbers with 960 core config.

AH: Surprised how well it scaled. Did it scale that well on raijin? MW: The performance scaling elbow did show up lower. AH: 3x more processors per node has an effect? MW: Yes, big part of it. AH: 0.1 scaled well on raijin, so should scale better on gadi. 1/30th should scale well. Only bottleneck will be if the library can handle that many ranks.

NH: If repro flags don’t change performance that is interesting. Seem to regularly have a “what trade off does repro flags have?”, would be good to avoid. MW: Probably best to have an automated pipeline calculating these numbers. NH: People have an issue with fp0 flag. MW: Shouldn’t affect performance. NH: Make sure fp0 is in there. MW: Agree 100%.

ACCESS-OM2 update

AH: Do we have a gadi compatible master branch on gadi? AK: No, not currently. NH: At a previous TWG meeting I self-assigned getting master gadi compatible. Merged all gadi-transition branches and tested, seemed to be working ok. Subsequent meeting AK said there were other changes required, so stopped at that point. gadi-transition branches still exist, but much has already been merged and tested on a couple of configurations. Have since moved to working on other things.

NH: Close if AK has all the things he wants into gadi-transition branch. Previous merge didn’t include all the things AK wanted in there. Happy to spend more time on that after finishing JRA55 v1.4 stuff.

JRA55-do v1.4 update

NH: Made code changes in all the models, but have not checked existing experiments are unchanged with modified code.

NH: v1.4 has a new coupling field, ice calving. Passing this through to CICE as a separate field. In CICE split into two fields, liquid water flux and a heat flux. MOM in ACCESS-CM2 already handles both these fields. Just had to change preprocessor flags to make it work for ACCESS-OM2 as well.

NH: Lots of options. Possible to join liquid and solid ice at atmosphere and becomes the same as we have now. Can join in CICE and have a water flux but not a heat flux.

Strange MOM6 error

AH: A quick update with Navid’s error. Made a little mpi4python script to run before payu to check status of nodes, and all but root node had a stale version of the work directory. Like it hadn’t been archived. Link to executable was gone, but everything else was there. Reported to NCI, Ben Menadue does not know why this is happening. Also tried a delay option between runs and this helped somewhat, but also had some strange comms errors trying to connect to exec nodes. Will next try turning off all input/output can find in case it is a file lock error. Have been told Lustre cannot be in this state.

MW: In old driver do a lot of moving directories from work to archive, and then relabelling. Is it still moving directories around to archive them? Maybe replace with hard copy of directory to archive. MOM6 driver is the MOM5 driver, so maybe all old drivers are doing this. Definitely worth understanding, but a quick fix to copy rather than move.

NH: Filesystem and symbolic links might be an issue MW: Maybe symbolic links are an issue on these mounted filesystems. AH: There was a suggestion it might be because it was running on home which is NFS mounted, but that wasn’t the problem. MW: Often with raijin you just got the same nodes back when you resubmit, so maybe some sort of smart caching.

 

Scalability of ACCESS-OM2 on Gadi – Paul Leopardi 18 March 2020