twg – COSIMA

Technical Working Group Meeting, April 2021

May 20, 2021June 15, 2021Aidan Heerdegen

Minutes

Date: 21st April, 2021

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU
Angus Gibson (AG) RSES ANU
Russ Fiedler (RF) CSIRO Hobart
Paul Leopardi (PL), Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision
Peter Dobhrotoff (PD) CSIRO Aspendale

ACCESS-OM2 Model Scaling on DUG

MC: ACCESS-OM2 running on KNL and Cascade Lake. Concentrating on KNL.

0.1 degree config not running. Issues with our OpenMPI 4.1.0 upgrade affects this.

OpenMP threading provides no performance gain in CICE or OASIS.

Profiling difficult but possible. Scalability is an issue.

Modifications: Added AVX512 compiler flag + appropriate array alignment. Add VECTOR pragma in key CICE routines. Need intel specific pragmas where necessary. One or two routines in CICE need them. Got them listed.

OpenMP enablement: Bug stopping OpenMP. Add openmp flag to LDFLAGS in MOM and OASIS. CONCURRENT compiler directive not recognised with latest Intel version.

1 degree model. 68 core KNL cores. Scaling not great. Did a number of different configs, processor shapes and domain decomposition. Latitude flavoured CICE with slenderX1 best. OASIS doesn’t scale.

By component, stress largest percent of time. 12 MPI tasks. Not limited by comms. 37% Vectorization. Could be non-contiguous data. Mallocs. stall. OpenMP would have most effect on vectorisation.

OASIS does a lot of memory alloc/dealloc of temporary buffers. Known weakness of KNL. 37%. isend_. 16% in memory placement with non-contiguous memory placement. Need to keep in cache, fast RAM. OpenMP enablement is possible in OASIS. Has it been explored? Looks like tried multithreaded reading and multi-threaded MPI passing. Tried, best didn’t do anything, worst caused crashes.

PL & NH: Never tried.

MC: Could talk to French OASIS team see if they have tried this.

0.25 scales ok out to 18 KNL nodes (1224 cores). OASIS scales poorly. Other components don’t scale well > 18 nodes. MOM scales well by itself at 0.25, is this all OASIS or OpenMPI as well? Maybe a little of both, but can’t prove it.

New work: OpenMP running. CICE looks promising. Possibly OASIS.

0.1 still not working. UCX driver not working. Input data available due to new transfer tool.

INSIGHT use: real-time analysis and computation is target.

1 yr of 0.1 data downloaded for next prototyping effort. See if real time analysis and computation possible.

NH: Careful not to measure waiting in OASIS time. This is mostly what it does. MC: Masking load imbalance? NH: Might be. Just be careful. It is where the model imbalances show up. MC: Did try different layouts and still shows similar behaviour. NH: Really hard to benchmark coupled model for this reason. Look at ice and ocean numbers in isolation. MC: Interesting to talk about common approaches to benchmarking. Talked to CRAY about this sort of thing. Sometimes separate components and run in isolation. score-p doesn’t work like that. Using gprof.

PL: Anything useful you can say about OpenMP in MOM? MC: Not spent much time on that. PL: Looking at source not many places OpenMP present. Maybe not a great fit for KNL without very good MPI implementation. MC: Two ways to get performance on KNL: exploit on-node parallelisation, use as many as those cores as possible. If you can split up well with MPI can work well. Worked ok with MOM by itself on KNL. Not so worried. Good parallelisation as it is. When linking in OASIS calls with coupling module inside MOM if build OASIS with OpenMP need that flag in your link statement otherwise get kmp linking error.

AH: Re 0.25 scaling, how many nodes is last data point? MC: 34. AH: So 2.5K cores. 0.25 stops scaling much above that on conventional Intel cores. MC: Ideally move scalability forward, as KNL cpus are 1/3 clock speed, so need more scalability to match that. Would hope OpenMP might help. AH: Scaling most difficult with load balancing in multi-model coupled models.

NH: Good to compare gadi scaling results to see how much realistically can sale, see whether makes sense to go to KNL. If can’t squeeze more scalability out of it, never going to match conventional architecture.

MC: New architectures like sapphire graphics are basically beefed up KNL. Faster cores, but big chunk of fast memory. Could look like KNL. Might be a preview of Intel world.

NH: KNLs much cheaper to run, so given that discount is it possible to do cheaper cost calculations. A priori cost calculation.

MC: Could have done, but more like a sales pitch. Standard analysis to see best fit for code.

RF: Not sure why atmospheric boundary layer routine popped up in OM2. Don’t think it should go through that routine. Square processor domain decomposition in 0.25 massively unbalanced. MC: Tried a number of domain decompositions, layout and type. RF: Depends on month by month, depending on north/south distribution of ice. Using one month get biased result. MC: Used slender for 1 deg, square for 0.25. Using just january. RF: CICE doing all work in northern hemisphere. A lot of nodes doing no work. Slender ok on 1 degree with a few tiles north and south, should be ok. MC: Maybe a couple of 1 month runs? RF: Often run 1 degree for whole year, reasonably well distributed over year. Jan & Jul or Mar & Sep would give a good spread. Will give max/min.

DUG copy

MC: Want to move TBs of data into/out of many sites as possible. Current tools easy but not quick or efficient. Visualisation/real-time monitoring. Biggest bottleneck is client endpoint. AARNET gives 2+MiB. Current tools <1% peak speed. Azure and AWS provide their own high speed data transfer tool.

MC: Currently experimental. CLI. Allows to quickly transfer directories, very large files and/or large number of files. Creates parallel rsync like file transfer. 192 by default. All transfers are logged and can be restarted. ssh authentication. self-contained single binary runs on centos/ubuntu. Includes verification checks. Part of a utility called dug. HPC users will get a cutdown version.

MC: cli looks like scp. Can do offline verification.

MC: Copied 400GB from gadi to perth. Over 500MB/s

MC: Current challenges are authentication and IT policy limitations on remote sites. Want even faster without ridiculous numbers of threads. In development right now. Hope to release in 2021 to DUG customers.

AH: Would be great if some part of it could be open source. Still offer special features for customers but get value from exposure for DUG and potentially improvements from the community. Ever talked to ESGF? Not sure if they have a decent tool.

MC: Always a balancing act. Don’t want to have two separate projects. Do have some open source projects. Have vigorous discussions about this.

Forcing perturbation feature

NH: Dropped PIO due to working on forcing. Work going well. Currently working on JSON parsing code. One of the trickiest bits. Split out a separate library for forcing management. Includes all the perturbation, reading/caching/indexing of netCDF files. Cleaner way to break things out. API to configure forcing. Stand-alone tests. Ready soon. Ryan keen to use.

AK: Visiting students keen to do perturbation experiments.

Relative humidity forcing of ACCESS-OM2

RF: Not done much. Develop a completely new module using all the gill-82 configurations, rather than work in with code as-is. Would be a mess of conditional compilations, mess of IF/THEN. Have everything consistent rather than GFDL style. GFDL developed with atmospheric model in mind rather than just boundary layer. Should be clean and compact. Just a few elemental functions. Other module pretty old (around 2000). When Simon/Dave took sat tap press stuff picked the eyes out of bits. There is redundant code. It’s a real nightmare.

CICE Harmonisation

NH: Which way do we go? Who merges into who? Might be pretty easy for us to look at their changes to begin with if not too many. Who they come to grab our stuff a lot easier.

AH: I took it upon myself to make a PR that isn’t just rubbish. Problem is that after codebases diverged there was a big commit where they dumped *all* the GA7 changes in. Ended up touching a lot of files that NH had also changed. So quite messy. Was trying to make something we could compare. Maybe just put that up as branch to begin with. Anything that stops code being hidden in private svn repos. CICE harmonisation not critical. Can be done at a later stage, urgency depending on performance at 0.25 degrees.

AH: They have already advertised a postdoc to work on a model that doesn’t exist yet. NH: Good to have a customer first!

COSIMA metadata

AH: Cleaning up of tags in cosima data collection would be good. Can be very explicit about what you want people to use. No point having multiple tags for same thing. Needs someone to take charge of that. AK: Need a glossary for what tags are what. AH: Could put on database build repo?

Assimilation

RF: Still working on pavel with reduced config and restarts for assimilation runs. Some are redundant when changing ocean state at the beginning. How to override state and do consistently. Fewer restarts = faster start-up. Make some restarts optional. Allow user to turn off. All restarts are needed for proper run. Not so for assimilation run.

RF: Working on configuration using fewer nodes. Begin and end time is almost a constant. Only run 3 days. Start/stop is large percentage of wall time. A lot less cost if using fewer cores. NH: Still some minimal configs on COSIMA repo. No longer maintained. Were being used some time ago for cheap(er) testing RF: Will take a look, be a good starting point. AH: It is a 2064 cpu config. RF: 2000 cpu is perfect. If Pavel runs ensemble KF can run several in parallel, 2K core jobs easy to get on machine. AH: Paul Spence is running a bunch of 2K 0.25 jobs and getting great throughput.

AH: How does MPI_Init scale with core count? NH: Can’t recall off the top of my head. OASIS is non-linear, but made a lot of changes to speed it up. Are you using the new OASIS version? Didn’t we say that would speed up initialisation time? RF: Thought there was a chance, a potentially good possibility. NH: Will merge that soon.

Technical Working Group Meeting, March 2021

April 19, 2021April 19, 2021Aidan Heerdegen

Minutes

Date: 17th March, 2021

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU
Angus Gibson (AG) RSES ANU
Russ Fiedler (RF) CSIRO Hobart
Paul Leopardi (PL), Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision
Peter Dobhrotoff (PD) CSIRO Aspendale
Mark Cheeseman (MC) Down Under Geosolutions (DUG)

General forcing perturbation support

AK: Would like to perturb forcing in model run without having to change forcing files. Would like to support linear function on forcing. Not most general, covers most current cases. How to represent scaling and offset fields. Common to separate in time and space. Generalisation is sum of separate components.

AK: Currently implement arbitrary spatiotemporal or constant . Full spatiotemporal variation is data intensive. Need to work out details of implementation. Is the proposal feasible? One component, should be able to be generalised to N components. Calendar is a bit more complicated. Currently support experiment and forcing calendars. Can be different, e.g. RYF forcing. May want to tie forcing to experimental or forcing calendar. Perturbation should have a time coordinate and is only applied when it matches the current forcing time. Is this feasible?

JM: Precompute spatial/temporal fields. Going to be standardised ramps? Are perturbations going to simple functions of time and space? Reproducibily generated? AK: Not necessarily. EOFs would not work like this. NH: Important to know what the model is forced with. This complicates it a bit. A file is a dead-end. Files are a complete representation, even if you don’t know how you made the files. Want to document the creation of forcing files and attach code. Don’t want it to be too expressive or complex, will introduce more difficulty in understanding what it does. Sounds like if you could put arbitrary mathematical functions in there that would be preferable? JM: Say a temporal perturbation, have to look up file, has calendar issues. Different information to saying this is a linear ramp, or a step function every 12 years. Can put that tooling somewhere else in the workflow. This is maybe better, previously had a whole new forcing which is worse. So this is an improvement.

NH: Pretty arbitrary what can go in there. Can damp a single storm or collection of storms. AK: Manifests will document which files are used with hashes on the data. Yes not evident what is happening without looking at the file. Would encourage comments in netCDF attributes with git commit for script that made the file. So files carry with them a reference to what created them.

AH: Maybe insist on a comment field in the JSON file? AK: Encourage people to make it a URL to commit of script that made the file. AH: Yes, but some short descriptive information. We know it won’t necessarily mean it is updated, but make it compulsory and make them put something in there. AK: Compulsory means people tend to copy & paste form previous, and bad information is worse than none. AH: Allow people to do the right thing. A pithy comment can have a lot of information in it. NH: Enforce comments in netDF files? AH: Most netCDF files have a comment field in them, but often not that useful. NH: Allow or enforce comment. Make it possible to have a comment. AK: Ignore any non-defined field they can put anything they like? AH: Always a bad idea to allow through things that are wrong, as can mean people think they have defined something, but a typo can mean they haven’t. Happens with payu config.yaml files sometimes.

PL: What is description field? NH: For the file as a whole.

DUG scaling with ACCESS-OM2

MC: Running ACCESS-OM2 on CascadeLake cluster and KNL cluster, about 10-15% slower than gadi. Probably due to slightly slower Intel Cascade Lake, and MPI interconnect also slightly slower. Very early results. PL: Is that across the board speed, or top end scalability or both? MC: Both across the board and individual components. Only running quarter and one degree so far. Most results are one degree. Run quarter a couple of times. Doing that work in Houston and has had problems recently. AH: How does it work on KNLs? MC: Slower per node. Scaling is just as good. OpenMP not as good as thought. Will look at, as OpenMP critical to getting good scaling on KNL. PL: With OpenMP, which components? MOM and CICE? MC: Both CICE and MOM. Not done anything deep, just turned it on. NH: Little to no instrumentation. Not much OpenMP stuff. RF: There is in CICE. Not a huge amount. Thread over the blocks. NH: Cool.

MC: Had another guy working on it. He has been running in the last week. Will be back on to it next week looking at OpenMP. AH: Anything stopping 0.1? MC: Just getting data. Getting timeouts when transferring. Some guys looking at it who work on SKA moving 10+TB quickly and easily. Will get in touch to try high speed trial between DUG and ANU. AH: Involve working a client on our side communicating? MC: I think so, just something at the very end, doesn’t need IT support. Not 100% sure. AH: Very interested in this. MC: If works should have entire 0.1 data in minutes.

MC: Visualisation guys interested due to insight work. Different from normal data. Very excited. AH: Lots of small scales. MC: Interested in path tracing. Made filter to follow isopycnal along gulf stream. POV or from a coordinate position.

NH: Seen the great one on youtube? Awesome visualisation which shows isopycnal surfaces around Antarctica. AH: NCI one uses iso-surfaces to pick out density classes.

AH: Nice NYT article https://www.nytimes.com/interactive/2021/03/02/climate/atlantic-ocean-climate-change.html

MC: Anything simulation we could do, or benchmark, that would help you guys out? NH: 0.1 with daily ocean output. PL: Got all the benchmark documentation? Concentrated on not producing output.ur

AH: Any point pushing the scaling? NH: Focussing on year/submit. Get a year done in 5 hours, whatever it takes. Not so interested in how big we can get it. PL: As KNL individual cores slower, how good is top end scaling as that is where speed comes from. Could try AMDs as well. Don’t have enough for large scale runs. As most have GPUs on them, ML guys use them a lot. AH: AMDs maybe interesting. Pawsey have AMDs for their new system. Historically AMDs had a big memory bandwidth advantage. Isn’t model cache limited? AK: Didn’t Marshall say it was the speed of the RAM. AH: So memory bandwidth improvement might make a difference.

MC: For memory transaction bound calcs depends on memory type and data being moved. AMD and CL faster, depending on data types. Intel works better if switching between 32-bit integer and 64 bit real causes a problem with AMD. Bioinformatics and genomics hit this hard. Intel is better by far. Using latest intel compiler. PL: Complicated by alignment and AVX? MC: Maybe Intel preload vector units better? Open question trying to answer.

PL: Other scaling: OpenMP vs MPI? More threads vs more cores? NH: Curious about this for CICE. KNL has 4 threads/core need OpenMP to make full use of it. On gadi only one FPU/core, maybe don’t care about threads? PL: Depends on latency in MPI calls vs what happens in OpenMP threading. MC: For KNL 2 threads/core is a sweet spot for saturating vector units, very few can use 4 threads/core. AH: Which MPI library? MC: Using OpenMPI 4.1, also have Intel MPI. AH: Never used Intel, so don’t know if there is much difference. Can you comment on any difference? MC: Mostly OpenMPI, mostly optimised for that. Have recently put some time into Intel MPI, some recent mellanox collectives don’t work as well in newer OpenMPI (hierarchical collector, collectives). Dues to some of the lower driver ware. Got a really good guy working on this. Don’t see those issues with Intel MPI. Only notice a performance difference on a couple of very specific codes.

PIO

NH: Last time updated on getting async IO working. PIO library allows synch or asynch IO servers. Trying to get async working. Had to change OASIS version, which meant changes to coupling code. Also change handling of communicators. All done. Very close to getting working. Some memory corruption inside C library within PIO. Fairly new code, especially FORTRAN API. Maybe just running into some bugs. Same thing happened when I first started with PIO. Test case is very simple. Single model, single CPU doing IO. Our case is more complex. In the process of working out what is going on. Now have to run with valgrind. Couldn’t find with code inspection. AH: Make test case more complex until it falls over? NH: My 1 degree test case is pretty simple. Errors early on. Good idea, might try that. Running valgrind on test case would be much simpler. AH: Can be tricky to make a test case that fails. NH: Run of the mill memory corruption should pop up with test case. AH: Maybe just adding one more CPU will make it fail.

NH: PIO devs good to work with. Get changes in quickly, feel positive about it.

ACCESS-CM2-025

PL: Have accumulated some fixes to CICE. Code for CM2 was a file copy rather than pulled from GitHub. Don’t know where to put my fixes. Currently thinking of creating a branch from my own fork of CICE and put it there. One example in ice_grid.f90 looks like if auscom is defined then OpenMP won’t work. Not sure how OpenMP works with CICE. NH: Don’t remember that in our version of CICE. Got the entire bundle from Martin Dix as stand-alone test case for ACCESS-CM2.

PD: There is a repository for CICE. Would be good to create a branch for that. Defer back to Martin where to put it. Maybe a .svn dir in tarball? PL: Possibly? In first instance will create a branch in my fork on my own GitHub CICE repo. PD: Do an svn info. Not sure if it is the same code version as CMIP runs. PL: Will take offline.

AH: I am working on CICE harmonisation between OM2 and CM2. My strategy was to look at subversion history and trace it back to the shared history of the CICE version we have in COSIMA. Got very close, maybe a commit away from linking them. CICE version used in the CMIP runs have a single large commit from UKMO after versions diverged. NH cloned from Hailin’s repo but before this large commit. A lot of changes made similar files. Intention was to make a branch on our CICE repo and a pull request where we could see all the changes are. Maybe PL could make his changes there. We are making this for ACCESS-CM2-025, not sure if the main ACCESS-CM2 will end up pulling from the same repository in the future.

NH: Is this everything except the driver? In the drivers folder there is auscom and accesscm drivers, not changing those? AH: Not as far as I know. Pulled in a lot of code changes. Wanted something people could look at, and see if changes affect OM2. Maybe don’t have time to have a version that OM2 and CM2 could work. Could be a separate branch.

PL: Harmonisation is great for MOM5. Would be great to have CICE harmonised, but much more tightly bound to atmosphere model. Hence big changes with GA7 release. Need to make sure CICE correctly coupled to UM. CICE is intermediary between atmosphere and ocean. Not sure what costs and benefits would be. Dave Bi would know, how much effort and what scope. AH: Valuable even having in the same repo. Can cherry-pick come of the changes NH has and will make. Maybe a 0.25 might need some of the changes NH has made for decent performance. PL: There is a lot of interest in the improvements and bug fixes, but not sure from this distance about effort required.

NH: We are completely up to date with upstream CICE5. Half a dozen commits that are good and valuable. Also brought some things in from CICE6. Not scientific. Less of an issue. PL: Also updating OASIS coupler? Will that complicate with UM? NH: I’ve updated to OASIS4, should be some performance improvements. Not a lot of coupling time, not sure how much difference it will make, but it is a bottle neck, so any improvement will make an impact. Upgrading to OASIS was flawless except for warning for unbalanced Comms. No changes to namcouple or API.

Technical Working Group Meeting, February 2021

March 9, 2021March 9, 2021Aidan Heerdegen

Minutes

Date: 17th February, 2021

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU
Angus Gibson (AG) RSES ANU
Russ Fiedler (RF) CSIRO Hobart
Paul Leopardi (PL), Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision

DUG and ACCESS-OM2

AH: Been working with Mark Cheeseman from DownUnder GeoSolutions (DUG)

MW: GFDL collaborating with Seattle group Vulcan. Mark Cheeseman was on staff. Now with DUG. NH: Worked at NCI and then Pawsey, saw him there when gave a talk. Went to US project Vulcan. Then contacted me, now contacted working from DUG. Interested in REASP startup, now on ACCESS-OM2 project.

NH: Curious about Vulcan, what is it? MW: Vulcan is a project to wrap fortran core with python. Similar to f2py. Controlled by python and a domain specific language to put arrays on GPUs. Not my style. GFDL atmosphere group like it. Not sure if they are using it in production. Lucas Harris (head of atmosphere) spruiks it. NH: Vulcan were advertising a job in their team (short term). MW: Vulcan was a Paul Allen start-up. Jeremy McGibbon is a good guy to talk to.

MW: MITgcm was rewritten in Julia. PL: Domain specific languages becoming more popular. NH: Fortran is having a revival as well. MW: Talked to Andre Certik who made lfortran. Wants to connect with flint project. fortranlang.org is doing good stuff. Vulcan is similar to cyclone (LFRIC). Similar spirit. PL: Cyclone is one of the enabling technologies. LFRIC is the application of it to weather modelling. MW: Will be UM15 before LFRIC.

PL: UM model has a huge amount that isn’t just the dynamical core that all has to be integrated. Has to be more scientifically valid and performant than current. MW: Everybody is trying to work around compiler/hardware. Seems backwards. Need compilers to do this. Need better libraries. malloc should be gpu aware. Maybe co-arrays should be gpu aware. Seems silly to write programs to write programs.

AH: Agree that compiler should do the work. Was dismayed when talked to Intel guy at training session when he said compiler writers didn’t know or care overly about fortran. Don’t expect them to optimise language specific features well. If you want performance write it in the code, don’t expect compiler to do it for you.

MW: If can write MPI, surely can write a compiler to work with GPUs. Can figure out language/library level.

AH: DUG contacted us and wanted to move into weather and climate. Met them at the NCI Leadership conference. Stu Midgley told me they had a top 25 compute capability, larger than raijin but now gadi is bigger. Biggest centre is in Texas. Immersed all their compute in oil tanks for cooling. Reckons their compute cycles are half the cost of NCI.

NH: Selling HPC compute cycles: HPC in the cloud. Also recently listed on stock exchange. AH: Floated at $1.50, now down to $1.00 a share. NH: Don’t understand how they can make money. What is so special about HPC cloud?Competing with AWS and NCI and governments. MW: Trying to pull researchers? NH: Getting research codes working. MW: NCI is free, how do you beat that? PL: Free depending on who you are.

AH: Liased with Mark. Pointed him at repos, got him access to v45 to see how it runs on gadi. He has duplicated the setup on their machines, not using payu. Only running 1 degree so far.

AH: Had a meeting with them and Andy Hogg. Showed us their analysis software. Talked about their business model. Made the point we use a lot of cycles, but produce a lot of data that needs to be accessed and analysed over a long time period. Seemed a difficult sell to us for this reason. They view their analysis software as a “free” add-on to get people to buy cycles. Developed their own blocked-binary storage format similar to zarr. Can utilised multiple compression filters. Will need to make this sort of thing public for serious uptake IMO. Researchers will not want to be locked into a proprietary format. Andy pointed out there were some researchers who can’t get the NCI cycles they need. DUG know of Aus researchers who currently go overseas for compute cycles. Also targeting those who don’t have enough in-house expertise: they will run models for clients. NH: Will run models, not just compute? AH: They’ll do whatever you want. Quite customer focussed. MW: Running a model for you is a valuable service. AH: Would military guys who run BlueLink model be interested in those services RF? RF: Want to keep secure and in house. AH: Big scale and looking for new markets.

PIO update

NH: Task was to improve IO performance of CICE and MOM. Wanted async IO using PIO in CICE. Numbers showed could improve speed 1-2% if not waiting on CICE IO. RF sent out a presentation on PIO, active on GitHub, moving quickly. Software all there to do this. 6 months ago no support for fortran async IO. First problem OASIS uses MPICOMMWORLD world to do everything. With IO tasks can’t use MPICOMMWORLD. New version of OASIS-mct v4 makes it easier not to use MPICOMMWORLD. Upgraded OASIS. Has new checks in coupling. Historically doing some weird stuff in start-up. New OASIS version didn’t like it. Ocean and Ice both write out coupling restarts at the end of the run. Payu copies both over but read in by just Ice, then sent back to Ocean. Rather than each model writing and reading their own restarts get this weird swap thing. Get unbalanced comms. OASIS has check for this. Could have disabled check, but decided to change coupling as you would expect. Don’t know why it was done this way. RF: Did it this due to timing. Needs to store an average of the previous step, to do with multiple time steps. Also might have changed the order in the way things were done with the coupling, and how time steps were staggered. From a long time ago. NH: Might also be with dynamic atmosphere, need different coupling strategy. Made change, checked answers are identical. Now we’re using OASIS-mct 4.0. Now integrating the IO server.

NH: Open to advice/ideas. Start off with all MPI processes (world), gets split to compute group and IO groups. New compute group/world fed to OASIS and split into models. payu will see a new “model” which is the IO server. Maybe 10 lines of fortran, calls into PIO library and sits and waits.

NH: Other possibility, could split world into 3 models, then split CICE and IO into separate models. Doesn’t work, have to tell everyone else what the new world is. Async IO has to be the first thing that happens to create separate group. OASIS upgrade was smooth and changed no results. Should have async IO server soon. Could also use it for MOM. Wonder if it will be worthwhile. MW: Be nice. PL: Testing on gadi? MW: Using NCAR PIO? NH: Yes. AH: NCAR PIO guys very responsive and good to work with.

MW: They have been good custodians of netCDF.

PL: Also work with UM atmosphere with CM2? NH: No reason not to. Seems fairly easy to use starting from a netCDF style interface. Library calls look identical. With MOM might be more difficult. netCDF calls buried in FMS. If UM has obvious IO interface using netCDF, should be obvious how to do it.

PL: Wondering about upgrading from OASIS3-mct, might baulk at coupling. NH: Should be ok. Might have problem I saw, not sure if it happens with climate model. Worst case scenario you comment out the check. Just trying to catch ugly user behaviour, not an error condition. AH: Surprised you can’t turn off with a flag. NH: New check they had forgotten to include previously.

AH: UM already has IO server? MW: In the UM, so have to launch a whole UM instance which is an IO server. Much prefer the way NH has done it.

NH: PIO does an INIT and never returns. Must be IO server. MW: Prefer that model with IO outside rest of model code. NH: Designed with idea multiple submodels would use same IO server. MW: When Dale did IO server in UM, launch say 16 ranks, 15 ran model, 16th waiting for data. Always seemed like a waste. Does your IO server have it’s own MPI PE? Perfect if it was just a threaded process that doesn’t have a dedicated PE. RF: Locked in. Tell how many you want and where they are. Maybe every 48th processor on whole machine or CICE nodes. Doesn’t have to be one on every node. Can send it an array designating IO servers, split etc handled automatically. Also a flag to flag if a PE will do IO. NH: I have set it up as a separate executable launched by payu. All IO bunched together. Other option is to do what RF said, part of the CICE executable, starts but never computes CICE code. More flexible? RF: Think so. Better way to do it I think. NH: If were to go to MOM, would want 1 IO server per x number of PEs rather than bunching at end.

AH: Keep local, reducing communications. Sounds difficult to optimise. PL: PE placement script? NH: Could do it all at the library level. Messed around with aggregators and strides and made no difference. Tried an aggregator every node, and then a few over whole job which was the fastest. AH: Didn’t alter lustre striping? RY: Striping depends on how many nodes you use. AH: Something that was many dimensional optimisation space as Rui showed, throwing in this makes if very difficult. Would need to know the breakdown of IO wait time. NH: IO server the best, non-blocking IO sends and you’re done. Doesn’t matter where the IO server PEs are. Unless waiting on IO server. PL: Waiting on send buffers? NH: Why. Work, send IO, go back to work and should be empty when you come back to send more MW: IO only one directional fire and forget. Should be able to schedule IO jobs and get back to work. As long as IO work doesn’t interfere with computing. PL: Situation where output held up, and send buffers not clean and wind up with gap in output. MW: Possible. Hope to drain buffers. NH: If ever wait on IO server something broken in your system. MW: Always thought optimal would be IO process is 49th process on 48 core node, OS squeezing in IO when there is free time. RY: If bind IO to core always doing IO. MW: Imagine not-bound, but maybe not performant. RY: Normally IO server bound to single node but performance depends on number of nodes when writing to a single file. Can easily saturate single node IO bandwidth. Better to distribute different IO processes around different nodes to utilise multiple nodes bandwidth. NH: I think the library makes that ok to do. RY: User has to provide mapping file. NH: Have to tell it which ranks do what. RY: Issues with moving between heterogenous nodes on gadi. Core number per node will change. We can provide a mapping for acceptable performance. Need lots of benchmarking.

AH: If IO server works well, it will simplify timings to understand what the model is doing? How much time spent waiting? NH: Looking like MOM spending a lot of time on IO. AK: Daily 3D fields became too slow. So turned them off. AH: So IO is a bottleneck for that.

NH: Any thoughts on PIO in MOM? RY: When I did parallel IO in MOM there already exist a compute and IO domain in MOM. Don’t need to use PIO. Native library supports directly. IO server in OASIS? Depends how MOM couples to OASIS. Can set up PIO as separate layer and forget IO domain and create mapping to PIO and use IO PE to do these things. Easier in MOM if pick up IO domain. MW tried to improve FMS domain settings to pick up IO domain as separate MPI world. NH: Still not going to work with OASIS? RY: Similar to UM. Doesn’t talk to external ranks. MW: Fixed to grid. PL: OASIS does an MPI ISEND wrapped so it waits for buffer before does ISEND to make sure buffer is always ready. Sort of blocking non-blocking. Not sure if IO would have to work the same way? NH: Assume so, would have to check request is completed, otherwise overwrite buffers.

FMS MOM6

MW: FMS2 rewrite nearing completion. Pretty severe. Can’t run older models with FMS2. Bob has done a massive rewrite of MOM6. Made an infra layer. Now an abstraction of the interface to FMS from MOM6, can run either FMS or FMS2. MOM6 will continue to have a life with legacy FMS. MOM6 will support any framework you choose, FMS, FMS2, ESMF, our own fork of FMS with PIO. That is now a safe boundary configurable at the MOM6 level. Could be achieved in MOM5, or if migrate MOM6 have this flexibility available. Not sure if it helps, but useful to know about.

MW: MOM6 is a rolling release. Militantly against versioning. Now in CESM, not sure if it is production currently. It is their active model for development.

ACCESS-OM2 tenth

AK: Not currently running. Finished 3rd IAF cycle a month ago. All done with new 12K configuration. Only real instability was a bad node. Hardware fix.

AK: Talk of doing another cycle with BGC at 0.1 deg. Not discussed how to configure. AH: Would need all the BGC fields? AK: Assume so.

AH: A while ago there were parameter and code updates. Are those all included? AK: Yes, using the latest versions. AH: Weres the run restarted? AK: Yes. AH: Metrics ok? AMOC collapsing? AK: Not been looked at detail. Only looked at sea-ice and that has been fine, but mostly result of atmospheric forcing. AH: Almost have an OMIP submission? AK: Haven’t saved the correct outputs. AH: Not sure if need to save for whole run, or just final one.

ACCESS-CM2-025

AH: Continuing with this. Going slowly, not huge resources available at Aspendale. Currently trying harmonise CICE code bases by getting their version of CICE into git and do a PR against our version of CICE so we can see what needs updating since the forking of that CICE version.

PL: I am doing performance testing on CM2, how do I keep up to date? AH: I will invite you to the sporadic meetings.

Technical Working Group Meeting, December 2020

February 15, 2021February 15, 2021Aidan Heerdegen

Minutes

Date: 9th December, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU
Angus Gibson (AG) RSES ANU
Russ Fiedler (RF) CSIRO Hobart
Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision
Peter Dobrohotoff (PD) CSIRO Aspendale

Testing with spack

NH: Testing spack. On lightly supported cluster. Installed WRF and all dependencies with 2 commands. Only system dependency was compiler and libc. Automatically detects compilers. Can give hints to find others. Tell it compiler to use for build. Can use system modules using configuration files. AH: Directly supported modules based on Lmod. Talked to some of the NCI guys about Lmod, as the raijin version of modules was so out of date. C modules has been updated, so they installed on gadi. Lmod has some nice features, like modules based on compiler toolchain. Avoids problem with Intel/GNU subdirectories that exist on gadi. NCI said they were hoping to support spack, by setting up these configs so users could spack build things. Didn’t happen, but would have been a very nice way to operate to help us.

NH: Primary use case in under supported system where can’t trust anything to work. Just want to get stuff working. Couldn’t find an MPI install using latest/correct compiler. gadi well maintained. See spack as a portability tool. Containment is great.

AH: Was particularly interested in concretisation, id of build, allows reproducibility of build and identification of all components.

NH: Rely on MPI configured for system. Not going to have our own MPI version. AH: Yes. Would be nice if someone like Dale made configs so we could use spack. Everything they think is important to control and configure they can do so. Probably not happy with people building their own SSL libraries. Thought it would improve NCI own processes around building software. Dale said he found the system a but fragile, too easy to break. When building for a large number of users they weren’t happy with that. Thought it was a great idea for NCI, to specify builds, and also easy to create libraries for all compiler toolchains programmatically.

AG: Haven’t tried recently.

Parallel compression of netCDF in MOM5

parallel_compression_mom5

RY: Continuation of previous PIO work, including compression as now supported by netCDF. Used FMS IO benchmark test_mpp_io to tune parameters. 174GB -> 74 GB with level 4 deflation. Tested two PE numbers. Tested two schemes ROMIO and OPMIO. v1.10.x lots of errors. v.12.x much better. Only had to change deflate_level in mpp_io_nml namelist, no source code changes.

RY: Best settings, for 720 PE, (48,15), best IO layout (24,15), and 1440PE (48,30) best IO layout 12,30. Non-compressed match chunk size with layout. Best time keep x contiguous when compression turned on. Memory access dominate, so layout continuous. Hence x-axis continuous.

RY: Stripe count affects non-compressed more than compressed. PIO doesn’t work perfectly with Lustre, fails with very large stripe count. With large file sizes (2TB) can be faster to write compressed IO due to less IO time.

Large measurement variability in IO intensive benchmark as affected by IO activity. Difficult to get stable benchmark.
Use HDF5 1.1.12.x, much more stable.
Use OMPIO for non-compressed PIO
Similar performance between OMPIO and ROMIO for compressed performance

RY: Early stage of work. Many compression libraries available. Here only used zlib. Other libraries will lead to smaller size and faster compress times. Can be used as external HDF filter. File created like this requires filter to be compiled into library.

NH: How big is measurement variability? RY: Can be very different, took shortest one. Sometimes double. TEST_MPP_IO is much more stable. Real case much less so.

NH: Experiencing similar variability with ACCESS model with CICE IO. Anything we can do? Buffering? RY: Can increase IO data size and see what happens. Thinking it is lustre file system. More stripe counters touches more lustre servers. Limit to performance as increase stripe counters, as increase start to get noise from system. NH: What are the defaults, and how do you set stripe counters? RY: Default is 1, which is terrible. Can set using MPI Hints, or use lfs_setstripe on a directory. Any file created in that directory will use that many stripe counters. OMPIO and ROMIO have different flags for setting hints. Set stripe is persistent between reboots. Use lfs_getstripe to check. AH: Needs to be set to appropriate value for all files written to that directory.

NH: Did you change MPI IO aggregators or aggregator buffer size. RY: Yes. Buffer size doesn’t matter too much. Aggregator does matter. Previous work based on raijin with 16 cores. Now have 48 cores, so experience doesn’t apply to gadi. Aggregator default is 1 per node for ROMIO. Increase aggregator, doesn’t change too much, doesn’t matter for gad. OMPIO can change aggregator, doesn’t change too much.

NH: Why deflate level 4? Tried any others? RY: 4 is default. 1 and 4 doesn’t change too much. Time doesn’t change too much either. Don’t use 5 or 6 unless good reason as big increase in compression time. 4 is good balance between performance and compression ratio.

NH: Using HDR 5 c1.12.x. With previous version of HDF, any performance differences? RY: No performance difference. More features. Just more stable with lustre. Using single stripe counter both work, as soon as increase stripe counter v1.10 crashes. Single stripe counter performance is bad. Built my own v1.12 didn’t have problem.

NH: Will look into using this for CICE5. AH: Won’t work with system HDF5 library?

AH: Special options for building HDF5 v1.12? RY: Only if you need to keep compatibly with v1.10. Didn’t have any issues myself, but apparently not always readable without adding this flag. Very new version of the library.

AH: Will this be installed centrally? RY: Send a request to NCI help. Best for request to come from users.

AH: Worried about the chunk shapes in the file. Best performance with contiguous chunks in one dimension, could lead to slow read access for along other dimensions. RY: If chunks too small number of metadata operations blow out. Very large chunks use more memory and parallel compression is not so efficient. So need best chunk layout. AH: Almost need a mask on optimisation heat map to optimise performance within a useable chunk size regime. RY: Haven’t done this. Parallel decompression is not new, but do need to think about balance between IO and memory operations.

RF: Chunk size 50 in vertical will make it very slow for 2D horizontal slices. A global map would require reading in the entire dataset. RY: For write not an issue, for read yes a big issue. If include z-direction in chunk layout optimisation would mean a large increase in parameter space.

AH: Optimisation based on performance from simpler benchmark. Numbers didn’t correlate that well with more complex benchmark due to being a much larger file. Would running the benchmark with a larger file change the layouts used for the real world test? RY: Always true that chunk size along x should be contiguous. Probably y chunk size would change with real world example. Trends are the same. Default chunk layout slices all 3 axes. Best performance is always better than default chunk layout.

AH: Larger core counts now around 10K cores. RY: Have to select correct io_layout. Restricts the number of PEs. AH: This is an order of magnitude larger. RY: Filesystem has limited number of IO server. This sets the maximum number of IO PEs. Should always keep number of IO server less than this.

ACCESS-OM-01 runs

NH: AK has been running 0.1 seeing a lot of variation in run time due to IO performance in CICE. More than half the submits are more than 100%worse than the best ones. Is this system variability we can’t do much about? Also all workers are also doing IO. Don’t have async IO, don’t have IO server. Looking at this with PIO. Have no parallelism in IO so any system problem affects our whole model pipeline. RY: Yes IO server will mean you can send IO and continue calculations. Dedicated PE for IO. UM has IO server. NH: Ok, maybe go down this path AH: Code changes in CICE? NH: Exists in PIO library. Doesn’t exist in fortran API for the version we’re using. Does exist for C code. On their roadmap for the next release. A simple change to INIT call and use IO servers for asynchronous IO. Currently uses a stride to tell it how many IO servers per compute. AH: Are CICE PEs aligned with nodes? Talked about shifting yam, any issues with CICE IO PEs sharing nodes with MOM. NH: Fastest option is every CPU doing it’s own IO. Using stride > 1 doesn’t improve IO time. RY: IO access a single server, doesn’t have to jump to different file system server. There is some overhead when touching multiple file system servers when using striping for example.

AH: Run time instability too large? AK: Variable but satisfactory. High core count for a week. 2 hours for 3 months. AH: Still 3 month submits? AK: Still need to sometimes drop time step. 200KSU/year. Was 190KSU/year, but also turned off 3D daily tracer output. AH: More SUs, not better throughput? AK: Was hitting walltime limits with 3D daily tracer output. Possibly would work to run 3 months/submit with lower core count without daily tracers.

AK: Queue time is negligible. 3 model years/day. Over double previous throughput. Variability of walltime is not too high 1.9-2.1 hours for 3 months. Like 10% variability.

AH: Any more crashes? Previously said 10-15% runs would error but could be resubmitted. AK: Bad node. Ran without a hitch over weekend. NH: x77 scratch still an issue? AK: Not sure. AH: Had issues, thought they were fixed, but still affected x77 and some other projects. Maybe some lustre issues? AK: Did claim it was fixed a number of times, but wasn’t.

Tripole seam issue in CICE

AH: Across tripole seam one of the velocity fields wasn’t in the right direction, caused weird flow. AK: Not a crash issue. Just shouldn’t happen, occurs occasionally. The velocity field isn’t affected, seen in some derived terms, or coupling terms. Do sometimes get excess shear along that line. RF: There is some inconsistencies with how some fields are being treated. Should come out ok. Heat fluxes slightly off, using wrong winds. They should be interpolated. What gets sent back to MOM is ok, aligned in the right spot. No anti-symmetry being broken. AK: Also true for CICE? RF: Yeah, winds are being done on u cells correctly. Don’t think CICE sees that. AH: If everything ok, why does it occur? RF: Some other term not being done correctly, either in CICE or MOM. Coupling looks ok. Some other term not being calculated correctly.

AH: How much has our version of CICE changed from the version CSIRO used for ACCESS-ESM-1.5 NH: Our ICE repo has full git history which includes the svn history. Either in the git history or in a file somewhere. Should be able to track everything. Can also do a diff. I don’t know what they’ve done, so can’t comment. Have added tons of stuff for libaccessom2. Have back-ported bug fixes they don’t have. We have newest version of CICE5 up to when development stopped which include bug fixes. As well as CICE6 back ports. AH: Can see you have started on top of Hailin’s changes. NH: They have an older version of CICE5, we have a newer version which includes some bug fixes which affect those older versions.

RF: Also auscom driver vs access driver. Used to be quite similar, ours has diverged a lot with NH work on libaccessom2. We do a lot smarter things with coupling, with orange peel segment thing. There is an apple and an orange. We use the orange. NH: Only CICE layout they use is slender. They don’t use special OASIS magic to suppler that. Definitely improves things a lot in quarter degree. Our quarter degree performance a lot better because of our layout. AH: The also have 1 degree UM, so broadly similar to a quarter degree ocean. NH: Will make a difference to efficiency. AH: Efficiency is probably a second order concern, just get running initially.

Improve init and termination time

AH: Congratulations to work to improve init and termination time. RF: Mostly NH work. I have just timed it. NH: PIO? RF: Mostly down to reading in restart fields on each processor. Knocked off a lot of time. A minute or so. PIO also helped out a lot. Pavel doing a lot of IO with CICE. Timed work with doing all netCDF definitions first and then the writing, taking 14s including to gather on to a single node and write restart file. The i2o.nc could be done easily with PIO. Also implemented same thing for MOM, haven’t submitted that. Taking 4s there. Gathering global fields is just bad. Causes crashes at the end of a run. There are two other files, cicemass and ustar do the same thing, but single file, single variable, so don’t need special treatment.

RF: Setting environment variable turns off UCX messages. Put into payu? Saves thousands of lines in output file.

Technical Working Group Meeting, November 2020

November 30, 2020November 30, 2020Aidan Heerdegen

Minutes

Date: 11th September, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU
Angus Gibson (AG) RSES ANU
Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
Paul Leopardi (PL), Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision
Peter Dobrohotoff (PD) CSIRO Aspendale
James Munroe (JM) Sweetwater Consultants
Marshall Ward (MW) GFDL

Updated ACCESS-OM-01 configurations

NH: Testing for practical/usable 0.1 configs. Can expand configs as there is more available resource on gadi and CICE not such a bottleneck. When IO is on, scalability has been a problem, particularly with CICE. IO no longer bottleneck with CICE, but CICE still scalability problem. All I/O was routed through root processor with a lot of gathers etc. So now have a chance to have larger practically useful configs.

NH: Doesn’t look like we can scale up to tens of thousands of cores.

NH: Coupling diagram. YATM doesn’t receive any fields from ICE or OCEAN model. Just sync message. Can just send as much as it wants. Seemed a little dangerous, maybe stressing MPI buffers, or not intended usage. YATM always tries to keep ahead of the rest of the model, and send whenever it can. Does hang on a sync message. Will prepare coupling fields, once done will do a non-blocking sync. Should always send waiting for CICE to receive. When ice and ocean model coupling on every time step, both send and then both receive. Eliminate time between sending and waiting to receive. CICE should never wait on YATM. CICE and MOM need to run at roughly same pace. MOM should wait less as it consumes more resources. Internally CICE and MOM processors should be balanced. Previously CICE I/O has been big problem. Usually caused other CICE CPUs to wait, and MOM CPUs.

NH: Reworked coupling strategy originally from CSIRO to do as best as we can.

NH: Four new configs: small, medium, large and x-large. New configs look better than existing because old config did daily ice output, which it was never configured to do. Comparing new configs to worst existing config. All include daily ice and ocean output. Small basically the same as previous config. All tests are warm from April from restart from AK run. Lots of ice. Three changes from existing in all configs: PIO, small coupling change where both models simultaneously, small change to initialisation in CICE. Medium config (10K) is very good. Efficiency very similar to existing/small. Large not scaling too well. Hitting limit on CICE scaling. Probably limit of efficiency. X-Large hasn’t been able to be tested. Wanted to move projects, run larger job. AH: X-large likely to be less efficient than large NH: Unlikely to use it. Probably won’t use Large either.

AH: MOM waiting time double from med -> large. NH: MOM scaling well. CICE doesn’t scale well. Tried a few different processor counts, doesn’t help much. Problem getting right number of CICE cpus. Doesn’t speed up if you add more. AK: Played around with block numbers? NH: Keep blocks around 6 or 7. As high as 10 no down side. Beyond that, not much to do.

PL: Couldn’t see much change with number of blocks. Keeping same seemed better than varying. NH: Did you try small number like 3-4. PL: No. RF: Will change by season. Need enough blocks scattered over globe so you never blow out. PL: I was only testing one time of the year. NH: Difficult to test properly without long tests.

NH: PIO is great. Medium is usable. Large and X-Large probably too wasteful. Want to compare to PL and MW results. Might be some disagreement with medium config. From PL plots shouldn’t have scaled this well. PL: Maybe re-run my tests with new executables/configs. NH: Good idea. Will send those configs. Will run X-Large anyway.

NH: Had to fix ocean and ice restarts after changing layout. Didn’t expect it would be required but got a lot of crashes. Interpolate right on to land and then mask land out again. In ice case, making sure no ice on land. PL: I had to create restart for every single grid configuration.

NH: Not clear why this is necessary. Thought we’d fixed that, at least with ocean. Maybe with PIO introduced something weird. Would be good to understand. Put 2 scripts in topography_tools repo fix_ice_restarts.py fix_ocean_restarts.py.

RF: Will you push changes for initialisation? Would like to test them with our 3 day runs. NH: Will do.

RF: Working on i2o restarts. Code writes out a variable, defines the next variable, chance of doing a copy each time. About 10-12 copies of the files in and out. NH: OASIS? RF: No coupled_netcdf thing. Rewritten and will test today. Simple change to define variables in advance. NH: Still doing a serial write? RF: Yes. Could probably be done with PIO as well. Will just build on what is there at the moment.

AH: Large not as efficient as Medium or Small, but more efficient than current. NH: Due to daily output and no PIO. AH: Yes, but still and improvement than current. AH: How does this translate to model years per submit? AK: Small is 3 months in just over 2.5 hours. Should be able to do 1 year/submit in large. AH: Might be a reason to use large. Useful for example if someone wanted to do 3-daily outputs for particle tracking for RYF runs. Medium is 6months/submit? Or maybe a year? AK: Possibly. Not so important when queue wait was longer. NH: I calculate medium would take 5.2 hours/year. Could make medium an thousand processors on the ocean. Might be smarter. AK: Target wall time.

AK: About to start 3rd IAF cycle. Andy keen to try medium. Good enough for production? NH: No real changes. AK: Do you have a branch I could look at. NH: Maybe test a small bump on medium to get it under 5 hours. AK: Hold off until config that gets under 5 hours? NH: Spend the rest of the day and get it under 5 hours. AK: Next few days, updates to diagnostics.

PL: Any testing to see if it changes climate output? NH: None. Don’t know how to do it. MOM should be reproducible. Assume CICE isn’t. AH: Run, check gross metrics like ice volume. Not trying to reproduce anything. Don’t know if there is an issue if it is due to this or not. Cross that bridge when we have to.

AH: Any code changes as well to add to core/cpu bumps? RF: Just a few minutes at the beginning and end.

MW: Shouldn’t change bit repro? NH: Never verified CICE gives same answers with different layouts. Assuming not. MW: Sorry, thought you meant moving from serial to PIO. AH: AK does done testing with that. AK: With PIO only issue is restart files within landmarked processors. MW: Does CICE do a lot of collectives? Only real source of issues with repro. PL: Fix restart code might perturb things? NH: Possibly? AH: Can’t statistically test enough on tenth. PL: Martin Dix sent me an article about how to test climate using ensembles.

AH: When trying to identify what doesn’t scale? Already know? EVP solver NH: Just looked at the simplest numbers. Might be other numbers we could pull out to look at it. Maybe other work. Already done by PL and MW. If goal is to improve scalability of CICE can go down that rabbit hole. AH: Medium is efficient and a better user of effort to concentrate on finessing that. Good result too. NH: Year in a single submit with good efficiency is a good outcome. As long as it can get through the queue, and had no trouble with testing. AH: Throughput might be an issue, but shouldn’t be with much larger machine now.

AK: Less stable at 10K? Run into more MPI problems with larger runs. Current (small) config is pretty reliable. Resub scripts works well, falls over 5-10% of the time. Falls over at the start, gets resubmitted. Dies at the beginning so no zombie jobs that run to wall time limit. NH: Should figure out if our problem or someone else’s? In the libraries? AK: Don’t know. Runs almost every time when restarted. Tends to be MPI stuff, sometimes lower down interconnect. Do tend to come in batches. MW: There were times when libraries would lose connections to other ranks. Those use to happen with lower counts on other machines. Depends on the quality of the library. MPI is a thin layer above garbage that may not work as you expect at very high core counts.

PL: What is happens when job restarts? Different hardware with different properties? AK: Not sure if it is hardware. MW: mxm used to be unstable. JM: As we get to high core count jobs, may nott trust every node won’t be ok for the whole run? Should get used to this? In cloud this is expected. In future will have to cope with this? That one day cannot guarantee an MPI rank may not ever finish?

MW: Did see a presentation with heap OpenMPI devs, acknowledge that MPI is not well suited to that problem. Not designed for pinging and rebooting. Actions in implementation and standard to address that scenario. Not just EC2. Not great. Once an MPI starts to fray falls to pieces. Will take time for library devs to address problem, also if MPI is the appropriate back end for IPC scenario. PL: 6-7 yrs ago ANU had a project with Fujitsu and exascale and what happens when hardware goes away. Not MPI as such. Mostly mathematics of sparse grids. Not sure what happened to it. NH: With cloud expect to happen, and why. We might expect it, but don’t know why. If know why, if it is dropping nodes we can cope. If just transient error, due to corrupted memory due to bad code in our model, want to know and fix. MW: MPI doesn’t provide API to investigate it. MPI trace on different machines (e.g. CRAY v gadi) look completely different. Answering that question is implementation specific. Not sure what extra memory is being used, which ports, and no way to query. MPI may not be capable of doing what we’re asking for, unless they expand the API and implementors commit to it.

AH: Jobs are currently constituted don’t fit cloud computing model, as jobs are not independent, reply on results from neighbouring cells. Not sure what the answer is?

MW: Comes up a lot for us. NOAA director is private sector. Loves cloud compute. Asks why don’t we run on cloud? Any information for a formal response could be useful longer term. Same discussions at your places? JM: Problem is in infrastructure. Need an MPI with built in redundancy to model. Scheduler can keep track and resubmit. Same thing happens in TCP networking. Can imagine a calculation layer which is MPI like. MW: We need infiniband. Can’t rely on TCP, useful stock answer. JM: Went to DMAC meeting, advocating regional models in the cloud. Work on 96 cores. Spin up a long tail of regional models. Decent interconnect on small core counts. Maybe single racks within datasets. MW: Looking at small scale MOM6 jobs. Would help all of us if we keep this in mind, what we do and do not need for big machines. Fault tolerance not something I’ve thought about. JM: NH point about difference between faulty code and machine being bad is important.

AH: Regarding idea of having an MPI scheduler that can cope with nodes disappearing. Massive problem to save model state and transfer around. A python pickle for massive amounts of data. NH: That is what restart is. MW: Currently do it the dumbest way possible with files. JM: Is there a point where individual cores fail, MPI mesh breaks down. Destroys whole calculation. Anticipating the future. NH: Scale comes down to library level. A lot of value in being able to run these models on the cloud. Able to run on machines that are interruptible. Generally cheaper than defined duration. Simplest way is to restart more often, maybe every 10 days. Never lose much when you get interrupted. Lots we could do with that we have. AH: Interesting, even if they aren’t routinely used for that. Makes the models more flexible, usable, shareable.

MW: Looking at paper PL sent, are AWS and cori comparable for moderate size jobs? PL: Looking at a specific MPI operation. JM: Assuming machine stays up for the whole job. Not really testing redundancy.

Testing PIO enabled FMS in MOM5

AH: Last meeting talked to MW about his previous update of FMS to ulm. MW put code directly into shared dir in MOM5 source tree. Made and merged PR where shared dir is now a subtree. Found exact commit MW had used from FMS repo, and recreated the changes he subsequently made so old configs still work: adding back some namelist variables that don’t do anything, but mean old configs still work. FMS code is identical, but now can swap out FMS more easily with subtree. There is a fork of FMS in the mom-ocean GitHub organisation. The master branch is the same as the code that is in the MOM5 subtree. Have written a README to explain the workflow for updating this. If we want to update FMS, create a feature branch, create a PR, create a matching branch on MOM5, use subtree to update FMS and make a PR on the MOM5 repo. This will then do all the compilation tests necessary for MOM5. Both PRs are done concurrently. Bit clunky, but a lot easier than otherwise, and changes we make are tied to the FMS codebase. Wasn’t the case when they were bare code changes directly included in MOM5.

AH: Can now look at testing a PIO enabled FMS branch based on work of RY and MW. Is there an appetite for that? MW and RY changes are on a branch on FMS branch. Based on a much more recent version of FMS? MW: Pretty old, whatever it is based on. 1400 commits behind. AH: Current is 2015, PIO branch is based on 2018. MW: Must be compatible with MOM5 because it was tested. Not a big issue if it is ulm or not.

AH: So should we do that? I could make a branch with this updated PIO enabled FMS. NH: Great idea. Uses what we’ve already done. AH: RY and MW have done it all, just have to include it. Is compression of outputs handled by library? RY: Testing parallel compression. Code does not need changing. Just need to change deflate level flag in FMS. Parallel IO can import that. No need for code changes. Not compatible with current HDF version 1.10. If specify chunk layout will crash. Test HDF 1.12. Works much better than 1.10. Performance looks pretty good. Compression ratio 60%. Sequential IO 2.06TB, using compression 840GB. Performance is better than sequential IO. Currently available HDF library not compatible. Will present results at next meeting. Not rush too much, need stable results. When netCDF support parallel compression, always wanted to test that, and see if there are code changes. Current HDF library layout only compatible with certain chunk layouts. AH: Certainly need stability for production runs.

NH: Had similar issues. Got p-netcdf and PIO working. PIO is faster, no idea why. Also had trouble with crashes, mostly in MPI libraries and not netCDF libraries. Used MPI options to avoid problematic code. PIO and compression turned on, very senstitive to chunk size. Could only get it working with chunks the same as block size. Wasn’t good, blocks too small. Now got it working with MPI options which avoid bad parts of code. RY: ompi is much more stable. NH: Also had to manually set the number of aggregators. RY: Yes, HDF v 1.12 is much more stable. Should try that. Parallel IO works fine, so not MPI issue, so definitely comes from HDF5 library. So much better moving to HDF1.12. MW: Is PIO doing some of the tuning RY has done manually. NH: Needs more investigation, but possibly being more intelligent about gathers before writes. MW: RY did a lot of tuning about how much to gather before writes. NH: Lots of config options. Didn’t change much. Initially expected parallel netCDF and PIO to be the same, and surprised PIO was so much better. Asked on GitHub why that might be, but got no conclusive answer.

AH: So RY, hold off and wait for changes? RY: Yes doing testing, but same code works. AH: Even though didn’t support deflate level before? RY: Existed with serial IO. PIO can pick up the deflate level. Before would ignore or crash.

Miscellaneous

MW: At prior MOM6 meeting was obvious preference to move MOM6 source code to mom-ocean organisation if that is ok with current occupants. Hadn’t had a chance to mention this when NH was present. If they did that were worried about you guys getting upset. AH: We are very happy for them to do this. NH: Don’t know time frame. Consensus was unanimous? AH: Definitely from my point of view makes the organisation a vibrant place. NH: I don’t own or run, but think it would cool to have all the codebase in one organisation. Saw the original MOM1 discs in Matt England’s office, which spurred putting all the versions on GitHub. So would be awesome. MW: COSIMA doesn’t have a stake or own this domain? AH: Not at all. I have just invited MW to be a owner, and you can take care of it. MW: Great, all settled.

PL: Action items for NH? NH: Sent PL configs, commit all code changes tested on current config and give to RF and fix up Medium config. AH: four, shared config with RF and AK, and please send me the slides you presented. RY: PIO changes all there for PL to test? NH: Yes. PL: Definitely need those to test the efficiency of the code. AK: PIO in master branch of CICE on the repo.

PL: Have an updated report that is currently being checked over by Ben. Will release when that is give the ok.

AH: Working on finally finished the cmake stuff for MOM5 for all the other MOM5 test configs. Will mean MOM5 can compile in 5 minutes as parallel compilation works better due to dependency tree being correctly determined.

Technical Working Group Meeting, October 2020

November 10, 2020November 10, 2020Aidan Heerdegen

Minutes

Date: 13th October, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU
Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
Peter Dobrohotoff (PD) CSIRO Aspendale

CICE Initialisation

RF: NH found CICE initialisation broadcast timing issue. Using PIO to read in those files? NH: Just a read, regular netCDF calls. Still confused. Thought slowdown was OASIS setting up routing tables. RF pointed out docs that talked about OASIS performance and start up times. No reason ours should be > 20s. Weren’t using settings that would lead to that. Found broadcasts in CICE initialisation that took 2 minutes. Now take 20s. That is the routing table stuff. Confused why it is so bad. Big fields, surprised it was so bad. PL: Not to do with the way mpp works? At one stage a broadcast was split out to a loop with point to points. NH: Within MCT and OASIS has stuff like that. We turned it off. MW had the same issue with xgrid and FMS. Couldn’t get MOM-SIS working with that. Removed those.

MW: PL talking about FMS. RF: These are standard MPI broadcasts of 2D fields. MW: Once collectives were unreliable, so FMS didn’t use them. Now collectives exceed point to point but code never got updated. PL: Still might be slowing down initialisation?

NH: Now less than a second. Big change in start up time. Next would be to use newer version of OASIS. From docs and papers could save another 10-15s if we wanted to. Not sure it is worth the effort. Maybe lower hanging fruit in termination. RF: Yes, another 2-3 minutes for Pavel’s runs. Will need to track exactly what he is doing, how much IO, which bits stalling. Just restarts, or finalising diagnostics. AH: What did you use to track it down. NH: Print statements. Binary search. Strong hunch it was CICE. RF: Time wasn’t being tracked in CICE timers. NH: First suspected my code, CICE was the next candidate. AH: Close timer budgets? RF: A lot depend on MPI, others need to call system clock.

NH: Will push those changes to CICE so RF can grab them.

NH: Made another mention in CICE. Order in coupling: both models do work ocn send ice, ice recv ocn, ice send ocn, ocn recv from ice. So send and recv are paired after the work. Not best way. Both should work, both send and then both recv. Minute difference, but does mean ocean is not waiting for ice. AH: Might affect pathological cases.

AH: Re finalisation, no broadcasts at end too? NH: Even error messages, couple of megabytes of UCX errors. Maybe improve termination by 10% by cleaning that up. CICE using PIO for restarts. Auscom driver is not using PIO and has other restarts. ACCESS specific restarts are not using PIO. Could look at. From logs, YATM finished 3-4 minutes before everything else. AH: Just tenth? NH: Just at 0.1. Not sure a problem, just could be better.

Progress in PIO testing in CICE with ACCESS-OM2

NH: AK done most of the testing. Large parts of the globe where no PE does a write. Each PE writing it’s own area. Over land no write, so filled with netCDF _FillValue. When change ice layout different CPUs have different pattern of missing tiles can read unitialised values, same as MOM. Way to get around that is to just fill with zeroes. Could use any ice restart with any new run with different PE layout. AH: Why does computation care over land? NH: CICE doesn’t apply land mask to anything. Not like MOM which excludes calculations over land, just excluding calc over land where there is no cpu. Code isn’t careful. If change PE layout, parts which weren’t calculated are now calculated. RF: Often uses comparison to zero to mask out land. When there is a _FillValue doesn’t properly detect that isn’t land. MW: A lot of if water statements in code. RF: Pack into 1D array with offsets. NH: Not how I thought it works. RF: Assumes certain things have been done. Doesn’t expect ever to change, running the same way all the time. Because dumped whole file from one processor, never ran into the problem. NH: Maybe ok to do what they did before: changed _FillValue to zero. RF: Nasty though, is a valid number. NH: Alternative is to give up on using restarts RF: Process restarts to fill where necessary, and put in zeroes where necessary. NH: Same with MOM? How do we fix it with MOM? RF: Changed a while ago. Tested on _FillValue or _MissingValue, especially in thickness module. PL: Does this imply being able to restart from restart file means can’t change processor layout? Just in CICE or also MOM? RF: Will be sensitive to tile size. Distribution of tiles (round-robin etc) still have the same number of tiles, so not sensitive. AH: MOM always have to collate restarts if changing IO layout. AK: Why is having _FillValue of zero worse than just putting in zero? RF: Often codes check for _FillValue, like MissingValue. So might affect other codes.

NH: Ok settle on post-processing to reuse CICE restarts in different layouts. AK: Sounds good. Put a note in the issue and close it. AH: Make another issue with script to post process restarts. NH: Use default default netCDF _FillValue.

Scaling performance of new Configurations for 0.1

NH: Working on small (6K), medium (10K+1.2K), large (18K) and x-large (33K+) ACCESS-OM2-01 configs. Profiling and tuning. PIO improves a lot. Small is better than before. CICE has scalability issues but before PIO, but little IO? PL: Removed as much IO as possible. Hard-wired to remove all restart writing. Some restart writing didn’t honour write_restart flag. Not incorporated in main code, still on a branch.

NH: Medium scales ok. Not as good as MOM, but efficiency not a lot worse. Large and x-large might not be that efficient. x-large just takes a long time to even start. 5-10 minutes. Will have enough data for a presentation soon. Still tweaking balance between models. Can slow down when decrease the number of ICE cpus *and* increase them. Can speed up current config a little by decreasing the number of CICE cpus. It is a balance between the average time that each CPU takes to complete ice step, and the worst case time. As increase CPUs the worse case maximum increases. Mean minus worst case decreases RF: Changing tile size? NH: Haven’t needed to. Trying to keep number of blocks/cpu around 5-10. RF: Fewer tiles the larger chance they are to be dist evenly across ice/ocean. Only 3-4 tiles per processor, some may have 1, others with 2. So 50% difference in load. About 6 tiles/processor is a sweet spot. NH: Haven’t created a config less than 7. From PL work, not bad to err on the side of having more. Not noticeable difference between 8 and 10, err on higher side. Marshall, did you find CICE had serious limits to scalability.

MW: Can’t recall much about CICE scaling work. Think some of the bottlenecks have been fixed since then. NH: Always wanted to compete with MOM-SIS. Seems hard to do with CICE. MW: Recall CICE does have trouble scaling up computations. EVP solver is much more complex. SIS is a much simpler model.

AH: When changing the number of CPUs and testing the sensitivity run time, are you also changing the CPUs/block? NH: Algorithm does distribution. Tell it how many blocks overall then give it any number of CPUS and will distribute it over the CPUs. Only two things. Build system conflates them a little. Uses the number of cpus to calculate the number of blocks. Not necessary. Can just say how many blocks you want and then change how many cpus as much as you want. RF: I wrote a tool to calculate the optimal number of CPUS. Could run that quickly. NH: What is it called? RF: Under v45. Maybe masking under my account. NH: So don’t change the number of blocks just change number of cpus in config. So can use CICE to soak up extra CPUS to use full ranks. That is why we were using 799. Should change that number with different node sizes. AH: Small changes of the number CPUs just means a small change to average number of blocks per CPU? Trying to understand sensitivity of run time considering such a small change. NH: Some collective ops, so more CPUs you have the slower. MW: Have timers for all the ranks. Down at subroutine level? NH: No, just a couple of key things. Ice step time.

MW: Been using perf. Linux line profiler. Getting used to using it in parallel. Very powerful. Tell you exactly where you are wasting your time. AH: Learning another profiler? MW: Yes, but and no documentation. score-p is good, but overhead is too erratic. Are the Allinea tools available? PL: Yes, bit restricted. MW: Small number of licenses? Why I gave up on them.

1 degree CICE ice config

AK: Re CICE core count, 1 degree config has 241 cores. Wasting 20% of cpu time on 1 degree. Currently has 24 cores for CICE. Should reduce to 23? NH: Different for 1 degree, not using round-robin. Was playing with 1 degree. Assuming asn’t as bad with gadi, were wasting 11. Maybe just add or subtract a stripe to improve efficiency. Will take a look. Improve the SU efficiency a lot.

RF: fortran programs in /scratch/v45/raf599/masking which will work out the processors for MOM and CICE. Also a couple of FERRET scripts given a sample ICE distribution will tell you how much work is expected to be done for a round-robin distribution. Ok, so not quite valid. Will look at changing the script to support sect-robin. More of a dynamic thing for how a typical run might go. Performance changes seasonally. NH: Another thing we could do is get a more accurate calculation of work per block. Work per block is based on latitude. Higher latitude that block going to do more work. Can also give it a file specifying the work done. Is ice evenly distributed across latitudes? RF: Index space, and is variable. Maybe some sort of seasonal climatology, or an annual thing. AH: Seasonal would make sense. AH: Can give it a file with a time dimensions? RF: Have to tell it the file, use a script. Put it in a namelist, or get it to figure out. NH: Hard to know if it is worth the effort. AH: Start with a climatology and see if it makes a difference. Ice tends to stick to close to coastlines. Antarctic peninsula will have ice at higher latitudes than in Weddell Sea for example. Also in the Arctic the ice drains out at specific points. RF: Most variable southerly part is north of Japan, coming a fair way south near Sakhalin. Would have a few tiles there with low weight. NH: Hard to test, need to run a whole year, not just a 10 day run. Doing all scaling test with warm run in June. MW: CICE run time is proportional to amount of ice in ocean. There is a 10% seasonal variation. RF: sect-robin of small tiles tries to make sure each processor has work in northern and southern hemispheres. Others divide into northern and southern tiles. SlenderX2? NH: That is what we’re using for the 1 degree.

FMS updates

AH: Was getting the MOM jenkins tests running specifically to test the FMS pull request which uses subtree so you can switch in and out FMS versions easily. Very similar to what MW had already done. Just have to move a few files out of the FMS shared directory. When I did the tests it took me a week to find MW had already find all these bugs. When MW put in ulm reverted some changes to multithreaded reads and writes so the tests didn’t break. MW: Quasi-ulm. AH: Those changes were hard coded inside a MOM repo. MW: In the FMS code in MOM. AH: Not a separate FMS branch where they exist? MW: No. Maybe some clever git cherry-picking would work. MW: Has changed a lot since then. Unlikely MOM5 will every change FMS every again. AH: Intention was to be able to make changes in a separate fork. MW: Changes to your own FMS? Ok. Hopefully would have done it in a single commit. AH: Allowed those options but did nothing with them? MW: Don’t remember, sorry. AH: Yours and Rui’s changes are in an FMS branch? MW: Yes. Changes I did there would not be in shared dir of MOM5. Search for commits by me maybe. If you’re defining subtree wouldn’t you want the current version to be what is in shared dir right now? AH: Naively thought that was what ulm was. MW: I fixed some things, maybe bad logic, race conditions. There were problems. Not sure if I did them on FMS or MOM side. AH: Just these things not supported in namelist, not important. Will look and ask specific questions.

Miscellaneous

AH: Given Martin Dix latest 0.25 bathymetry for the ACCESS-CM2-025 configuration. Also need to generate regridding weights, will rely on latest scripts AK used for ACCESS-OM2-025. AK: Was issue with ESMF_RegridWeightGen relying on a fix from RF. Were you going to do a PR to the upstream RF? RF: You can put in the PR if you want. Just say it is coming from COSIMA.

MW: Regarding hosting MOM6 code in mom-ocean GitHub repo. Who is the caretaker of that group? AH: Probably me, as I’m paid to do it. Not sure. MW: Natural place to move the source code. They’re only hesitant because not sure if it is Australian property. Also wants complete freedom to add anything there, including other dependencies like CVMix and TEOS-10. AH: I think the more the better. Makes a more vital place. MW: So if Alistair comes back to me I can say it is ok? AH: Steve is probably the one to ask. MW: Steve is the one who is advocating for it. MW: They had a model of no central point, now they have five central points. NOAA-GFDL is now the central point. Helps to distance the code one organisation. Would be an advantage to be in a neutral space. COSIMA doesn’t have a say? AH: COSIMA has it’s own organisation, so no. AH: Just ask for Admin privileges. MW: You’re a reluctant caretaker? AH: NH just didn’t have time, so I stepped in and helped Steve with the website and stuff. Sounds great.

Technical Working Group Meeting, September 2020

October 14, 2020October 14, 2020Aidan Heerdegen

Minutes

Date: 16th August, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
Russ Fiedler (RF) CSIRO Hobart
Rui Yang (RY), Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
Peter Dobrohotoff (PD) CSIRO Aspendale

ACCESS-CM2 + 0.25deg Ocean

PD: Dave Bi thinking about 0.25 ocean. Still fairly unfamiliar with MOM5. Trying to keep was harmonised as possible. Learn from 1 degree case harmonisation. PD: Doing performance plan for current financial year hence talk about that now. Asking supervisors what they want and popped out of that conversation with Dave Bi. AH: Andy Hogg has been pushing CLEX to do the same work. PD: Maybe some of this already happening, need some extra help from CSIRO? AH: We think it is much more that CSIRO has a lot of experience with model tuning and validation. Not something researchers want to do, but want to use a validated model and produce research. So a win-win. PD: Validation scripts are something we run all the time, so yes would be a good collaboration. Won’t be any CMIP6 submission with this model. AH: Andy Hogg keen to have a meeting. PD: Agree that sounds like a good idea.

PL: How do get baseline parameterisation and ocean mask and all that. Grab from OM2? PD: Yes grab from OM2. AH: Yes, but still some tuning in coupled versus forced model.

ACCESS-OM2 and MOM-SIS benchmarking on Broadwell on gadi

PL: Started this week. Not much to report. Running restart based on existing data fell over. Just recreated a new baseline. Done a couple of MOM-SIS runs. Waiting on more results. Anecdotally expecting 20% speed degradation.

Update on PIO in CICE and MOM

AK: Test run with NH exe. Tried to reproduce 3 month production run at 0.1deg. Issue with CPU masking in grid fields. Have an updated exe set running. Some performance figures under CICE issue on GitHub. Speeds things up quite considerably. AH: Not waiting on ice? AK: 75% less waiting on ice.

AH: Nick getting queue limits changed to run up to 32K.

PD: Flagship project such as this should be encouraged. Have heard 70% NCI utilisation? May be able to get more time. RY: No idea about utilisation. Walltime limitation can be adjusted. Not sure about CPU limit. AH: I believe it can. They just wanted some details. Have brought up this issue at Scheme Managers meeting. Would like to get number of cpu limits increased across the board. There is positive reaction to increasing limits, but no motivation to do so. Need to kick it up to policy/board level to get those changes. Will try and do that.

NH: Some hesitation. Consume 70-80KSU/hr. Need to be careful. PL: What is research motivation? NH: Building on previous work of PL and MW. With PIO in CICE can make practical configs with daily ice output with lots more cores. Turning Paul’s scaling work in production config. Possible due to PL, MW work, moving to gadi and having PIO in CICE.

NH: Got 3 new configs. Existing small (5K), medium (8K), large (16K) and x-large (32K). MOM doubling. CICE doesn’t have to double. Running short runs of configs to test. PL: 16K is where I stopped. NH: Andy Hogg said good to have a document to show scalability for NCMAS. PL: All up on GitHub. NH: Will take another look. NH: Getting easier and easier to make new configs. CICE load balancing used to be tricky. Current setup seems to work well to increase cpus.

PD: What is situation with reproducibility? In 1 degree MOM run 12×8. Would be same with 8×12? More processors? NH: Possible to make MOM bit repro with different layout and processor counts, but not on by default. Can’t do with CICE, so no big advantage. PL: What if CICE doesn’t change? NH: Should be ok if keep CICE same, can change MOM layout with correct options and get repro. RF: Generally have to knock down optimisation and floating point config. Once you’re in operational mode do you care? Good to show as a test, but operationally want to get through fast as possible as long as results statistically the same. PL: Climatologically the same? RF: Yeah. PL: All the other components, becomes another ensemble member. RF: Exactly. NH: Repro is a good test. Sometimes there are bugs that show up when don’t reproduce. That is why those repro options exist in MOM. If you do a change that shouldn’t change answers, then repro test can check that change. Without repro don’t have that option.

RF: Working with Pavel Sakov, struggling with some of the configs, updating yam to latest version. Moving on to 0.1 degree model. Hoping to run 96 member ensembles, 3 days or so, with daily output. A lot of start/stop overhead. PIO will help a lot. Maybe look at start up and tidy up time. A lot different to 6 month runs. AH: Use ncar_rescale? RF: Just standard config. Not sure if it reads in or does dynamic. AH: Worth precalculating if not doing so. Sensitivity to io_layout with start-up? RF: Data assimilation step, restarts may come back joined rather than separate. Thousands of processors trying to read the same file. AH: mppncscatter? RF: Thinking about that. Haven’t looked at timings yet, but will be some issues. AH: How long does DA step take? RF: Not sure. Right at the very start of this process. Pavel as had success with 1 degree. Impressed with quality of results from the model. Especially ice.

AH: Maybe Pavel present to a COSIMA meeting? RF: Presented to BlueLink last week. AH: Always good to get a different view of the model.

Testing

AH: Trying to get testing framework NH setup on Jenkins running again. Wanted to use this to check FMS update hasn’t broken anything. Can then update FMS fork with Marshall and Rui’s PIO changes.

NH: A couple of months ago got most of the ACCESS-OM2 tests running. MOM5 runs were not well maintained. MOM6 were better maintained and run consistently until gadi. Can get those working again if it is a priority. Was being looked at by GFDL. AH: Might be priority as we want to transition to MOM6.

NH: Don’t have scaling results yet. Will probably be pretty similar to Pauls numbers. Will should you next time. PL: Will update GitHub scaling info. NH: Planning to do some simple plots and tables using AK’s scripts that pull out information from runs.

Bathymetry

AK: Got list of edits from Simon Marsland for original topography. Wanted to get feedback about what should be carried across. Pushed a lot into COSIMA topography_tools repo. Use as a submodule in other repos which create the 1 degree and 0.25 degree topographies. Document the topography with the githash of the repo which created it. Pretty much finished 0.25. Just a little hand editing required. Hoping to get test runs with old and new bathymetry.

AH: KDS50 vertical grid? QA analyses? AK: Partial cells used optimally from KDS50 grid. Source data is also improved (GEBCO) so not potholes and shelves. AH: Sounds like a nice well documented process which can picked up and improved in the future.

AK: Way it is done could be used for other input files, have all in git repo and embedding metadata into file its exact link to git hash. Good practice. Could also use manifests to hash inputs? NH: Great, talked about reproducible inputs for a long time. AH: Hashing output can track back with manifests. Ideally would put hashes in every output. There is an existing feature for unique experiment IDs in payu, but has not gone further, still think it is a good idea.

AK: Process can be applied to other inputs. AH: The more you do it, the more sense it makes to create supporting tools to make this all easier.

Jupyterhub on gadi

AH: What is the advantage too using jupyterhub?

JM: Configurable http proxy and jupyter hub forward ports through a single ssh tunnel. If program crashes and re-run, might choose different port but script doesn’t know. This is a barrier. Also does persistence, basically like tmux for jupyter processes. AH: Can’t do ramping up and down using bash script? JM: Could do, that is handled through dask-jobqueue. Bash script could use that too. JM: Long term goal would be a jupyterhub.nci.org.au. Difficult to deploy services at scale. AH: Pawsey and the NZ HPC mob were doing it.

Technical Working Group Meeting, August 2020

September 4, 2020September 4, 2020Aidan Heerdegen

Minutes

Date: 12th August, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
Russ Fiedler (RF) CSIRO Hobart
Rui Yang (RY), Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
Marshall Ward (GFDL)

PIO with MOM (and NCAR)

MW: NCAR running global 1/8 CESM with MOM6. Struggling with IO requirements. Worked out they need to use io_layout. Interested in parallel. Our patch is not in FMS, and never will be. They understand but don’t know what to do. Can’t guarantee my patch will work with other models and in the future. Said COSIMA guys are not using it. Using mppnccombine-fast so don’t need it. Is that working? AK: Yes.

RY: Issue is no compression. Previous PIO not support compression and size is huge. Now netCDF supports parallel compression, so maybe look at it again. Haven’t got time to look at it. Should be a better solution for COSIMA group.

MW: Ideally Ed Harnett or someone else from NCAR would add PIO to FMS. They have been working on latest FMS rewrite for more than 2 years. Haven’t finished API update. FMS API is very high level. They have decided too high level to do PIO. FMS completely rewriting API. Ed stopped until FMS update. Added PIO and used direct netCDF IO calls. Bit hard-wired but suitable for MOM-like domains. Options 1 sit and wait, Option 2 is do their own, Option 3 is do it now and use fork of FMS. Maybe Option 4 is mppnccombine-fast. What do you think?

AK: Outputting compressed tiles with io_layout and using fast combine. Potential issue is if io_layout makes small tiles. MW: Chunk size has to match tile size? Do tiles have to be same size? AH: Yes. Still works but slower but has to do deflate/reflate step. It is fast when it can just copy compressed chunks from one file to another. Limit is only filesystem speed. Still uses much less memory if it has to do deflate/reflate. Chooses first tile size and imposes that on all tiles. If first tile is not typical tile size for most files could do end up reflating/deflating a lot of the data. Also have to choose processor layout and io_layout carefully. For example 0.25 1920×1080 doesn’t have consistent tile sizes. MW: Trying to figure out if it is worth telling them to reach out to you guys. Sent them a link to the repo. AH: Might be a decent way to keep going until they get a better solution.

MW: Bob had strategy to force modelling services to include PIO support by getting NCAR to use PIO.

NH: Can they use PIO patch with their current version of FMS? MW: They want to get rid of the old functions.

Bad idea to ditch old API, creates a lot of problems. The parallel-IO work is on a branch.

AH: Regional output would be much better. Output one file per rank. Can aggregate with PIO? NH: One output file. Can set chunking. AH: Not doing regional outputs any more because so slow. Would give more flexibility. AK: Slow because of huge number of files. Chunks are tiny and unusable. Need to use non-fast. AH: I thought it was hust the output is slow. RF: Many processors on same node will pump output. MW: Many outputs will throttle lustre. Only have a couple of hundred disks. Will get throttled. AH: Another good reason to use for MOM. MW: Change with io_layout? RF: No, always output for themselves. MW: Wonder how patch would behave. AH: NCAR constrained to stay consistent with FMS. MOM5 is not so constrained, should just use it. NH: Should try it if code already. parallel netcdf is a game changer. AH: I have a long-standing branch to add FMS as a sub-tree. Should do it that way. Have our own FMS fork with the code changes. MW: Only took 3 years!

AK: Put in a deflate level as namelist parameter as it defaulted to 5. Used 1 as much faster but compression was the same.

CICE PIO

NH: Solved all known issues. Using PIO output driver. Works well. Can set chunking, do compression, a lot faster. Ready to merge and will do soon. I don’t understand why it is so much better than what we had. I don’t understand the configuration of it very well. Documentation is not great. When they suggested changes they didn’t perform well. Don’t understand why it is working so well as it is, and would like to make it even better.

NH: Converted normal netcdf CICE output driver to use latest parallel netCDF library with compression. So 3 ways, serial, same netCDF with pnetcdf compressed output, or PIO library. netcdf way is redundant as not as fast as PIO. Don’t know why. Should be doing this with MOM as well. Couldn’t recall details of MW and RY previous work. Should think about reviving that. Makes sense for us to do that, and have code already.

MW: Performance difference is concerning. NH: Has another player of gathers compared to MPI-IO layer. PIO adds another layer of gathering and buffering. With messy CICE layout PIO is bringing all those bits it needs and handing it lower layer. Maybe possible reason for performance difference. RY: PIO does some mapping from compute to IO domain. Similar to io_layout in MOM. Doesn’t use all ranks to do IO. Sends more data to a single rank to IO, saves contention issues. NH: MPI-IO has aggregators? RY: In the library you can select number of aggregators. Default is 1 aggregator per node. If you use PIO to use single rank per node this matches MPI-IO. Did this in the paper where we tested this. If consistent io_layout, aggregator number and lustre striping should get good performance.

RY: Tried different compression levels? NH: Just using level 1. Did some testing in serial case not much point going higher. Current tests doing all possible outputs. RF: A lot of compression will be due to empty fields. RY: compression performance is related to chunk size. NH: performance difference with chunk size. Too big and too small is slower. Default chunk size is fastest for writing. 360×300 for 2D field. Might not be ok for reading. RY: Should consider both read and write. Write once and many read patterns. MW: Parallel reads were slower than POSIX reads. AH: What is dependence of time on chunk size. NH: Depends how many fields we output. Cutting down should be fast for larger chunk size. Is a namelist config currently. Tell it chunk dimension. RY: Did similar with MOM. AH: CICE mostly 2D, how many have layers. AH: What chunking in layers? NH: No chunking, chunk size is 1, for layers. AH: Have noticed access patterns have changed with extremes. Looking more at time series, and sensitive to time chunking. Time chunking has to be set to 1? NH: With unlimited not sure RF: Can chunk in time with unlimited, but would be difficult as need to buffer until writing possible. With ICE data layers/categories are read at once. Usually aggregated, not used as individual layers. Make more sense to make category chunk the max size. Still a small cache for each chunk. netCDF4 4.7.4 increased default cache size from 4M to 16M.

AH: I thought deflate level 4 or 5 was still worth it. NH: Can give it a try. Don’t really care about deflate level, just getting rid of zeroes.

Masking ICE from MOM – ICE bypass

NH: Chatted with RF on slack. Mask to bypass ICE. Don’t talk to ice in certain areas. Like the idea. Don’t know how much performance improvement. RF: Not sure it would make much difference. Just communication between CICE and MOM. NH: Also get rid of all the CICE ranks that do nothing. RF: Those are hidden away because of round-robin and evenly spread. Layout with no work would make a difference. NH: What motivated me was the IO would be nicer without crazy layouts. If didn’t bother with middle, would do one block per cpu, one in north and south. Would improve comms and IO. If it were easy would try. Maybe lots of work. AK: Using halo masking so no comms with ice-free blocks.

AH: What about ice to ice comms with far flung PEs? RF: Smart enough to know if it needs to MPI or not. Physically co-located rank will not use MPI. AH: Thought it would be easy? NH: Not sure it is justified in terms of performance improvement. With IO tiny blocks were killing performance, so this was a solution to IO problem. MW: Two issues are funny comms patterns and calculations are expensive, but ice distribution is unpredictable. Don’t know which PEs will be busy. Load imbalances will be dynamic in time. Seasonal variation is order 20%. Might improve comms, but that wasn’t the problem. Stress tensor calcs are expensive, so ice regions will do a lot more work. NH: Reason to use small blocks which improves ability to balance load. MW: Alisdair struggling with icebergs. Needs dynamic load balancing. Difficult problem. RF: Small blocks are good. Min max problem. Every rank has same amount of work, not too much or too little. CICE ignores tiles without ice. CICE6 a lot of this can be done dynamically. There is dynamic allocation of memory. AH: Dynamic load balancing? RF: Who knows. Now using OpenMP. AH: Doesn’t seem to make much difference with UM. MW: Uses it a lot with IO as IO is so expensive.

AH: A major reason to pursue masking is it might make it easier when scaling up. If round-robin magically scales well that is ok, but last time there was a lot of analysis with heat maps and discussion about optimal block sizes. Conceptually it might be easier to understand how to best optimise for new config. NH: Does seem to make sense, could simplify some aspects of config. Not sure if it is justified. MW: Easy to look at comms inefficiency. Did this a lot for MOM5, and mostly it wasn’t comms. Sometimes the hardware, or a library not handling the messages well, rather than comms message composition. Bob does this a lot. Sees comms issues, fixes it and doesn’t make a big difference. Definitely worth running the numbers. NH: Andy made the point. This is an architecture thing. Can’t make changes like this unilaterally. Coupled model as well. Fundamentally different architecture could be an issue. MW: Feel like CPUs are the main issue not comms. Could afford to do comms badly. NH: comms and disks seems pretty good on gadi. Not sure we’re pushing the limits of the machine yet. Might have double or triple size of model. AH: Models are starting to diverge in terms of architecture. Coupled model will never have 20K ocean cpus any time soon. NH: Don’t care about ice or ocean performance.

AH: ESM1.5 halved runtime by doubling ocean CPUS. RF: BGC model takes more time. Was probably very tightly tuned for coupled model. 5-6 extra tracers. MW: 3 on by default, triple expensive part of model. UM is way more resources. AH: Did an entire CMIP deck run half as fast as they could have done. My point is that at some point we might not be able to keep infrastructure the same. Also if there is code that needs to move in case we need to do this in the future. NH: Code is more of an ocean calculation anyway? RF: Kind of. Presume there is a separate ice calc. Coupling code taken from gfdl/MOM and put into CICE to bypass FMS and coupler code. From GFDL coupler code. Rather than ocean_solo.f90 goes through coupler.f90. NH: If 10 or 20K cores might revisit some of these ideas. Goal to get to those core counts working, not sure about production.

MW: Still thinking about super high res, like 1/30. OceanMaps people wanted it? More concrete. RF: Some controversy with OceanMaps and BoM. Wanting to go to UM and Nemo. There is a meeting, and a presentation from CLEX. Wondering about opportunity to go to very high core counts (20K/40K). AH: Didn’t GFDL run 60K cores for their ESM model? NH: Never heard about it. Atmosphere more aggressive. RF: Config I got for CM4 was about 10K. 3000 line diag_table. AH: Performance must be IO limited? MW: Not sure. Separated from that group.

New bathymetry

AK: Russ made a script to use GEBCO from scratch. Worked on that to polish it up. Everything so far has been automatic. RF: Always some things that need intervention for MOM that aren’t so much physically realistic but are required for the model. AK: Identified some key straits. Retaining previous land masks so as not to need to redo mapping files. 0.25 need to remove 3 ocean points and add 2 points. Make remap weights scripts are not working on gadi, due to ESMF install. Just installed latest esmf locally, 8.0.1, currently running. AH: ESMF install for WRF doesn’t work? AK: Can’t find opal/MCA MPI error. RF: That is an MPI error.

AH: Sounds like the sort of error that was a random error, but if happening deterministically not sure. AK: Might be a library version issue. AH: They have wrappers to guess MPI library, major version the same should be the same.

AH: All this is scriptable and be re-run right? Bathymetries are intimately tied to vertical grid, so needs to be re-run if that is changed. AK: Vision is certainly for it to be largely automated. Not quite there yet.

NH: I’ll have a quick look too. Noticed there is no module load esmf? AK: Using esmf/nuwrf. I’ll have a look at what esmf built with. AH: I want esmf installed centrally. We should get more people to ask. NH: I think it is very important. AK: Definitely need it for remapping weights. AH: Other people need it as well.

Technical Working Group Meeting, July 2020

August 11, 2020August 11, 2020Aidan Heerdegen

Minutes

Date: 10th June, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
Rui Yang (RY), Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
James Munroe (JM) Sweetwater Consultants
Peter Dobrohotoff (CSIRO Aspendale)
Marshall Ward (GFDL)

Optimisation report

PL: Have a full report, need review before release. This is an excerpt.

PL: Aims for perf tuning and options for configuration. Did a comparison with Marshall’s previous report on raijin.

Testing included MOM-SIS at 0.25 and 1 deg to get idea of MOM scalability stand-alone.

The ACCESS-OM2 at 0.1 deg. Resting with land masking, scaling MOM and CICE proportional.

Couldn’t repeat Marshall’s exactly. ACCESS-OM2 results based on different configs. Differences:

continuation run
time step 540s v 400s
MOM and CICE were scaled proportionally
Scaling taken to 20k v 16k

MOM-SIS at 0.25 degrees on gadi 25% faster than ACCESS-OM2 on raijin at low end of CPU scaling. Twice as fast for MOM-SIS at 0.1 degrees. Scalability at high end better.

ACCESS-OM2: With 5K MOM cores, MOM is 50-100% faster than MOM on raijin. Almost twice as fast at 16K, scaled out to 20K. CICE: with 2.5K cores CICE on gadi seems 50% faster than CICE on raijin. Scales to 2.5 times as fast at 16K OM2 CPUs.

Days per cpu day. From 799/4358 CICE/MOM cpus does not scale well.

Tried to look at wait time as fraction of wall time. Waiting constant for high CICE ncpus, decreases with high core counts with low CICE ncpus. So higher core count probably best to reduce ncpus as proportion. In this case half the usual fraction.

JM: How significant are results statistically? PL: Expensive. Only ran 3-4 runs. Spread quite low. Waiting time varied the most. Stat sig probably not due to small sample size.

MW: Timers in libacessom2 were better than OASIS timers, which include bootstrapping times which are impossible to remove. Also noisy IO timers. Not sure how long your run. Longer would be more accurate. PL: Runs are for 1 calendar month (28 days). MW: oasis_put and get are slightly magical, difficult to know what they’re doing. PL: Still have outputs, could reanalyse.

MW: Speedups seem very high. Must be configuration thing. PL: Worried not a straight comparison. MW: If network time is 15-20%, wouldn’t make a difference. Always been RAM bound, which be good if that wasn’t an issue now. PL: Very meticulous documentation of configuration, and is very reproducible. Made a shell script that pulls everything from GitHub. MW: I think your runs are better referenced. While experiments were being released, seemed some parameters etc were changing as I was testing. That could be the difference. Wish I documented it better.

AH: All figures independent of ocean timestep. PL: All timestep at 540s. AK: Production runs are 15-20% faster, but a lot of IO. PL: Switched off IO and only output at end of the month. It really drags it. Make sure it isn’t IO bound. Probably memory bound. Didn’t do any profiling that was worth presenting. MW: Got a FLOP rate? PL: Yes, but not at my fingertips. If it is around a GFLOP, probably RAM bound. PL: Now profiling ACCESS-CM2 with ARM-map. RY is looking at a micro view, looking at OpenMP and compilation flag level. RY: Gadi has 4gb/core, raijin has 2GB/core. Not sure about bandwidth. Also 24 cores/node. Much less remote note comms. Maybe a big reduction in MPI overheads. MW: OpenMPI icx stuff helping. RY: Lots of on-node comms. Not sure how much. MW Believe at high ranks. At modest resolutions comms not a huge fraction of run time. Normal scalable configs only about 20%. PL: The way the scaling was done was different. MW scaled components separately. MW: I was using clocks that separated out wait time.

RY: If config timestep matters, any rule for choosing a good one? AK Longest timestep that is numericaly stable. 540s is stable most of the time.

MW: How have progressed on CICE layout stuff? Changed in the last year? I was using sect-robin. RF: sect-robin or round-robin. AK: You use sect-robin, production did use round-robin, but not sect-robin. Less comms overhead, not sure about load balance.

PL: Is there any value in releasing the report. NH: Would be interested in reading it. Looking to get these bigger configurations. AH: Worth to document the performance at this point. RY: Any other else worth trying? AH: Why 20K limit? PL: Believe that is a PBS queue limit. Some projects can apply for an exception. RY: For each queue there are limits. Can talk with admin if necessary to to increase. AH: Will bring this up at Scheme Managers meeting. They should be updated with gadi being a much bigger machine. Would give more flexibility with configurations. Scalability is very encouraging.

RF: BlueLink runs very short jobs, 3days at a time. Quite a bit of start-up and termination time. How much does that vary with various runs. MW: I did plot init times, it was in proportion to the number of cores. Entirely MPI_Init. It has been a point of research in OpenMPI. Double ranks, double boot time. RF: Also initialisation of OASIS, reading and writing of restart files. PL: Information has been collected, but hasn’t been analysed. RF: Paul Sandery’s runs are 20% of the time. MW: MPI_Init is brutal, and then exchange grid. There are obvious improvements. Still doing alltoall when it only needs to query neighbours. Can be speed up by applying geometry. Don’t need to preconnect. That is bad.

PL: At least one case where MPI collectives were being replaced with a loop of point to points. Was collective unstable? MW: Yes, but also may be there for reproducibility. MW: I re-wrote a lot of those with collectives, but they had no impact. At one time collectives were very unreliable. Probably don’t need to be that way anymore. MW I doubt that they would better. Hinges on my assertion that comms are not a major chunk.

AH: MOM is RAM bound or cache-bound? MW: When doing calculations the model is waiting to load data from memory. AH: Memory bandwidth improves all the time. MW: Yes, but increase the number of cores and it’s a wash. It could have improved. AMD is doing better, but Intel know there is a problem.

AH: To wrap-up. Yes would like full report. This is useful for NH to work up new configurations, as naive scaling is not the way to go. Also intialisation numbers RF would like that PL can provide.

ACCESS-OM2 release plan

AH: Are we going to change bathymeytry? Consulted with Andy who consulted with COSIMA group. What is the upshot? AK: Ryan and Abhi want to do some MOM runs. Problems with bathy. Andy wants run to start, if someone has time to do it, otherwise keep going. Does anyone have some idea how long it would take. RF: 1 deg would be fairly quick. We know where the problems are. Shouldn’t be a big job. Maybe a few days. 1 deg in a day for an experienced person. GFDL has some python tools for adjusting bathymetry on their GitHub. Point and click. Alisdair wrote it. Might be in MOM-SIS examples repo. MW: Don’t know, could ask. RF: Could be something that would make it straightforward.

AK: Will have a look.

MW: topog problems in MOM6 not usually the same as MOM5 due to isopycnal coordinate.

AK: Some specific points that need fixing? RF: I think I put some regions in a GitHub issue. AK: What level to dig into this? RF: Take pits out. Set to min depth in config. Regions which should be min depth and have a hole. Gulf of Carpenteria trivial. Laptev should all be set to min depth. NH: I did write some python stuff called topog_tools, can feed it csv of points and it will fix it. Will fill in holes in automatically. Also smooths humps. May still have to look at the python and fix stuff up. Another possibility.

AK: Another issue is quantisation to the vertical grid. A lot of terracing that has been inherited. RF: Different issue. Generating a new grid. 1 degree not too bad. 0.25 would be weighty still. MC: BGC in 0.25, found a hole off Peru that filled up with nutrients.

AH: Only thing you’re waiting on? AK: Could do an interim release without topog fixes. People want to use. master is so far behind now Also updating wiki which refers to new config. Might merge ak-dev into master, and tag as 1.7, and have wiki instructions up to date with that. AH: After bathy update what would version be? AK: 2.0. AH: Just wondering what constitutes a change in model version. AK: Maybe one criteria if restarts are unusable from one version to the next. Changing bathy would make these incompatible. AH: Good version.

AH: Final word on ice data compression? NH: Decided deflate within model was too difficult due to bugs. Then Angus recognised the traceback for my degfault which was great. Wasted some time not implementing it correctly. Now working correctly. Using different IO subsystem. IO_MP rather than ROMIO_MP. Got more segfaults. Traced and figured out. Need to be able to tell netCDF the file view. The view of a file that a rank is responsible for. MPI expecting that to be set. One way to set it up is to specify the chunks to something that makes sense. Once I did that file view was correct. Then ran into bugs in PIO library. Seem like cut n’ paste mistakes. No test for chunking. Library wrapper is wrong. Fixed that. No working. Learnt a lot and a satisfying outcome. Significantly faster. Partly parallel, also different layer does more optimisation. Noticed with 1 degree things are flying. Nothing definitive, but seems a good outcome. Will get some definitive numbers and a better explanation. Will have something to merge. Will have some PRs to add to and some bug reports to PIO. RY: PIO just released a new version yesterday. NH: Didn’t know that. Tracking issues that are relevant to me. Still sitting there. Will try new version. RY: Happy it is working now. NH: Was getting frustrated with PIO, wondered why not using netCDF directly. For what we do pretty thin wrapper around netCDF. Main advantage is the way it handles the ice mapping. Worth keeping just for that. MW: FMS has most of a PIO wrapper but not the parallel bit.

PL: Any of that fix needs to be pushed upstream? NH: Changes to the CICE code. Will push to upstream CICE. Will be a couple of changes to PIO. AK: Dynamically determines chunking? NH: Need to set that. Dynamically figures out tuneable parameters under the hood about the number of aggregators. Looking at what each rank is doing. Knows what filesystem it is on. Dependent on how it is installed. Assuming it knows it is on lustre. Can generate optimal settings. Can explain more when do a summary.

AK: Want to make sure OP files are consistently chunked. NH: Using the chunking to set the file view. Another way to explicitly set the file view using MPI API. Chunks are the same size as the data that each PE has. In CICE each block is a chunk. MW: These are netCDF chunks? AK: More chunks than cores? NH: Yes. Is that bad or good? This level is perfect for writing. Every rank is chunked on what it knows to do. Not too bad for reading. JM: How large a chunk? NH: In 1 degree every PE has full domain 300 rows x 20 columns. JM: Those are small. Need bigger for reading AK: For tenth 36×30. Something like 9000 blocks/chunk. NH: Might be a problem for reading? RF: Yes for analysis. JM: Fixed cost for every read operation. A lot of network chatter. AH: Is that the ice or the ocean in the tenth? Not sure. Chunk size is 36×30. A lot of that is ice free, 30% is land. MW: Ideal chunks are based on IO buffers in the filesystem. AH: Best chunking depends a lot on access patterns. JM: One chunk should be many minimal file units big. AH: When 0.25 had one file per PE it was horrendously bad for IO. Crippled Ryan’s analysis scripts. If you’re using sect robin that could make the view complicated? NH: Wasn’t Ryan’s issue that the time dimension was also chunked? AH: He was testing mppnccombine-fast which just copies chunks as-is, which were set by the very small tiles sizes. Similar to your problem? NH: Probably worse. Not doing MOM tiles, doing CICE blocks, which are even smaller. Same grid as MOM. RF: Fewer PEs, so blocks are half the size. MOM5 tile size to CICE block size are comparable apart from 1 degree model.

NH: Will carry on with this. Better than deflating externally, but could run into some problems. The chunks in the file view don’t have to be the same. Will this be really bad for read performance? Gathering that it is. What could be done about it? Limited by what each rank can do. No reason the chunks have to be the same as the file view. Could have multiple processors contribute to a chunk. Can’t do without out collective. MW: MPI-IO does collectives under the hood. Can configure MPI-IO to build your chunk for you? NH: Currently every rank does it own IO as it was simpler and faster. MW: Can’t all be configured at MPI-IO later? RY: PIO can map compute domain to IO domain. Previous work had one IO rank per node. IO rank collect all data from node. Set chunking at this level. NH: For example, our chunk size could be 48 times bigger. RY: Yes. Also best performance is single rank per node. PIO does have this function to map from compute to IO domain, and why we used it. Can also specify how many aggregators / node. First decide how many IO ranks per node, and how many aggregators per node. Those should match. Can also number of stripes and strip size to match chunk size. IO rank per node is the most important, as will set chunk size. MW: Only want same number of writers as OSTs. RY: Many writers per node, will easily saturate the network and be the bottleneck. AH: Have to go, but definitely need to solve this, as scaling to 20K cores will kill this otherwise.

MW: Will also help RF. If you’re desperate should look at the patch RY and I did. Will help a lot once you’ve identified your initialisation slow down. RF: Yes, will do once I’ve worked out where the blockages are. Just seen some timings from Paul Sandery, but haven’t looked into it deeply yet. NH: Even with rubbish config, model is showing performance improvements. Will continue with that, and will consider the chunk size stuff as an optimisation task. MW: Sounds like you’ve gone from serial write to massively parallel, so inverted the problem, from one disk bound, to network bound within lustre. If you can find a sweet spot in between then should see another big speed improvement. NH: Config step pretty easy to do with PIO. Will talk to RY about it. RY: Could have a user settable parameter to specify IO writers per node. PL: Need to look into lustre striping size? RY: Currently set to 1GB, so probably ok, but can always tune this. NH: Just getting a small taste of the huge world of IO optimisation. MW: Just interesting to be IO bound again. NH: heavily impacted by IO performance with dailies with tenth. MW: IO domain offset the problem. Still there but could be dealt with in parallel with next run so could be sort of ignored.

AK: This is going to speed up the run. Worse case is post-processing to get chunking sorted out. NH: Leaves us back at the point of having to do another step, which I would like to avoid. Maybe different, before was a step to deflate, maybe rechecking was always going to be a problem.

AK: Revisiting Issue #212, so we need to change model ordering. Concerns about YATM and MOM sharing a node and affecting IO bandwidth. Tried this test, there is an assert statement that fails. libaccessom2 demands YATM is first PE. NH: Will look at why I put the assert there. Weirdly proud there is an assert. RF: Remember this, when playing around with OASIS intercommunicators, might have been conceptually this was the easiest way to get it work. MW: I recall insisting on a change to the intercommunicator to get score-p working. AK: Not sure how important this is. NH: There are other things. Maybe something to do with the runoff. The ice model needs to talk to YATM at some point. Maybe a scummy way of knowing where YATM is. For every config maybe then know where YATM is. These would be shortcut reasons. PL: Give it it’s own communicator and use that? NH: Maybe that is what we used to do. Could always go back to what we had before. RF: Just an idea if it would have an impact. Could give YATM it’s own node as a test. MW: Not sure why it is that way. Should be easy to fix. NH: Ok, certain configurations are shared. Like timers and coupling fields. Instead of each model have their own understanding, share this information. So models check timestep compatibility etc. Using it to share configs. Another way to do that. MW: Doesn’t have to rank zero. NH: Sure it is just a hack. MW: libaccessom2 is elegant, can’t say the same for all the components it talks to. RF: There is a hard-wired broadcast from rank zero at the end.

MOM6

MW: Ever talk about MOM6? AK: Angus is getting a regional circumantarctic MOM6 config together. RF: Running old version of CM4 for decadal project. PL: Maybe a good topic for next meeting?

Attachments

https://cosima.org.au/wp-content/uploads/2020/08/ACCESS-OM2-scalability-on-Gadi.pdf

https://cosima.org.au/wp-content/uploads/2020/08/ACCESS-OM2-Performance-Report.pdf

Technical Working Group Meeting, June 2020

July 13, 2020July 13, 2020Aidan Heerdegen

Minutes

Date: 10th June, 2020

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
Russ Fiedler (RF) CSIRO Hobart
Rui Yang (RY), Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
James Munroe (JM) Sweetwater Consultants
Peter Dobrohotoff (CSIRO Aspendale)

Zarr

JM: Already have compression with netCDF4. How do they consume the data? Jupyter/dask? AK: Mostly in COSIMA, BoM and CSIRO have their own workflows. JM: Maybe archival netCDF4? As long as writing/parallel IO/combining, no hurry to move to zarr direct output. JM: Inodes not an issue. Done badly can be bad. lustre file system has a block size, so natural minimum size. At least as many inodes as allocatable units on FS. If a problem wrap whole thing in uncompressed zarr. RY: many filters. blosc is pretty good. can use in netcdf4, but not portable. needs to compiled into library. netCDF4 now supports parallel compression. HDF supported a couple of years ago.

AH: As we’re in science, want stable well supported software. Unlikely to use bleeding edge right now. Probably won’t output directly from model for the time being. Maybe post process.

NH: What about converting to zarr from uncollated ocean output? Why collating when zarr uncollates anyway? Also collate as difficult to use uncollated output. How easy to uncollated to zarr. JM: Should pretty straightforward. Write that block directly to part of directory tree. Why is collating hard? Don’t just copy blocks to appropriate place in file? AH: Outputs are compressed. Need to be uncompressed and then recompressed. Scott Wales has made a fast collation tool (mppnccombine-fast) that just copies already compressed data. There are subtitles. Your io_layout determines block size, as netCDF library chooses automatically. Some of the quarter degree configs had very small tiles which led to very small chunks and terrible IO. AK: regional output is one tile per PE and mppnccombine can’t hand the number of files in a glob. AH: Yes that is disastrous. Not sure was a good idea to compress all IO.

JM: Good idea to compress even for intermediate storage. Regional collation: what do we currently collate? Original collation tool? AK: Yes, but don’t have a solution JM: definitely need to combine to get decent chunk sizes. If interested happy to talk about moving directly to zarr, or parallelising in some way. AK: Would want a uniform approach/format across outputs. JM: not sure why collate runs out of files. AK: Shell can’t pass that many files to it. AH: My recollection is that it is a limit in mpirun, which is why mppnccombine-fast allows quoting globs to get around this issue. Always interested in hearing about any new approaches to improving IO and processing at scale.

ACCESS-OM2-BGC testing and rollout

AH: Congrats on Russ getting this all working. How do we roll this out?

AK: All model components are now WOMBAT/BGC versions, but two MOM exes. One with BGC and one without. All in ak-dev. No standard configs that refer to BGC. Need some files. RF: About 1G, maybe 10 files. Climatologies for forcing and 3 restart files. This will work with standard 1 degree test cases. Slight change to a field table, and a different o2i.nc (OASIS restart file). Not much change. Will work on that with Richard and Hakase. Haven’t tested with current version. Maintaining a 1 degree config should be ok. AK: No interest at high res? NH: Yes, but not worth supporting as yet. One degree shows how it is used.

AK: Would BGC be standard 1 degree with BGC as option, or separate config specialised for BGC? RF: Get together and give you more info. Probably stand separately as an additional config. Currently set it up as a couple of separate input directories. AH: So RF going to work up a 1 degree BGC config. RF: Yes. AH: So work up config, make sure it works, and then tell people about it. RF: The people who are mainly interested know about the progress.

ACCESS-OM2 release version plan

AH: Fixing configs, code, bathymetry. Do we need a plan? Need help?

AK: Considering merging all ak-dev configs into master. Was constant albedo, moving to Large and Yeager as RF advised. Didn’t make a big difference. How much do we want to polish? Initial condition is wrong. Initial condition is potential temp rather than conservative. Small compared to WOA and drift. Not sure if worth fixing. Also bathymetry at 1 and 0.25 degree. Not sure who would fix it. AH: Talk to Andy Hogg if unsure about resources? AK: A number of problems could all be fixed in one go.

AK: Not much left to do on code. PIO with CICE. Still issues?

NH: Good news. In theory compression is supported on PIO with latest netCDF. PIO library also has to enable those features. There was a GitHub issue that indicated they just needed to change some tests and it would be fine. Not true. There is code in their library that will not let you call deflate. Tried commenting out, and one of the devs thought was reasonable. Getting some segfaults at netCDF definition phase. Want to explore until decide it is waste of time, will go back to offline compression if it doesn’t pan out. Done the naive test if there is a simple work-around. Looking increasingly like won’t work easily. Could do valgrind runs etc.

AK: Bleeding edge isn’t best for production runs. NH: Agreed. Will try a few more things before giving up. AK: Offline compression might be safer for production. NH: Agreed, random errors can accumulate. RY: Segfault is with PIO? NH: Using newly installed netCDF4.7.4p, and latest version of PIO wrapper with some commented out some checks. Not complicated just calls deflate_var. RY: When install PIO do you link to new netCDF library? NH: Yes. Did have that problem before. AG pointed it out, and fixed that. Takes a while to become stable. RY: Do you need to specify a new flag with parallel compression when you open the with nf_open? NH: Not doing that. Possible PIO doing that. RY: Maybe PIO library not updated to use new flags to use this correctly. NH: If using netCDF directly what would you expect to see? Part of nf_create? RY: Maybe hard wired for serial. NH: Possible they’ve overlooked something like this. Might be worth a little bit of time to look into that. PIO does allow compression on serial output only. Will do some quick checks. Still shouldn’t segfault. Been dragging on, keep thinking a solution is around the corner, but might be time to give it up. A bit unsatisfying, but need to use my time wisely. AH: It’s not nothing to add a post-processing compression step and then take it out again. Not no work, and no testing, so out of the box compression would be nice. Major code update you’re waiting on? Need to update fortran datetime library? Just a couple of PRs from PL and NH. Paul’s look like a bug fix PL: Mine an edge case? AK: Incorrect unit conversion. AH: Don’t know where in the code and if it affects us. AK: We’re using the version that fixes that. NH: Guy who wrote that library wrote a book called “Modern Fortran”.

Testing update

AH: MOM travis testing properly testing ACCESS-OM2 and ACCESS-OM2-BGC as have also updated libaccessom2 testing so that it creates releases that can be used in the MOM travis testing. Lots of boring/stupid commits to get the testing working. AK: Latest version gets used regardless of what is in the ACCESS-OM2 repo? AH: Not testing the build of ACCESS-OM2, just the MOM5 bit of that build linking to the libaccessom2 library so that it can produce an executable so we know it worked successfully. We do compile OASIS, and uses just the most recent version. Had intended to do the same thing with OASIS, make it create releases that could be used. Haven’t done it, but it is a relatively fast build. AK: OASIS is now a submodule in the libaccessom2 build. AH: Yes, but still have a dependency on OASIS. Don’t have a clean dependence on libaccessom2. Might be possible to refactor, but probably not worth it. So yes have a dependence on libaccessom2 *and* OASIS.

AH: Previously travis allowed ACCESS-om2 to fail and the only way you knew it worked correctly would have to look at the logs to determine if everything was successful up to the linking step.

AH: Currently fixing up Jenkins automated testing. Starting on libaccessom2 testing. Hopefully won’t be too difficult. NH: Definitely need to have it working.

PL: Writing up testing/scaling tests for ACCESS-OM2.