Technical Working Group Meeting, April 2021

Minutes

Date: 21st April, 2021

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU
Angus Gibson (AG) RSES ANU
Russ Fiedler (RF) CSIRO Hobart
Paul Leopardi (PL), Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision
Peter Dobhrotoff (PD) CSIRO Aspendale

ACCESS-OM2 Model Scaling on DUG

MC: ACCESS-OM2 running on KNL and Cascade Lake. Concentrating on KNL.

0.1 degree config not running. Issues with our OpenMPI 4.1.0 upgrade affects this.

OpenMP threading provides no performance gain in CICE or OASIS.

Profiling difficult but possible. Scalability is an issue.

Modifications: Added AVX512 compiler flag + appropriate array alignment. Add VECTOR pragma in key CICE routines. Need intel specific pragmas where necessary. One or two routines in CICE need them. Got them listed.

OpenMP enablement: Bug stopping OpenMP. Add openmp flag to LDFLAGS in MOM and OASIS. CONCURRENT compiler directive not recognised with latest Intel version.

1 degree model. 68 core KNL cores. Scaling not great. Did a number of different configs, processor shapes and domain decomposition. Latitude flavoured CICE with slenderX1 best. OASIS doesn’t scale.

By component, stress largest percent of time. 12 MPI tasks. Not limited by comms. 37% Vectorization. Could be non-contiguous data. Mallocs. stall. OpenMP would have most effect on vectorisation.

OASIS does a lot of memory alloc/dealloc of temporary buffers. Known weakness of KNL. 37%. isend_. 16% in memory placement with non-contiguous memory placement. Need to keep in cache, fast RAM. OpenMP enablement is possible in OASIS. Has it been explored? Looks like tried multithreaded reading and multi-threaded MPI passing. Tried, best didn’t do anything, worst caused crashes.

PL & NH: Never tried.

MC: Could talk to French OASIS team see if they have tried this.

0.25 scales ok out to 18 KNL nodes (1224 cores). OASIS scales poorly. Other components don’t scale well > 18 nodes. MOM scales well by itself at 0.25, is this all OASIS or OpenMPI as well? Maybe a little of both, but can’t prove it.

New work: OpenMP running. CICE looks promising. Possibly OASIS.

0.1 still not working. UCX driver not working. Input data available due to new transfer tool.

INSIGHT use: real-time analysis and computation is target.

1 yr of 0.1 data downloaded for next prototyping effort. See if real time analysis and computation possible.

NH: Careful not to measure waiting in OASIS time. This is mostly what it does. MC: Masking load imbalance? NH: Might be. Just be careful. It is where the model imbalances show up. MC: Did try different layouts and still shows similar behaviour. NH: Really hard to benchmark coupled model for this reason. Look at ice and ocean numbers in isolation. MC: Interesting to talk about common approaches to benchmarking. Talked to CRAY about this sort of thing. Sometimes separate components and run in isolation. score-p doesn’t work like that. Using gprof.

PL: Anything useful you can say about OpenMP in MOM? MC: Not spent much time on that. PL: Looking at source not many places OpenMP present. Maybe not a great fit for KNL without very good MPI implementation. MC: Two ways to get performance on KNL: exploit on-node parallelisation, use as many as those cores as possible. If you can split up well with MPI can work well. Worked ok with MOM by itself on KNL. Not so worried. Good parallelisation as it is. When linking in OASIS calls with coupling module inside MOM if build OASIS with OpenMP need that flag in your link statement otherwise get kmp linking error.

AH: Re 0.25 scaling, how many nodes is last data point? MC: 34. AH: So 2.5K cores. 0.25 stops scaling much above that on conventional Intel cores. MC: Ideally move scalability forward, as KNL cpus are 1/3 clock speed, so need more scalability to match that. Would hope OpenMP might help. AH: Scaling most difficult with load balancing in multi-model coupled models.

NH: Good to compare gadi scaling results to see how much realistically can sale, see whether makes sense to go to KNL. If can’t squeeze more scalability out of it, never going to match conventional architecture.

MC: New architectures like sapphire graphics are basically beefed up KNL. Faster cores, but big chunk of fast memory. Could look like KNL. Might be a preview of Intel world.

NH: KNLs much cheaper to run, so given that discount is it possible to do cheaper cost calculations. A priori cost calculation.

MC: Could have done, but more like a sales pitch. Standard analysis to see best fit for code.

RF: Not sure why atmospheric boundary layer routine popped up in OM2. Don’t think it should go through that routine. Square processor domain decomposition in 0.25 massively unbalanced. MC: Tried a number of domain decompositions, layout and type. RF: Depends on month by month, depending on north/south distribution of ice. Using one month get biased result. MC: Used slender for 1 deg, square for 0.25. Using just january. RF: CICE doing all work in northern hemisphere. A lot of nodes doing no work. Slender ok on 1 degree with a few tiles north and south, should be ok. MC: Maybe a couple of 1 month runs? RF: Often run 1 degree for whole year, reasonably well distributed over year. Jan & Jul or Mar & Sep would give a good spread. Will give max/min.

DUG copy

MC: Want to move TBs of data into/out of many sites as possible. Current tools easy but not quick or efficient. Visualisation/real-time monitoring. Biggest bottleneck is client endpoint. AARNET gives 2+MiB. Current tools <1% peak speed. Azure and AWS provide their own high speed data transfer tool.

MC: Currently experimental. CLI. Allows to quickly transfer directories, very large files and/or large number of files. Creates parallel rsync like file transfer. 192 by default. All transfers are logged and can be restarted. ssh authentication. self-contained single binary runs on centos/ubuntu. Includes verification checks. Part of a utility called dug. HPC users will get a cutdown version.

MC: cli looks like scp. Can do offline verification.

MC: Copied 400GB from gadi to perth. Over 500MB/s

MC: Current challenges are authentication and IT policy limitations on remote sites. Want even faster without ridiculous numbers of threads. In development right now. Hope to release in 2021 to DUG customers.

AH: Would be great if some part of it could be open source. Still offer special features for customers but get value from exposure for DUG and potentially improvements from the community. Ever talked to ESGF? Not sure if they have a decent tool.

MC: Always a balancing act. Don’t want to have two separate projects. Do have some open source projects. Have vigorous discussions about this.

Forcing perturbation feature

NH: Dropped PIO due to working on forcing. Work going well. Currently working on JSON parsing code. One of the trickiest bits. Split out a separate library for forcing management. Includes all the perturbation, reading/caching/indexing of netCDF files. Cleaner way to break things out. API to configure forcing. Stand-alone tests. Ready soon. Ryan keen to use.

AK: Visiting students keen to do perturbation experiments.

Relative humidity forcing of ACCESS-OM2

RF: Not done much. Develop a completely new module using all the gill-82 configurations, rather than work in with code as-is. Would be a mess of conditional compilations, mess of IF/THEN. Have everything consistent rather than GFDL style. GFDL developed with atmospheric model in mind rather than just boundary layer. Should be clean and compact. Just a few elemental functions. Other module pretty old (around 2000). When Simon/Dave took sat tap press stuff picked the eyes out of bits. There is redundant code. It’s a real nightmare.

CICE Harmonisation

NH: Which way do we go? Who merges into who? Might be pretty easy for us to look at their changes to begin with if not too many. Who they come to grab our stuff a lot easier.

AH: I took it upon myself to make a PR that isn’t just rubbish. Problem is that after codebases diverged there was a big commit where they dumped *all* the GA7 changes in. Ended up touching a lot of files that NH had also changed. So quite messy. Was trying to make something we could compare. Maybe just put that up as branch to begin with. Anything that stops code being hidden in private svn repos. CICE harmonisation not critical. Can be done at a later stage, urgency depending on performance at 0.25 degrees.

AH: They have already advertised a postdoc to work on a model that doesn’t exist yet. NH: Good to have a customer first!

COSIMA metadata

AH: Cleaning up of tags in cosima data collection would be good. Can be very explicit about what you want people to use. No point having multiple tags for same thing. Needs someone to take charge of that. AK: Need a glossary for what tags are what. AH: Could put on database build repo?

Assimilation

RF: Still working on pavel with reduced config and restarts for assimilation runs. Some are redundant when changing ocean state at the beginning. How to override state and do consistently. Fewer restarts = faster start-up. Make some restarts optional. Allow user to turn off. All restarts are needed for proper run. Not so for assimilation run.

RF: Working on configuration using fewer nodes. Begin and end time is almost a constant. Only run 3 days. Start/stop is large percentage of wall time. A lot less cost if using fewer cores. NH: Still some minimal configs on COSIMA repo. No longer maintained. Were being used some time ago for cheap(er) testing RF: Will take a look, be a good starting point. AH: It is a 2064 cpu config. RF: 2000 cpu is perfect. If Pavel runs ensemble KF can run several in parallel, 2K core jobs easy to get on machine. AH: Paul Spence is running a bunch of 2K 0.25 jobs and getting great throughput.

AH: How does MPI_Init scale with core count? NH: Can’t recall off the top of my head. OASIS is non-linear, but made a lot of changes to speed it up. Are you using the new OASIS version? Didn’t we say that would speed up initialisation time? RF: Thought there was a chance, a potentially good possibility. NH: Will merge that soon.