Technical Working Group Meeting, February 2021

Minutes

Date: 17th February, 2021

Attendees:

Aidan Heerdegen (AH) CLEX ANU
Andrew Kiss (AK) COSIMA ANU
Angus Gibson (AG) RSES ANU
Russ Fiedler (RF) CSIRO Hobart
Paul Leopardi (PL), Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision

DUG and ACCESS-OM2

AH: Been working with Mark Cheeseman from DownUnder GeoSolutions (DUG)

MW: GFDL collaborating with Seattle group Vulcan. Mark Cheeseman was on staff. Now with DUG. NH: Worked at NCI and then Pawsey, saw him there when gave a talk. Went to US project Vulcan. Then contacted me, now contacted working from DUG. Interested in REASP startup, now on ACCESS-OM2 project.

NH: Curious about Vulcan, what is it? MW: Vulcan is a project to wrap fortran core with python. Similar to f2py. Controlled by python and a domain specific language to put arrays on GPUs. Not my style. GFDL atmosphere group like it. Not sure if they are using it in production. Lucas Harris (head of atmosphere) spruiks it. NH: Vulcan were advertising a job in their team (short term). MW: Vulcan was a Paul Allen start-up. Jeremy McGibbon is a good guy to talk to.

MW: MITgcm was rewritten in Julia. PL: Domain specific languages becoming more popular. NH: Fortran is having a revival as well. MW: Talked to Andre Certik who made lfortran. Wants to connect with flint project. fortranlang.org is doing good stuff. Vulcan is similar to cyclone (LFRIC). Similar spirit. PL: Cyclone is one of the enabling technologies. LFRIC is the application of it to weather modelling. MW: Will be UM15 before LFRIC.

PL: UM model has a huge amount that isn’t just the dynamical core that all has to be integrated. Has to be more scientifically valid and performant than current. MW: Everybody is trying to work around compiler/hardware. Seems backwards. Need compilers to do this. Need better libraries. malloc should be gpu aware. Maybe co-arrays should be gpu aware. Seems silly to write programs to write programs.

AH: Agree that compiler should do the work. Was dismayed when talked to Intel guy at training session when he said compiler writers didn’t know or care overly about fortran. Don’t expect them to optimise language specific features well. If you want performance write it in the code, don’t expect compiler to do it for you.

MW: If can write MPI, surely can write a compiler to work with GPUs. Can figure out language/library level.

AH: DUG contacted us and wanted to move into weather and climate. Met them at the NCI Leadership conference. Stu Midgley told me they had a top 25 compute capability, larger than raijin but now gadi is bigger. Biggest centre is in Texas. Immersed all their compute in oil tanks for cooling. Reckons their compute cycles are half the cost of NCI.

NH: Selling HPC compute cycles: HPC in the cloud. Also recently listed on stock exchange. AH: Floated at $1.50, now down to $1.00 a share. NH: Don’t understand how they can make money. What is so special about HPC cloud?Competing with AWS and NCI and governments. MW: Trying to pull researchers? NH: Getting research codes working. MW: NCI is free, how do you beat that? PL: Free depending on who you are.

AH: Liased with Mark. Pointed him at repos, got him access to v45 to see how it runs on gadi. He has duplicated the setup on their machines, not using payu. Only running 1 degree so far.

AH: Had a meeting with them and Andy Hogg. Showed us their analysis software. Talked about their business model. Made the point we use a lot of cycles, but produce a lot of data that needs to be accessed and analysed over a long time period. Seemed a difficult sell to us for this reason. They view their analysis software as a “free” add-on to get people to buy cycles. Developed their own blocked-binary storage format similar to zarr. Can utilised multiple compression filters. Will need to make this sort of thing public for serious uptake IMO. Researchers will not want to be locked into a proprietary format. Andy pointed out there were some researchers who can’t get the NCI cycles they need. DUG know of Aus researchers who currently go overseas for compute cycles. Also targeting those who don’t have enough in-house expertise: they will run models for clients. NH: Will run models, not just compute? AH: They’ll do whatever you want. Quite customer focussed. MW: Running a model for you is a valuable service. AH: Would military guys who run BlueLink model be interested in those services RF? RF: Want to keep secure and in house. AH: Big scale and looking for new markets.

PIO update

NH: Task was to improve IO performance of CICE and MOM. Wanted async IO using PIO in CICE. Numbers showed could improve speed 1-2% if not waiting on CICE IO. RF sent out a presentation on PIO, active on GitHub, moving quickly. Software all there to do this. 6 months ago no support for fortran async IO. First problem OASIS uses MPICOMMWORLD world to do everything. With IO tasks can’t use MPICOMMWORLD. New version of OASIS-mct v4 makes it easier not to use MPICOMMWORLD. Upgraded OASIS. Has new checks in coupling. Historically doing some weird stuff in start-up. New OASIS version didn’t like it. Ocean and Ice both write out coupling restarts at the end of the run. Payu copies both over but read in by just Ice, then sent back to Ocean. Rather than each model writing and reading their own restarts get this weird swap thing. Get unbalanced comms. OASIS has check for this. Could have disabled check, but decided to change coupling as you would expect. Don’t know why it was done this way. RF: Did it this due to timing. Needs to store an average of the previous step, to do with multiple time steps. Also might have changed the order in the way things were done with the coupling, and how time steps were staggered. From a long time ago. NH: Might also be with dynamic atmosphere, need different coupling strategy. Made change, checked answers are identical. Now we’re using OASIS-mct 4.0. Now integrating the IO server.

NH: Open to advice/ideas. Start off with all MPI processes (world), gets split to compute group and IO groups. New compute group/world fed to OASIS and split into models. payu will see a new “model” which is the IO server. Maybe 10 lines of fortran, calls into PIO library and sits and waits.

NH: Other possibility, could split world into 3 models, then split CICE and IO into separate models. Doesn’t work, have to tell everyone else what the new world is. Async IO has to be the first thing that happens to create separate group. OASIS upgrade was smooth and changed no results. Should have async IO server soon. Could also use it for MOM. Wonder if it will be worthwhile. MW: Be nice. PL: Testing on gadi? MW: Using NCAR PIO? NH: Yes. AH: NCAR PIO guys very responsive and good to work with.

MW: They have been good custodians of netCDF.

PL: Also work with UM atmosphere with CM2? NH: No reason not to. Seems fairly easy to use starting from a netCDF style interface. Library calls look identical. With MOM might be more difficult. netCDF calls buried in FMS. If UM has obvious IO interface using netCDF, should be obvious how to do it.

PL: Wondering about upgrading from OASIS3-mct, might baulk at coupling. NH: Should be ok. Might have problem I saw, not sure if it happens with climate model. Worst case scenario you comment out the check. Just trying to catch ugly user behaviour, not an error condition. AH: Surprised you can’t turn off with a flag. NH: New check they had forgotten to include previously.

AH: UM already has IO server? MW: In the UM, so have to launch a whole UM instance which is an IO server. Much prefer the way NH has done it.

NH: PIO does an INIT and never returns. Must be IO server. MW: Prefer that model with IO outside rest of model code. NH: Designed with idea multiple submodels would use same IO server. MW: When Dale did IO server in UM, launch say 16 ranks, 15 ran model, 16th waiting for data. Always seemed like a waste. Does your IO server have it’s own MPI PE? Perfect if it was just a threaded process that doesn’t have a dedicated PE. RF: Locked in. Tell how many you want and where they are. Maybe every 48th processor on whole machine or CICE nodes. Doesn’t have to be one on every node. Can send it an array designating IO servers, split etc handled automatically. Also a flag to flag if a PE will do IO. NH: I have set it up as a separate executable launched by payu. All IO bunched together. Other option is to do what RF said, part of the CICE executable, starts but never computes CICE code. More flexible? RF: Think so. Better way to do it I think. NH: If were to go to MOM, would want 1 IO server per x number of PEs rather than bunching at end.

AH: Keep local, reducing communications. Sounds difficult to optimise. PL: PE placement script? NH: Could do it all at the library level. Messed around with aggregators and strides and made no difference. Tried an aggregator every node, and then a few over whole job which was the fastest. AH: Didn’t alter lustre striping? RY: Striping depends on how many nodes you use. AH: Something that was many dimensional optimisation space as Rui showed, throwing in this makes if very difficult. Would need to know the breakdown of IO wait time. NH: IO server the best, non-blocking IO sends and you’re done. Doesn’t matter where the IO server PEs are. Unless waiting on IO server. PL: Waiting on send buffers? NH: Why. Work, send IO, go back to work and should be empty when you come back to send more MW: IO only one directional fire and forget. Should be able to schedule IO jobs and get back to work. As long as IO work doesn’t interfere with computing. PL: Situation where output held up, and send buffers not clean and wind up with gap in output. MW: Possible. Hope to drain buffers. NH: If ever wait on IO server something broken in your system. MW: Always thought optimal would be IO process is 49th process on 48 core node, OS squeezing in IO when there is free time. RY: If bind IO to core always doing IO. MW: Imagine not-bound, but maybe not performant. RY: Normally IO server bound to single node but performance depends on number of nodes when writing to a single file. Can easily saturate single node IO bandwidth. Better to distribute different IO processes around different nodes to utilise multiple nodes bandwidth. NH: I think the library makes that ok to do. RY: User has to provide mapping file. NH: Have to tell it which ranks do what. RY: Issues with moving between heterogenous nodes on gadi. Core number per node will change. We can provide a mapping for acceptable performance. Need lots of benchmarking.

AH: If IO server works well, it will simplify timings to understand what the model is doing? How much time spent waiting? NH: Looking like MOM spending a lot of time on IO. AK: Daily 3D fields became too slow. So turned them off. AH: So IO is a bottleneck for that.

NH: Any thoughts on PIO in MOM? RY: When I did parallel IO in MOM there already exist a compute and IO domain in MOM. Don’t need to use PIO. Native library supports directly. IO server in OASIS? Depends how MOM couples to OASIS. Can set up PIO as separate layer and forget IO domain and create mapping to PIO and use IO PE to do these things. Easier in MOM if pick up IO domain. MW tried to improve FMS domain settings to pick up IO domain as separate MPI world. NH: Still not going to work with OASIS? RY: Similar to UM. Doesn’t talk to external ranks. MW: Fixed to grid. PL: OASIS does an MPI ISEND wrapped so it waits for buffer before does ISEND to make sure buffer is always ready. Sort of blocking non-blocking. Not sure if IO would have to work the same way? NH: Assume so, would have to check request is completed, otherwise overwrite buffers.

FMS MOM6

MW: FMS2 rewrite nearing completion. Pretty severe. Can’t run older models with FMS2. Bob has done a massive rewrite of MOM6. Made an infra layer. Now an abstraction of the interface to FMS from MOM6, can run either FMS or FMS2. MOM6 will continue to have a life with legacy FMS. MOM6 will support any framework you choose, FMS, FMS2, ESMF, our own fork of FMS with PIO. That is now a safe boundary configurable at the MOM6 level. Could be achieved in MOM5, or if migrate MOM6 have this flexibility available. Not sure if it helps, but useful to know about.

MW: MOM6 is a rolling release. Militantly against versioning. Now in CESM, not sure if it is production currently. It is their active model for development.

ACCESS-OM2 tenth

AK: Not currently running. Finished 3rd IAF cycle a month ago. All done with new 12K configuration. Only real instability was a bad node. Hardware fix.

AK: Talk of doing another cycle with BGC at 0.1 deg. Not discussed how to configure. AH: Would need all the BGC fields? AK: Assume so.

AH: A while ago there were parameter and code updates. Are those all included? AK: Yes, using the latest versions. AH: Weres the run restarted? AK: Yes. AH: Metrics ok? AMOC collapsing? AK: Not been looked at detail. Only looked at sea-ice and that has been fine, but mostly result of atmospheric forcing. AH: Almost have an OMIP submission? AK: Haven’t saved the correct outputs. AH: Not sure if need to save for whole run, or just final one.

ACCESS-CM2-025

AH: Continuing with this. Going slowly, not huge resources available at Aspendale. Currently trying harmonise CICE code bases by getting their version of CICE into git and do a PR against our version of CICE so we can see what needs updating since the forking of that CICE version.

PL: I am doing performance testing on CM2, how do I keep up to date? AH: I will invite you to the sporadic meetings.