Aidan Heerdegen (AH) CLEX ANU
- Andrew Kiss (AK) COSIMA ANU
- Angus Gibson (AG) RSES ANU
Russ Fiedler (RF) CSIRO Hobart
Paul Leopardi (PL), Rui Yang (RY) NCI
Nic Hannah (NH) Double Precision
NH: Testing spack. On lightly supported cluster. Installed WRF and all dependencies with 2 commands. Only system dependency was compiler and libc. Automatically detects compilers. Can give hints to find others. Tell it compiler to use for build. Can use system modules using configuration files. AH: Directly supported modules based on Lmod. Talked to some of the NCI guys about Lmod, as the raijin version of modules was so out of date. C modules has been updated, so they installed on gadi. Lmod has some nice features, like modules based on compiler toolchain. Avoids problem with Intel/GNU subdirectories that exist on gadi. NCI said they were hoping to support spack, by setting up these configs so users could spack build things. Didn’t happen, but would have been a very nice way to operate to help us.
NH: Primary use case in under supported system where can’t trust anything to work. Just want to get stuff working. Couldn’t find an MPI install using latest/correct compiler. gadi well maintained. See spack as a portability tool. Containment is great.
AH: Was particularly interested in concretisation, id of build, allows reproducibility of build and identification of all components.
NH: Rely on MPI configured for system. Not going to have our own MPI version. AH: Yes. Would be nice if someone like Dale made configs so we could use spack. Everything they think is important to control and configure they can do so. Probably not happy with people building their own SSL libraries. Thought it would improve NCI own processes around building software. Dale said he found the system a but fragile, too easy to break. When building for a large number of users they weren’t happy with that. Thought it was a great idea for NCI, to specify builds, and also easy to create libraries for all compiler toolchains programmatically.
AG: Haven’t tried recently.
RY: Stripe count affects non-compressed more than compressed. PIO doesn’t work perfectly with Lustre, fails with very large stripe count. With large file sizes (2TB) can be faster to write compressed IO due to less IO time.
RY: Early stage of work. Many compression libraries available. Here only used zlib. Other libraries will lead to smaller size and faster compress times. Can be used as external HDF filter. File created like this requires filter to be compiled into library.
NH: How big is measurement variability? RY: Can be very different, took shortest one. Sometimes double. TEST_MPP_IO is much more stable. Real case much less so.
NH: Experiencing similar variability with ACCESS model with CICE IO. Anything we can do? Buffering? RY: Can increase IO data size and see what happens. Thinking it is lustre file system. More stripe counters touches more lustre servers. Limit to performance as increase stripe counters, as increase start to get noise from system. NH: What are the defaults, and how do you set stripe counters? RY: Default is 1, which is terrible. Can set using MPI Hints, or use lfs_setstripe on a directory. Any file created in that directory will use that many stripe counters. OMPIO and ROMIO have different flags for setting hints. Set stripe is persistent between reboots. Use lfs_getstripe to check. AH: Needs to be set to appropriate value for all files written to that directory.
NH: Did you change MPI IO aggregators or aggregator buffer size. RY: Yes. Buffer size doesn’t matter too much. Aggregator does matter. Previous work based on raijin with 16 cores. Now have 48 cores, so experience doesn’t apply to gadi. Aggregator default is 1 per node for ROMIO. Increase aggregator, doesn’t change too much, doesn’t matter for gad. OMPIO can change aggregator, doesn’t change too much.
NH: Why deflate level 4? Tried any others? RY: 4 is default. 1 and 4 doesn’t change too much. Time doesn’t change too much either. Don’t use 5 or 6 unless good reason as big increase in compression time. 4 is good balance between performance and compression ratio.
NH: Using HDR 5 c1.12.x. With previous version of HDF, any performance differences? RY: No performance difference. More features. Just more stable with lustre. Using single stripe counter both work, as soon as increase stripe counter v1.10 crashes. Single stripe counter performance is bad. Built my own v1.12 didn’t have problem.
NH: Will look into using this for CICE5. AH: Won’t work with system HDF5 library?
AH: Special options for building HDF5 v1.12? RY: Only if you need to keep compatibly with v1.10. Didn’t have any issues myself, but apparently not always readable without adding this flag. Very new version of the library.
AH: Will this be installed centrally? RY: Send a request to NCI help. Best for request to come from users.
AH: Worried about the chunk shapes in the file. Best performance with contiguous chunks in one dimension, could lead to slow read access for along other dimensions. RY: If chunks too small number of metadata operations blow out. Very large chunks use more memory and parallel compression is not so efficient. So need best chunk layout. AH: Almost need a mask on optimisation heat map to optimise performance within a useable chunk size regime. RY: Haven’t done this. Parallel decompression is not new, but do need to think about balance between IO and memory operations.
RF: Chunk size 50 in vertical will make it very slow for 2D horizontal slices. A global map would require reading in the entire dataset. RY: For write not an issue, for read yes a big issue. If include z-direction in chunk layout optimisation would mean a large increase in parameter space.
NH: AK has been running 0.1 seeing a lot of variation in run time due to IO performance in CICE. More than half the submits are more than 100%worse than the best ones. Is this system variability we can’t do much about? Also all workers are also doing IO. Don’t have async IO, don’t have IO server. Looking at this with PIO. Have no parallelism in IO so any system problem affects our whole model pipeline. RY: Yes IO server will mean you can send IO and continue calculations. Dedicated PE for IO. UM has IO server. NH: Ok, maybe go down this path AH: Code changes in CICE? NH: Exists in PIO library. Doesn’t exist in fortran API for the version we’re using. Does exist for C code. On their roadmap for the next release. A simple change to INIT call and use IO servers for asynchronous IO. Currently uses a stride to tell it how many IO servers per compute. AH: Are CICE PEs aligned with nodes? Talked about shifting yam, any issues with CICE IO PEs sharing nodes with MOM. NH: Fastest option is every CPU doing it’s own IO. Using stride > 1 doesn’t improve IO time. RY: IO access a single server, doesn’t have to jump to different file system server. There is some overhead when touching multiple file system servers when using striping for example.
AH: Run time instability too large? AK: Variable but satisfactory. High core count for a week. 2 hours for 3 months. AH: Still 3 month submits? AK: Still need to sometimes drop time step. 200KSU/year. Was 190KSU/year, but also turned off 3D daily tracer output. AH: More SUs, not better throughput? AK: Was hitting walltime limits with 3D daily tracer output. Possibly would work to run 3 months/submit with lower core count without daily tracers.
AK: Queue time is negligible. 3 model years/day. Over double previous throughput. Variability of walltime is not too high 1.9-2.1 hours for 3 months. Like 10% variability.
AH: Any more crashes? Previously said 10-15% runs would error but could be resubmitted. AK: Bad node. Ran without a hitch over weekend. NH: x77 scratch still an issue? AK: Not sure. AH: Had issues, thought they were fixed, but still affected x77 and some other projects. Maybe some lustre issues? AK: Did claim it was fixed a number of times, but wasn’t.
AH: Across tripole seam one of the velocity fields wasn’t in the right direction, caused weird flow. AK: Not a crash issue. Just shouldn’t happen, occurs occasionally. The velocity field isn’t affected, seen in some derived terms, or coupling terms. Do sometimes get excess shear along that line. RF: There is some inconsistencies with how some fields are being treated. Should come out ok. Heat fluxes slightly off, using wrong winds. They should be interpolated. What gets sent back to MOM is ok, aligned in the right spot. No anti-symmetry being broken. AK: Also true for CICE? RF: Yeah, winds are being done on u cells correctly. Don’t think CICE sees that. AH: If everything ok, why does it occur? RF: Some other term not being done correctly, either in CICE or MOM. Coupling looks ok. Some other term not being calculated correctly.
AH: How much has our version of CICE changed from the version CSIRO used for ACCESS-ESM-1.5 NH: Our ICE repo has full git history which includes the svn history. Either in the git history or in a file somewhere. Should be able to track everything. Can also do a diff. I don’t know what they’ve done, so can’t comment. Have added tons of stuff for libaccessom2. Have back-ported bug fixes they don’t have. We have newest version of CICE5 up to when development stopped which include bug fixes. As well as CICE6 back ports. AH: Can see you have started on top of Hailin’s changes. NH: They have an older version of CICE5, we have a newer version which includes some bug fixes which affect those older versions.
RF: Also auscom driver vs access driver. Used to be quite similar, ours has diverged a lot with NH work on libaccessom2. We do a lot smarter things with coupling, with orange peel segment thing. There is an apple and an orange. We use the orange. NH: Only CICE layout they use is slender. They don’t use special OASIS magic to suppler that. Definitely improves things a lot in quarter degree. Our quarter degree performance a lot better because of our layout. AH: The also have 1 degree UM, so broadly similar to a quarter degree ocean. NH: Will make a difference to efficiency. AH: Efficiency is probably a second order concern, just get running initially.
AH: Congratulations to work to improve init and termination time. RF: Mostly NH work. I have just timed it. NH: PIO? RF: Mostly down to reading in restart fields on each processor. Knocked off a lot of time. A minute or so. PIO also helped out a lot. Pavel doing a lot of IO with CICE. Timed work with doing all netCDF definitions first and then the writing, taking 14s including to gather on to a single node and write restart file. The i2o.nc could be done easily with PIO. Also implemented same thing for MOM, haven’t submitted that. Taking 4s there. Gathering global fields is just bad. Causes crashes at the end of a run. There are two other files, cicemass and ustar do the same thing, but single file, single variable, so don’t need special treatment.
RF: Setting environment variable turns off UCX messages. Put into payu? Saves thousands of lines in output file.
MW: Will also help RF. If you’re desperate should look at the patch RY and I did. Will help a lot once you’ve identified your initialisation slow down. RF: Yes, will do once I’ve worked out where the blockages are. Just seen some timings from Paul Sandery, but haven’t looked into it deeply yet. NH: Even with rubbish config, model is showing performance improvements. Will continue with that, and will consider the chunk size stuff as an optimisation task. MW: Sounds like you’ve gone from serial write to massively parallel, so inverted the problem, from one disk bound, to network bound within lustre. If you can find a sweet spot in between then should see another big speed improvement. NH: Config step pretty easy to do with PIO. Will talk to RY about it. RY: Could have a user settable parameter to specify IO writers per node. PL: Need to look into lustre striping size? RY: Currently set to 1GB, so probably ok, but can always tune this. NH: Just getting a small taste of the huge world of IO optimisation. MW: Just interesting to be IO bound again. NH: heavily impacted by IO performance with dailies with tenth. MW: IO domain offset the problem. Still there but could be dealt with in parallel with next run so could be sort of ignored.
The COSIMA Cookbook is the recommended, and supported, method for finding and accessing COSIMA datasets.
Currently COSIMA datasets are located in temporary storage under the hh5 project on the /g/data filesystem at NCI. The default COSIMA Cookbook database (/g/data/hh5/tmp/cosima/database/access-om2.db) indexes data in this location.
The COSIMA datasets are being moved to a new project, ik11: dedicated storage provided by an ARC LIEF grant. As part of this transition the default database will change to:
and will index all data in /g/data/ik11/outputs/. The database is updated daily.
This change will take place from Wednesday the 1st of July. To access the old database pass an argument to create_session:
session = cc.database.create_session(db='/g/data/hh5/tmp/cosima/database/access-om2.db')
or set the COSIMA_COOKBOOK_DB environment variable, e.g. for bash
export set COSIMA_COOKBOOK_DB=/g/data/hh5/tmp/cosima/database/access-om2.db
In the same way the new ik11 database can be accessed by using the path to it (/g/data/ik11/databases/cosima_master.db) in the same manner as above.