Technical Working Group Meeting, September 2020

Minutes

Date: 16th August, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD) CSIRO Aspendale

ACCESS-CM2 + 0.25deg Ocean

PD: Dave Bi thinking about 0.25 ocean. Still fairly unfamiliar with MOM5. Trying to keep was harmonised as possible. Learn from 1 degree case harmonisation. PD: Doing performance plan for current financial year hence talk about that now. Asking supervisors what they want and popped out of that conversation with Dave Bi. AH: Andy Hogg has been pushing CLEX to do the same work. PD: Maybe some of this already happening, need some extra help from CSIRO? AH: We think it is much more that CSIRO has a lot of experience with model tuning and validation. Not something researchers want to do, but want to use a validated model and produce research. So a win-win. PD: Validation scripts are something we run all the time, so yes would be a good collaboration. Won’t be any CMIP6 submission with this model. AH: Andy Hogg keen to have a meeting. PD: Agree that sounds like a good idea.
PL: How do get baseline parameterisation and ocean mask and all that. Grab from OM2? PD: Yes grab from OM2. AH: Yes, but still some tuning in coupled versus forced model.

ACCESS-OM2 and MOM-SIS benchmarking on Broadwell on gadi

PL: Started this week. Not much to report. Running restart based on existing data fell over. Just recreated a new baseline. Done a couple of MOM-SIS runs. Waiting on more results. Anecdotally expecting 20% speed degradation.

Update on PIO in CICE and MOM

AK: Test run with NH exe. Tried to reproduce 3 month production run at 0.1deg. Issue with CPU masking in grid fields. Have an updated exe set running. Some performance figures under CICE issue on GitHub. Speeds things up quite considerably. AH: Not waiting on ice? AK: 75% less waiting on ice.
AH: Nick getting queue limits changed to run up to 32K.
PD: Flagship project such as this should be encouraged. Have heard 70% NCI utilisation? May be able to get more time. RY: No idea about utilisation. Walltime limitation can be adjusted. Not sure about CPU limit. AH: I believe it can. They just wanted some details. Have brought up this issue at Scheme Managers meeting. Would like to get number of cpu limits increased across the board. There is positive reaction to increasing limits, but no motivation to do so. Need to kick it up to policy/board level to get those changes. Will try and do that.
NH: Some hesitation. Consume 70-80KSU/hr. Need to be careful. PL: What is research motivation? NH: Building on previous work of PL and MW. With PIO in CICE can make practical configs with daily ice output with lots more cores. Turning Paul’s scaling work in production config. Possible due to PL, MW work, moving to gadi and having PIO in CICE.
NH: Got 3 new configs. Existing small (5K), medium (8K), large (16K) and x-large (32K). MOM doubling. CICE doesn’t have to double. Running short runs of configs to test. PL: 16K is where I stopped. NH: Andy Hogg said good to have a document to show scalability for NCMAS. PL: All up on GitHub. NH: Will take another look. NH: Getting easier and easier to make new configs. CICE load balancing used to be tricky. Current setup seems to work well to increase cpus.
PD: What is situation with reproducibility? In 1 degree MOM run 12×8. Would be same with 8×12? More processors? NH: Possible to make MOM bit repro with different layout and processor counts, but not on by default. Can’t do with CICE, so no big advantage. PL: What if CICE doesn’t change? NH: Should be ok if keep CICE same, can change MOM layout with correct options and get repro. RF: Generally have to knock down optimisation and floating point config. Once you’re in operational mode do you care? Good to show as a test, but operationally want to get through fast as possible as long as results statistically the same. PL: Climatologically the same? RF: Yeah. PL: All the other components, becomes another ensemble member. RF: Exactly. NH: Repro is a good test. Sometimes there are bugs that show up when don’t reproduce. That is why those repro options exist in MOM. If you do a change that shouldn’t change answers, then repro test can check that change. Without repro don’t have that option.
RF: Working with Pavel Sakov, struggling with some of the configs, updating yam to latest version. Moving on to 0.1 degree model. Hoping to run 96 member ensembles, 3 days or so, with daily output. A lot of start/stop overhead. PIO will help a lot. Maybe look at start up and tidy up time. A lot different to 6 month runs. AH: Use ncar_rescale? RF: Just standard config. Not sure if it reads in or does dynamic. AH: Worth precalculating if not doing so. Sensitivity to io_layout with start-up? RF: Data assimilation step, restarts may come back joined rather than separate. Thousands of processors trying to read the same file. AH: mppncscatter? RF: Thinking about that. Haven’t looked at timings yet, but will be some issues. AH: How long does DA step take? RF: Not sure. Right at the very start of this process. Pavel as had success with 1 degree. Impressed with quality of results from the model. Especially ice.
AH: Maybe Pavel present to a  COSIMA meeting? RF: Presented to BlueLink last week. AH: Always good to get a different view of the model.

Testing

AH: Trying to get testing framework NH setup on Jenkins running again. Wanted to use this to check FMS update hasn’t broken anything. Can then update FMS fork with Marshall and Rui’s PIO changes.
NH: A couple of months ago got most of the ACCESS-OM2 tests running. MOM5 runs were not well maintained. MOM6 were better maintained and run consistently until gadi. Can get those working again if it is a priority. Was being looked at by GFDL. AH: Might be priority as we want to transition to MOM6.
NH: Don’t have scaling results yet. Will probably be pretty similar to Pauls numbers. Will should you next time. PL: Will update GitHub scaling info. NH: Planning to do some simple plots and tables using AK’s scripts that pull out information from runs.

Bathymetry

AK: Got list of edits from Simon Marsland for original topography. Wanted to get feedback about what should be carried across. Pushed a lot into COSIMA topography_tools repo. Use as a submodule in other repos which create the 1 degree and 0.25 degree topographies. Document the topography with the githash of the repo which created it. Pretty much finished 0.25. Just a little hand editing required. Hoping to get test runs with old and new bathymetry.
AH: KDS50 vertical grid? QA analyses? AK: Partial cells used optimally from KDS50 grid. Source data is also improved (GEBCO) so not potholes and shelves. AH: Sounds like a nice well documented process which can picked up and improved in the future.
AK: Way it is done could be used for other input files, have all in git repo and embedding metadata into file its exact link to git hash. Good practice. Could also use manifests to hash inputs? NH: Great, talked about reproducible inputs for a long time. AH: Hashing output can track back with manifests. Ideally would put hashes in every output. There is an existing feature for unique experiment IDs in payu, but has not gone further, still think it is a good idea.
AK: Process can be applied to other inputs. AH: The more you do it, the more sense it makes to create supporting tools to make this all easier.

Jupyterhub on gadi

AH: What is the advantage too using jupyterhub?
JM: Configurable http proxy and jupyter hub forward ports through a single ssh tunnel. If program crashes and re-run, might choose different port but script doesn’t know. This is a barrier. Also does persistence, basically like tmux for jupyter processes. AH: Can’t do ramping up and down using bash script? JM: Could do, that is handled through dask-jobqueue. Bash script could use that too. JM: Long term goal would be a jupyterhub.nci.org.au. Difficult to deploy services at scale. AH: Pawsey and the NZ HPC mob were doing it.

Technical Working Group Meeting, August 2020

Minutes

Date: 12th August, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (GFDL)

PIO with MOM (and NCAR)

MW: NCAR running global 1/8 CESM with MOM6. Struggling with IO requirements. Worked out they need to use io_layout. Interested in parallel. Our patch is not in FMS, and never will be. They understand but don’t know what to do. Can’t guarantee my patch will work with other models and in the future. Said COSIMA guys are not using it. Using mppnccombine-fast so don’t need it.  Is that working? AK: Yes.
RY: Issue is no compression. Previous PIO not support compression and size is huge. Now netCDF supports parallel compression, so maybe look at it again. Haven’t got time to look at it. Should be a better solution for COSIMA group.
MW: Ideally Ed Harnett or someone else from NCAR would add PIO to FMS. They have been working on latest FMS rewrite for more than 2 years. Haven’t finished API update. FMS API is very high level. They have decided too high level to do PIO. FMS completely rewriting API. Ed stopped until FMS update. Added PIO and used direct netCDF IO calls. Bit hard-wired but suitable for MOM-like domains. Options 1 sit and wait, Option 2 is do their own, Option 3 is do it now and use fork of FMS. Maybe Option 4 is mppnccombine-fast. What do you think?
AK: Outputting compressed tiles with io_layout and using fast combine. Potential issue is if io_layout makes small tiles. MW: Chunk size has to match tile size? Do tiles have to be same size? AH: Yes. Still works but slower but has to do deflate/reflate step. It is fast when it can just copy compressed chunks from one file to another. Limit is only filesystem speed. Still uses much less memory if it has to do deflate/reflate. Chooses first tile size and imposes that on all tiles. If first tile is not typical tile size for most files could do end up reflating/deflating a lot of the data. Also have to choose processor layout and io_layout carefully. For example 0.25 1920×1080 doesn’t have consistent tile sizes. MW: Trying to figure out if it is worth telling them to reach out to you guys. Sent them a link to the repo. AH: Might be a decent way to keep going until they get a better solution.
MW: Bob had strategy to force modelling services to include PIO support by getting NCAR to use PIO.
NH: Can they use PIO patch with their current version of FMS? MW: They want to get rid of the old functions.
Bad idea to ditch old API, creates a lot of problems. The parallel-IO work is on a branch.
AH: Regional output would be much better. Output one file per rank. Can aggregate with PIO? NH: One output file. Can set chunking. AH: Not doing regional outputs any more because so slow. Would give more flexibility. AK: Slow because of huge number of files. Chunks are tiny and unusable. Need to use non-fast. AH: I thought it was hust the output is slow. RF: Many processors on same node will pump output. MW: Many outputs will throttle lustre. Only have a couple of hundred disks. Will get throttled. AH: Another good reason to use for MOM. MW: Change with io_layout? RF: No, always output for themselves. MW: Wonder how patch would behave. AH: NCAR constrained to stay consistent with FMS. MOM5 is not so constrained, should just use it. NH: Should try it if code already. parallel netcdf is a game changer. AH: I have a long-standing branch to add FMS as a sub-tree. Should do it that way. Have our own FMS fork with the code changes. MW: Only took 3 years!
AK: Put in a deflate level as namelist parameter as it defaulted to 5. Used 1 as much faster but compression was the same.

CICE PIO

NH: Solved all known issues. Using PIO output driver. Works well. Can set chunking, do compression, a lot faster. Ready to merge and will do soon. I don’t understand why it is so much better than what we had. I don’t understand the configuration of it very well. Documentation is not great. When they suggested changes they didn’t perform well. Don’t understand why it is working so well as it is, and would like to make it even better.
NH: Converted normal netcdf CICE output driver to use latest parallel netCDF library with compression. So 3 ways, serial, same netCDF with pnetcdf compressed output, or PIO library. netcdf way is redundant as not as fast as PIO. Don’t know why. Should be doing this with MOM as well. Couldn’t recall details of MW and RY previous work. Should think about reviving that. Makes sense for us to do that, and have code already.
MW: Performance difference is concerning. NH: Has another player of gathers compared to MPI-IO layer.  PIO adds another layer of gathering and buffering. With messy CICE layout PIO is bringing all those bits it needs and handing it lower layer. Maybe possible reason for performance difference. RY: PIO does some mapping from compute to IO domain. Similar to io_layout in MOM. Doesn’t use all ranks to do IO. Sends more data to a single rank to IO, saves contention issues. NH: MPI-IO has aggregators? RY: In the library you can select number of aggregators. Default is 1 aggregator per node. If you use PIO to use single rank per node this matches MPI-IO. Did this in the paper where we tested this. If consistent io_layout, aggregator number and lustre striping should get good performance.
RY: Tried different compression levels? NH: Just using level 1. Did some testing in serial case not much point going higher. Current tests doing all possible outputs. RF: A lot of compression will be due to empty fields. RY: compression performance is related to chunk size. NH: performance difference with chunk size. Too big and too small is slower. Default chunk size is fastest for writing. 360×300 for 2D field. Might not be ok for reading. RY: Should consider both read and write. Write once and many read patterns. MW: Parallel reads were slower than POSIX reads. AH: What is dependence of time on chunk size. NH: Depends how many fields we output. Cutting down should be fast for larger chunk size. Is a namelist config currently. Tell it chunk dimension. RY: Did similar with MOM. AH: CICE mostly 2D, how many have layers. AH: What chunking in layers? NH: No chunking, chunk size is 1, for layers. AH: Have noticed access patterns have changed with extremes. Looking more at time series, and sensitive to time chunking. Time chunking has to be set to 1? NH: With unlimited not sure RF: Can chunk in time with unlimited, but would be difficult as need to buffer until writing possible. With ICE data layers/categories are read at once. Usually aggregated, not used as individual layers. Make more sense to make category chunk the max size. Still a small cache for each chunk. netCDF4 4.7.4 increased default cache size from 4M to 16M.
AH: I thought deflate level 4 or 5 was still worth it. NH: Can give it a try. Don’t really care about deflate level, just getting rid of zeroes.

Masking ICE from MOM – ICE bypass

NH: Chatted with RF on slack. Mask to bypass ICE. Don’t talk to ice in certain areas. Like the idea. Don’t know how much performance improvement. RF: Not sure it would make much difference. Just communication between CICE and MOM. NH: Also get rid of all the CICE ranks that do nothing. RF: Those are hidden away because of round-robin and evenly spread. Layout with no work would make a difference. NH: What motivated me was the IO would be nicer without crazy layouts. If didn’t bother with middle, would do one block per cpu, one in north and south. Would improve comms and IO. If it were easy would try. Maybe lots of work. AK: Using halo masking so no comms with ice-free blocks.
AH: What about ice to ice comms with far flung PEs? RF: Smart enough to know if it needs to MPI or not. Physically co-located rank will not use MPI. AH: Thought it would be easy? NH: Not sure it is justified in terms of performance improvement. With IO tiny blocks were killing performance, so this was a solution to IO problem. MW: Two issues are funny comms patterns and calculations are expensive, but ice distribution is unpredictable. Don’t know which PEs will be busy. Load imbalances will be dynamic in time. Seasonal variation is order 20%. Might improve comms, but that wasn’t the problem. Stress tensor calcs are expensive, so ice regions will do a lot more work. NH: Reason to use small blocks which improves ability to balance load. MW: Alisdair struggling with icebergs. Needs dynamic load balancing. Difficult problem. RF: Small blocks are good. Min max problem. Every rank has same amount of work, not too much or too little. CICE ignores tiles without ice. CICE6 a lot of this can be done dynamically. There is dynamic allocation of memory. AH: Dynamic load balancing? RF: Who knows. Now using OpenMP. AH: Doesn’t seem to make much difference with UM. MW: Uses it a lot with IO as IO is so expensive.
AH: A major reason to pursue masking is it might make it easier when scaling up. If round-robin magically scales well that is ok, but last time there was a lot of analysis with heat maps and discussion about optimal block sizes. Conceptually it might be easier to understand how to best optimise for new config. NH: Does seem to make sense, could simplify some aspects of config. Not sure if it is justified. MW: Easy to look at comms inefficiency. Did this a lot for MOM5, and mostly it wasn’t comms. Sometimes the hardware, or a library not handling the messages well, rather than comms message composition. Bob does this a lot. Sees comms issues, fixes it and doesn’t make a big difference. Definitely worth running the numbers. NH: Andy made the point. This is an architecture thing. Can’t make changes like this unilaterally. Coupled model as well. Fundamentally different architecture could be an issue. MW: Feel like CPUs are the main issue not comms. Could afford to do comms badly. NH: comms and disks seems pretty good on gadi. Not sure we’re pushing the limits of the machine yet. Might have double or triple size of model. AH: Models are starting to diverge in terms of architecture. Coupled model will never have 20K ocean cpus any time soon. NH: Don’t care about ice or ocean performance.
AH: ESM1.5 halved runtime by doubling ocean CPUS. RF: BGC model takes more time. Was probably very tightly tuned for coupled model. 5-6 extra tracers. MW: 3 on by default, triple expensive part of model. UM is way more resources. AH: Did an entire CMIP deck run half as fast as they could have done. My point is that at some point we might not be able to keep infrastructure the same. Also if there is code that needs to move in case we need to do this in the future. NH: Code is more of an ocean calculation anyway? RF: Kind of. Presume there is a separate ice calc. Coupling code taken from gfdl/MOM and put into CICE to bypass FMS and coupler code. From GFDL coupler code. Rather than ocean_solo.f90 goes through coupler.f90. NH: If 10 or 20K cores might revisit some of these ideas. Goal to get to those core counts working, not sure about production.
MW: Still thinking about super high res, like 1/30. OceanMaps people wanted it? More concrete. RF: Some controversy with OceanMaps and BoM. Wanting to go to UM and Nemo. There is a meeting, and a presentation from CLEX. Wondering about opportunity to go to very high core counts (20K/40K). AH: Didn’t GFDL run 60K cores for their ESM model? NH: Never heard about it. Atmosphere more aggressive. RF: Config I got for CM4 was about 10K. 3000 line diag_table. AH: Performance must be IO limited? MW: Not sure. Separated from that group.

New bathymetry

AK: Russ made a script to use GEBCO from scratch. Worked on that to polish it up. Everything so far has been automatic. RF: Always some things that need intervention for MOM that aren’t so much physically realistic but are required for the model. AK: Identified some key straits. Retaining previous land masks so as not to need to redo mapping files. 0.25 need to remove 3 ocean points and add 2 points. Make remap weights scripts are not working on gadi, due to ESMF install. Just installed latest esmf locally, 8.0.1, currently running. AH: ESMF install for WRF doesn’t work? AK: Can’t find opal/MCA MPI error. RF: That is an MPI error.
AH: Sounds like the sort of error that was a random error, but if happening deterministically not sure. AK: Might be a library version issue. AH: They have wrappers to guess MPI library, major version the same should be the same.
AH: All this is scriptable and be re-run right? Bathymetries are intimately tied to vertical grid, so needs to be re-run if that is changed. AK: Vision is certainly for it to be largely automated. Not quite there yet.
NH: I’ll have a quick look too. Noticed there is no module load esmf? AK: Using esmf/nuwrf. I’ll have a look at what esmf built with. AH: I want esmf installed centrally. We should get more people to ask. NH: I think it is very important. AK: Definitely need it for remapping weights. AH: Other people need it as well.

Technical Working Group Meeting, July 2020

Minutes

Date: 10th June, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants
  • Peter Dobrohotoff (CSIRO Aspendale)
  • Marshall Ward (GFDL)

Optimisation report

PL: Have a full report, need review before release. This is an excerpt.
PL: Aims for perf tuning and options for configuration. Did a comparison with Marshall’s previous report on raijin.
Testing included MOM-SIS at 0.25 and 1 deg to get idea of MOM scalability stand-alone.
The ACCESS-OM2 at 0.1 deg. Resting with land masking, scaling MOM and CICE proportional.
Couldn’t repeat Marshall’s exactly. ACCESS-OM2 results based on different configs. Differences:
  1. continuation run
  2. time step 540s v 400s
  3. MOM and CICE were scaled proportionally
  4. Scaling taken to 20k v 16k
MOM-SIS at 0.25 degrees on gadi 25% faster than ACCESS-OM2 on raijin at low end of CPU scaling. Twice as fast for MOM-SIS at 0.1 degrees. Scalability at high end better.
ACCESS-OM2: With 5K MOM cores, MOM is 50-100% faster than MOM on raijin. Almost twice as fast at 16K, scaled out to 20K. CICE: with 2.5K cores CICE on gadi seems 50% faster than CICE on raijin. Scales to 2.5 times as fast at 16K OM2 CPUs.
Days per cpu day. From 799/4358 CICE/MOM cpus does not scale well.
Tried to look at wait time as fraction of wall time. Waiting constant for high CICE ncpus, decreases with high core counts with low CICE ncpus. So higher core count probably best to reduce ncpus as proportion. In this case half the usual fraction.
JM: How significant are results statistically? PL: Expensive. Only ran 3-4 runs. Spread quite low. Waiting time varied the most. Stat sig probably not due to small sample size.
MW: Timers in libacessom2 were better than OASIS timers, which include bootstrapping times which are impossible to remove. Also noisy IO timers. Not sure how long your run. Longer would be more accurate. PL: Runs are for 1 calendar month (28 days). MW: oasis_put and get are slightly magical, difficult to know what they’re doing. PL: Still have outputs, could reanalyse.
MW: Speedups seem very high. Must be configuration thing. PL: Worried not a straight comparison. MW: If network time is 15-20%, wouldn’t make a difference. Always been RAM bound, which be good if that wasn’t an issue now. PL: Very meticulous documentation of configuration, and is very reproducible. Made a shell script that pulls everything from GitHub. MW: I think your runs are better referenced. While experiments were being released, seemed some parameters etc were changing as I was testing. That could be the difference. Wish I documented it better.
AH: All figures independent of ocean timestep. PL: All timestep at 540s. AK: Production runs are 15-20% faster, but a lot of IO. PL: Switched off IO and only output at end of the month. It really drags it. Make sure it isn’t IO bound. Probably memory bound. Didn’t do any profiling that was worth presenting. MW: Got a FLOP rate? PL: Yes, but not at my fingertips. If it is around a GFLOP, probably RAM bound. PL: Now profiling ACCESS-CM2 with ARM-map. RY is looking at a micro view, looking at OpenMP and compilation flag level. RY: Gadi has 4gb/core, raijin has 2GB/core. Not sure about bandwidth. Also 24 cores/node. Much less remote note comms. Maybe a big reduction in MPI overheads. MW: OpenMPI icx stuff helping. RY: Lots of on-node comms. Not sure how much. MW Believe at high ranks. At modest resolutions comms not a huge fraction of run time. Normal scalable configs only about 20%. PL: The way the scaling was done was different. MW scaled components separately. MW: I was using clocks that separated out wait time.
RY: If config timestep matters, any rule for choosing a good one? AK Longest timestep that is numericaly stable. 540s is stable most of the time.
MW: How have progressed on CICE layout stuff? Changed in the last year? I was using sect-robin. RF: sect-robin or round-robin. AK: You use sect-robin, production did use round-robin, but not sect-robin. Less comms overhead, not sure about load balance.
PL: Is there any value in releasing the report. NH: Would be interested in reading it. Looking to get these bigger configurations. AH: Worth to document the performance at this point. RY: Any other else worth trying? AH: Why 20K limit? PL: Believe that is a PBS queue limit. Some projects can apply for an exception. RY: For each queue there are limits. Can talk with admin if necessary to to increase. AH: Will bring this up at Scheme Managers meeting. They should be updated with gadi being a much bigger machine. Would give more flexibility with configurations. Scalability is very encouraging.
RF: BlueLink runs very short jobs, 3days at a time. Quite a bit of start-up and termination time. How much does that vary with various runs. MW: I did plot init times, it was in proportion to the number of cores. Entirely MPI_Init. It has been a point of research in OpenMPI. Double ranks, double boot time. RF: Also initialisation of OASIS, reading and writing of restart files. PL: Information has been collected, but hasn’t been analysed. RF: Paul Sandery’s runs are 20% of the time. MW: MPI_Init is brutal, and then exchange grid. There are obvious improvements. Still doing alltoall when it only needs to query neighbours. Can be speed up by applying geometry. Don’t need to preconnect. That is bad.
PL: At least one case where MPI collectives were being replaced with a loop of point to points. Was collective unstable? MW: Yes, but also may be there for reproducibility. MW: I re-wrote a lot of those with collectives, but they had no impact. At one time collectives were very unreliable. Probably don’t need to be that way anymore. MW I doubt that they would better. Hinges on my assertion that comms are not a major chunk.
AH: MOM is RAM bound or cache-bound? MW: When doing calculations the model is waiting to load data from memory. AH: Memory bandwidth improves all the time. MW: Yes, but increase the number of cores and it’s a wash. It could have improved. AMD is doing better, but Intel know there is a problem.
AH: To wrap-up. Yes would like full report. This is useful for NH to work up new configurations, as naive scaling is not the way to go. Also intialisation numbers RF would like that PL can provide.

ACCESS-OM2 release plan

AH: Are we going to change bathymeytry? Consulted with Andy who consulted with COSIMA group. What is the upshot? AK: Ryan and Abhi want to do some MOM runs. Problems with bathy. Andy wants run to start, if someone has time to do it, otherwise keep going. Does anyone have some idea how long it would take. RF: 1 deg would be fairly quick. We know where the problems are. Shouldn’t be a big job. Maybe a few days. 1 deg in a day for an experienced person. GFDL has some python tools for adjusting bathymetry on their GitHub. Point and click. Alisdair wrote it. Might be in MOM-SIS examples repo. MW: Don’t know, could ask. RF: Could be something that would make it straightforward.
AK: Will have a look.
MW: topog problems in MOM6 not usually the same as MOM5 due to isopycnal coordinate.
AK: Some specific points that need fixing? RF: I think I put some regions in a GitHub issue. AK: What level to dig into this? RF: Take pits out. Set to min depth in config. Regions which should be min depth and have a hole. Gulf of Carpenteria trivial. Laptev should all be set to min depth. NH: I did write some python stuff called topog_tools, can feed it csv of points and it will fix it. Will fill in holes in automatically. Also smooths humps. May still have to look at the python and fix stuff up. Another possibility.
AK: Another issue is quantisation to the vertical grid. A lot of terracing that has been inherited. RF: Different issue. Generating a new grid. 1 degree not too bad. 0.25 would be weighty still. MC: BGC in 0.25, found a hole off Peru that filled up with nutrients.
AH: Only thing you’re waiting on? AK: Could do an interim release without topog fixes. People want to use. master is so far behind now Also updating wiki which refers to new config. Might merge ak-dev into master, and tag as 1.7, and have wiki instructions up to date with that. AH: After bathy update what would version be? AK: 2.0. AH: Just wondering what constitutes a change in model version. AK: Maybe one criteria if restarts are unusable from one version to the next. Changing bathy would make these incompatible. AH: Good version.
AH: Final word on ice data compression? NH: Decided deflate within model was too difficult due to bugs. Then Angus recognised the traceback for my degfault which was great. Wasted some time not implementing it correctly. Now working correctly. Using different IO subsystem. IO_MP rather than ROMIO_MP. Got more segfaults. Traced and figured out. Need to be able to tell netCDF the file view. The view of a file that a rank is responsible for. MPI expecting that to be set. One way to set it up is to specify the chunks to something that makes sense. Once I did that file view was correct. Then ran into bugs in PIO library. Seem like cut n’ paste mistakes. No test for chunking. Library wrapper is wrong. Fixed that. No working. Learnt a lot and a satisfying outcome. Significantly faster. Partly parallel, also different layer does more optimisation. Noticed with 1 degree things are flying. Nothing definitive, but seems a good outcome. Will get some definitive numbers and a better explanation. Will have something to merge. Will have some PRs to add to and some bug reports to PIO. RY: PIO just released a new version yesterday. NH: Didn’t know that. Tracking issues that are relevant to me. Still sitting there. Will try new version. RY: Happy it is working now. NH: Was getting frustrated with PIO, wondered why not using netCDF directly. For what we do pretty thin wrapper around netCDF. Main advantage is the way it handles the ice mapping. Worth keeping just for that. MW: FMS has most of a PIO wrapper but not the parallel bit.
PL: Any of that fix needs to be pushed upstream? NH: Changes to the CICE code. Will push to upstream CICE. Will be a couple of changes to PIO. AK: Dynamically determines chunking? NH: Need to set that. Dynamically figures out tuneable parameters under the hood about the number of aggregators. Looking at what each rank is doing. Knows what filesystem it is on. Dependent on how it is installed. Assuming it knows it is on lustre. Can generate optimal settings. Can explain more when do a summary.
AK: Want to make sure OP files are consistently chunked. NH: Using the chunking to set the file view. Another way to explicitly set the file view using MPI API. Chunks are the same size as the data that each PE has. In CICE each block is a chunk. MW: These are netCDF chunks? AK: More chunks than cores? NH: Yes. Is that bad or good? This level is perfect for writing. Every rank is chunked on what it knows to do. Not too bad for reading. JM: How large a chunk? NH: In 1 degree every PE has full domain 300 rows x 20 columns. JM: Those are small. Need bigger for reading AK: For tenth 36×30. Something like 9000 blocks/chunk. NH: Might be a problem for reading? RF: Yes for analysis. JM: Fixed cost for every read operation. A lot of network chatter. AH: Is that the ice or the ocean in the tenth? Not sure. Chunk size is 36×30.  A lot of that is ice free, 30% is land. MW: Ideal chunks are based on IO buffers in the filesystem. AH: Best chunking depends a lot on access patterns. JM: One chunk should be many minimal file units big. AH: When 0.25 had one file per PE it was horrendously bad for IO. Crippled Ryan’s analysis scripts. If you’re using sect robin that could make the view complicated? NH: Wasn’t Ryan’s issue that the time dimension was also chunked? AH: He was testing mppnccombine-fast which just copies chunks as-is, which were set by the very small tiles sizes. Similar to your problem? NH: Probably worse. Not doing MOM tiles, doing CICE blocks, which are even smaller. Same grid as MOM. RF: Fewer PEs, so blocks are half the size. MOM5 tile size to CICE block size are comparable apart from 1 degree model.
NH: Will carry on with this. Better than deflating externally, but could run into some problems. The chunks in the file view don’t have to be the same. Will this be really bad for read performance? Gathering that it is. What could be done about it? Limited by what each rank can do. No reason the chunks have to be the same as the file view. Could have multiple processors contribute to a chunk. Can’t do without out collective. MW: MPI-IO does collectives under the hood. Can configure MPI-IO to build your chunk for you? NH: Currently every rank does it own IO as it was simpler and faster. MW: Can’t all be configured at MPI-IO later? RY: PIO can map compute domain to IO domain. Previous work had one IO rank per node. IO rank collect all data from node. Set chunking at this level. NH: For example, our chunk size could be 48 times bigger. RY: Yes. Also best performance is single rank per node. PIO does have this function to map from compute to IO domain, and why we used it. Can also specify how many aggregators / node. First decide how many IO ranks per node, and how many aggregators per node. Those should match. Can also number of stripes and strip size to match chunk size. IO rank per node is the most important, as will set chunk size. MW: Only want same number of writers as OSTs. RY: Many writers per node, will easily saturate the network and be the bottleneck. AH: Have to go, but definitely need to solve this, as scaling to 20K cores will kill this otherwise.

MW: Will also help RF. If you’re desperate should look at the patch RY and I did. Will help a lot once you’ve identified your initialisation slow down. RF: Yes, will do once I’ve worked out where the blockages are. Just seen some timings from Paul Sandery, but haven’t looked into it deeply yet. NH: Even with rubbish config, model is showing performance improvements. Will continue with that, and will consider the chunk size stuff as an optimisation task. MW: Sounds like you’ve gone from serial write to massively parallel, so inverted the problem, from one disk bound, to network bound within lustre. If you can find a sweet spot in between then should see another big speed improvement. NH: Config step pretty easy to do with PIO. Will talk to RY about it. RY: Could have a user settable parameter to specify IO writers per node. PL: Need to look into lustre striping size? RY: Currently set to 1GB, so probably ok, but can always tune this. NH: Just getting a small taste of the huge world of IO optimisation. MW: Just interesting to be IO bound again. NH: heavily impacted by IO performance with dailies with tenth. MW: IO domain offset the problem. Still there  but could be dealt with in parallel with next run so could be sort of ignored.

AK: This is going to speed up the run. Worse case is post-processing to get chunking sorted out. NH: Leaves us back at the point of having to do another step, which I would like to avoid. Maybe different, before was a step to deflate, maybe rechecking was always going to be a problem.
AK: Revisiting Issue #212, so we need to change model ordering. Concerns about YATM and MOM sharing a node and affecting IO bandwidth. Tried this test, there is an assert statement that fails. libaccessom2 demands YATM is first PE. NH: Will look at why I put the assert there. Weirdly proud there is an assert. RF: Remember this, when playing around with OASIS intercommunicators, might have been conceptually this was the easiest way to get it work. MW: I recall insisting on a change to the intercommunicator to get score-p working. AK: Not sure how important this is. NH: There are other things. Maybe something to do with the runoff. The ice model needs to talk to YATM at some point. Maybe a scummy way of knowing where YATM is. For every config maybe then know where YATM is. These would be shortcut reasons. PL: Give it it’s own communicator and use that? NH: Maybe that is what we used to do. Could always go back to what we had before. RF: Just an idea if it would have an impact. Could give YATM it’s own node as a test. MW: Not sure why it is that way. Should be easy to fix. NH: Ok, certain configurations are shared. Like timers and coupling fields. Instead of each model have their own understanding, share this information. So models check timestep compatibility etc. Using it to share configs. Another way to do that. MW: Doesn’t have to rank zero. NH: Sure it is just a hack. MW: libaccessom2 is elegant, can’t say the same for all the components it talks to. RF: There is a hard-wired broadcast from rank zero at the end.

MOM6

MW: Ever talk about MOM6? AK: Angus is getting a regional circumantarctic MOM6 config together. RF: Running old version of CM4 for decadal project. PL: Maybe a good topic for next meeting?

Attachments

Technical Working Group Meeting, June 2020

Minutes

Date: 10th June, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants
  • Peter Dobrohotoff (CSIRO Aspendale)

Zarr

JM: Already have compression with netCDF4. How do they consume the data? Jupyter/dask? AK: Mostly in COSIMA, BoM and CSIRO have their own workflows. JM: Maybe archival netCDF4? As long as writing/parallel IO/combining, no hurry to move to zarr direct output. JM: Inodes not an issue. Done badly can be bad. lustre file system has a block size, so natural minimum size. At least as many inodes as allocatable units on FS. If a problem wrap whole thing in uncompressed zarr. RY: many filters. blosc is pretty good. can use in netcdf4, but not portable. needs to compiled into library. netCDF4 now supports parallel compression. HDF supported a couple of years ago.
AH: As we’re in science, want stable well supported software. Unlikely to use bleeding edge right now. Probably won’t output directly from model for the time being. Maybe post process.
NH: What about converting to zarr from uncollated ocean output? Why collating when zarr uncollates anyway? Also collate as difficult to use uncollated output. How easy to uncollated to zarr. JM: Should pretty straightforward. Write that block directly to part of directory tree. Why is collating hard? Don’t just copy blocks to appropriate place in file? AH: Outputs are compressed. Need to be uncompressed and then recompressed. Scott Wales has made a fast collation tool (mppnccombine-fast) that just copies already compressed data. There are subtitles. Your io_layout determines block size, as netCDF library chooses automatically. Some of the quarter degree configs had very small tiles which led to very small chunks and terrible IO. AK: regional output is one tile per PE and mppnccombine can’t hand the number of files in a glob. AH: Yes that is disastrous. Not sure was a good idea to compress all IO.
JM: Good idea to compress even for intermediate storage. Regional collation: what do we currently collate? Original collation tool? AK: Yes, but don’t have a solution JM: definitely need to combine to get decent chunk sizes. If interested happy to talk about moving directly to zarr, or parallelising in some way. AK: Would want a uniform approach/format across outputs. JM: not sure why collate runs out of files. AK: Shell can’t pass that many files to it. AH: My recollection is that it is a limit in mpirun, which is why mppnccombine-fast allows quoting globs to get around this issue. Always interested in hearing about any new approaches to improving IO and processing at scale.

ACCESS-OM2-BGC testing and rollout

AH: Congrats on Russ getting this all working. How do we roll this out?
AK: All model components are now WOMBAT/BGC versions, but two MOM exes. One with BGC and one without. All in ak-dev. No standard configs that refer to BGC. Need some files. RF: About 1G, maybe 10 files. Climatologies for forcing and 3 restart files. This will work with standard 1 degree test cases. Slight change to a field table, and a different o2i.nc (OASIS restart file). Not much change. Will work on that with Richard and Hakase. Haven’t tested with current version. Maintaining a 1 degree config should be ok. AK: No interest at high res? NH: Yes, but not worth supporting as yet. One degree shows how it is used.
AK: Would BGC be standard 1 degree with BGC as option, or separate config specialised for BGC? RF: Get together and give you more info. Probably stand separately as an additional config. Currently set it up as a couple of separate input directories. AH: So RF going to work up a 1 degree BGC config. RF: Yes. AH: So work up config, make sure it works, and then tell people about it. RF: The people who are mainly interested know about the progress.

ACCESS-OM2 release version plan

AH: Fixing configs, code, bathymetry. Do we need a plan? Need help?
AK: Considering merging all ak-dev configs into master. Was constant albedo, moving to Large and Yeager as RF advised. Didn’t make a big difference. How much do we want to polish? Initial condition is wrong. Initial condition is potential temp rather than conservative. Small compared to WOA and drift. Not sure if worth fixing. Also bathymetry at 1 and 0.25 degree. Not sure who would fix it. AH: Talk to Andy Hogg if unsure about resources? AK: A number of problems could all be fixed in one go.
AK: Not much left to do on code. PIO with CICE. Still issues?
NH: Good news. In theory compression is supported on PIO with latest netCDF. PIO library also has to enable those features. There was a GitHub issue that indicated they just needed to change some tests and it would be fine. Not true. There is code in their library that will not let you call deflate. Tried commenting out, and one of the devs thought was reasonable. Getting some segfaults at netCDF definition phase. Want to explore until decide it is waste of time, will go back to offline compression if it doesn’t pan out. Done the naive test if there is a simple work-around. Looking increasingly like won’t work easily. Could do valgrind runs etc.
AK: Bleeding edge isn’t best for production runs. NH: Agreed. Will try a few more things before giving up. AK: Offline compression might be safer for production. NH: Agreed, random errors can accumulate. RY: Segfault is with PIO? NH: Using newly installed netCDF4.7.4p, and latest version of PIO wrapper with some commented out some checks. Not complicated just calls deflate_var. RY: When install PIO do you link to new netCDF library? NH: Yes. Did have that problem before. AG pointed it out, and fixed that. Takes a while to become stable. RY: Do you need to specify a new flag with parallel compression when you open the with nf_open? NH: Not doing that. Possible PIO doing that. RY: Maybe PIO library not updated to use new flags to use this correctly. NH: If using netCDF directly what would you expect to see? Part of nf_create? RY: Maybe hard wired for serial. NH: Possible they’ve overlooked something like this. Might be worth a little bit of time to look into that. PIO does allow compression on serial output only. Will do some quick checks. Still shouldn’t segfault. Been dragging on, keep thinking a solution is around the corner, but might be time to give it up. A bit unsatisfying, but need to use my time wisely. AH: It’s not nothing to add a post-processing compression step and then take it out again. Not no work, and no testing, so out of the box compression would be nice. Major code update you’re waiting on? Need to update fortran datetime library? Just a couple of PRs from PL and NH. Paul’s look like a bug fix PL: Mine an edge case? AK: Incorrect unit conversion. AH: Don’t know where in the code and if it affects us. AK: We’re using the version that fixes that. NH: Guy who wrote that library wrote a book called “Modern Fortran”.

Testing update

 AH: MOM travis testing properly testing ACCESS-OM2 and ACCESS-OM2-BGC as have also updated libaccessom2 testing so that it creates releases that can be used in the MOM travis testing. Lots of boring/stupid commits to get the testing working. AK: Latest version gets used regardless of what is in the ACCESS-OM2 repo? AH: Not testing the build of ACCESS-OM2, just the MOM5 bit of that build linking to the libaccessom2 library so that it can produce an executable so we know it worked successfully. We do compile OASIS, and uses just the most recent version. Had intended to do the same thing with OASIS, make it create releases that could be used. Haven’t done it, but it is a relatively fast build. AK: OASIS is now a submodule in the libaccessom2 build. AH: Yes, but still have a dependency on OASIS. Don’t have a clean dependence on libaccessom2. Might be possible to refactor, but probably not worth it. So yes have a dependence on libaccessom2 *and* OASIS.
AH: Previously travis allowed ACCESS-om2 to fail and the only way you knew it worked correctly would have to look at the logs to determine if everything was successful up to the linking step.
AH: Currently fixing up Jenkins automated testing. Starting on libaccessom2 testing. Hopefully won’t be too difficult. NH: Definitely need to have it working.
PL: Writing up testing/scaling tests for ACCESS-OM2.

Technical Working Group Meeting, May 2020

Minutes

Date: 20th May, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • James Munroe (JM) Sweetwater Consultants

ACCESS-OM2-01 scaling experiments

See PL’s associated scaling doc, scaling spreadsheet and python notebook
PL: Scaled MOM5 and CICE5 by same amount. Based in 01deg_jra55v13_ryf9091. Run an initial run to get restart output from February 1900. Restart runs for February (28 day). 540s time step. 4480 steps. No diagnostic output. Left ice as is.
PL: CICE scaled ncpus and ntask proportionally. Scaled MOM from 80×75 (4358) to 160×150 (16548). Scaling ok looking just at ocean timer and ice timer. Didn’t have daily iCE output.
PL: Most efficient at 10K cores with total wall time. Ocean timer shows perfect scaling. CICE only timer also shows good scaling.
NH: Keen to try these configs in production.
PL: Not sure how appropriate for production, no IO. NH: Good place to start, turn on output and see how it goes. Looks well balanced. Somewhat surprised.
PL: Now trying to reproduce Marshall’s figures from the report. Scales ocean and ice separately. Yet to get reproduction runs going. Working through namelist differences. Sometimes get a silent hang. Worth scaling ocean and ice at same time.
AH: Why do both models scale so well but overall not so well when combined. RY: CICE is waiting for MOM. Maybe some more optimal setting for CPU numbers?  AH: Seems odd MOM scaling better than CICE, but CICE waiting for MOM.
NH: CICE is waiting less for ocean as cpus scaled. oasis_recv is constant, which means MOM not waiting on CICE. Definitely don’t want MOM waiting for CICE. RY: If increase MOM and reduce CICE would we get better performance? PL: Not sure. Might be useful to know how I got those numbers, using log file and figures are total time divided by number of steps.. RF: Output from access-om2.out.  Just summary, won’t show load balance with MOM. PL: Any guidance would be useful. RF: Look in access-om2.out.
AH: Look for MOM timers. Might be some information about range of values, could be some very slow PEs masked by average. RF: the check mask is out of date. Has 12-16 processors which are purely land. Changed land mask and didn’t update mask. Some processors only have values in the halo boundaries. Crashes otherwise.
PL: Regenerated new mask files. Numbers should agree with what was done. Any more advice would be welcome. Send email or talk on slack. RF: I’ll look at CICE layouts and balance, and masks. CICE Is also seasonally dependent.
PL: Moving to a more conventional experiment payout. Will move to a shared location. AH: Could put in /g/data/v45.
AK: CICE scaling with serial IO. Nic almost finished PIO. Will stop scaling without PIO. Runs much faster with parallel even with monthly outputs. AH: Seems to be scaling ok. AK: Any output written? AH: Running for a month, so should be some output.
AH: Ran from initial conditions? PL: Yes. Ran for 1 month with timestep of 300s. Then ran from those restarts with timestep of 540s. AH: There is an ice climatology? RF: If run for a month, should have generated ice. AK: Ice generated from surface temperature in initial conditions.
RY and PL left meeting.
AH: Maybe a bit more to look at in PL runs. NH: May have misunderstood where those numbers came from. RF: Looked like it was scaling nice and linear. AH: Yes for each model, but together scaling died going to 20K. RF: Not sure these results are that useful when IO is turned on. Code paths not currently going through without IO. Putting stuff on density levels. And a whole lot of globals/collectives that aren’t being done. AH: Encouraging though. NH: In principle can scale up.

PIO compilation in ACCESS-OM2

NH: Got a reply from NCI. Resistance to having PIO in a module. Best to be self sufficient. If it turns out to be an issue can address later. Will make a submodule. Clean up the build process. Changes to CICE repo. One CICE namelist change, tell it to not explicitly use netCDF for certain things. Bit odd.
NH: Experiment repos will require updates. Maybe AK will report some more realistic performance numbers.
AH: PIO with MOM? NH: Not sure. CICE isn’t doing a great deal in the configuration I am using. Seems to all work inside parallel netCDF as doing output from all processors. Can use IO nodes and use comms, but doesn’t show performance improvement, and looks worse in many cases. We could configure in the same way without using PIO. RF: Don’t have much control where we put processors. CICE at the end. Probably sharing with MOM. Playing with layout might be tricky. NH: At some stage put CICE on all it’s own nodes. RF: Once YATM is on the first node, it ends up messing things up. NH: Why are we doing that? RF: Something to do with OASIS in the old days. Now have YATM and root PE of MOM on same node. Would make sure all root PEs on their own node. No contention. YATM and MOM also on same NUMA partition. NH: We should change that, easy fix. YATM doesn’t do much on 0.1 as rest of the model takes so long. RF: Two IO processors on the same node. MOM root PE uses for diagnostics and YATM process. NH: If each model on their own node. Could make sure each node has a single IO processor. With PIO if want 1 node per 16 processors, don’t know if it is talking across nodes.
JM: In terms of PIO are multiple nodes writing to the same file? NH: For CICE very single process writing to same file at the same time. Works well. Haven’t looked into it deeply, probably the optimum is something in between. Still a big improvement over serial output. AH: Kaizen (改善): small incremental improvements all the time. Compressed netCDF output? NH: No. PIO GitHub talked about supporting compression. AH: Same as what RY and Marshall did? NH: Yes. Have to wait for parallel netCDF implementation which supports it. Confusing because there is also p-netCDF. PIO is a wrapper. AH: Yes, wraps p-netCDF and MPI-NetCDF. p-netCDF is only netCDF3, not based on HDF5. AK: Will need post-processing compression step. NH: Task not done until compression done. AK: Very sparse data, shame not to compress.
AH: xarray is supporting sparse data now. FYI. Can mean a lot less memory use for some data.

Compiling with/without WOMBAT

AH: Any speed/memory use implications to always have it compiled in? RF: Should be separate. Overhead basically nothing. Will only allocate BGC arrays if they’re in the field table. Should be kept separate like all other BGC packages. I put in some lines in the compile scripts. Also f you want to compile without ACCESS.
AK: If want to maintain harmony with CM2 want a non-BGC compilation? RF: Yes. AK: From the point of view of OM2 users would be nice to be able to switch BGC on and off just through namelists. RF: Switched on via field_table. Strange design choices years ago. Also need changes in some of the restart files, o2i.nc and i2o.nc. AK: Not something that can be switched on and off? RF: No.

MOM Pull Requests

AH: Guidance for checking? RF: Two main changes in the code are probably fine. Maybe the ACCESS compilation scripts. Unless want to change that it gets compiled in all the time for ACCESS-OM.AH: Decided not I think. RF: Made changes to install.sh to specify the type of model. AH: Separate model designation with WOMAT? RF: ACCESS-OM-BGC is a new model type. Run tests, all ok. AH: Do we need any tests to check it hasn’t changed non-BGC tests. RF: Shouldn’t be anything that effects a normal run. Code compiled ok on travis. Put in some heat diagnostics, the fluxes from CICE, might be the only thing. AH: Are Jenkins tests working? NH: ACCESS-OM2 tests haven’t worked since moved to the new machine. RF: Run a 1 degree model and see how it goes. AH: I’ll do that.
AK: Managing ACCESS-OM2, should the distinction between BGC and non-BGC be in the control directories. So build script builds both and choose which in the config, or compile once, supporting both. AH: I don’t think BGC is a supported configuration yet. Needs testing. How it is implemented, shared or separate exes is just a choice of how you decide is the best.
AH: Turns out that Geos PR was a mistake. Asked about it, and they closed it.

Bad bathymetry

AH: Any comments? Does it need fixing? RF: Bad bathymetry needs to be fixed, or copy bathymetry from somewhere else. Bad around Australia. Same for CM2. Mentioned it 3-4 years ago, still not updated. Some pits in Gulf of Carpentaria down to 120m in 0.25. 1 degree goes down to 80m. Should be no deeper than 60m. OCCAM created some bad bathymetry in Bass Strait, off coast of China. Russian and Alaskan issues, and White Sea. Remapping indices got mucked up. AH: Wasn’t 0.25 fixed north of Bering Strait? RF: Doesn’t look like it.
JM: Bathymetry files are wrong in certain regions? RF: Came from Southampton OCCAM model. They ran it with a normal mercator and a transverse mercator across the top. Remapping onto spherical grid indices got mucked up and got some strange bathymetry. GFDL inherited it and based a bunch of models on it. Leaked through to the ACCESS models. Was in the US forecast model and they noticed all the stuff around Alaska.
AH: Should be a relatively straightforward as this is only ocean bottom cells, and doesn’t touch coasts? RF: Yes. AK: Base on a coarsened tenth grid? RF: Not a big job, just a few slabs that need smoothing/removing. AH: Does this need to be fixed for the next release of OM2? RF: Yes. AK: No. RF: Get a student to look at it. AK: Also land mask inconsistencies, would be good to have all three models consistent. There are big curvy bits of coastline keeping ocean away from tripoles. AH: The 1 degree is very much a model, that isn’t that realistic. Tenth starts to look much more like real life.

Zarr file format

AH: Wanted to engage JM about zarr. RF: Interested as this is being used in decadal prediction project. JM: Exactly. Talked today about parallelising output from model into netCDF, and then post-analysis requires transforming to zarr. Zarr is a distributed file format that stores files in directories, each chunk is a separate file, parallelisation handled by filesystem. Should we write directly into zarr like file format. There are file formats like it. netCDF5 may have a zarr like back-end. RF: There is some discussion on the netCDF GitHub about zarr, looks like just one person. JM: Unidata is willing to move away from HDF5. Parallelisation of HDF5 has never worked the way it was supposed to. Instead of using parallel IO, just write directly to the format people want to use. AH: Got the impression netCDF people never got the buy-in from HDF5 that they thought would get. HDF5 just do their own thing. JM: Still have people using netCDF3. AH: A strength of netCDF, they could hop back end again and keep the same interface. JM: Same data model. AH: What is the physical format of a zarr blob? JM: It is a binary blob that supports different filters/compression schemes. AH: Does machine independent storage? Bad old days with swapping endianness on binary files. AG: In zarr there are raw data blobs, and associated metadata files that describe the filter/endianess etc.
JM: Inodes not a problem. Still relatively large, on the order of the lustre striping scheme. Can wrap the whole thing inside an uncompressed zip file. Parallelises for reading just fine. Works like a tar, index on where to read, supports multiple reads on same file. AH: Would want to do this when archiving.
NH: Another one is TileDB, which is a file format. JM: There are other backends, n5/z5. Distributed storage for large data sets.
AH: At one stage we did wonder if collation was even necessary with tools like xarray, but never looked into it. NH: Things have changed a lot. xarray is relatively new. 3-5 years ago might segfeault on tenth model data. So much better now, so many more possibilities.

 

Technical Working Group Meeting, April 2020

Minutes

Date: 29th April, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Andrew Kiss (AK) COSIMA ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

Apologies from Peter Dobrohotoff.

JRA55-do v1.4 support

AK: Staged rollout. NH tagged some branches, so existing master tagged 1.3.0, using old JRA55-do v 1.3.1 using NH new exes which also support 1.4
AK: Also working on a new feature branch for 1.4. Same exes configured to use JRA 1.4 version. Seems to run ok. Not looked at output. Will look at that today. Once satisfied that is ok will move into master, tag 1.4.
AK: Also looking at ak-dev branch with a wide variety of changes. Once this is ok will tag with a new ACCESS-OM2 version. Will be new standard for new experiments. Good to make an equivalent point across repos.
AH: COSIMA cookbook hackathon showed value of project boards. Might be a good idea next time something like this attempted. AK: Tried, but it didn’t go anywhere.
NH: Two freshwater fields coming from forcing, liquid and solid. Both go into the ICE model which accepts one new forcing field. Get added together, solid magically becomes liquid without heat changes, passed straight to Ocean. Ocean and Ice models have also been changed to accept liquid part of land/ice melt and heat part of land ice melt. Exist but just pass zeroes. Extra engineering not being used as yet. A harmonisation step which takes us close to CM2 as coupled model uses these fields.
RF: With my WOMBAT updates incorporated this new code, could get rid off ACCESS-CM preprocessor directives.
NH: In the future can put work into calculating those fields correctly in the ice model. Not a huge amount of work. Will then have river runoff, land ice runoff and land ice melt heat.
NH: New executables have another change, support different numbers of coupling fields. Land/ice coupling fields are optional. At runtime figures out what coupling fields used. Dependent on namcouple being consistent. Coded internally as a maximum set of coupling fields. You can take coupling fields out but not add new ones. Possibly useful for others. Not a fully flexible coupling framework.
NH: Working on ak-dev branch. Harmonising namcouple files. Have a lot of configuration fields, but a lot ignored. Could use same namcouple in all configs, but in practice might leave them looking a little different. They include the timestep in them, but ignored. Could set to zero? AH: Or a flag value that is obviously ignored?
NH: Only three variables used in namcouple. Rest ignored, bust must parse properly. Needs cruft to make it parse. Never liked namcouple. Completely inflexible, values must be changed in multiple places.
AK: On version Oasis3-mct2, have they improved it in new version?
NH: Can now bunch fields together, pass a single 3d field instead of many 2d fields. Should improve performance. RF: Not through namcouple at all. Just a function call.
MW: What does OASIS do now? NH: Just doing routing. Which is done by MCT anyway. Remapping done by ESMF. Coupler meant to do 3 things, config, remap and routing. Made libaccessom2 do as much as possible automatically. So OASIS does very little. Still using API, so would require effort to remove.
MW: Know about NUOPC? NCAR is using it. NH: Coupling API. If all use the same API then can go plug and play. MW: MOM6 has a NUOPC driver. NH: In the future would to look at OASIS4, but probably just chuck OASIS, use MCT to do the routing and ESMF to do remapping. MW: NCAR dropped MCT. NH: MCT is a small team. AK: Something that would suit ACCESS-CM. Any critical things that rely on OASIS? MW: At mercy of UM. Probably still use OASIS due to Europe. NH: Not using ESMF, so using OASIS a lot more than we are. Might never change because of that. AK: Even moving to v4 would require coordination with CM2. NH: Nicer and cleaner, but no clear benefit.

Updated ACCESS-OM2 model configs

AK: 3 different tags. 1.3.1, 1.4 in works. ak-dev new tag. 1.4 intended to be minimal other than change in JRA55-do version. ak-dev making more extensive changes. Ussing mppccombine-fast for tenth. Output compressed data and use fast collation. Not worthwhile for 1 deg. With 0.25 output uncompressed and use mppnccombine to do compression. Hopefully output will be a reasonable size.
AK: If outputting uncompressed restarts might get large. Might want to collate restarts. Wanted to verify which run is collated: just finished, or the previous run? AH: Yes it is the restarts which are not used in the next run.
AH: Because quarter degree is not compressed won’t get the inconsistent chunk sizes between different sized tiles. Ryan had the problem when he had a io_layout with very small chunk sizes which made his performance very bad. mppnccombine-fast might be faster, will definitely use less memory. Still got compression overhead but memory use much reduced. AK: Not such a big issue as tenth. AH: Paul Spence had some issues with the time to collate his outputs. Maybe because they were compressed. Would recommend using AK: Fast version will always be faster? AH: Yes, at least no slower, but definitely uses less memory and will be much faster with compressed output.
MW: No appetite for FMS with parallel-IO? AH: Compression? Without it won’t bother probably. RY: Did some tests on parallel IO compression tests. Can’t recall results. Interested to try again. Requires a bit more memory. gadi has optane as storage or as memory. Interesting to test. Probably can use that for parallel compression or even just serial compression. Thinking about, but haven’t started. AH: Please keep us updated.
NH: Anyone have thoughts on CICE? Planning on parallel IO on CICE. Are we going to need a compression step? RF: With daily would like compression. Post processing to do compression on smaller number of PEs would be fine. Improving IO is critical for Paul Sandrey and Pavel. NH: Might need a post processing step similar to MOM. RF: Yes. Getting parallel IO is the most important. Worry about compression later. NH: Did a run yesterday with parallel IO. Completed successfully. Output was garbage. Was expecting to do heaps of work and segfaults. Surprised at that. RF: Misaligned or complete garbage? NH: Default assumption as bad as can be. Just used parallel-IO output driver on CICE. AK and RF realised daily CICE output was a bottle neck on 0.1 performance. As model code existed, decided to get working. RY: Parallel IO need to set up mapping correctly between compute and IO domains. NH: Should be part of the current implementation. Mapping is a tricky part of CICE. AK: Values out of range, so maybe not just a mapping issue? NH: Completely broken, but not segfaulting. Just getting it building was one hurdle. Also had to call the right initialisation stuff within CICE. Had to rewrite some of it that was depending on another library from one of the NCAR models (CESM). CICE is used with  CESM and they had a dependency on another utility library. Changed some code to remove dependence. Relatively positive. Library under active development and well supported. AH: Did they develop just for their use case, and maybe doesn’t support round-robin? NH: Not sure. We do know never been used in any other model than CESM.
MW: Ed Hartnett (PIO) eager to get into FMS. Also lead maintainer of netCDF4.

Status of WOMBAT in ACCESS-OM2

RF: Compiled. Next is testing. Up to current ACCESS-OM2 code changes. Had issues with submodules. AK: Previously libaccessom2 dependencies brought in through CMake, now moved to submodules. If you have an existing repo will have initialise submodules to pull in latest from GitHub.
RF: Made some changes to installation procedures. Can go between BGC version or standard ACCESS-OM. Want it to be different for BGC version. Changes to install scripts and hashexe etc. AH: Good that it is up to date, could have been an messy merge otherwise. RF: Will run tests today or tomorrow.

MOM5 PR from GEOS-ESM

AH: See this PR? Seemed a bit odd to me. First idea was to ask them to split the PR into science changes and config changes. RF: Looked like a lot of it was config changes. MW: Adding the GEOS5 stuff, which they shouldn’t. Code changes are challenging. Introduced a generic tracer, not sure what they’re doing with it. AH: Strategy? Ask them to wrap science stuff in preprocessor flags? MW: First step is to get config stuff out. Asked GFDL about it. GEOS are switching from MOM5 to MOM6. This must be associated with that effort to validate their runs. Maybe just giving back what it took to get it work. Maybe just makes his build process easier. AH: They have a specific requirement to use the same FMS library. Seems odd, as MOM5 and MOM6 are not likely to share FMS versions in the future. MW: Thorny topic, as it is not clear how FMS compatible MOM6 will be in the future. AH: Using FMS for less and less. MW: The PR needs to be cleaned up. AH: Also put in a CMake build system. MW: They need to explain more.
AK: Has conflicts, so can’t be merged at the moment. AH: Only going to get more conflicted. Which is why I was thinking they could split it up. I have a CMake build system in another branch, but never finished. if we can use theirs cool. I’ll engage with them.

Miscellaneous

AH: Been experimenting with graceful error recovery with payu. Can specify a script which can decide if the error is something you can just resubmit after. Mostly of interest to the production guys.
PL: Scalability testing with land masks, manifests, and payu setup. Supposed to be simpler but taking some time to get used to it. AH: Manifests are relatively new so some of the use cases have not been as well tested. MW: Are not all using manifests? AH: They are, but can be used in different ways. Tracking always works, but options to reproduce inputs and runs. Suggested PL could use reproduce to start a run. It was confounded by some restarts being missing, so not quite sure if it works as we would like. This is a very desirable feature, as it makes it very simple to fork off new runs from existing ones as well as making sure the files are consistent. PL: Working now. Next step is to change core counts and look for scalability numbers. AH: When I was doing scalability stuff for MOM-SIS I use input directory categories to isolate processor changes. Not quite doing that same thing anymore, but you can do something similar, but you won’t want use the reproduce flag if you are changing any of the input files.
AK: Just MOM scaling or CICE as well? PL: Just looking at MOM to begin with to see dependency and wait times. AK: CICE run time is critically dependent on daily outputs. Revelance to scaling data to production output. MW: Make sure your clock can tell them apart. In principle can distinguish compute from IO. AH: Daily output always part of production? AK: Ice modellers want very high temporal output. Ice is very dynamic. Even daily output not enough to resolve  some features. Maybe wait for PIO for CICE scaling tests? AH: I thought scaling tests always turned off IO? Can’t properly test scaling with daily output, as it dominates runtime.
NH: Would be nice to look at performance with and without PIO. PL: Will also look at CICE. Start with ocean model. AK: Were you (MW) running models coupled for paper scaling numbers? MW: Coupled. Not sure what IO was set to. Subtracted it and don’t recall it was large. Don’t recall a bottle neck, so might have had it turned off. RF: Wouldn’t be running with daily IO. Monthly IO doesn’t show up. MW: sounds likely.
AK: For IAF had a lot of daily CICE output. Not complete set of fields.
MW: Starting to run performance tests at GFDL and want to use payu. Has it changed much? Manifest stuff hasn’t made a big difference? Will have to get slurm working. Filesystem will be a nightmare. You moved PBS stuff into a component? AH: No, you did that. Not huge differences. Will be great to have slurm support.

Technical Working Group Meeting, March 2020

Minutes

Date: 18th March, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU
  • Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

Scalability of ACCESS-OM2 on gadi

(Paul’s report is attached at the end)

PL: Looking at scaling. Started with ACCESS-OM2, but went to testing MOM5 directly with MOM5-SIS. Using POM25, global 0.25 model with NYF forcing. The model MW developed for testing scaling prior to ACCESS-OM2. Had to add specify min_thickness in ocean_topog_nml.

PL: Tested the scaling of 960/1920/3840/7680/15360, with no masking. Scales well up to some point between 7680 and 15360.

PL: Tested effect of vectorising options (AVX2/AVX512/AVX512-REPRO). Found no difference in runtime with 15360 cores. MW: Probably communication bound at the CPU count. Repro did not change time.

MW: Never seen significant speed up from vectorisation. Typically only a few percent improvement. Code is RAM bound, so cannot provide enough data to make use of vectorisation. Still worth working toward a point where we can take advantage of vectoeisatio.

PL: Had one “slow” run outlier out of 20 runs. Ran 20% slower. Ran on different nodes to other jobs, not sure if that is significant. MW: IO can cause that. AH: Andy Hogg also had some slow jobs due to a bad node. AK: Job was 20x slower. Also RYF runs become consistently slower a few weeks ago. MW: OpenMPI can prepend timestamps in front of output, can help to identify issues.

PL: Getting some segfaults in ompi_request_wait_completion, caused by pmpi_wait and pmpi_bcast. Both called from the coupler. NH: Could be a bad bit of memory in the buffer, and if it tries to copy it can segfault. PL: Thinking to run again using valgrind, but would require compiling own version of valgrind wrapper for OpenMPI 4.0.2. Would be easier to Intel MPI, but no-one else has use this. Saw some cases similar when searching which were associated with UCX, but sufficiently different to not be sure. These issues are with highest core count. MW: Often see a lot of problems at high core counts. NH: Finding bugs can be a never ending bug. Use time wisely to fix bugs that affect people. MW: Quarter degree at 15K cores would have very small tile sizes. Could be the source of the issue. AH: This is not a configuration that we would use, so it is not worth spending time chasing bugs.

PL: Next testing target is 0.1 degree, but not sure which configuration and forcing data to use. Will not use MOM5-SIS, but will use ACCESS-OM2 for direct comparison purposes. AK: Configurations used in the model description paper have not been ported to gadi. Moving on to a new iteration. Andy Hogg is running a configuration that is quite similar, but moving to new configurations with updated software and forcing. Those are not quite ready.

PL: Need a starting configuration for testing. Want to confine to scalability testing and compiler flags. NH: ACCESS-OM2 is setup to be well balanced for particular configurations. Can’t just double CPUs on all models as load imbalance between submodels will dominate any other performance changes. Makes it a problematic config for clean configurations for things like compiler flags. MW: Useful approach was to check scalability of sub-model components independently. Required careful definition of timers to strategically ignore coupling time. MOM was easy, CICE was more difficult, but work with Nic’s timers helped a lot. Try to time the bits of code that are doing computation and separate from code that waits on other parts. Coupled model is a real challenge to test. Figure out what timers we used and trust those. Can reverse engineer from my old scripts.

PL: Should do MOM-SIS scalability work? MW: Easier task, and some lessons can be learned, but runtime will not match between MOM-SIS and ACCESS-OM2. Would be more of a practice run. PL: Maybe getting out of scope. Would need 0.1 MOM-SIS config. RY: Yes we have that one. If PL wanted to run ACCESS-OM2-01 is there a configuration available? AK: Andy Hogg’s currently running configuration would work. PL: Next quarter need to free up time to do other things.

MW: Might be valuable to get some score-p or similar numbers on current production model. Useful to have a record of those timings to share. Scaling test might be too much, but a profile/timing test is more tractable. RY: Any issues with score-p? Overhead? MW: Typical, 10-20%, so skews numbers but get in-depth view. Can do it one sub-model at a time. Had to hack a lot scripts, and get NH to rewrite some code to get it to work. score-p is always done at compile time. Doesn’t affect payu. Try building MOM-SIS with score-p, then try MOM within ACCESS-OM2. Then move on to CICE and maybe libaccessom2. PL: Build script does include some score-p hooks. MW: Even without score-p MOM has very good internal timers. Not getting per-rank times. score-p is great for measuring load imbalance. AH: payu has a repeat option, which repeats the same time, which removes variability due to forcing. Need to think about what time you want to repeat as far as season. AK: CICE has idealised initial ice, evolves rapidly. MW: My earlier profile runs had no ice, which affects performance. MW: Not sure it is huge, maybe 10-20%, but not huge.

MW: Overall surprised at lack of any speed up with vectorisation, and lack of slow-down with repro. PL: Will verify those numbers with 960 core config.

AH: Surprised how well it scaled. Did it scale that well on raijin? MW: The performance scaling elbow did show up lower. AH: 3x more processors per node has an effect? MW: Yes, big part of it. AH: 0.1 scaled well on raijin, so should scale better on gadi. 1/30th should scale well. Only bottleneck will be if the library can handle that many ranks.

NH: If repro flags don’t change performance that is interesting. Seem to regularly have a “what trade off does repro flags have?”, would be good to avoid. MW: Probably best to have an automated pipeline calculating these numbers. NH: People have an issue with fp0 flag. MW: Shouldn’t affect performance. NH: Make sure fp0 is in there. MW: Agree 100%.

ACCESS-OM2 update

AH: Do we have a gadi compatible master branch on gadi? AK: No, not currently. NH: At a previous TWG meeting I self-assigned getting master gadi compatible. Merged all gadi-transition branches and tested, seemed to be working ok. Subsequent meeting AK said there were other changes required, so stopped at that point. gadi-transition branches still exist, but much has already been merged and tested on a couple of configurations. Have since moved to working on other things.

NH: Close if AK has all the things he wants into gadi-transition branch. Previous merge didn’t include all the things AK wanted in there. Happy to spend more time on that after finishing JRA55 v1.4 stuff.

JRA55-do v1.4 update

NH: Made code changes in all the models, but have not checked existing experiments are unchanged with modified code.

NH: v1.4 has a new coupling field, ice calving. Passing this through to CICE as a separate field. In CICE split into two fields, liquid water flux and a heat flux. MOM in ACCESS-CM2 already handles both these fields. Just had to change preprocessor flags to make it work for ACCESS-OM2 as well.

NH: Lots of options. Possible to join liquid and solid ice at atmosphere and becomes the same as we have now. Can join in CICE and have a water flux but not a heat flux.

Strange MOM6 error

AH: A quick update with Navid’s error. Made a little mpi4python script to run before payu to check status of nodes, and all but root node had a stale version of the work directory. Like it hadn’t been archived. Link to executable was gone, but everything else was there. Reported to NCI, Ben Menadue does not know why this is happening. Also tried a delay option between runs and this helped somewhat, but also had some strange comms errors trying to connect to exec nodes. Will next try turning off all input/output can find in case it is a file lock error. Have been told Lustre cannot be in this state.

MW: In old driver do a lot of moving directories from work to archive, and then relabelling. Is it still moving directories around to archive them? Maybe replace with hard copy of directory to archive. MOM6 driver is the MOM5 driver, so maybe all old drivers are doing this. Definitely worth understanding, but a quick fix to copy rather than move.

NH: Filesystem and symbolic links might be an issue MW: Maybe symbolic links are an issue on these mounted filesystems. AH: There was a suggestion it might be because it was running on home which is NFS mounted, but that wasn’t the problem. MW: Often with raijin you just got the same nodes back when you resubmit, so maybe some sort of smart caching.

 

Scalability of ACCESS-OM2 on Gadi – Paul Leopardi 18 March 2020

 

 

Technical Working Group Meeting, February 2020

Minutes

Date: 27th February, 2020
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

New installed payu version

Version 1.0.7 is now installed in conda/analysis3-20.01 (analysis3-unstable

AH: payu is now 100% gadi compatible. Default cpus/node is now 48 and memory 192GB/node. Python interpreter, short path and manifests are scanned to automatically determined from model config and manifests. Using qsub_flags to manually specify storage flags no longer works, as automatically determined storage flag option is appended and the manually specified one no longer works.

RF: Paul Sandery having issues getting 0.1 deg model working. [AH: turns out it was a typo in config.yam]

AH: No need for the number of cpus in a payu job to be divisible by the number of CPUS in a node. Request however many the job uses, and payu will pad the request to make sure the PBS submission is requesting an integer number of nodes if ncpus is greater than the number in a single node. PL: Rounds up for each model? AH: No, just the total. MW: Will spread models across ranks, so a rank can have different models on it.

AH: Andy Hogg ran out 80 odd submits with the tenth model. Occasional hang, resubmit ok. Might be more stable than raijin.

AH: Navid has MOM6 model that cannot run more than a couple of submits without it crashing with an error that it cannot find the executable. Weird error, let me know if you see anything similar.

NH: Caution with disks and where to put things. Reading input files can be very slow sometimes, or not, and then files not there and turn up later. If executable is missing, running off a disk that is not good? MW: Filesystems are very complicated on gadi? NH: Less certainty of performance with such a different system with data file systems being mounted separately. I’d look at this.
PD: Good place to look if disk has got caught up doing too many tasks. gdata just hangs, saving text file takes a while. Due to being on login node? Get similar delays with interactive job on execute node.
AH: People reporting issues with login delays. Probably a disk issue? Navid’s job is not being run from gdata, but from scratch. Inclined to blame new system of mounting. Could we use jobfs. MW: Like in the old days when we ran on the node? Good luck! AH: Could just do some tests. NH: Concerning if scratch is slow.
AH: Not sure if filesystems are mounted with NFS. MW: That is what we do on gaia, and have tons of problems with mount on demand. Biggest frustration with using GFDL machine. It’s a nightmare. At least NCI have lustre know-how. AH: Used to have a lot of problems with NFS cache errors in the past, files disappearing and reappearing. Does sound similar to Navid’s problem.
MW: Raijin’s filesystem was quite good. Why the change? AH: Security. Commercial in confidence stuff. I think it is overblown. Can’t seen anyone else’s jobs on the queue. Can’t even check it other people are running on the project. Are moving to 2-factor auth also.

What is required to get gadi transition into master for ACCESS-OM2

AH: Andrew Kiss is on personal leave but sent around an email:
re. gadi-transition, we could proceed like so:
– we’ve also been transitioning libaccessom2 to use submodules for its dependencies instead of cmake https://github.com/COSIMA/libaccessom2/issues/29 which would require this commit https://github.com/COSIMA/libaccessom2/tree/53a86efcd01672c655c93f2d68e9f187668159de (not currently in gadi-transition branch)
– get the libaccessom2 tests working https://github.com/COSIMA/libaccessom2/issues/36
– there’s a gadi-transition branch libaccessom2, cice and mom that could be merged into master. They use openMPI4.0.2
– there’s also a gadi-transition branch for all the primary (ie JRA, non-minimal) configurations but the exe paths would need to be updated before merging to master
– the access-om2 gadi-transition branch would then need to be updated to use the correct submodules for model components and configurations. We also want to remove the core and minimal config submodules https://github.com/COSIMA/access-om2/issues/183
also fyi the current gadi build instructions are here
AH: Feels urgent that people can use on gadi. Any comments on Andrew’s email?
PL: Transition to submodules finished? AH: That is on a separate branch. NH: I did that work. Put it in a dev branch. Not intending to be part of gadi transition to have least number additions. AH: Agree if that is the easiest. Master is broken for gadi, so anything that works is an improvement. If there is no feedback can do this offline. Could make a project to be explicit about what is required. NH: Given that gadi-transition does work. Andrew and Andy use it. Wouldn’t hurt to put it in now. Work that PL has done to make sure it does reproduce ticks that box. So ready to go. Able to reproduce if we need to. I’ll merge it and do some interactive testing. Then people can use it and I can do automatic testing.
PL: What branch will it be merged into? A lot of branches in a lot of repos.
NH: Isolate gadi-transition branches and merge into master straight away. Not bother with other development branches at this stage. Want to get something in master that people can use. In future bring everything into dev as discussed, with master staying stable, just bug fixes, until decide to update from dev. I’ll go through the branches and just bring in the gadi transition stuff. PL: So dev will have submodule changes and master will not? NH: For the time being. With previous discussion we’ll be slower moving on master, to make sure it is working. Having dev will allow us to move that more rapidly. People can run off dev at their own risk. AH: Submodules will remain a named feature branch and pulled into dev at some future time. Should discourage having personal development branches on the main repo. If you want to experiment do it on your own fork. Branches on the main repo should be master, dev or named feature to keep it clean and everyone can understand what they mean.

Stack array errors and heap-array option

AH: Apologies minutes from last TWG meeting are not on the COSIMA website. There is an IT issue with the server. We wanted to follow up with stack array errors.
AH: Did ever test on raijin with same compiler? Is there any way we can do comparative test? Use raijin image? Any more from Dale about this stack stuff? PL: Haven’t heard anything. AH: Last meeting some mention of there being a limit on UM stacksize. RY: Already fixed Ilia’s issue. Fixed by making stacksize unlimited. RF: Always run with unlimited stack size. When had problem only fixed by setting heap arrays small or zero. When I went into code and made array allocation from automatic to allocatable the error went away.
MW: If I have an automatic array I get three different heap allocations for three different compilers. RF: This option forces all arrays on to the heap.
AH: This was fixed a while ago Rui? RY: Not clear this is the same problem. Ilia’s issue was the end of 2019 when gadi first on line. Not sure it is the same issue.

BGC Update

AH: Russ forwarded an update to Andy Hogg.
RF: Work was completed on raijin in 2019. BGC code in to MOM and CICE. Required changes in CICE: moving arrays around to different modules due to scope issues which allow optional fields to be sent. Main one is to send 10m winds to ocean, not just the wind stress. Holding off to issue PR until gadi transition done so could go in clearly.
NH: Will be useful for JRA1.4 work.
RF: Hakase will be using it for BGC. Passing algae between ice and ocean components. To add new field, need to add field to code, but don’t have to be passed. Just picked up from namcouple using the flags in OASIS to see if it’s registered.
AH: Can this be the next cab off the rank after gadi-transition, before AKs science tweaks. Not relying on any changes in Andrews branches? RF: Would like to get gadi transition out of the way and then test these changes. Not tested on gadi yet.
How to proceed? Testing?
I’ve held off issuing a pull request until the dust settles wrt the gadi transition. There’s a bit of code rearrangement in order to allow optional fields (10m wind speed but this can be extended) to be passed from CICE.
The flags ACCESS-OM-BGC (tested) and ACCESS-ESM (untested) enable compilation of the BGC code. The 10m winds need to be added to the namcouple files and the MOM coupling fields namelist.
Work done on raijin last year. Changes in CICE to move arrays around in modules due to scope issues. Main one is to send 10m winds to ocean. No just wind stress. Holding off until gadi-transition done.
NH: Useful for stuff I’m doing with JRAv1.4.
RF: Hakase will use for BGC, passing algae between ice and ocean components. Have to change code to add fields. Don’t need to hard code as much. Once field in there optional to pass. Using the OASIS flags to see if registered.

JRA55-do counter-rotating cyclones

RF: Fortunately Paul Sandrey’s started in 1988. Last reverse cyclone in 1987. Cafe 60 use whole month window, so washed out on the average.
One of the RYF runs has reverse cyclone (83-84). Tell Kial.

Scaling

PL: Thanks to Marshall for getting me up to speed on scaling tests and sharing scripts. Can reproduce diagrams so can compare between raijin and gadi.
 AH: Any more performances numbers? PL: Now in a position to answer questions, just need to know what questions to ask.
AH: ACCESS-OM2-01 currently running around 5K cores, would love to be able to scale to 10K, 20K even better. MW: MOM scaled to 50K. AH: CICE doesn’t scale as well. MW: Any work on CICE distributions? RF: Nope. Would need to be done again at higher core counts. MW: Current one working really well. AH: On NH’s to-do list was to experiment with layouts and load balancing. MW: Alistair is very interesting in load balancing sea ice models. Particularly icebergs. Has some quasi lagrangian code in SIS2 to load balance icebergs. Maybe some ideas will translate or vice versa.
PL: For the moment will just look at MOM and see how it scales at 0.1? AH: Maybe just try doubling everything and see if it scales ok? MW: Used to make those processor heat maps to get the load imbalance of CICE. Would be good to keep an eye on that while working with scaling. Tony Craig (CICE developer) is very interested.

 Atmosphere/coupled models

 PD: Still using code frozen for CMIP runs. Extending number of runs in ensemble.
AH: People in CLEX are keen to run CM2. PD: Not aware, maybe through someone else, maybe Simon or Martin? CM2 and ESM-1.5 runs have been published under s38 project.
AH: Scott Wales doing an ultra high resolution atmosphere run over Australia, under  the STRESS2020 project. PD: Atmosphere only, do you know what resolution? I’ve also done some high res atmosphere only runs. On a project to improve turbulent kinetic energy spectrum in UM. Working on code to put stochastic back scatter into low res N96 (CMIP6) atmosphere. Got some good results injecting turbulent kinetic energy into small scales to improve artificial dissipation associated with semi-lagrangian timestep in UM. To test this is to see how improved N96 results compare to N512 runs using STRESS2020 resources. Working with Jorgen Fredrikson. Should talk to Scott.
AH: At the moment Scott is targeting 400m over Australia. PL: Convection resolving? AH: Planning a 2 day run to simulate Cyclone Debbie. Nested 400m run for Australia, inside BARRA at 2.2km. 10500×13000. PD: We’re going global. MW: How many levels? Same as global? PD: 85. AH: Major problem is running out of memory. MW: More cores should mean less memory. Maybe their Helmholtz server imposes some memory limit on the ranks. AH: Currently waiting for large memory nods to come online.

New FMS

MW: New FMS version coming. Targeting auto tools and getting rid of mkmf. If you’re on MOM5 you can use your frozen version. Completely rewritten IO in FMS. Now a thin wrapper to netCDF. No more magic functions like save_restart, write_restart. They have been replaced by lower level ops to allow model developers to have more control. Not sure MOM5 significance. AH: API compatible? MW: Keep compatible with old API as long as they can. Could dump it in and slowly integrate. Only raising in case you want to do more innovative stuff with IO. PL: Affects MOM6 mainly? MW: MOM6 is one of the main targets. PL: Parallel IO support? MW: Part of the reason. They want parallel IO in atmosphere model which NCAR now uses it. Now an important model. This implements the hooks for that work. RY: MPI-IO still there or be replaced by PIO? MW: It is. RY: Simpler to do one? MW: They’ve sent a patch to get MOM6 working with that now. Doesn’t work currently. Not sure about the progress, but know you were interested in PIO. RF: We’re interested from the ICE point of view. New version of BRAN will need daily inputs in CICE. Performance is terrible as IO is collected on to one processor.  MW: FMS will not help CICE, but a test case if PIO is a valid solution.

Technical Working Group Meeting, November 2019

Minutes

Date: 27th November, 2019
Attendees:
  • Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) ANU,  Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Rui Yang (RY), Paul Leopardi (PL) NCI
  • Nic Hannah (NH) Double Precision
  • Marshall Ward (MW) GFDL

ACCESS-OM2 on gadi

PL: Submodules not updated (#176). Reported bug from CICE5 but not being built. AK: not sure how to release this. Sometimes model components updated but not tested. AH: gadi transition branch? AK: Yes. PL: Science bug.
PL: To test had to copy files around. Needed to update config.yaml and atmosphere.json. Made fork of 1deg_JRA55_RYF for testing. Had to move to non-public places as don’t have access to public places. Will send details in an email.
PL: conda/analysis3-unstable needs to be updated, payu not working on gadi. AH: Did update, still not working. Update only tested on interactive job. PBS job strips out environment. Wanted to consult with Marshall about why payu works as it does currently. Difficult to debug as payu-run as it does not have the same environment as “payu run”. PL: Work-around to add -V option to qsub_flags in config.yaml. AH: This is what I am considering to change payu to by default. Not sure. Currently looking into this.
PL: nccmp module not on gadi. Been using for reproducibility testing. In backlog. RY: Can install personally, don’t have to wait for system install.
PL: Running on gadi. Got 1 deg RYF55 finished. Did not have mppnccombine compiled. Will have to do this to get this working correctly. Got something for baseline for comparison. Report by the end of the week.
RY: gadi 48 cores. Default based on broadwell (28 cores). Do you have an up to date config? Paul currently changes core count in his config, but is it done in official config?
AH: I was in the process of making an official configuration for gadi. Copied all inputs that were in /short/public to the ik11 project. Once directory structure finalised will make a config that runs, update on GitHub, and look at making the same changes for other configs. Make an exemplar config with those changes. RY: Should work on same configs.
RY: Anyone else running on gadi? AH: No.
AH: What are the impediments to others updating ACCESS-OM2 on GitHub? People not sure if they can? How they should go about tit? AK: Put my hand up to do this. Other model components also need updating. AH: Maybe dev branch that everyone pulls from. Easier to make changes without worrying about breaking. So everyone working from the same version and don’t have to re-fix known bugs.
AH: Environment stuff? MW: Something about python exec command. Nuance? Wholesale copy everything? Wanted to create idealised processes, rather  than depend on what users haves stop. payu run submits job to PBS with whole new environment. Explicitly give environment variables.
AH: Drawback payu-run does not use same environment as payu run. MW: Not launching a process. payu run submits to PBS and starts posix process with defined environment. Exception when explicitly give it environment variables. AH: One work-around is to make list of environment variables want to keep. Losing MODULEPATH variables. PL: module env being used by payu required modules 3. Modules 4 works differently. Python code from modules 4 may work better.
MW: Fixed? AH: Thought I had, but was fooled because using payu-run. MW: If you set MODULEPATH locally, it won’t be exported to payu run process.
PL: What is the fix? MW: On raijin there was a bootstrap script in init dir, which sets everything. I duplicated those commands and put them in the payu module that did equivalent bootstrap. If moving to gadi and it is different none of that bootstrap script works. PL: Bootstrap script there, but completely different. MW: Was old version, and never actually used the bootstrap script. Maybe exec the bootstrap script they provide? AH: Or pass through environment variables that are set already. MW: Do whatever you think is best. Did try and make it so ‘payu run’ job was clean and always looked the same regardless of who submits. If we take entire ENV and submit to run, every run will be different. One variable is a controlled solution. Solution should be possible to have job on submitted node can set it up on it’s own. Should get it going and not be held up by my purist notions. AH: Try/except blocks can be used to support multiple approaches. MW: Definitely need to bootstrap the modules. PL: Sent through email with details.

OpenMPI/4.0.1 on gadi

AH: Angus reported openmpi/4.0.1 seems broken. Has this been fixed?
AG: Any wrapped commands (mpicc, mpifort) will print whitespace before output. In most cases ok, but can break configure scripts. Ben M knows about it, but not why.
PL: Divide by zero error in MPI_Init. MW: Remember that one UCX back-end, FP exception. Evaluates a log function when evaluating binary tree when working out communication. Ben M told them about it, but got nothing back. We use FP exception checking, but can’t ignore for just MPI. PL: Work-around like turn off UCX? MW: Could turn off FP exceptions. A race condition, so not every job sees it. RY: Can turn off UCX. Can use ob1 instead of UCX. Also try that. PL: Wasn’t sure it would work on gadi.
AH: Maybe 4.0.1 not a good candidate for testing? Get intermittent crashes.

Russ update on model performance on gadi

RF: Been testing OFAM bluelink, compiled as MOM-SIS without doing ice. Performance was fantastic. 2x faster than Sandy Bridge. Don’t get hammered with extra cost on new CPUs. Initialisation was very fast. A lot of files, so might be a low load issue. Dropped from 100s to 8s. Doing data assimilation runs, run 3 days at a time. 25% of the run time was init. Now pretty much zero. MOM5 performance was really good.
RF: Did notice some variation on start up of CM4. Still a lot faster. Reads in a lot more files and a lot more data. Still considerably faster than on raijin. MW: MOM has IO timers, do you have those on? FMS timers. Rui used them a lot. RF: No, didn’t turn them on.
RF: Running CM4 was about 15% faster than Broadwell. Improved but will cost a lot more for decadal prediction. RY: 15% is normal. Martin report UM is 30% quicker. RF: SIS2 load balance is bad. Probably a bunch of things being covered up. Needs more testing.
MW: Bob has never talked about SIS2 load imbalance. Presumably oblivious to them. RF: Would have to be. Regular layout would lead to many redundant processors. MW: Alistair has done some iceberg code load balance improvements. RF: Doesn’t take much time. Had to turn off iceberg stuff on raijin. netcdf stuff broke it. Might turn back on. Time spent in iceberg code minimal.

Stack array errors and heap array option

RF: When compiling need to set heap-arrays option in compiler, otherwise get segfaults with stack, even when stack set to unlimited. Wasn’t an issue on raijin. Happened for both MOM5 and CM4. PL: Dale mentioned about stack size limited to 8MB. RF: I unlimited stack size, so shouldn’t have been an issue. Got all sorts of issues with unmapped addresses. First one saw it was automatic so tried moving to allocatable, moved error. Then tried different heap-arrays size options, which moved error again. MOM5 dropped to heap-arrays 5KB. Same for CM4 but set to zero for SIS2 and it got through. Different models, seems ubiquitous. MW: Intel fortran?
MW: When compile and run on CRAY machines stack vars use malloc, so heap variables not stack. Same model, same compiler on laptop (gcc), same variables are stack variables. Is it possible moving from raijin to gadi something different about malloc. RY: CentOS 7 v 8 makes some difference. MW: Is kernel making some decisions on malloc? RY: Had similar issues with UM. Stacksize unlimited seemed to fix for UM. But Dale talked about this in ACCESS meeting, kernel changed something that caused this problem.
NH: Intel compiler has heap always arrays option. Useful in some cases. Models can have array bounds overruns, and easier to track when trash heap compared to stack. RY: Slower? NH: Depends. Doesn’t do it for everything, just the larger arrays. RF: If you just set heap-arrays, all on heap. Can control it. MW: In MOM6 explicit places we declare variables we know we won’t use, contingent on assumption they are stack vars. Can’t make those assumptions any longer.
NH: Surprised to hear linux kernel. Would think it was Fortran runtime or compiler. MW: runtime or libc. Couldn’t figure out why different results with same compiler on different platforms. NH: Calculating variables addresses, compiler computes stack offsets. Looking at the executable there are static offsets. Needs to be done at compile time. MW: Shouldn’t be running models that need to use heap. Should be resilient to either choice. No? NH: Comes down to algorithms used to manage memory. Heap has algorithm to minimise fragmentation. Don’t have an answer, will need to think about it.
MW: Can you send a bug report for SIS2? RF: Could be everywhere that has run out of stack space. Just the first one I tried to fix this.
AH: What OS are you running on your laptop? MW: Archlinux. Comparing them to the travis VMs. AH: At some point the compiler has to query the system to see what resources are available? MW: The fact that you’re typing stacksize unlimited shows you accessing the kernel. AH: Seems strange, system has plenty of memory. MW: I’m interested in this problem. AH: Problem should be reported to relevant NCI people (Dale/Ben?). Potentially affecting a lot of codes. Not tenable that everyone who has this issue have to debug it themselves. MW: Bad memory explicit in stack, buried in the heap? NH: Can make a huge difference. Layout of memory is different. More likely something on HEAP won’t affect other variables. More fragmented on stack. Heap memory more tightly packed. MW: Fixed a couple of dozen memory access bugs in MOM6 and they take it seriously. RF: Old versions I’m using with CM4 release. Happens with MOM5. Only FMS common. MW: Wondering if this is a bug that is hidden moving from stack to heap.
MW: Using GCC9.0 to find these. Few flags to find stuff. Initialise with NaNs. malloc-perturb is an environment variables you can turn on and that helps. Turns on signal NaNs. Any FP op generates an error now. Finds a lot of zeroes in bad memory accesses that didn’t trigger errors. Trying to not use valgrind, but that would work also.
RF: Switch in GCC that does something similar to valgrind. Puts in guards around arrays. MW: Don’t know the explicit option, using -Wall, turns it on for me. GCC9.0 is very aggressive at finding issues in a way that 5/6/7 were not.
AH: Same compiler on raijin and gadi, see if gadi only issue. RF: Not sure if it was the same version of 2019 I was using. AG: One overlapping compiler 2019.3. RF: Recently recompiled MOM-SIS build. Will look and see if it is the same. AH: Useful data point if same issue is gadi specific.

Update on BGC

AH: Andy Hogg has asked for an update. People at Melbourne would like to us eit. RF: On my desk with Hakase. Been promising. Will prioritise. Almost there for a while. Been distracted with gadi. On to-do list.
MC: Do we know who in Melbourne wants to use it? AH: A student, not sure who.

New projects to support COSIMA and ACCESS-OM2 on gadi

AH: /g/data/ik11 is where inputs that were on /short/public will now live. Not sure exactly how this will be organised. Will mostly likely have input and output directories. Might be some pre-published COSIMA datasets there. Part of a publishing pipeline. AK: Moving data from scratch to this as a holding area? AH: People were using datasets from hh5 that had no status, not sure how to reference them.
AK: Control directories are separate, and not well connected to the data on hh5. Nice to have ways to link things more firmly. AH: To-do for payu is have experiment tracking IDs. Generate UUIDs as unique identifiers for experiments. Will go in metadata file. Not linked to git hash. If they don’t exist, make new ones. AK: Have data on hh5 and the control directories have been moved or deleted. Lose the git history of the runs that were used to generate the output. AH: Nothing to stop that all being in the same directory. Nic has advocated this for some time. Could change the way we do things. AK: Not sure on solution, but flagging as an issue.
AH: Published dataset from the COSIMA paper is almost ready. New location for COSIMA published data will be cj50. To do this publishing have created a python/xarray tool to create published dataset from raw model data. Splits data into separate files for each variable, a year per file in most cases. Needs a specific naming convention for THREDDS publishing. Using xarray  it doesn’t matter what the temporal range of each model output file. Uses pandas style resampling to generate outputs. In theory simple, in practice there are many many exceptions and specific tweaks to be standards compliant. Same tool can handle MOM and CICE outputs, which are different models, and radically different file metadata and layout. If you have something that you might find it useful for it is called splitvar. Also made a tool called addmeta for adding metadata. Do the metadata modification as a separate step as it is always fiddly. Uses yaml formatted files to define metadata. The metadata for the COSIMA data publishing is available.
PL: Published data is netCDF format with all the correct metadata? AH: MOM doesn’t put much metadata in the files. To make this better connection between runs and outputs is to insert the experiment tracking id mentioned above into the files. Would be nice to put that into a namelist so that MOM could put it in the file. Best option, and if anyone knows how would like to know. Another option is a post-processing step, on all the tiled outputs. MOM isn’t the only model we run. Not all output netCDF. Would be nice if there was a consistent way for payu to do this. COSIMA published data should be up before the end of the year.
PL: Will ik11 replace hh5 and v45. AH: hh5 is storage space that is part of a ARC LIEF grant from the Australian climate community. The COE CMS team was tasked with managing this, and people could ask for temporary storage allocations. In practice it is harder to get people to remove their data. COSIMA was one of the first to ask for an allocation, but it somewhat outgrown the original intent of hh5, as it has been there for a long time and grown quite large. hh5 might still be used for some models outputs. Not sure. ik11 started because we needed somewhere to put common model inputs/exes because /short/public went away and /scratch/public is ephemeral. /scratch space is difficult to utilise because of the ephemeral nature. NH: Have some experienced /scratch space on Pawsey. Once you lose data you make sure you have a better system to make sure your data is backed up. Possibly a good thing. AH: Doesn’t suit the workflow people currently use, where they come back and run some more of a model after a break. Suits workflows that create large amounts of data and then do a massive reduction and only save the reduced dataset. Maybe suits ensemble guys. Our models everything we create we want to keep. NH: Doesn’t all the model output go to scratch. AH: Yes, but model output doesn’t get reduced, so end up having to mirror the data.

Technical Working Group Meeting, September 2019

Minutes

Date: 11th September, 2019
Attendees:

  • Aidan Heerdegen (AH) CLEX ANU,  Andrew Kiss (AK)  COSIMA ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Rui Yang (RY) NCI
  • Nic Hannah (NH), Double Precision

libaccessom2

AK: JRA55 v1.4 splits runoff into liquid and solid. Most elegant way to support? Have a flag in accessom2 namelist to enable combining these runoffs. NH: Is it a problem in terms of physics? Have to melt it? AK: Had previously ignored this anyway, so ok to continue. NH: Backward compatibility!
AK: Some interest in multiplicative scaling and additive perturbations to allow for model perturbation runs. NH: Look at existing code. Might not be too hard. AK: Test framework for libaccessom2? NH: When did scaling did longer to write test than make the code change. All there, could use as an example. Worth to run tests, don’t want to get it wrong. AK: Not familiar with pytest. NH: In this case just copying scaling test, modify, and get pytest to run just that test. Once got just  that test running and passing you’re done.
AK: New JRA55 now in Input4MIPS. Used JRA v1.3 from that directory and didn’t reproduce. AH: Correct. Didn’t work out why it wasn’t reproducing. AK: Ingesting the wrong files? Should be identical. AH: Never figured out what was wrong. Didn’t match checksums from historical runs. Next step was to regenerate those checksums to make sure the historical ones were correct. Could have been ok, but didn’t get that far.
AH: JRA55-do is now on the automatic download list, should be kept up to date by NCI. If it isn’t let us know.
NH: Liquid and frozen runoff backwards compat, but what about future? AK: Some desire to perturb solid and liquid separately, and/or distribute solid runoff. NH: Can we just put it somewhere and allow model to deal with it. AK: In terms of distributing it, not sure. Some people are waiting on this for CMIP6 OMIP run. Leave open for the future. NH: MOM5 doesn’t have icebergs? AK: No. Depoorter et al. has written a paper for meltwater distribution. Maybe use a map to distribute. RF: What they use for ACCESS-CM2. Read in from a file.
AK: Naming convention for JRA55 v1.4 has year+1 fields. Put in a PR some time ago. AH: Problem with operator in token? NH: Should be fine as long as within quotes. AK: Just a string search shouldn’t make a difference.
AK: Can’t get libaccessom2 to compile and link to correct netcdf library. Ben Menadue tried and worked ok for him. Problem with findnetCDF plugin for CMake. Not properly supported on NCI. Edited the CMake file to remove this, could find netCDF, but used different versions for include than linking. Should move to a newer version of netCDF. v4.7.1 has just been released. Have requested this be installed on NCI NH: Does supported include CMake infrastructure around library? If getting findnetCDF working was NCI responsibility that would be great. Difficult getting system library stuff working properly with CMake. CMake isn’t well supported in HPC environments. AK: Ben suggested adding logic to check and not use on NCI. NH: Definitely upgrade, to 4,7 if they install it.
AH: Didn’t Ben Menadue login as AK and it ran ok? AK: No, he didn’t do that as far as I know. AH: Definitely check there is nothing in .bashrc. Also worth checking if there is a csh login file that is sourced by the the csh build scripts.

OpenMPI testing

RY: OpenMPI 2,3,4 and Intel 2019. Consistent results between for all OpenMPI versions. 1, 0.25 and 0.1. Some differences between Intel 2017, not from MPI library. Not sure if difference is acceptable or not? Would like some help to check differences.
Just looking at access-om2.out differences. Maybe need to look at output file like ocean.nc? RF: Need to compile with strict floating point precision to get repro results. MOM is pretty good. Don’t know about CICE. Can’t use standard compilation options. fp-precise at a minimum.
RY: If this difference is not acceptable need to use flags to check difference between 2017 and 2019? RF: Once get a bit change, chaos and get divergence. RY: Intel 2017 still on new system. AH: So not only newest versions of modules on gadi? RY: 2017 will be there, but no system software built with it. AH: Done a lot of testing. Should be possible to just use 1 degree as a test to get 2017 and 2019 to agree. There are repro build targets in some of those build files. Could try and find them. RY: Yes please.
AK: Any difference in performance? RY: No big difference. NH: New machine? RY: No, old machine, with broadwell.
RY: NCI recently sent out gadi update and blog and webpage. 48 cores/node. NH: Did we think it was 64 cores/node? AH: Still 150K cores in gadi, with 30K of broadwell+skylake. Maybe have to change some decompositions. RY: Not the same as any existing processors.
AH: Two week overlap with gadi, then short will be read only on gadi. RF: There was panic in ACCESS due to an email that said short would disappear in mid October. AH: Easy to misread those dates.

accessom2 release strategy

AK: Harmonising accessom2 configurations. Somewhat haphazard release strategy, but not tested. Maybe master branch that is known good, and have a dev branch people can try if they want? Any thoughts?
NH: Good way is really time consuming and labor intensive. Would mean testing every new configuration. Not sure if we can do that. Tried to keep master of parent repo only references master of all the control experiments. Not sure if necessary or desirable? Maybe makes more sense to develop freely on own experiment and keep everything in control stable? Not sure. If all control experiments are stable and working, can be a bit slow to update. Just update your experiment.
AK: Some people are cloning directly from experiment repos, some cloning all of access-om2. Would reduce confusion if control directories under accessom2 are kept up to date with latest known good version. NH: Does make sense I guess. Shame for people to clone something that is broken which has already been fixed. There is some python code in utils directory which can update everything. Builds everything at all resolutions, copies to public space, updates all exes in config.yaml and does something with input directories. AK: I ended up writing up something like that myself.
AH: Should split out control dirs from access-om2 repo. Is a support burden to keep them synched. Not all users need entire repository, as using precompiled binaries. Tends to confuse people. NH: Did need a way for config to reference source code and vice versa. AH: Required to “publish” code? Maybe worth looking into. NH: Ideally from the experiment directories need to know what code you’re using. Probably got that covered. In config.yaml do reference the code and it’s in the executable as well. When run executable it prints out the hash from the source code. Enough to link them?
AH: I recall NH wanted to flip it around and have the source code part of the experiment. NH: Probably too confusing for users. AH: True, but a useful idea to help refine a goal and best way to achieve it.
AH: A dev branch is a good idea. Then you have the idea that this is the version that will replace the current master. Can then possibly entrain others into the testing. Users who want updates can test stuff, you can make a PR and detail testing that has been done.
NH: Good idea. Some documentation that says experiments have stable and dev. When people are aware and have a problem, wonder if they can go to dev, see if it fixes. AK: Bug fixes should go into master ASAP. Feature development is not so urgent. A bit gray, as sometimes people need a feature but they can work off dev. AH: Now have some process for this: hot fixes that go straight in. Other branches are dev/feature branches. Maybe always accumulate changes into dev. Any organisation helps.
NH: Re: Removing experiment repositories: namelists depend on source code. AK: Covered by executables defined in config.yaml. NH: Yes ok.

FAFMIP PR

RF: Did it work? It’s got a lot of merges. RF: Just two lines. Did a merge and pushed it to my branches on GitHub. AH: I’ll merge it in. Just wanted to check. AH: Can always make a new master branch that tracks the origin, check that out and pull in code from other branches. RF: Have a lot of other branches. AH: Can get very confusing.

payu restart issue

AH: Issue has resurfaced. I commented on #193, but didn’t look into the source of the problem. Should look into it rather than talk about it here.

FMS subrepo

AH: Still not done the testing on this. Been sick. Will try and get back to it.

Tenth update

AK: Andy done 50 years with RYF 90/91. Running stably. AH: What tilmestep? RF: Think he was using 600s. AK: 3 months / submit. Should ask for longer wall time limit. RF: Depends on how queues will be on new machine, what limits and what performance. AH: Talking about high temporal res output. AK: Putting out 3D daily prognostic fields. Want it for particle tracking. Including vertical velocity. Slowed it down a little bit. RF: More slowdown through ice. AK: No daily outputs from CICE.

CICE PIO

NH: Still in progress. AK: Also requires newer version of netCDF? NH: Requires specific version of netCDF. Needs parallel version. Not a parallel build for every version. AK: Has parallel for 4.6.1. RF: Bug in HDF5 library which it is linked to. Documented in PIO. Probably a bug we’re not going to trip. Doing a collective write, and some of the processors not taking part/writing no data. Fixed next version of HDF5 1.10.4? AH: Not a netCDF version so much as the HDF library it links to. RF: Yes. AH: So should make sure we ask for a version of netCDF that doesn’t have this bug? AK: Add to request.
RY: If want parallel version, use OpenMPI 3 or 4? AH: Good question! RY: All dependencies will be available and very easy to use. AH: This using spack? RY: Above spack and other stuff. Automatic builds with all possible combinations. AH: Using it for your builds? RY: We are requested to test and are now using. Difficult to create new versions currently. In transition difficult, but in new system should be fixed quite easily. AH: Should fix the various versions of OpenMPI with different compilers. RY: Yes. AH: Will have a compiler/OpenMPI toolchain? RY: Will automatically use correct MPI and compiler. AH: Any documentation? RY: Some preliminary, but not released. When gadi is up all this should be available.
AK: Should I ask for a specific version of MPI? RY: If don’t specify, will be built with 3 or 4. Do you gave a preference? AK: No, just want the version with performance and stability we need. Do we need to use the same MPI version across all components. RY: Not necessarily. Good time to try OpenMPI3. No performance benefit as system hardware is still old hardware.