Technical Working Group Meeting, December 2018

Minutes

Date: 11th December 2018
Attendees:

  • Marshall Ward (MW) (Chair) NCI
  • Aidan Heerdegen (AH) and Andy M Hogg (AMH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Nic Hannah (NH) Double Precision

COSIMA Models

Profiling

MW: Been profiling CICE, score-p profiling doesn’t work. Been timing by time step. Anomalously long time spent at step 72. AH: could it be atmosphere being updated. JRA55 is 3 hourly. Not sure timestep. MW: Seem to have lost my logs. Not sure best way to handle it.

CM2 Harmonisation update

AH: Peter has been testing release candidate. Russ supplied a diag_table which just outputs fields for first 2 time steps which is really good for seeing code issues. Russ found some bugs introduced by me. A couple of logic errors with preprocessor flags and omission of a couple of lines that got lost in translation. Confident latest update has squashed all the bugs. MW: Not old bugs? AH: Did find some old issues. Russ found a stuffed iceberg file. RF: Not related, but is something they were using for CMIP6. AH: Did find some old bugs, had to emulate the lack of reproducibility from a the readsea salinity fix timing bug to be able to closely reproduce CM2 output. Put a flag in to do the wrong thing to do the same as theirs, will remove before merging. MW: I thought reds fix had been changed to be faster but not reproducible. RF: That’s right, but not issue. This has to do with timing. Aidan fixed it, but not compatible with what they are using. AH: Just need something that reproduces CM2 output.

Narrator: The new way of doing salt fix will reproduce over time steps, but is not bit reproducible with the old algorithm. Don’t see that effect in these tests.

AH: Peter has a test suite which is old CM2, and a copy which uses updated MOM. He compiles the new code manually and runs the two suites side by side. Both use Russ’ diag_table. Just find out which fields don’t match. Most are the same, few different, seem to be affected by the same issue. Once we’re good for a few time steps then maybe look at them after a few months RF: Once chaos starts, hard to say. As long as nothing gross happening. Unless there is something further on with coupling. AH: Yes, look after a month and check it looks close. MW: Not trying to be bit reproducible? AH: Just want to fix my bugs. RF: Make sure you’re getting the same forcing fields. Can see out in the open ocean hardly any change. Just noise. This means we’re close. Saw the outline of where the forcing field is supposed to be. The bug in the forcing field data showed up, which indicated the issue. AH: Once we’ve confirmed fixed, will merge PR and then move on to ESM.

MW: Will the CM2 code remain in step with the MOM5 code? RF: CSIRO Aspendale not doing much code development at the moment. AH: Peter is pulling directly from his GitHub repo, but once it is harmonised they will pull directly from the MOM5 repo. They will want to have a tag and pull from the tag. RF: Yes they will want frozen versions. AH: Should have some automated tested, if we find a bug, should be able to updated CM2 code and confirm doesn’t change important answers.

AH: Short answer: Lots of progress. I made lots of bugs and Russ found them. Thanks Russ. NH: Yes thanks Russ.

Model reproducibility and payu bug

NH: working on documentation, wiki, tech report and model paper. Like to do more. Wiki doc easier as a brain dump. Made sure ACCESS-OM2 Jenkins tests are passing. Takes time something always seem to go wrong. Six tests passing and useful. Repro test working and now reproducing across restarts. Wasn’t working due to 1. payu bug, 2. red sea fix and 3. compiling with repro.
NH: Doing 2 runs with and without that payu bug on 1 and 0.25 degree. Doing 4 years as individual 1 year submits. Make sure bug not too serious. The way the coupling field restarts are done not good. Ocean has to write out a restart for cice (o2i.nc). Copy of restart file missing. Had in the past. Refactor with libaccessom2 and change of payu model driver didn’t carry this over. Means every first forcing fields that the ice model gets at the beginning of a new submit for the first coupling step are from the beginning of the run, not the previous run. Ice model is getting the wrong forcing for the first 3 hours.
MW: Has it been fixed? All runs affected? AK: Yes fixed now. Scope which runs affected. Only since YATM? NH: Yes. If your run uses YATM it will have this problem. Around the time the bug introduced. Restructured how config.yaml organised. Created libaccessom2 driver, and bug came in at that point. MW: Used to have oasis driver that did that. NH: Restart repro test existed but failing for other reasons, not being kept up to date. If that test was passing and then started failing, then would have been noticed. Doing a post mortem to see if there is anything significant on a 5 year run. Gut feeling, just in the ice. RF: Will just be the SST that it sees. If running a month at a time significant. Yearly not so important. Also depends what was in the initial coupling field. NH: Initial field correct, probably January. RF: Didn’t get updated for changes to landmasks? NH: Land has been eliminated so not necessary. NH: Any run which is a multiple of 1 year, problem is smaller. AH: Quarter and 1 degree aren’t that affected, tenth most affected. NH: Could do 1 month 1 degree runs. AH: Good idea. Don’t forget about runspersub option, could do 50 in a single submit. MW: payu restart flag now works as well. Could be useful for testing reproducibility. NH: This could be a problem in other cases as well. Existing restart is based on a specific time. May be correct for the specific model it was created for. RF: Should be matched to initial condition, with correct fields. MW: This is a cold start? NH: Needs to be created each time based on start time of your forcing. AH: Write code into model to read in IC and write back out to coupling fields? NH: Something like that might be good.
AK: Bunch of fields SST, SSS, SS velocity, SS slope, frazil ice formation energy. RF: SST and SSS only ones not zero in a cold start. AK: Replace by initial condition for entire experiment NH: There is a single file in the ACCESS-OM2 input directory that all experiments use. NH: Could diff that against what it should have been. MC: That is cold start bug, not so important. Warm start bug fixed? NH: yes fixed in latest version version 0.11.2. AK: People aren’t using that? MW: No, because it was broken. Now fixed. AH: Arguably should delete payu versions with the known warm start bug. Or back port the fix? MW: Don’t have framework to back port fix. AH: How many versions affected? NH: Put a warning message/assert in that stops and doesn’t let it load. MW: happy to delete old versions. Some people use a specific payu versions. Easy to put warnings in module files. Can also delete old ones. Not a huge problem.
AH: figure out which payu versions affected. Make a decision based on that. MW: Only those with libaccessom2. AH: Don’t delete straight away. Turn off modules first. See if there are people affected. AK: Could be people not using access-om2. AH: yes, but can use new versions. Need to make sure people not using buggy code. AK: Possibly move to new space. AH: yes, but might not be necessary. MW: May be impossible to back port fixes. Driver might not be functional. No problem doing backports, not sure how.
AH/MW: Might not need to back port, should:
  1. Confirm payu/0.11.2 working correctly
  2. Set as default version
  3. Determine which payu versions affected
  4. Turn off affected modules in modulefile and issue message about bug, what module to load and to email climate_help if users still has issues
  5. When complain assess individual cases
  6. If necessary move payu module to non-app path
  7. Delete old versions?
2 week time frame.
MW: People shouldn’t be encouraged not to specify module versions.
MW: Make sure 0.11.2 working correctly. Works for NH and AH. AK a good test for it as running. AK: Not running at the moment. Can we use old mppnccombine with payu/0.11.2. AH: Yes. MW: Use whichever you want. AH: works better for 1 deg in any case.
MW: added a restart directory feature. run 0 uses the restart and reset counters back to zero. AK: Had been copying stuff. MW: I’ve been symlinking and other hideous things. AK: Documents what you did better. AH: Used to have problems with drivers trying to delete symlinks when cleaning up restart directories.
AH: Will finish manifest this week. Chatted with Marshall and reimplementing it a bit differently. Will make NH’s job a lot easier. Run config has all the files, just need to clone and run. NH: awesome.
NH: Want any post-mortem or checking on tenth model for the payu bug? Could do some short 1 month runs. AK: Not sure what we would do with the information. Diagnosis without treatment. Interesting from an academic viewpoint. Planning to do a longer re-run with other changes and will be fixed in that. Interesting to see a couple of months and see scope of issue. Is it negligible? Maybe tell people AH: Choose a worse case: Southern summer? NH: Ok, might do that.

OpenMPI

MW: Been using OpenMPI/3.0.3. Working well. Speeds same as 1.10. uses ucx by default. Turn off all flags, except error aggregate if you want. Can try 3.1.3, had some issues. Likely the version on the next machine.
AH: Test on Jenkins with new OpenMPI? MW: Good idea
MW confirmed that using hyperthreading option in payu is harmless (might even be on by default).

COSIMA Models

Bathymetry

RF: Wanted to get rid of Ob river? 1150 looks good. Need an inlet to keep runoff in correct place. See GitHub issue. Plot shows 0.25 degree cell size is cut off.
AMH: Need to get rid off the Ob. Russ’ plot at 1150m looks good, maybe smooth out corners. RF: Have to look at index space, straight edges, no inlet, things like that. Depth is minimum depth, 10m, a lot more shallow in actuality. AK: Only real reason to keep it is to have the runoff in the right place. Had to smooth to stop model crashing. Main reason to keep is to make sure runoff is mapped correctly. AMH: Where is runoff coming from? Take it too far up and might get remapped to the wrong embayment. Why I like the minimal change. It is stable. AK: Yes since Russ’ fix that stops salinity drop below zero with ice formation. AH: If your map had water at depth zero, as opposed to land, then can follow the water along until it is > 0. Say this is water, use for remapping but not for model. AK: Need a separate file? AH: Not necessarily. Remapping using it’s own logic anyway. AK: Remapping takes no account for topography. NH: Could make the distance function smarter, use a directional weight, something like AH suggested, or take into account topography. AK: Go downslope.
RF: Other problem was Southhampton Island. Just taking out inlet was sufficient. AMH: Keep Island separated from mainland? RF: Yes. Hasn’t been causing problems? AK: No. AMH: Will leave cells smaller than 1150m. AK: Yes, but not too bad. Also an abrupt change in spacing. RF: Yes tripolar grid has discontinuity. AH: Cut of at 1150m, what was it before? AK: 880m. All crashes I had with ice remap error were less than 1100m. Those can be eliminated with closing channels. AMH: Worried about Southampton. AK: Never had issues there. Will be getting new constraints. Had to put damping on Kara Strait, and had issues with seamount off tip of Severny. AMH: Ok, keep it at 1150m and see.
AK: In quarter degree Baffin Island is attached to Canadian mainland. Tenth has much more open water. A lot of it extremely shallow (less than 100m), so unlikely important for sea water transport, but likely important for ice transport. AMH: And therefore fresh water transport. AH: Who will do this? RF: Planning to do it today or tomorrow. AMH: Awesome, thanks.

Profiling

AMH: getting different numbers between IAF and RYF due to AK needing more ice time steps in IAF case. He can’t run with ndtd=2, so load imbalanced to cice. ntdt=2 with minimal. AK: Time difference is due to value of ndtd. Ruth still getting bad departure points with minimal. Reduces ocean time step for a single submit. I reduced ndtd instead. AMH: This has caused a load imbalance. Not the same as our optimisation that NH targeted. NH used ndtd=2 in optimisation. AK using 50% more time.
MW: What optimisation? AMH: When NH looking at load balancing. AK using 50% more time steps, and taking 50% more time.
NH: Now have a rebalanced tenth minimal with ndtd=3. With the bathymetry changes might not need it. AH: Hold off on that until AK can tell if we need it. AK: May still on occasion need to reduce time step every 5 or 10 years, preferable to ndtd=3. IAF variability means can’t guarantee it will work with every year.
MW: OASIS timing issue. Struggling to define main loop time. Looking at 1 deg, outputting time of every time step. Not literally useful due to overhead. AH: Give you scaling? MW: Not sure.
MW: timing between 170-200ms per step. Step 32 get a big number. 36s in one, 72s in the other. Is it just waiting? Doing IO? Maybe some sort of OASIS thing happening to bootstrap. Get infrequent huge time steps. Run again and don’t get them. Going to remove the largest timestep. Anyone know what is causing this?
NH: What are you profiling? MW: Just the coupling step. Reporting the coupling code.
MW: Does it do a lot of IO on that first coupling step? NH: Yes it does on the first step. What about CICE diagnostics? Are they printing to ice_diag.d. Should be consistent. If it goes away?
RF: CICE does IO through one PE, so does a global collective. MW: Could be IO and MPI collective issues. Not sure if this is legitimate timing or not?
NH: Not sure what the bigger picture is, but find targeting specific routines to look at load imbalance. NH: definitely look into CICE diagnostics.
MW: Timing so inconsistent. AH: Run a bunch of use the minimum. Turn off all diagnostics. AH: For the paper MOM scales well. Need to say something about CICE scaling. Doesn’t need to be the final word. MOM gives some leeway and these are the best configurations …
NH: Happy to help. Can do more fine grained stuff. Do some counting. MW: like score-p but it dies with CICE.

Grid scale noise

 RF: Chris Chapman problem with submeso scale stuff (see issue). There is a smoothing feature in submeso but says it doesn’t reproduce. Think I found a bug. Does smoothing of mixed layer. Possible to put mixed layer into rock with smoothing, doesn’t seem to be any check. Might get some others to look at it. If they agree we might be able to fix it and reduce the checkerboard. AK: This in MOM6? Also in MOM5? RF: There is a namelist parameter, says not to use because not repro, but because buggy. No reason it shouldn’t reproduce.
MW: Is this filtering a numerical mode? AK: KPP purely numerical, so adjacent columns can decouple. RF: Will point out code and see if people agree. AK: Get fixed and could be good to put in for next tenth degree run.

Technical Working Group Meeting, November 2018

Minutes

Date: 13th November 2018
Attendees:

  • Marshall Ward (MW) (Chair) NCI
  • Aidan Heerdegen (AH) CLEX, and Andrew Kiss (AK) COSIMA, ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
  • Nic Hannah (NH) Double Precision

Payu update

MW: payu is now python3 compatible. Can be run from a .local install. No longer uses modules. Tagged 0.11.1. AH: will also install into conda environment on raijin. NH: sounds great. Have used in conda python27 environment. Good to have python3.

MW: Want to get this to a position where I can leave it to others to support. Might get GFDL interested in it. Have time to wrap up a lot of things.

AK: How rigid is the 3 digit output in archives? MW: Only a print statement. Should work with higher numbers. AH: Won’t list nicely. MW: Had meant to add a format option.

MW: Try out the new payu, want to make it the new one.

TWG Organisation

AH: Peter Dobrohotoff sends his apologies, cannot make meeting.

AH: Do we need to decide on a new chairman? MW: Happy to resign straight away. MW: Considering going to GFDL in February to bootstrap remote working. AH: Nic did you want to take over? NH: No, convinced it’s not a good idea. AH: Sort something out next meeting. Last one of the year?

OM2/CM2 MOM5 Harmonisation

AH: Peter Dobrohotoff sends his apologies, cannot make meeting.

AH: Shame, as PD tested harmonised MOM5 but used incorrect namelist options. Losing some momentum as would like to have that checked off so we could start harmonised ESM.

MC: Wrong namelist options? RF: Didn’t have correct namelist options to include my new mixing scheme.

AH: They were also concerned about background mixing, in case that as having an effect. Could point them to the relevant PR/Issue on GitHub with all the plots and documentation showing it was working correctly.This is a valuable resource and a good way of working.

AH: Richard Matear contacted me and wanted to know status of ESM harmonisation and what he could do to help it progress.

ESM MOM5 Harmonisation

ESM is using an version of MOM5 updated to the beginning of the year with WOMBAT added by MC. It was decided to not continue with ESM harmonisation until CM2 bedded down, as it requires some of the code changes from the CM2 harmonisation.

MW: Is cylc suite used for testing using the MOM repo compilation? AH: I believe Peter is currently turning off the automatic compilation in the suite and using the repo compilation script on the command line to create the MOM executable. MW: I think improving/streamlining and harmonising the cylc suite is as important as harmonising the code. AH: I would have liked a test suite that incorporates this, but this is the way Peter has been comfortable working. Can’t progress ESM until have the all clear that CM2 is working correctly.

MW: Richard & Matt are using the ESM model? Not using GFDL stack? MC: GFDL in decadal project. This harmonisation will get WOMBAT in the MOM5 master branch, which has long been a goal. Decadal project not using UM. Research effort there is data assimilation and reanalysis that they’re running, rather than updating the model. Won’t hear much from them. When AH gets the WOMBAT code in, please contact us. Have experience using payu runs at 0.25 deg. Will fix you up with files and help when it comes to testing. AH: I will mostly be relying on others to run stuff. MC: Not running under payu currently. Out of the loop at the moment. Did run one of Kial’s runs. I need help running with payu. AH: Yes we can do that together. Holger is trying to get payu to run with ESM, making slow progress. MW: Who is supporting the ESM? Surprised Richard will contact AH. AH: Someone told him it was part of this code harmonisation. MC: Richard’s interest is WOMBAT in MOM5. AH: ESM will be the CLEX coupled model. MW: Who will be responsible for ESM? AH: Tilo is doing a CMIP6 submission with ESM. CLEX will be wanting to use all of Tilo’s runs to spin off their own experiments. At that point CLEX CMS will support the model with payu on NCI HPC systems. MC: Tilo is doing the work of the equivalent of the entire CM2 team to get ESM working. We support Tilo somewhat, and there are others who are no longer formally part of the team but contribute. Richard has interests in this space also. AH: Tilo can benefit from work we’re doing. Scott found a 10% improvement in UM speed. MC: Speed/efficiency not a priority right now. Focus on land model, forcing etc. Not #1 priority. AH: If they’re open to that input, they can still get the benefit even if it isn’t their focus.

MC: Have to go to another meeting.

AH: Do we need to move the meeting? MW: Happy to move to another day if it works for people. AH: Doodle poll? RF: Next year MW: Ok, do something next year.

Minimal 0.1 MOM

AH: Working really well. Would like to get it to run a little faster so she could get 2 months per submit. Which would improve her throughput a lot. Does NH have any ideas to speed it up? Seems like MOM model has 10 minutes of spare time. Is it CICE bound? NH: That might be the initialisation time. AH: Don’t MOM timings take into account initialisation? NH: Not until recently. Marshall fixed it. MW: Clocks weren’t showing MPI initialisation time, just MOM initialisation time. AH: Could be 10 minutes? MW: I would be surprised. Would guess 5 mins, less than 10. Big model, spends a lot of time on field exchange.

AH: Not compiled with AVX2. Will that help? MW: Did AVX and AVX2 test. Could see the difference, wasn’t large enough to bother making non-compatible binaries. NH: Slack me the path and I can take a look. Haven’t spent a lot of time trying to optimise it. MW: Can sometimes improve time by changing layout. GFDL tried very long tiles that means halo updates are only north/south. NH: She might have an older config. I switched to Sandybridge for efficiency. MW: I think you can get 7-8% speed up going from AVX to AVX2. AH: I advised Ruth to use broadwell due to memory requirements making for better throughput. AH: I suggested she have higher diag_steps, but RF pointed out global scalars mean she is doing daily global MPI calls. RF said doesn’t necessarily have to be this way? Could it be changed in FMS? RF: Yeah, all the diagnostic code. In many cases time average can be commuted with area average. Every timestep doing a MPP sum or MPP global sum, can do local sums on local process and call MPP sum when you need to. Could rewrite the MOM code to do cumulative average and do an instantaneous output. Sort of fudge an average. MW: Wouldn’t make much difference to speed. RF: Depends how much those global sums are hurting you if any, but it might be a single global sum acts as a synchronisation point and doesn’t really matter. MW: I told NCI MOM didn’t do collectives because they were so fast they didn’t show up in profile. So unlikely to help a lot. RF: MOM collectives are very simple. If not doing bitwise stuff, just taking a collective of one number. MW: Caveat, only tested at 0.25 deg, so can’t know for sure it is the same at 0.1 deg. Should do it because it will eventually start to bite. Could do a profile?

AH: AK only does 2 months/submit, so maybe we would all be better running a minimal config? MW: doesn’t bode well for exascale. AH: So many constraints with PBS etc. AK: Optimised for model+machine+queue constraints not for the model on its own. NH: Could just bump all the cpus by 10%? AH: You’ve got a nice sweet spot there, first try AVX2 and see if we can get the speed up we need.

MW: Vectorisation can help. AH: Would bigger tiles help? MW: So much time moving in and out of L1 cache that it doesn’t make much difference. AH: Broadwell got bigger caches? MW: Bigger L3 but 12 more cores. NH: Is she using 600s timestep without crashing? AH: Had an ice remap crash after a couple of years. AK: Unclear if that will happen again. Doing RYF, and I found once the crashes started happening they kept happening. RF: Using latest bathymetry? AK: Yes. RF: Any difference? AK: No idea. NH: I think it  is generally more stable.

MW: Can play with barotropic halo. Barotropic solver has halo of 10 so it can do it’s work every 10 steps. Might be able to get some speed up by playing with that. AH: Put path to Ruth’s control directory on TWG slack channel.

NCRIS ACCESS meeting

NCRIS is doing a scoping study to see if it is feasible for a team of 15-20 people to support ACCESS modelling in Australia, which would be used for submission for funding from NCRIS. The meeting was to get feedback to help write the submission.

Some discussion of the experience of the meeting.

Calendar issues

RF: MOM uses Proleptic Gregorian calendar type, but does not use the correct calendar attribute when outputting the file. It sets it as Gregorian instead. So, when using days since 01/01/0001 there is a jump in October 1582 depending on which calendar is used. Get a 2 day offset for IAF files because of this incorrect calendar attribute. Found python netCDF interface uses udunits calendars and has problems. Had to force it to proleptic gregorian to read dates correctly. Big issue when dealing with daily data. Output files need to be fixed. Could change the calendar attribute to Proleptic Gregorian or change units to be days since a year after 1582. MW: GFDL use since 1900? RF: Yes, as this is what Ferret uses.

AH: Had a lot of date issues using python. Uses date library from numpy as there is limited date range available due to nanosecond resolution. We often have to do date offsets anyway, so probably don’t see this issue as much. Should we put proleptic gregorian into MOM? MW: Shouldn’t we change the start date? RF: That is the easiest thing. There is a lot of broken software that doesn’t treat these calendars correctly. MW: should tell GFDL about this RF: Looked at the code and made some changes, but not uploaded. AK: Is MOM using the correct dates? As with coupling to CICE etc? RF: Works ok internally. AH: Arguably a bug if they’re using proleptic and not using correct attribute. RF: Yes. CMIP6 accounts for this. Checks for dates before 1582 and requires using proleptic gregorian. Future runs should have an offset of some later date.

RF: Getting huge number of messages from restoring files starting at year 0000. Restoring files on a time modulo axis and created from Ferret, which automatically treats any file with a start year of zero or one as modulo. However year zero does not exist and is incorrect. Just need to change that attribute in the restoring files, won’t make any difference to operation but save a huge number of warning messages. MW: I get 482,000 lines of errors. I would be very happy if this is fixed. MW: Someone should change those fields. RF: I don’t have access. MW: Should go and edit the public forcing fields. What specific files? AH: If you’re talking about the ACCESS-OM2 configs, NH has the most ability to change them. MW: salt_sfc_restore? RF: Yes, and temperature, chlorophyll. Anything seasonal, a restoring. MW: Anything that says “months since 0000”? AH: Yes change to 0001. RF: Anything that uses that date (zero years) can be changed. AK: Anything that isn’t JRA that isn’t multiyear? RF: Maybe runoff? Do we use the JRA runoff? The problem really is the stuff MOM reads directly, like sponges. NH: I am happy to look into this. I might be the only one with access, hope not. Have been thinking about this for a while. Changed the OASIS code as well to ensure DEBUG_LEVEL zero does not output anything. Was outputting thousands of lines. Also an Andy Hogg GitHun issue and this was the next one on my list. MW: So far I can only find salt restore and ssw shortwave. RF: Shouldn’t be using that. Should be using GFDL formulation which reads in chl.nc.

AK: Some files in those ACCESS-OM2 input tarballs that aren’t used. Should they be removed? NH: Posted on slack about this? AK: Yes, but not sure they aren’t used by someone. NH: Bit messy how this is done. Should really just have a bunch of files and grab what they need. Would save a lot space. Currently versioning sets of files rather than individual files.

mppnccombine-fast

AK: Some issues on GitHub. RF: Been discussing this with AH and Scott Wales. An attribute needs to be removed. AH: Biggest issue is regional outputs having incorrect dimensioning. Which has been fixed. Also fixed the unlimited dimension getting squashed. Also another issue with passing too many files on the command line due to an MPI issue. Requires a change to payu as globbing is now done internally so any glob needs to be quoted. It’s on my list of tasks.

MW: Original tool used a pattern? AH: Didn’t implement that in mppnccombine-fast, maybe we should? MW: Stopped doing that in payu to support some coupled FMS codes where tiles didn’t start at zero, but could go back to the old way. AH: Does using the pattern work with masked configs when tiles are missing? I can’t recall. MW: Not sure.

ACCESS-OM2 disk usage

AH: AK and I went through some of the 0.1 deg output directories and found we could get significant space savings in the ice diagnostics AK: Ice outputs are not compressed, daily data is in individual files half of which is grid data. Can get a 8 fold decrease in size. Out of 20TB of total data can save 12TB of space. AH: Want to make a post processing script to run this automatically. AK: Yes, also delete all the zero length log files AH: This was to clean up for archiving. MW: payu should do this, maybe not looking in the right places. NH: FORTRAN has an option when closing a file to delete if empty, so looking into that. Also some CICE logs just have one line at the top with exactly the same text. AH: Yes we found those, matched the same number of bytes and deleting them. MW: If payu isn’t deleting zero length files not sweeping through submodes. AH: A lot tidier after cleaning. AK: Yes an hour well spent.

Technical Working Group Meeting, October 2018

Minutes

Date: 16th September 2018
Attendees:

  • Marshall Ward (MW) (Chair), Rui Yang (RY), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD), CSIRO Aspendale

TWG Organisation

MW: Taken position at GFDL. Starting 3-6 months. Need new TWG Chairman. Need to organise meetings. Not much communication with other working groups. AH: Anyone who is interested think about it, we can decide at a subsequent meeting.

MW: As I am leaving, noone left at NCI following ocean model development. NCI will appoint a new person, but RY is attending for some knowledge transfer.

OM2/CM2 MOM5 Harmonisation

AH: there is a cm2_release_candidate branch on MOM5 repository. Contains all substantive code changes from Hailin’s fork on Peter’s repo.

AH: Need a rose suite to support MOM5 compile script. Might get Scott Wales to help make the suite. MW: I might be able to help AH: Used original MOM_compile script? MW: Not sure. AH: Currently pulls in a build script from a totally different svn branch.

PD: Yes MOM5 in git repo. One of the directories (exp) has the same build script as you’re using AH. PD: I cloned your repository, copied over compile script and environment file, pressed go and it compiled. Problem at link time. Don’t have an opinion about build script being in repo. Rose suites do “blossom”. Ok to compile from command line at the moment. Can consult with AH offline.

MW: Are AH and RF happy with the code changes itself? AH: last set of changes are that crucial. Steve Griffies would have liked more atomic changes. Need to run, see if it is different, if it is, figure out how different and if it is important.

AH: Next harmonisation target is ACCESS-ESM-1.5, adding WOMBAT BGC. This will go into the main MOM5 repo. In theory will also be in the CM2 version of MOM5. It won’t be turned on, but we should check that it doesn’t make a difference to CM2 results.

AH: Seems straightforward, as MC had already put WOMBAT BGC into MOM5, but there have been some changes since then. MC: Pull 3 years of MOM5 changes into my own branch. RF: ocean_sbc is what hooks into WOMBAT. The components we’ve added in, like 10m winds and sea ice coverage is what WOMBAT wants. What we’ve got there now is compatible, except WOMBAT assumes 10m winds aren’t masked, and uses sea ice coverage to do masking. MC: Yep. RF: The way we do it, it is already masked. So might need a change to WOMBAT, or a flag. MC: does multiple masking matter? RF: if it’s multiplying by ice fraction, don’t want to multiply a second time. MC: Around the fringes? RF: No difference to open ocean or full ice coverage. RF: Pretty close to correct. Changed the interfaces. A lot of things in ocean_model can be kept in ocean_sbc. I can go with it with Aidan.

MW: Only time pressure is when adopted in CLEX? AH: No. Some people would like this to be in the ACCESS-ESM-1.5 CMIP runs. I don’t know what the politics situation is like. MC: Tilo is anxious to get control runs going ASAP. If there is a changed to a stable version he will run with it. Catia and Fabio are anxious to get extra diagnostics in for their experiments, but not central to ESM effort. Tilo will start as soon as he has his carbon cycle stuff fixed. MW: Pressure point on RF? RF: I’ll look at it. Just need to throw in a couple of the hooks into WOMBAT, but think they’re there. Should be straightforward.

AH: made a PR, link on TWG slack channel. Cherry picked out commits that seemed necessary. If make code changes please pull down latest code before submitting changes. Can delete fork if necessary and start again. RF: Yes, done that a few times.

MW: Harmonisation on track? AH: Holger is working on payu version for ACCESS-ESM-1.5. MW: CLEX specific? PD: CLEX is picking up ESM as climate model. We are all working in the same direction. Lots of non-CMIP science coming out of these models. Shouldn’t dismiss payu as something we don’t care about.

COSIMA Models

NH: Running minimal 0.1 degree config. Around 2K cores. Maybe not actual minimum, but decent compromise. Good efficiency. With dt=600s, around 5KSU/month. Models well balanced. Ice model not slowing things down and only using 350 cores. MW: sectrobin? NH: yes but probably doesn’t matter.

NH: Thanks for heads up for NCAR tripolar efficiency fix for CICE. RF: Surprised it makes a difference at low core counts. NH: Not sure it does, just wanted everyone to know it is now in the code. NCAR say they have checked they get identical results, confirmed no difference. One month in 2.5 hours with dt=600s. Can’t squeeze in 2 months/run. AK: What diagnostics? NH: Just monthly. Same as AK’s, changed daily to monthly, just in ice. AK: Currently have 3D daily prognostic fields. NH: Might slow things down a bit. Because this config is small it is nicely balanced. Fitting so much work into each ICE PE, there is more chance they are balanced. Using 8 blocks per core. AK: ndtd=3? NH: no, try with ndtd=2 to begin with, and seems to be going ok.

NH: Currently crashing off tip of Severny Island. High velocities at tip. Crashing after 14 submits (months). Surprised it took so long to crash. Done some work smoothing bathymetry. Doesn’t seem to have helped, now trying Rayleigh damping. RF: What month? NH: October RF: Is there ice there? NH: Don’t think so RF: Had a look at other months. A jet of warm salty water coming up from the south along the coast. Those sea mounts are there. NH: Almost completely levelled them. Still a dip. Cleared seamounts before and in the dip. Velocities are very high there. Highest velocities that far north by a long way. Wondering if it is an extreme situation. AH: I tried the truncate_velocity option north of a certain latitude. Didn’t work, had a temp or salt blow up, so don’t bother. MW: usually a no-no. AH: Had the same issue with MOM-SIS-01 with CORE-II NYF, same crash, same time every year. RF: Interesting that same problem with a different bathymetry. AH: Severny Island pokes a long way north, any flow coming that direction gets funnelled along the coast. Could stop crashes with Rayleigh damping at depth in small area NE of sea mounts. Steve not happy as a solution, but one small spot places ocean timestep limit on the whole global model. I think we should use Rayleigh damping if it stops this. RF, NH: Agreed.

AK: Same crashes in same location when I’ve attempted 600s timestep, so wound it back. Put Rayleigh drag in Kara Stratit NH: Yes I have those. AK: Can give some idea of scale of drag required. Also that drag might be pushing more water around the Severny Island. AH: You already have Rayleigh drag in your model? NH: yes, all of AK’s additions. Understand some of the frustration with this model. Small config, easier to run and test. Want to push timestep as far as possible. AK: Sounds like a good strategy. Though concerned by oscillations in vorticity field in shallow area south of Bearing Strait. Some sort of numerical glitch. Goes away with 450s timestep. Seem to get stuff like this when timestep is pushed up. AH: Any idea where it is coming from? AK: Not sure which terms/equations involved. Dispersion gets worse as CFL gets higher. Not sure. NH: Explore some of these things, as MOM-SIS-01 was running at 600s right? AH: Yes with Rayleigh damping. AK: Fanghua was using MOM-SIS-01 with this bathymetry, couldn’t go higher than 450s. Added damping and did a lot of work to track down issues. AH: Bathymetry has changed since then? AK: Yes, problem with ocean that shouldn’t have been. NH: Didn’t realise Fanghua used same bathymetry. AK: Similar. Would have had one full of potholes.

RF: Anyone used new bathymetry I made? Couple of cells filled also, but mostly partial cells. In bathymetry directory, added about a month ago. NH: Will try it.

NH: Want to get recent CICE changes into 6K PE model using one of AK’s restarts. Crashing with ice remap transport errors. MW: Include tripole changes? NH: yes. Also sectrobin code change (also doesn’t change answers). Experimenting with sectrobin and blocks to get a more efficient setup. MW: That is what I am running and trying to understand. If I do a git pull from yours will I expect crashes? NH: Crashes not due to code, just model instability. Tested that code doesn’t change answers. MW: Will try that.

AH: Which is the correct bathymetry file? Some discussion, turns out the new file is

/g/data3/hh5/tmp/cosima/bathymetry/topog_05_09_2018_1m_partial.nc

AK: To overcome ice crashes like that, use ndtd=3 to give ice more time. NH: You haven’t had ice remap crash since using this? AK: Correct. CFL issue, ice moving more than one grid cell per timestep. NH: Ice is going unrealistically fast, 35 m/s. MW: How does it do this? AH: Instability? NH: Yes. AK: Is sea surface slope high? RF: Diagnoses slope, derives slope assuming geostrophic properties. Not passing slope from ocean model. If you do, get checkerboard unless smoothed.

AH: Is ratio of PEs in minimal model same as for large model? NH: In 1/10 ratio is about 1:4 ice:ocean. Minimal model it is 1:5.

RF: Bugfixes found in CICE6 should be back ported. Were using the wrong mask in the EVP solver for updating the halos. Stops bit reproducibility. NH: I saw that bug list. Know where they are. Will bring them across. RF: Found different types in u and t masks (one logical, one 0/1).

MW: Latest profiling shows EVP taking most of the time, and in particular EVP halos. Wonder if these have any effectives RF: Purely a masking issue. Could be the cause of the strange stuff due to tripolar join. Only 5 lines of code. MW: Huge patch? NH: No. Not messy. This is not a big change of code.

AH: With CM2 with old versions of CICE5 with UM hooks etc. How serious an issue before back port to CM2 version? MW: Not time to go into that too far.

NH: Since CICE6 is just incremental improvement of CICE5, maybe we should use that in future?

Miscellaneous

MW: Ben arranging meeting with Team Leaders in this space. Set meeting on Nov 7. NH to be contacted? NH: I think I am going. MW: Discussing infrastructure needs for next 10 years. Would be good to have a consistent view on what is required. Meeting at a high level. MW: RY and I are going.

AH: Doing another payu training for CLEX, covering mppnccombine-fast, file tracking and ACCESS-OM2 configs, how to get them and what to do. Anyone at CSIRO interested?

MW: Will go over more profiling info on slack.

MW: Will merge latest payu versions. Can run without patching python version. AH: Yes can also run in a conda environment, which maybe tick’s portability box for NH

AH: people on payu/dev should move to payu/0.10.

PD: COSIMA meeting where harmonised code delivered. Amazing! Well done.

Actions

New:

  • Check ACCESS-ESM-1.5 PR / WOMBAT integration (RF, AH)
  • Backport CICE6 bugs into CICE5 (NH)
  • Forward training email to PD (AH)

Existing:

  • Create even 5 blocks per PE map for CICE (RF)
  • Update model name list and other configurations on OceansAus repo (AK)
  • Shared google doc on reproducibility strategy (AH)
  • Pull request for WOMBAT changes into MOM5 repo (AH, RF)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)
  • Provide 1 deg RYF ACCESS-OM-1.0 config to MC (AK)
  • Update ACCESS-OM2 model configs (AK)

Technical Working Group Meeting, September 2018

Minutes

Date: 11th September 2018
Attendees:

  • Marshall Ward (MW) (Chair), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD), CSIRO Aspendale

Clean up Actions list

Finished:

  • Incorporate RF wave mixing update into MOM5 codebase + bug fix (AH)
  • Code harmonisation updates to ACCESS and ESM meetings (PD, RF)
  • Check red sea fix timing is absolute, not relative (AH)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Profile ACCESS-OM2-01 (MW)

Deleted:

  • Follow up with Andy Hogg regarding shared codebase (MW)
  • Nudging code test case (RF)

CICE in ACCESS-OM2

MW: 4 block success. 16 block didn’t work. sectrobin also didn’t work. Limited perspective on problem.

RF: blow out in time with extra blocks was halo updates. Weakness with round robin. A lot of overhead, no local comms. Maybe 8 tiles/processor might work. Marshall’s profiling showed small number of processors dominated run time. Want to minimise the maximum. That is the limiter

AH: Where are the max tiles?

RF: Seasonal ice near Hudson Bay, Sea of Okhotsk and Aleutian Islands.

MW: Nic used total CPU count less than number of blocks

RF: Could run with more, or less. MW: 80 CPUs less, could solve this.

AH: General strategy to concentrate on not assigning CPUs to the low work (blue areas) and let the high work areas take care of themselves?

RF: Only worried about slowest tile. Nice to have even distribution, but hard to achieve that in practice.

AH: Slowest tiles change over time RF: read in a map of expected ice concentration. Or have a heuristic, say weight by latitude. AH: If identify areas that do very little work, say never want to have many processors there, and free up processors for high work areas.

AK: There are five hot stripes and four cold stripes. Some processors have 5 blocks, some have 4. The outlying busiest ranks are on those hot stripes. If we get rid of striping with more even split, that would have maybe a spike on a lower baseline

RF: About half the processors have 5 about half have 4, request a few more PEs and that would close to balancing this issue.

NH: First attempt 1600 PEs with an even 4 blocks across all. With idealised test case Ocean was not blocking at all. Though could save a couple of hundred PEs, and there was not a big difference. However Andrew’s real world config is behaving differently. Worth going back up to 1600 and doing an even 4 or even 8 blocks. Assumed wanted everything to be even. Seemed roughly the same to have a mix. This profiling shows I was wrong.

RF: Can easily work out to get exactly 5 blocks per PE. AK: If you give me that number I can try it. NH: 5 across the board is better. Don’t want a single PE doing more work. RF: Slowest one kills you.

AH: How does the land masking affect it? A thicker stripe in NH? RF: Yes. Did I post a picture of where tiles are allocated? NH: More blocks means getting rid of more land? RF: Lose with communication cost.

NH: In order to get this working I ran into the raijin problem: messages getting lost and deadlocks. When we got 0.1 deg MOM-SIS working had issues with point to point sends and recvs, and Marshall change that to proper gather to get initialisation working. The gather inside CICE is implemented with point to point sends and recvs. Assume similar. It is doing a send for every block. MW: Andrew’s finished ok? AK: Ran with 30×35. MW: mxm might resolve this problem? NH: Resolved by putting in a barrier after all the sends, otherwise deadlocks. MW: Did you add barriers? NH: Yes to the MPI gather code. MW: Clear that CICE is heavily barriered. NH: Could implement properly with MPI_gather. MW: Caveat didn’t work with the global field. NH: Only does a global gather once when writing out restarts. Not too bad. MW: A lot of MPI ranks? NH: 1600 x number of blocks is the number of sends. MW: So number of messages, not number of ranks. MW: Only added barrier for restart? NH: Could have done that, but added in MPI_gather. Maybe that is bad? Actually didn’t add, just enabled it by defining a preprocessor flag.

AH: Is there an effect that it gets wider in the north that you’re sampling more ice in those areas?

AH: Should we pull out the slowest blocks and see where all the blocks are that contribute to the slowest processors.

RF: Correspond to areas of highest ice concentration. AH: There is ice in Okhotsk in northern summer? RF: Yes.

MW: Arctic and Antarctic are sharing work. RF: How many for this run? MW: 1385 RF: If you run with 1500 or so get an even distribution.

NH: Should decided what is the next step/run?

MW: Two options, massively increase number of blocks, but this is blowing out with comms time, or even divided 5 blocks. RF: Yes that is the one to do next.

AK: sectrobin should solve the communications issue but couldn’t get it to run. NH: Not sure if code needs to change? RF: Test on 1 degree model.

AK: First step to even up current run with 4 or 5 blocks. MW: Should confirm that many blocks is a comms problem and not a tripole issue for example. But this is a research problem.

AK: Will switch to this for 0.1 deg production as it is already better.

NH: New code 1 block per PE gives identical answers to old code. 4 blocks does not give identical answers to old code. Not sure if I should expect it to be the same. Don’t know how CICE works. In terms of coupling it should be the same if you’re coupling to individual blocks or multiple blocks. Not ruling out it should be identical and there is something going wrong. AK: What would make it non-identical? Order of summation? NH: could be something like that. MW: Might be CIE doing a layer calc before doing vertical? Have to know more about CICE. NH: might be worth looking into further so at least we know that we’re not making bugs.

AK: How would I switch to this for the production run? Not bitwise identical? Just check fields look physically reasonable? NH: Hard problem. Can’t see physical difference. Only looking at last few bits of a floating point number. MW: Did an MPI sum on a single rank and it changed the last bit. Found it running the FMS diagnostics and that is why they failed. Don’t fail at GFDL. Scary stuff. NH: Scary and time consuming.

MW: Clear strategy. Get rid of bands. Go with 1600 cores. Have a 16 block job running, will keep everyone updated.

Code Harmonisation

AH: My understanding with the ESM harmonisation is that we’re close, as we haven’t yet put in the coupling changes from CM2 that you had to take out of the ESM code. PD: Dave Bi’s iceberg scheme? AH: If we get the WOMBAT code into MOM5 that would be harmonised I think. PD: Maybe Matt has a better handle?

MC: Are the OM and CM almost harmonised except for iceberg information? Are they almost the same? AH: I believe so. Once we get WOMBAT in there we’re good to go. Russ had a different idea about how to handle the case of different coupling fields.

RF: Have to get rid of ACCESS keyword. In many cases redundant. AH: ACCESS keyword can be replaced  by ACCESS_CM or ACCESS_OM. RF: Yes!

RF: On CICE side of things (and probably MOM) coupling fields are currently defined as parameters. Can use calls to PRISM, test return code, put some tests for legal code/parameters for icebergs for example. Don’t need ifdef’s, can test on the fly. A lot easier than recompiling every time.

AH: How do we implement this? Put WOMBAT code in now so we have an ESM harmonised version and then deal with coupling etc as this is ACCESS-CM? RF: Want to bed down ACCESS-CM and OM harmonised first. The WOMBAT stuff will move in quite simply. I’d like to take that on, have been tasked to do this to take some of the load off Matt. Get this first step out of the way and then move on to WOMBAT and ESM. Until the first step done things can be in a state of flux.

MC: Is wind ehanced mixing in ACCESS-OM? RF: Yes. MC: FAMIP in ACCESS-OM? RF: They’re in MOM5. MC: They weren’t in ACCESS-CM code. AH: That is a 3 year old fork. MC: Can we update ESM from ACCESS-OM? AH: This morning putting WOMBAT changes into MOM5 pull request. Can grab and check if it works. MC: What is the difference in pulling from one direction to the other? AH: ESM is a 3 year old fork with little history in common with current MOM. Couldn’t code  into ESM would be too difficult. Cherry picked your changes into the MOM5 code, but wouldn’t work the other way. Will lease with Russ to get ACCESS-CM changes.

AH: Would WOMBAT always be part of MOM5-SIS. MW: Is it big? RF: No, very small. MW: Let’s leave it in MOM5. Just executable bloat. RF: Just a few fields. MC: Allocated, so if not turned on, then no issues. RF: WOMBAT wants the 10m waves, but we need that for the wave mixing as well.

Travis CI on MOM5

AH: ACCESS-OM no longer compiles because you need libaccessom2 as well. NH: Same before. Always needed OASIS. AH: I’ve got CM compiling by pulling in OASIS and make it. All the compilation tests are passing. Could pull in the libaccessom2 and compile in a similar way to ACCESS-CM. There is no old ACCESS-OM build anymore. It is ACCESS-OM2. MW: Do we want to do this external to the repo? AH: Nice to have the tests there and passing. OM now has different driver code to CM, so can’t be sure you’ve done it properly without an ACCESS-OM compilation test. NH: There always needs to be a dependency on a coupler. libaccessom2 is more than a coupler. Maybe some of it is undesirable. Not worse than having a dependency on OASIS. AH: Just wanted to make sure there wasn’t an ACCESS-OM that was independent of libaccessom2. MW: Can you provide libaccessom2 as a binary and headers? AH: Yes, that is a possibility. NH: Could just be a .a file. MW: that is how you handle dependencies, as a binary, like libc. MW: Do you call OASIS in MOM? NH: Yes. In yatm don’t directly call OASIS. Could change coupler in future without changing models. MW: No problem with wrapping OASIS. AH: Can do the same thing I did with CM, pulled in OASIS, built it. Pretty straightforward.

Actions

New:

  • Create even 5 blocks per PE map for CICE (RF)
  • Get coupling changes into MOM for harmonisation (RF+AH)

Existing:

  • Update model name list and other configurations on OceansAus repo (AK)
  • Shared google doc on reproducibility strategy (AH)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)
  • Provide 1 deg RYF ACCESS-OM-1.0 config to MC (AK)
  • Update ACCESS-OM2 model configs (AK)

Technical Working Group Meeting, August 2018

Minutes

Date: 14th August 2018
Attendees:

  • Marshall Ward (MW), NCI
  • Aidan Heerdegen (AH) (Chair) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Nic Hannah (NH) Double Precision

Peter Dobrohotoff (CSIRO Aspendale) gave his apologies that he couldn’t attend.

MPI Coding Update Update

MW: Fixed constellation of bugs. 1/10th still not working under MPI, looks like a new issue. How do I approach getting this code into respositories? Not invested, but want working in long term. Intel v18 has a bug, fixed, but found another. Dale has built an OpenMPI3 library built using Intel v17. Do people want to use it? Or afraid another issue?

AH: What advantages? MW: Hangs in MPI_Init, commsplit, and hangs at random time steps. MXM seems to have solved random hangs. First two still happening. Still getting random fails during initialisation. Betting on newer library to solve them. Do we want to invest in new libraries and hope we get new solutions, or happy with status quo?

AH: Is there another way? Maybe dev branch on MOM5, submit to more testing? MW: Yes. I can add all build stuff to all repos, independent on NCI configs. Optionally turn them on over time. AH: Just using different versions of OpenMPI? MW: Yes, have my FMS changes as well. AH: FMS changes in master branch? MW: Could do multiple ways. FMS not been updated to GFDL version. Could have subtree/submodule or just dump in. Will change everyone’s code, so might not be solution. I want people to start testing soon. AH: The init hangs: critical core count when these occur? MW: Yes. Tenth gets them, not at 1 degree. AH: Don’t have these at a quarter either? MW: don’t run 1/4, but frequency of errors increases with cores. AH: If we can test this by running the model for a single time step lots of times to quantify issue, see how bad it is, and see if we get improvement. MW: Have not seen commsplit hang with newer versions of OpenMPI. AH: Do you get hangs AK? AK: Get hangs at initialisation about 10% of time. Since going to YATM not had these issue. Also much more consistent with timing. Was 1.5-2.5 hours/month. Now 4h5m-4h10m for 3 month submit. MW: When we studied variability always IO issues. AK: Don’t know what is behind it. AH: Has coincided with transition to YATM? AK: Yes.

AH: Just got to a point where the model is running at all. AK: By late today will have 1 year of IAF. Need capacity to shift to new MPI versions. MW: Concerned about next machine. OpenMPI 1.0 will not be available. Not concerned short term. Concerned long term and stability issues. AK: Yet to fail

MW: Just updating. Want a plan. Will submit build changes to all the projects. Will also replace FMS with submodule but no change in code, but can be changed when required. AH: with submodule can easily test changes. Need a plan to for implementation, testing, check timings.

NH: YATM makes sure it does all it’s work before ICE asks for anything. Does reads and regridding and waiting until required. Would hide any jitter in this disk access. AK: Preemptive fetching data. MW: Good case for IO servers.

Date handling bug in YATM

AH: We all good now? This fixed? AK: Yes. Just had to tell it to ignore the date in ice restart for the first one. AH: We have a method for this strange restart? Will we need to do this again? NH: Will happen any time you want to use someone else’s restart and don’t want to use their calendar. AH: Are there code changes we need to support this? AK: No. One off thing. Just need change a value in CICE restart and then change back again. AH: Kial got burnt with this once. MW: We could do this at payu level. NH: Little more to it. Also necessary to MOM and YATM date. AK: 3 or 4 things I needed to do to make it go. AH: Think if we need to streamline this? NH: To begin with just document on wiki. AK: I can put it up there.

MOM5 wind langmuir mixing stuff

AH: Fixed?

RF: Should not be able to do an ACCESS-OM run but not do langmuir mixing unless using u10 calculate from empirical formula. Won’t break current ACCESS-OM runs. Have to look into CICE5 and work out how to get winds into ACCESS-OM. ACCESS-CM was fine. OM I thought there was a option to pass winds, misread a preprocessor flag.

RF: A couple of other issues. The order it passes the fields between ice and ocean, the 10m winds are 22nd, another one at 21 which isn’t used. They can’t be done as a common thing. Strange code. The changes I have done will make it safe for the time being. Have to explicitly compile you want to use the winds. AH: Under what circumstances can you use langmuir in ACCESS-OM? RF: Can use MOM6 style to calculate 10m winds. Need to turn on another namelist option. Currently don’t pass winds in OM models. AH: If want to use langmuir do you need to also set the compile flag? RF: At moment can’t pass 10m winds from cice to MOM. Defaults will be fine. ACCESS wind preprocessor flag is for ACCESS CM. ACCESS wind flag is a placeholder at the moment. All allocations made no matter what. Problem all allocations no matter what. Currently initialised to zeroes. AH: A placeholder for the future when get winds through CICE? RF: Yes. Would like to figure out how they pass 10m winds in fully coupled model, and whether they mask them with ice or not. Currently not clear. Would like to make them compatible. AH: I will put in those changes, add new changes and submit a PR. MW: Turn off langmuir by default? It’s broken? RF: No, can calculate winds in MOM6 style. Not using this at the moment.

MOM5 testing

AH: Above bug brought home issue of testing on MOM5 repo. Currently have 3 targets, MOM-SIS, ACCESS-OM, ACCESS-CM. This got through beause I only tested MOM-SIS. NH: There is a jenkins setup which runs every MOM6 test case. Can’t remember if it has ACCESS-OM, ACCESS-CM  builds/test cases. Spent a lot of time to set up testing but it takes maintenance work. Doesn’t run periodically. Love the idea of testing MOM5, please look at what I have already done. Like the idea of production ready, but it takes effort to maintain the system, might not be justified by the number of tests we have to run. If we had weekly PRs would make sense. If infrequent need to revisit the testing every time. AH: Idea was to do some simple builds. MW: build tests on travis? AH: Yes. MW: Don’t have to run, just build. Nic did a lot of work to do runs.

NH: Periodically: ACCESS-OM2 build test, and a fast run test (1 day experiment). There is a lot of stuff being done for MOM5, no build test. MW: Is this the GFDL tests? AH: Nic runs the GFDL tests. NH: MOM5 runs not run as frequently. Not maintained, going red. Sure if something simple, maybe not worth doing on Jenkins. But definitely take a look? MW: Travis for commits? Weekly Jenkins runs for commits. AH: can see five MOM5 builds. NH: folder mom-ocean.org on Jenkins. MOM6 guys get a lot of value from it. AH: will take a look. NH: If you can’t change anything let me know.

ESM 1.5 Repo

MC: Didn’t work with https. AH: Made an ESM1.5 repo on OceansAus for Matt to upload MOM5+Wombat code. Pretty much frozen. Peter wanted somewhere to put this. Should it be possible to https? RF: Always had to use ssh. NH: Just need to put in password. MW: https should work, help to know error. AH: give it another crack. Complain on slack. Get on slack.

AH: ESM1.5 repo on OceansAus won’t change much (now frozen), but we have goal of getting WOMBAT code into main MOM5 repo. MC: Might do that in parallel. Who knows what will happen to ESM1.5. Depends on where investment with ACCESS-CM investment goes. ACCESS-CM2 is quite expensive. In the process of putting WOMBAT into ACCESS-OM-1.0. Going through steps. Put WOMBAT into it and submit PR. AH: ESM1.5 is just MOM+ Wombat? AH: We’re doing the harmonisation, so MOM5 master will have all the important changes. Once we have WOMBAT we have an ESM1.5 equivalent. ESM1.5 will be workhorse coupled model for CoE because ACCESS-CM2 is too expensive. Whilst ESM1.5 on OceansAus will be the canonical version, the MOM5 repo will be effectively the same but can included updates to diagnostics etc.

MC: Checked out ACCESS-OM2-1.0, checked out, compiled, but falling over on running payu. Config file has changed a lot since I last used. Want to run a 1 degree RYF model as basis. MW: Is Matt using the version that isn’t working? Is that what Matt is dealing with? AH: Matt, get on slack and let us know your issues, and we’ll get you going. AK: looking for a working setup? MC: Yes, 1 deg JRA RYF. AK: Can point you to working config. MC: A month since I cloned. AK: Yeah, need to update.

AK: Asking for just config? MC: Yes, but any information useful. AH: Kial has a lot of configs. I cloned one and changed exe paths and was up and running very quickly. MC: I cloned and built, but when I checked out the config it was pointing to common shared exes. AK: should change that. NH: Maybe documentation is out of date. Should follow the simple “if you’re a raijin user” instructions. MC: Yes mostly worked. AH: Get on slack! MC: Browser is out of date. NH: If you do it again, follow the quickstart for raijin users instructions. If that doesn’t work we need to fix stuff. MW: a lot of use problems we don’t know about. We have to think about students who will be coming to run this. If Matt can’t figure it out there is no hope. AK: there is a lot that needs to be updated for the more complex instructions. Also the configurations in control are not what they’re currently using. Could fix that easily.

JRA55-do versioning

AH: Andrew has issues with a ‘latest’ directory that has symlinks that point to most recent version. AK: Common use case is perturbation experiments. Go back to previous restart and branch a new experiment, but need to know what forcing was used. Rather than latest, have a directory which is named for the date it was setup, or date forcing was updated. If and when things are changed, make another one. All softlinks. AH: One good thing about latest is you have a config that always works with most recent version. If you have a config with latest, they start a new model and they can be confident that it works. AK: No problem extending forcing, only an issue if old forcing files change. AH: they have versioning issues with CMIP5, have a database. NH: latest is not reproducible. Experiment I ran, but latest is changed. Problem with old system, every version jumbled in one directory. At times there were different variables which had different versions. Not all variables had the same version. AH: That is correct. NH: If there is a single directory that has all the variables for that version that is fine. AH: some cases the variables don’t have the same version. I agree this is an issue, but best solved with manifests in payu. MW: filename is not a good system. Filenames change and hashes don’t. AH: If someone has a naming scheme they want, then happy to implement it, but will keep latest, and solve using manifests. NH: was there a reason to put all versions in same directory? AH: the way the JRA55 people publish it.

AK: Do we care if JRA forcing is extended? Does it affect reproducibility? NH: Not an issue. YATM has no end date for an experiment. You set a forcing start/end date, so no problem.

Misc

RF: Pavel Sakov is running a KDS75 MOM only on OFAM -75/+75 tenth model. Running 600s timestep from the start, hoping to get up to 900s. The problems in global model is not between +-75. NH: Just poles messing us up. RF: From a flat surface, huge heave. NH: all those little grid boxes. AK: Yes the tripole is the issue. AH: redo bathymetry? RF: did a naive regridding, some issues, potholes etc. Still works. Will be running a 100 member ensemble. AH: What is he trying to find out? RF: look at some issues with OFAM/BRAN/OceanMaps. Interested to see how much is due to vertical resolution. Also a test for the future. An intermediate model between what we run at the moment, and what Andrew is running. MC: interested in a figure from Kial at the COSIMA meeting, showing how variability changes with surface resolution. AH: how long will he run? RF: A year or two. Thought you might be interested.

Actions

New:

  • Incorporate RF wave mixing update into MOM5 codebase + bug fix (AH)
  • Code harmonisation updates to ACCESS and ESM meetings (PD, RF)
  • Provide 1 deg RYF ACCESS-OM-1.0 config to MC (AK)
  • Update ACCESS-OM2 model configs (AK)

Existing:

  • Edit tenth bathymetry to remove Cumberland Sound (RF)
  • Update model name list and other configurations on OceansAus repo (AK)
  • Check red sea fix timing is absolute, not relative (AH)
  • Shared google doc on reproducibility strategy (AH)
  • Follow up with Andy Hogg regarding shared codebase (MW)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Profile ACCESS-OM2-01 (MW)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Nudging code test case (RF)
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)

Technical Working Group Meeting, July 2018

Minutes

Date: 10th July 2018
Attendees:

  • Marshall Ward (MW) (Chair), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Peter Dobrohotoff (PD), CSIRO Aspendale

Tenth Model Update

AK: Starting from yr 38, new codebase, hangs on MPI_finalize. Hard to trace. padb can’t help.
MW: Ranks not getting traces from probably shut down cleanly. Depending on how process exits it may leave cleanly. Russ added explicit backtrace calls to force a backtrace.
RF: Only if it goes through a particular routine. If it calls a FATAL within MOM. Doesn’t happen if something internal goes haywire.
AH: bug in the code according to Nic? AK: recompiled. Will run this morning. MW: may need to force back traces in some situations?

OM2/CM2 Code Harmonisation

AH: finished wave mixing?
RF: Tested in CM2.5. Happy with results. Can get from my GitHub. MW: happy doing PR? AH: I will take care of the code wrangling.
RF: But not completely happy with on MOM side. Using one instantaneous value rather than an average. Should be fine for ACCESS. AH: your code? RF: had to add in a field, bit tricky to know what it is doing. Has to get it from CICE, hard to figure out what is going on. Matt fixed it 5 years ago. Getting one value in the coupling stage. One timestep fine. Coupling several time steps. Only when MOM-SIS. Running with ACCESS it is fine. MW: will it produce the same numbers for a MOM-SIS run? RF: won’t change anything. MW: Change FMS code? RF: No, coupling code. Added in an extra field. In the coupling code, in the flux exchange. Atmosphere supplies wind at it’s bottom level. Scaling law to calculate 10m winds, which is what ACCESS passes. In normal MOM code, doesn’t have access to that field. Added in something to get that field, pass to ICE, which then passes to the ocean. ACCESS code, through OASIS, it passes these 10m winds directly. Difference to ACCESS-CM2 code, passed through ocean_sbc now. ACCESS took the ice/ocean boundary field and sent directly to KPP scheme. Shouldn’t do that, should be going through ocean_sbc. Now 10m winds are in the velocity derived type. Cleans up the interfaces. Another slight change with the fraction of ice passed in. Had some #ifdef in ocean_model. Now made aice a local variable in ocean_sbc. Doesn’t have to be passed around. Same interface with ACCESS or MOM-SIS.
RF: Now distinction in KPP code for ACCESS version and MOM version, just some namelist variables. MW: 10m always passed? RF: there by default. MW: so memory usage is marginally higher? RF: 2D field. MW: Otherwise no impact? Increased data transfer? MW: Previously MOM-SIS was converting to a stress? RF: Yeah, just have this extra field. MW: Good we now have 10m winds. Was awkward. RF: Also put in an empirical method to calculate 10m winds given friction velocity. Don’t need 10m winds in case forcing with fluxes, can’t regenerate this. This is copied from MOM6.
RF: Do you mind having elemental functions? AH: No! Love elemental. MW: good, aspirational.
RF: Aidan look at what I’ve done. MW: this is the bottle neck in the code merge. Awesome!
PD: Russ, thanks for work on Harmonisation. Can we now test on CM2? RF: Yes, depends on Aidan’s updates. Just based on current version of MOM. Anything else isn’t there. Once Aidan does his stuff should be fine. AH: Yes I will do this. PD: timeframe is yesterday. Estimate of timeframe pass on to ESM meeting and CMIP6 meeting. Other people make decisions. Hopeful to use harmonised code. AH: Want me to attend a meeting? PD: ESM meeting on Friday. Matt, helpful? MC: a report from PD would be enough. PD: If RF is there, that might be enough. PD; 11 am Friday.

Model Reproducibility

PD: Any work on restarts? Working on warm restarts in CM2? AH: more information MW: 2×1 vs 1×2 jobs. AH: not that I know. Stability more information.
PD: the way a scientist sees this, a model is perturbed at every restart. How can you write a paper with this “feature”? Needs to be fixed. MW: MOM5/6 can do this. ACCESS-CM1 can. ACCESS-OM2 cannot. UM can do this. PD: CMIP5 runs can do this. MW: tested MOM-SIS and didn’t work. Steve reckons settings are correct for reproducibility. Tested a year ago, all differences in tripole. Didn’t pus this. With GFDL coupler, atm, ice model. Not our model. Nic confirmed this issue with OM2. libaccessom2 growing pains have pushed this out. AK: scientific credibility requires this to be solved. MW: floating point arithmetic is a perturbation. MW: Get consistency with consistent restart times. PD: if FP errors are on the same magnitude as restart errors, maybe we can say they’re ok. Interesting perspective. AK: should be save the state of the model and reload and carry on. MW: the order something is being handled with init not same as time step. AK: fields calculated from saved variables might not match. MW: need checksums at every step. Needs to be someone’s job. Need to communicate that to the people that control us. AH: Needs to be prioritised by Andy Hogg. AK: Are these differences large, or least sig bit. Any perturbation, of model stable perturbation will disappear, but maybe new trajectory due to chaos. Same order as numerical round off error, or different compiler, optimisation, maybe of order that are being made all the time. PD: a lot of calculations from one time step to another. When you say how big is this change? Measured at the end of the time step after the restart. AK: model state at beginning of restart must be the point where they are different. When do we measure difference? AK: when restart and initialise fields. at that stage should match when model restarts were written. MW: hard to define model state. global vars, scratch fields etc. Need to define state, then compare checksums at end of run, and beginning of next time step. After 30 time steps, get checksums, then proceed. Then compare to timesteps with a restart run.
PD: each processor checksums array, print that out. AK: specific reproducible order to sum? MW: MOM or UM safe operation, need a gather on a rank. PD: can  we work on what a state might be? MW: Can do this in MOM. Need it for all models. Hard for coupled models. MOM has framework for this. Could be as simple as OASIS getting out of sync. Depending on configuration it might not be restarting correctly. AH: nic has tested OASIS field consistency. MW: volatile time. PD: lags might be set explicitly for first time step? MW: restarts are supposed to handle that. AH: could we use compiler options to perturb FP operations to get scale of differences. MW: fused multiply add might not reproducible. PD: some clarity about what the model can and cannot do. MW: push this up to science leaders. Bob Hallberg did cool thing with MOM6 converts FP to fixed point and does global sum and converts back to FP.
MW: lack of testing and reproducibility means we can’t confidently change code quickly and easily. AK: engineering problem. useful for finding subtle bugs the way code is written. Hard to know how big this effect might be. MW: lab can do stuff. PD: is this a showstopper? MW: need a conversation at CSIRO wrt CMIP6. They have rules. PD: this isn’t a showstopper for science publishing? AK: depends on size of perturbation. AK: for testing need to walk all used code branches.

FMS (MPI) updates

MW: Been rewriting global field function. Done for a while, concerned about performance. Fair bit slower than original. Fixed stability. Probed it, but it was MPI alltoallw and it was slow. Tested against other MPI libraries. In Intel MPI alltoallw is a lot faster than p2p. OpenMPI is across the board slower than IntelMPI. Whatever I did was not a question of performance. Has anyone been testing IntelMPI? Maybe we have been making our lives hard by using IntelMPI? What do people think? RF: Makes no difference to us. Up to MW. MW: will keep testing. MW: Intel is not necessarily faster, but it might be smarter about choosing algorithms. Around the 1000 ranks it makes a bad choice. AH: How size sensitive? MW: has not tested alltoallw. Others are faster on IntelMPI. 2x as fast. Small tests. AH: full MOM test with MOM? MW: Years ago, volatile timing. This was IntelMPI 4, when it was sort of bad. Seems to have improved. Intel MPI is MPICH.

Actions

New:

  • Incorporate RF wave mixing update into MOM5 codebase (AH)
  • Code harmonisation updates to ACCESS and ESM meetings (PD, RF)

Existing:

  • Edit tenth bathymetry to remove Cumberland Sound (RF)
  • Update model name list and other configurations on OceansAus repo (AK)
  • Check red sea fix timing is absolute, not relative (AH)
  • Shared google doc on reproducibility strategy (AH)
  • Follow up with Andy Hogg regarding shared codebase (MW)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Profile ACCESS-OM2-01 (MW)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Nudging code test case (RF)
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)

Technical Working Group Meeting, June 2018

Minutes

Date: 12th June 2018
Attendees:

  • Marshall Ward (MW) (Chair), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Peter Dobrohotoff (PD), CSIRO Aspendale
  • Justin Freeman (JF) BoM Melbourne
  • Nic Hannah (NH) Double Precision

TWG Meeting

JF:  Would be able to attend more regularly if there was a calendar invite which would enable him to schedule the meeting. How do we integrate calendars for Justin

COSIMA Models

AK: Bathymetry error in tenth model in Cumberland Sound, Baffin Island. Causes model blow ups.
RF: Yes blast it out. Russ will do it today. AH: Do we need any changes to restart/input files? Russ: if below zero  for eta_t, might have to set to zero. Otherwise will complain about penetrating rock.
AK: tenth very unstable over the weekend.
MW: longjmp error means the backtrace is failing. Memory go so severely corrupted that can’t properly debug.
“nearest_index array must be monotonically increasing error”
AK: Sweep and resubmit and works.
AK: More errors since turned on diagnostics for Adele. RF: are these globals? MW: could be FMS bugs because MPI is being strained and things are out of order.
AK: daily outputs in regional area: temp, salt, uhrho_et, vhrho_nt, rho_dzt. RF: spewing output from a lot of processors as regional outputs do not use io_layout, so every affected processor outputting data. AK: only doing for 2 years and then turn it off. It has slowed it down. Become erratic in timing. RF: Some processors not outputting the field, not sure why it should make it unstable.
AK: Put up as an issue.
AH: What is the current model config for tenth, and performance? AK: 4.5K on MOM, 2K on cice. Runs with 450s timestep, 1.5 hr/mo. Now running at 400s. Crash in Baffin Island goes away with shorter timestep.
AH: Try and get tenth running faster. Ice no longer holding back timstep. AK: Was running 540s before Baffin island issues.
MW: netCDF4 v4.4 has FPE turned on. Built by a different person. Historically always had FPE disabled. AK: 4.2.1.1 in MOM. 4.3.2 in CICE. 4.4.1 in matm. OASIS has default. AK: waiting for yatm build to be signed off. Ben M suggested we should be using openmpi/1.10.7 (optionally with debug). Number of bugs fixed between 1.10.2 and 1.10.7.
AK: Want to try out orange layout with CICE. Currently 2000 cores with no landmasking and 1 block / processor. Could be run a lot cheaper. Currently MOM bound. Should be to run well below 2000 cores. Waiting for yatm to be sorted out. Trying some frankenstein builds and back porting to matm.
AK: Timing is very inconsistent. RF: Ocean eta and plot diagnose has a collective. Does a sum. Somewhere it has hung. All depend on this function. MW: Could be load imbalance in CICE.
MW: MPI_Comm_split hangs or fails intermittently.
AK: No stock runs since looking at runtime. MW: thinks his profiling was wrong because of lack of ice. AK: looking at the load imbalance there is ice. MW: ran from rest for 10 days. CICE would normally do work that wasn’t captured. MW: tried to redo profiles and all runs stopped working. Shocking.
MW: moved on to yatm. Putting scorep into yatm had issues, so not redone the profiles with realistic ice.
AK: will spin off run with no diagnostics as point of comparison. MW: at dt=300s, 100s/day seemed reproducible. Andrew’s 50% slower. Maybe more stuff happening. One of two issues that need to be resolved. CICE bound results different, second is MOM slowdown. Matching MOM-SIS important goal.
MC: how much longer running spinup? When switch to IAF? AK: will switch to IAF ASAP. Andy is running RYF @ quarter. Then Paul Spence will run IAF quarter. Currently 34 years of spin-up with 84/85 repeat year. MC: Will start from year zero? AK: there are biases in RYF, so not sure if we should spin off from this run. Might depend on how many years we have to get done.
MC: will there be multiple cycles of IAF? AK: depends. MC: start at WOA or from RYF spin up.
AH: For the model documentation paper there will be the standard 5 x IAF (JRA55) protocol for 1 degree and 0.25 degree. The MOM meeting discussed strategy for 0.1 degree. Andy Hogg thought the tenth was just too expensive to run this protocol and might have to run only one cycle of IAF, or maybe spin up with RYF and then run IAF from 85 onwards. Whatever was done would be repeated in a second quarter degree run to provide a point of comparison between the different resolutions.
RF: interested from 93 once the satellites go up.
JF: wanted to get up to speed. Looks over minutes when they come out, very useful. Mirko has been doing some runs. Will try and join in regularly. BoM will take up ACCESS-OM2 when up to speed. Will be OceanMaps version, used for forecasting.
AH: Andy running KDS50 for 0.25 deg for RYF spin up. Found KDS75 too unstable.
JF: Mirko is testing COSIMA models in back end. Mirko getting up to speed what we’ve done. Need the 75 level (COSIMA) grid. Will do some hindcast runs and compare with OceanMaps. Don’t have experience with sea ice model . Don’t know how it will affect forecasting. Need to look at the ice parameterisation. Also need to look at data assimilation. Will talk to Russ and Matt. At some point will be able to contribute back, will work from GitHub repo, using same codebase.
AK: run parameters and namelists on git repo are a long way out of date. JF: can we make sure these are updated. AK: Still in a state of flux. Still bedding down YATM configurations. Will do best.

ACCESS OM2/CM2 Code Harmonisation

AH: What is the other significant code difference in CM2 that Russ wanted to reimplement? RF: wave mixing scheme. Gets added into KPP. Comes via CVMix package. Two ways to implement. 1. 10m winds to come in via sbc. 2. Can empirically calculate them in MOM6. Russ has implemented this scheme under CM2.5 framework. Run for a while. Had to put in a limiter because it caused too much mixing. Dave reckoned it didn’t make difference. Haven’t looked at the most recent results. Running with CM2.5 coupled model.
RF: Also another scheme Russ wants to implement. Slightly different to ACCESS-CM2. Both schemes already in MOM6. One of them is in CVMix. That is what Dave has implemented in MOM5. Taken routine out of CVMix and plopped it into KPP module to give enhanced mixing. Also need 10m wind information to come in. Need changes in surface flux code. Russ has done this. Russ has implemented same thing, just change in the way winds get through. Not sure why ACCESS-CM2 didn’t see difference.
RF: Occasionally get massive mixing coefficients in KPP so put in a limiter.
RF: will put code changes into master branch. AH: when you have done this I can pull into CM2 and can test. RF: Griffies wants it in MOM5.
PD: followed along in slack channel. Not sure about all technical details. Big difference after 10 days between harmonised code and CM2 codebase. Has this been solved? How far along are we with this? Spinups will not have harmonised code if we don’t have a frozen version soon. ESM and CM2 groups want to know how close we are. We haven’t helped much to this point. How can I contribute.
PD: copied suite. Ran it. Thought was tracking down bug. PD: couldn’t find preprocessed source files. MW: do we run cpp? I get the right source code lines and don’t see .f90 files. No we don’t … which is why Peter couldn’t find them.
RF: why was red sea fix timing different? CM code has a fix? AH: might be because my fix uses relative time, not absolute model time. RF: timing fix should have absolute origin. AH: I’ll check.
AH: I don’t think there is that much more to go for the harmonisation
PD: when can I run harmonised MOM?
RF: when I can find some time to put in there. Now we have a way forward. Hopefully in a week or two.
PD: will put runs on ASAP. If harmonised code not ready, won’t be in spinups.
AH: will lease with Peter and tell him as soon as something is ready.
MW: if there are differences what do  they use? AH: they will use the MOM5 repo as far as I know.

Actions

New:

  • Edit tenth bathymetry to remove Cumberland Sound (RF)
  • Create calendar invites to TWG Meeting (AH)
  • Update model name list and other configurations on OceansAus repo (AK)
  • Check red sea fix timing is absolute, not relative (AH)

Existing:

  • Shared google doc on reproducibility strategy (AH)
  • Follow up with Andy Hogg regarding shared codebase (MW)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Profile ACCESS-OM2-01 (MW)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Nudging code test case (RF)
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)

Technical Working Group Meeting, April 2018

Minutes

Date: 10th April 2018
Attendees:

  • Marshall Ward (MW) (Chair), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Peter Dobrohotoff (PD), CSIRO Aspendale

COSIMA Meeting

MW: Will we present something again? Same as last time, a list of achievements? Consensus was yes we should.
MW: Want to make time during the meeting to work together on collective problem? NH: Yes, definitely. Even for just a short time.

Common codebase

AH: Have rebased PDs MOM code on to latest MOM5 source. Wrapped all PDs changes that are incompatible in ifdef ACCESS_CM statements. This is available as branch cm2 on MOM5 GitHub repo. Created a pull request to allow code review (https://github.com/mom-ocean/MOM5/pull/214).

AH: PD has provided a rose suite (u-aw048) for testing. AH successfully copied (u-aw405) and ran this suite. Created another suite (u-aw445) to test that this reproduces first copy with no changes. It does. Created a third suite (u-aw497), changed git URL to point at cm2 branch, but initial compile failed to find the source. Eventually got it to recognise the updated source, now compile fails due to an absence of a main routine. Needs some modification.

PD: Best to make small incremental changes to a suite. In this case just change the fortran files and see if it works. Avoid changing how the compile is done, definitely avoid changing rose app conf. Was just trying to determine the compile flags use in UM fcm build from Met Office was not a trivial task.

MW: cylc is good, small and configurable. rose is difficult and opaque.

AH: Will get some help from others via the cm2-om2-harmonisation slack channel

AH: Next step is to do the same for CICE5 as has been done for MOM5.

NH: CSIRO is getting CICE from the UKMO. There are code changes under src, not just in the drivers.

AH: Does UKMO version of CICE5 have a special licence? Will we be able to host UKMO modifications to CICE5 on OceansAus repo? NH: CICE has a CICE licence

 MW: UKMO runs CICE as a NEMO/CICE5 executable, not linked through OASIS like us.

Models

PD: ACCESS-CM2 doesn’t reproduce over restarts. Would like to run CICE stand alone. Does CICE reproduce over restarts?

NH: cleanest way to test is in the coupling code, before any model sends anything to another model, checksum fields. That is the point you can compare the output of a model. In the case of restarts, can include in checksum what current model run time is.

MW: there are 2 restarts, individual models and oasis restarts. Have to make sure trigger the same number of time steps in models.

NH: OASIS restarts are not a problem. OASIS get tells you if it read out of a restart. Not reading out magically, still working using PUT and GET. GET is from a file instead of from a model.
After each GET and before each PUT, print out time, processor, checksum can see when the checksums diverge. Will be different before the PUT, can then identify the model. If you determine CICE was the culprit can then look at CICE only run.
PD: How to do checksum? NH: Just sum whole array. MW: use MPP_checksum? NH: not in CICE. Our CICE code will output checksums if required. NH has already done this for CICE.
MW: does OM2 reproduce? NH: I spent some time on this. Repro for a couple of coupling time steps, then diverges.
MW: PD has 1×2 days not same as 2×1. Have we done equivalent tests with OM2? NH: yes, not passing.
NH: MOM by itself might not do this anymore? MC: restarts ok. RF: if you had the redsea bug it wouldn’t reproduce. Old repro results null and void. Models also have coupling code that might cause repo issues.
MW: UM passes, would think GC3 reproduces.
MW: interesting OM2 does not reproduce. Easier platform to test.  ACCESS-om2 needs to reproduce. NH: looking at this. With MATM changes, needs to make sure it works to get others to use it.
MW: Does someone want to check MOM?  Restarts, processor layouts. AK: don’t change layouts often so wouldn’t know if it does currently reproduce with layout changes. NH: non repo with layoutt changes indicates bug. MW: maybe not bug, but definitely volatile behaviour, maybe in a collective.
MW: Did find a repo problem with MPP_sum. Ran MPP_sum and MPP_reprosum, and got difference in one bit. Even something that simple can cause issues. GFDL always matches with same test. Maybe something we can control with compiler flags.
PD: As voltages go down can get make random errors occur. MW: Bob found bug with tridiagonal solver due to voltage issue in Intel chip. Maybe something going on with flags?
RF: GFDL definitely use precise option. Atmospheric model crashes otherwise. MW: MetOffice also uses precise.
MW: Maybe all could look more carefully?
AH: Do we have a reproducibility checklist? Some strategy. Shared google doc?
NH: starting work on tenth degree performance. Anyone interested in doing some profiling? MW: Any Hogg pressuring to do this. Will do it this week, and send to NH. Hope to have a bunch of profiles for the meeting.
NH: MATM is now clean, 100 lines of code, uses CMake. Hoping to start using it. All goes through CICE. Nothing about coupling has changed.
NH: Want to use newest version of OASIS-mct (v3 not v2). Improvement in performance, can collect together MPI comms.

Actions

New:

  • Poll TWG on list of achievements for Meeting presentation (MW)
  • Shared google doc on reproducibility strategy (AH)

Existing:

  • Follow up with Andy Hogg regarding shared codebase (MW)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Profile ACCESS-OM2-01 (MW)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Nudging code test case (RF)
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)

Technical Working Group Meeting, March 2018

Minutes

Date: 13th March 2018
Attendees:

  • Marshall Ward (Chair) (NCI)
  • Aidan Heerdegen and Andrew Kiss (CLEX ANU)
  • Russ Fiedler, Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff, Roger Bodman (CSIRO Aspendale)

Models

ACCESS-OM2-01

Andrew: stability was bad. A couple of weeks ago Ben Menadue said a dodgy cable fixed and then was working fine. Lately maybe 1 of 8 model runs fail. Model stops before time stepping. Does init. No model runtime. All the hangs look the same. Saved most of the outputs. Never timesteps. Hangs Marshall was looking at were OASIS restart writes (OpenMpi 2.1). OpenMPI 1.10.2 fails differently and maybe more randomly. OpenMPI 2.0 was fixing a problem but adding a new problem.
Aidan: Russ’ stuff might be useful to kill the job and resubmit.
Aidan: Should use padb to find where process hang. Marshall: can take 5+ mins to produce trace. Aidan: Need to give Marshall to info on crashes.
Aidan: ACCESS-OM2-01 running on only 4.2K processes. Shouldn’t be as fast as MOM-SIS-01 which was running on 5232 (7200 config masked).
Andrew: can’t get model on to queue during business hours. Runs start in the evening.
Aidan: Queue stuff unavoidable. Just very busy with some large jobs queued that Andrew’s job can’t leapfrog.
Andrew: ACCESS-OM2-01 quite variable in runtime. Between 2 and 3 hours per month. Just one month per submit. Variability forces shorter runs. Currently running 360s, every month but august. Currently testing 400s.
Marshall: tried yalla pml? Any effect? Andrew: hard to tell. Marshall: 100% OpenMPI 2.0 hang went away with yalla. Andrew: of first 4 submits, 2 hanged. Seems ok now, but no big improvement.
Marshall: rewrote global field to use alltoallw (actually using gather). Didn’t fix hang. 1 and 025 deg worked and produced correct response. Tried throttling, update a fraction of the domain and 0.1 ran. Library can’t handle the number of messages.
Russ: global field just for restart? Marshall: No. Yes used to produce OASIS restart. Also used when we use io_layout. Uses function to gather whole io subdomain onto individual master rank. Used inside FMS. Probably don’t fail at the moment. Thought that function was a relic. Still there and still used. Code changes did work, but failed on io servers when I have an io server of 1 (happens with masked runs).
If Marshall can get it working, can chunk alltoallw, free us from magic mpi/accelerator flags. This is a positive thing.
Marshall: Hanging on single tile in masked run is a bug. MOM has some logic checking for single tile and not run stuff, which might make other bits hang if they are expecting some communication.
Marshall: now using MPI types, avoids buffers.
Roadmap for OpenMPI updates:
  1. Resolve issue with io-layout with rank 1 will be working
  2. Clean up MPI code, get out of FMS code
  3. Try chunking/staging in alltoall. Might not need MXM at all
  4. Try to get into FMS and switch to independent FMS module (maybe GFDL)
  5. Address performance issue
  6. Run on OpenMPI 2.0
Marshall: Hope to solve random hangs. Run on supported libraries. Less dependent on magic MPI flags. More resilient for new machine. Maybe wait to submodule until GFDL accepts patch?
Marshall: If MPI issues happens again we have a better strategy. Can’t just replace point to points with collectives. Library has issues. Won’t scale to a new machine.

MOM-SIS-025-WOMBAT

Matt: Paul Spence is happy WOMBAT 025. Paul isn’t hanging? Aidan: Does Matt see WOMBAT runtime variations? Matt: not to frequently. Maybe check with Paul? When Matt did initial testing, got identical output and timings. So similar wasn’t sure if put the new code in!
Marshall: Paul was getting hang in xgrid init. New code required for that. Runtime is weird. Need more info before get worked up.
Marshall: diag_step stuff is really slow. Expensive function. Scales horribly.
Aidan: Matt, have you done a pull request to mom-ocean? Matt: need testing? Marshall: if it works for you, don’t worry. Matt: a number of hooks from tracer package into ocean and ice model. Similar to ACCESS, not quite the same. Few extra variables in boundary package.

Other business

Marshall: Unlimited time axis on sss restoring. Aidan: Yes was an issue, fixed.
Roger: not that busy in this space. Still wondering about change between 12×8, to 8×12. Marshall: no expectation for that to be
reproducible. Marshall: restart issues seem more severe. Has it been looked into? Peter: No. Tasked with fixing, not looked at this year.

Common codebase

Peter: Agenda item that Nic and I would bring OM2 and CICE5 to a common codebase. Nic: doesn’t seen this as being feasible anymore because the models have diverged so much.
Marshall: Does he mean he has put in coupling code that has diverged from yours? Peter: not sure. Hoping to talk to Nick today about this. Is the idea really dead? Just taking a lead from Nic as don’t know about the other codebases. Had a couple of meetings with Nic. Nic went and looked at the code, expected the differences to be trivial, but they weren’t.
Peter: Would like to work from a common codebase. Would like to capture the activity on GitHub. Some scientists would like them to be the same, can’t really make the case for that, but that is what they want. Not sure how to proceed. If we don’t share code now, we won’t ever. Do we just drop our MOM5 and grab the GitHub version? Seems like a lot of work a this point.
Matt: can you clarify the relationship between the code bases? Is it closely related to OM, CM etc? Peter: No, can’t give clarity.
Peter: in 2015 Nic and Hailin put together a version of MOM5 that was to be used with GA6. No idea what was specific about Hailin’s version don’t know. Not sure why they can’t be brought back together. Can we do some emails/issues to get this moving because there is a month.
Russ: I’d like to be brought in on this. Part of my work with decadal work is to couple wave watch 3.0 into ACCESS-CM2 and OM2. Worry that they are diverging so much.
Marshall: Andy would be disappointed about this news. Six years ago aspired to this goal. 3 years passed, nothing happened, ok, but to drop it now is unfortunate. Marshall: difficult for Aspendale, and volatile with runs about to begin. Next CMIP aspiring to do this? Disappointing to the science guys. Can resources be pumped into this?
Andrew: one of the objectives of COSIMA was to avoid duplication of effort. Marshall: doing better than duplication. Some redundancy.
Aidan: I think MOM should be relatively straightforward to get harmonised, CICE is the issue? Russ: yeah, problem with CICE. A lot of the things that are done in individual components should be done in OASIS-mct. Averages and double looping that is really confusing. Using native OASIS calls to do averaging would be much simpler. Old OASIS had to bring it all on to one processor was a disaster. Decisions made were sensible then, but coming back to bite.
Russ: Way to run with OASIS is to call it every timestep. Let OASIS decide. Don’t need specialist code in individual components. It is distributed. Time averages can be done on the local processor.
Marshall: I think the problem Nic found was a inefficiency between ocean/ice code communication. Maybe that makes merge undoable.  Weird log-jamming of messages. Nic has done this and done it in an ocean/ice context without too much consideration of atmosphere.
Marshall: Nic and I were going to sit down and look at it.
Russ: would like to get my head around it.
Aidan: can we converge to a common codebase? Maybe CM needs to make these changes anyway?
Peter: Nic said need to get CM2 and ESM code up to date with MOM5. From CM perspective, need to be a bit risk averse. Also risk with no changing. Already doing spin ups for CMIP6. A specific change I am aware of — pull request from Fabio. Ticket #211? Steve Griffies e patch? Paul Spence convection code changes. Were important to his OM2 model. Haven’t been regression tested. Needed for a student.
Peter: Changes important enough to pull into CMIP6. Conflict with direct merge. Can do by hand. Marshall: hand merge if required at this stage.

Actions

New:

  • Follow up with Andy Hogg regarding shared codebase (Marshall)
  • Marshall liase with Andrew Kiss about tenth model hangs (Andrew, Marshall)
  • Pull request for WOMBAT changes into MOM5 repo (Matt, Marshall)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (Russ)

Existing:

  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (Marshall)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (Russ)
  • Profile ACCESS-OM2-01 (Marshall)
  • Move FMS to submodule of MOM5 github repo (Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes (Nic, Peter)
  • Russ to add all his ocean bathymetry code to OceansAus repo (Russ)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (Nic).
  • Nudging code test case (Russ)
  • Redo SSS restoring with patch smoothing (Aidan)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (Andrew)

Technical Working Group Meeting, February 2018

Minutes

Date: 13th February 2018
Attendees:

  • Marshall Ward (Chair) (NCI)
  • Aidan Heerdegen (ARCCSS/ARCClEx ANU)
  • James Munroe (Memorial University of Newfoundland)
  • Russ Fiedler, Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff (CSIRO Aspendale)

MPI Errors on raijin

  • Marshall still having MPI problems.
  • Russ is running ~1200 processor jobs, haven’t been running a lot. Haven’t had any big MPI problems.
  • Russ had getcwd() race condition in kernel. Using mom4p1 reading input namelist. Error was “can’t get current working directory”. Hit similar problem 8 years ago. Was a lustre problem then. Run crashed. Only happened once. Same file being read by every processor. Was maybe the issue. Newer MOM does one read and uses MPI to distribute.
  • Marshall found a performance issue in MOM on CRAY system. NFS rather than lustre. Struggled to read ocean_grid.nc. Maybe multiple reads of the same file. Took 40% of run time to open file and read it. Maybe need some testing on different systems.
  • Kial Stewart had some run time blowout issues. Wondered if it could be an MPI problem. Aidan asked if others had seen similar issues. Matt is not running production runs. Not sure if he has had any run time issues. Hasn’t got a clean baseline. Not seeing anything like 2x longer.
  • Marshall: lots of people have hanging jobs issues right now. NCI has had waves of issues over past 5 years where this happens more frequently and then abates. Rhodri at RSES having similar issues, may mean NCI will take more notice.
  • Matt will keep monitoring. Marshall: can switch over to a debug MPI library. Will give better info. Ben M says it will run slower, Marshall hasn’t noticed a slow down. If runtime similar maybe use it. Marshall is learning about MPI debug info, TBs of it. Intel 17 now includes C code in backtrace. Debug gives a bit more about MPI state when bails.
  • Matt using old compilations. In coupled decadal config switched from mom4p1 to mom5 last year, recompiled then.
  • Russ got the openmpi/1.6.3 library path issue, recompiled and was fine.
  • Peter working now. Memory pinning stuffed them up. Perhaps due to resource exhaustion. Had some problems with getting on the queue. Was using up the time too quickly.
  • Marshall: epic issues with MPI 2.x. Running on tenth. Getting random rank fails. Not always the same. When it fails it thrashes the stack and kills the backtrace process. Get a backtrace of the backtrace. Other nodes wipe their stack and stop gracefully. Severe stack overflow, might be in the library itself. Trying valgrind. Failing in collectives. Running access-om2. Test programs are fine. Seems to be a memory thing. Need to run a large model for a decent time. OpenMPI 3.x fails in the same way as OpenMPI 2.x in a memory offset function (longjmp_).

Models

  • Paul’s wombat runs were failing because didn’t have the xgrid alltoall. Matt put the wombat changes into MOM5? Matt: WOMBAT has some ugly hooks into ice/ocean boundary conditions. Putting WOMBAT into access coupled model. Update to MOM5 should be straightforward.
  • Russ has put current MOM5 code into CM2.5 for decadal prediction. Should add in WOMBAT. Try and keep it up to date as much as possible. Kill two birds with one stone.
  • Russ’ monitor code runs on one processor, if whole jobs stops it takes it down. All you have to do is instrument MATM. Wrote it to not interact with WORLD communicator from OASIS. It gets spawned as a slave. Instrument MATM. If it takes a long time between segments will issue an abort, which will cascade. Only needs to communicate with MATM. Need knowledge of how long to expect things to take. Do an MPI_commspawn, issues an ab abort to it’s parent (MATM). Marshall: could we integrate this into MATM? Does it need an extra rank and sub communicator? Russ: yeah, could do something more complicated. This was easy and stands alone. Not interacting with OASIS. MATM already ends up wasting ranks.
  • CICE ncpus. Russ use halo approach like MOM barotropic solver? Marshall: had this suggestion before. CICE future was uncertain, never a big bottle neck or resource use, so not a target.
  • Marshall: profile OM2 with tenth. Russ: optimise processor layout? Nic did it. Could we improve this? Sub-blocks? 90×90. Marshall: slender2? Russ: no, cartesian, single 90×90 block. Russ once you get to large numbers, starts to fall apart. No sub-blocking. Compiled with 90×90. Marshall: CICE best suited to hybrid. 1 CPU/node, n threads per rank, with load balancing of threads. Long way away from that.
  • Russ looked at some of Andrew’s timings. Hard to make much sense. Without knowing where all the blocking/synchronising is happening.
  • Marshall: fan of score-p. Met the developer. Can make cartesian maps of processor timings. Did it for LIM profiling.

MAS Database for COSIMA Cookbook

  • James: has queried the DB and taken a quick look. Marshall: Happy with it? James: useful that a third party doing filesystem crawling. We might need to do some extra crawling for faster update. Their shard approach will scale better than James’. Will go forward, take the schema they have developed and write tools to use schema. Will use MAS and then fallback to local DB when not available. They gave us what we asked for.
  • Aidan still unable to access it due to a low uid number not playing nicely with security settings.
  • MAS will be good for other people might have data they want to use not under our control.
  • James: Regarding the COSIMA Cookbook experiment DB file, if you delete the file and recreate, umask will stuff it up. Can put in logic in the software to do the permissions checking.
  • With MAS DB, James can keep an eye on how often it is updating with a view to requesting more frequent updates if needed.

MPI-IO

  • Marshall: Rui has made a lot of progress on MPI-IO. Marshall wrote a bunch of fortran hooks into parallel netcdf and put it into FMS. Handed over to Rui. Worked on it a lot. netCDF4 struggles. Takes a long time to open and close files. Serious synching issues around metadata. Bottle-neck in MPI-IO. Tapping into parallel hdf5. Been around for a long time but noone uses it. Rui has a working relationship with Urbana HDF5 group. Rui has switched to pnetcdf. Works really well. Doesn’t have meta data synch issues. See speed ups with serial case. Not what we traditionally use. 3-10x faster than serial case. Writes that take a few hours take an hour. It is usable. Metric shows clear improvement.
  • Downside is netcdf3. Traditionally this isn’t what we usually do. Usually stitch multiple files together. This will eliminate post-processing. Will make a single coherent file. Not sure what to do. Do we want to use this approach? Good enough to put in main MOM code, but turned off by default. Very sensitive to lustre striping, number of writers. Correct lustre settings are essential.
  • If the overhead is a few minutes, it might be convenient to eliminate post processing. Thinking ahead to 1/30th simulations. Aidan: would still need to post-process and convert to compressed netCDF4.
  • James: mppnccombine isn’t based on pnetcdf? Russ: no. Was attempted, but now abandoned. Marshall: Rewrite mppnccombine to use MPI-IO? Yes good idea, but NCI wanted to test it inside a model.
  • Been very instructive. netcdf gets in the way, so hdf5 gets in the way. Not sure what is the best way forward. Might scale with the number of writers. Maybe 1-2 writers per node. Collectives on the nodes, and written by number of writers.
  • Dale has done this with native IO on UM and got 4x speed up. Quite profligate for CPU hours.

Actions

New:

  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (Marshall)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (Russ)
  • Profile ACCESS-OM2-01 (Marshall)

Existing:

  • Move FMS to submodule of MOM5 github repo (Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes (Nic, Peter)
  • Russ to add all his ocean bathymetry code to OceansAus repo (Russ)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (Nic).
  • Nudging code test case (Russ)
  • Redo SSS restoring with patch smoothing (Aidan)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (Andrew)