Technical Working Group Meeting, January 2019

Minutes

Date: 15th January, 2019
Attendees:

  • Marshall Ward (MW) (Chair) NCI
  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Russ Fiedler (RF) CSIRO Hobart
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD), CSIRO Aspendale

MOM5 CM2 code harmonisation

PD: Stopping an 18 year run with harmonised code. Seems successful. Not losing summer Antarctic sea ice, which had been an issue. Dave Bi gave his approval.

PD: Looking at new bug fixes on GitHub. Didn’t appear to be in the CSIRO code.

RF: The fixes I added weren’t to do with harmonisation. Diagnostic for transport on density levels. PD: Won’t affect our model run? RF: Yes. PD: Maybe should keep the 18 year run going. RF: There is also a fix to a submeso scale smoothing that you’re probably not using. 99% likely you’re not using that. PD: Could check by looking at the namelist? RF: It’s smooth hblt, or something, but also a note in the code specifying the namelist that shouldn’t be used because of this error. PD: Could you send me the namelist value? RF: Should be in GitHub issue/pull-request. AH: I’ll put links from your commits on to the slack channel.

https://github.com/mom-ocean/MOM5/commit/11f06f989645b1b21aa990ade61440976451bbb6

https://github.com/mom-ocean/MOM5/commit/06a6d0afb55a1f188d0b58b89513d646d47f062f

AH: hblt_smooth

RF: If you applied the smoothing could smooth into rock. PD: Keep getting current going? Want to get an ENSO spectrum. RF: Shouldn’t make a difference, but will get noise due to changes in red sea fix. Statistics will be the same

AH: In the release candidate code we reproduced a red sea fix timing bug to make the comparison as clean as possible, using a namelist option. That has been stripped out before merging into the master branch. If you continue with this test run and it becomes your spin up run then you will not be able to do a clean comparison if you then change something else from the master branch version. Is this a test run or will it become a spin up?

PD: Sometimes test runs become real runs, but I’ll say this is a test.

AH: If at any point you start a spin up you need to be on a commit on the master branch. Currently running from a commit on a pull request that no longer exists. There is no comparable commit on the MOM master branch repo because of removing the salinity time unfix option, and merging in RF’s bug fixes.

NH: Second that. Also important if we want to continue to be harmonised, this is a divergence and if we carry on with that we’re diverging immediately.

PD: Alright. Will leave this run going to test ENSO spectrum, but will start a new parallel run for safety. Don’t want to be in the position where the test run gets turned into a spin-up because of lack of time.

AH: Calendar time not compute time is your constraint? PD: Yes. AH: Definitely agree with that strategy. PD: Never have enough compute time, but always have that trade off.

AH: Will tag the code with CM2 version which can use to identify the code. PD: Should tag straight away and I will clone and let people know this is the correct code.

NH: Reproducibility is important, and the current MOM code does not reproduce. PD: On restarts. NH: Yes, so would be a shame to lose that by not using the merged code.

AH: Not only is it reproducible, but NH is running tests for this. NH: That’s right. Just a simple 2 day versus 2×1 day runs. To do that test presently turns off red sea fix. Now I can turn it back on? RF: It should now reproduce. AH: Turn it back on and check it reproduces? AH: This is reproducible between runs, not necessarily reproducible to before RF’s fix. PD: Which is reproducible on restarts and which isn’t? AH: Current harmonised code in the MOM5 master is reproducible between restarts, the code from the pull request that PD is currently running is potentially reproducible if you turn off the flag we introduced which emulated the incorrect behaviour of the old MOM5 CM2 code for testing purposes. PD: So the harmonised code in the pull request is different to the master branch? AH: Yes, in that the hack to emulate the incorrect timing behaviour of the red sea fix has been removed, and a couple of RF’s bug fixes added. PD: Going forward will there be a harmonised branch and a main branch? AH: No, there will be a tag identifying where you get your code from. If you need to pull in updates but don’t want all the updates in master, then you might start a new branch, and cherry pick those updates. At that point you might have a different branch, but not currently. In general better off not having a separate branch, as it just starts diverging again. I don’t know how you guys work and I’m guessing you don’t want code changes, but if, for example, you just wanted to add code changes that added diagnostics, you could add those in, and have something to compare against. PD: Just trying to figure out how it will work, if there are different branches, and if down the track we want to develop the harmonised code again. AH: Yes, and if you have some testing then you an add code and test to see if there are differences in the output, and so add code changes with confidence.

MW: Highlights that we have not been tagging MOM for some time. Maybe we need more regular tagging.

 

ESM code harmonisation

PD: In an ESM meeting on Friday Tilo said it was too late to include changed code for CMIP6.

AH: Whatever we did he wouldn’t put it in CMIP6? PD: Yes. AH: Invested too much time on spin ups? PD: Yes. AH: So no particular rush to do this. PD: I thought it was good to pass that on. AH: Good to know, thanks. MW: There ESM for CMIP6, but also in the CoE. Will they use what you’re working on? AH: Yes, but shame, as we’re not then using the same code as the CMIP6 submission. MW: I thought that was your interest in ESM. AH: Yes, not sure.

MOM5 Governance model

MW: Not sure how much we can do without Steve being present, or anyone from GFDL.

AH: Made some notes, hoped to get some feedback and others make changes, additions. MW did add some useful points about defining domain experts.

AH: The MOM5 repository and community is not welcoming to those outside the current clique of COSIMA. Pull requests languish for years without attention because it is no-one’s responsibility, no-one has a defined role. Even if some people put their hands up to monitor pull requests.

Have a contributing.md, to tell people how to contribute to the code. There are users outside this group who use it, see them on the MOM users mailing list, and can use that channel to advertise it.

MW: That was my experience as a grad student. Not sure who is charge of the code. Tried to contribute code and found it intimidating. Not a fan of overly prescriptive instructions on how pull requests must look, or how code must be written. That has made me less likely to report bugs. Would rather get bad contributions than no contributions. AH: Poor quality contributions require effort from us to work through them. MW: Yes, just want to make sure they know it is ok to screw-up. A lot of projects require a lot of environment information etc, which can be onerous for new users. Now tend to ignore those instructions and wait for devs to ask for it. Do think that governance model is good, but want to encourage contributions and emphasise it is the effort that matters rather than the quality of the contribution.

NH: I think I agree. Want to be more friendly to outsiders, also agree with MW, want to put as few roadblocks as possible. People have little incentive to contribute to the repository. The harder we make it, the less likely they are to contribute. With Pull Requests, in order to merge we need testing, but people aren’t going to do those tests. They test for themselves to satisfy their scientific goals and that’s it. Not sure how to reconcile that.

MW: Automated code coverage tests help. A lot of CI services panic too much when code coverage drops after a contribution. This can be useful if it then spurs contributors to improve units tests to improve that metric. AH: Currently have 0% code coverage, so it can’t get worse!

MW: Yes, one issue is we have no code coverage presently. But looking more broadly if we have some automated tests that can tell us what is broken that can help a lot. I know we have testing. AH: Only compilation testing.

AH: If we do this, it will be a burden, want to minimise this, hence the document defined some steps for assessing a PR:

  1. PR assigned to committer
  2. Committer checks PR conforms to guidelines
  3. PR passes CI checks
  4. Iterate until correct or time limit passes (say developer does not update PR as needed)
  5. Testing? Do we push that burden on to contributor to demonstrate bit repro? What about performance?
  6. Accept or reject (rejection is automatic if contributor does not meet expectations above and time limit is exceeded, so rejection at this point would be unusual, but might occur if the efficacy or efficiency of the code is questionable)

AH:

MW: James Munroe has experience contributing to dask, managed by experienced software engineers. He had to do a lot of testing, as well as style and engineering changes. They had someone asking him to make changes until it was deemed good enough. We could do this too, but it would require engagement from us. Also, are we overthinking this? We hardly get any pull requests.

AH: Yes, but some of this stuff is useful for us to do too. A lot of it is just good communication. It may sound onerous, but in many ways it makes it easier, so people aren’t trying to guess the right thing to do, e.g. with code style it may be as simple as pointing to a particularly well written section/module and say “seek to emulate this”

MW: GFDL is struggling with this right now. I sent them a beefy FMS patch and they did not how to handle it. They didn’t want it, but they didn’t want to just reject it. Might be worth figuring out how they are dealing with it. AH: They might copy what we end up doing. MW: Yes, they might end up copying this.

AH: First up, should we consider roles, e.g.

Contributors: Anyone who wants to contribute code

Committer: Anyone with commit access

Maintainers: Committers who check PRs, assign PRs to other committers and maybe do some other admin tasks.

Admins: Do we need another layer? Currently only Steve, Nic and I are Organisation Owners and Admins on the MOM5 repo. Need at least 2 admins at all times (run over by bus scenario).

Sponsoring institutions: Acknowledge role of institutions which provide time for code development?

BDFL: Steve? (Does he even want to be a Benevolent Dictator?)

 

AH: Do we want to do something like this? NH: Yes. It is a great idea. Write it down and we can use it ourselves, and our collaborators.

AH: Are we happy with the roles? NH: Yes, need to look more closely. Looks good. Keep it simple.

MW: Maintainers should be admins, at least for now. I wonder if Steve would prefer to not have a formal role, as he is transitioning out of the coding. AH: Would be want to be BDFL? MW: Would probably accept it, but maybe he doesn’t want to

AH: I think it is good to keep him there. At some point we will transition to MOM6 and it will no longer be my role to look after this. Others will move on too, so it would be good to have Steve still there in that eventuality.

MW: He could be a decider. AH: Yes, like having a Queen. Just have to host him in visits every now and then. MW: Head of State, but not the Head of Government.

AH: Ok, get rid of admin/maintainer distinction. Just make them Maintainers.

MW: Sponsoring institutions is interesting. AH: It was to acknowledge that institutions that pay people like me might have a reasonable expectation that the people they pay to help maintain the code could have some say in the way it is run. RF: I don’t like that, it is a community code. MW: I agree with RF, but there is some reality. AH: More about expectations. If I leave my job, the CoE might have some expectation that my replacement would be able to become a committer/admin. MW: I worry about sponsoring institutions insisting on having some oversight. AH: I always thought CSIRO was a bit keener on the whole “official stamp of approval” thing, but if we don’t like it I’ll get rid of it. No worries.

AH: Ok. Lets codify the roles, and look into a contributing.md to sit at the top of the repo to tell people how to contribute code. This all started with wanting to have timely responses to Pull Requests. Who wants to be what?

MW: What is the

NH: Happy to be a maintainer, but COSIMA would have to pay for my time. AH:

MW: Don’t have an explicit list of users.

RF: How far back do you go with sponsoring institutions? Back to 2009?

MW: I would prefer not to involve them.

RF: Just leave it. Keep it simpler.

MW: When I say sponsor, I mean someone who can’t function without this code. Need a maintainer.

AH: It is literally my job, and happy to volunteer as a maintainer.

AH: Maintainers check PRs, and then assign them to people, we need others to be willing to have a PR assigned to them. MW: Happy to do it, but feel like others are more invested. AH: Doesn’t have to be just you. It is just having someone respond to a PR, check progress, feedback what is required to get it merged, close it if it doesn’t meet standards and the submitted doesn’t respond to queries to fix it.

MW: Yes we need that role, at the moment it is AH and NH. If you want to add me I’m happy to do it, just not sure where I will be in the near future.

AH: Ok how do people get these roles? Commit access is currently Steve, NH and AH. Do other people need it, or want it. How do we decide who should have it?

MW: Best to have one person responsible for commits/merges. AH: Steve recently merged a commit that needed to be reverted. MW: Which is what prompted this. AH: Yes, and he wouldn’t have felt the need to do it if he knew it was being handled.

AH: Do we need to do this now? Specify who is on “the committee”?

MW: Contributor is obvious. Pull committers from contributors. Pull maintainers from committers. If you don’t get enough commits, then a project just dies. AH: Don’t need to codify how someone becomes a committer? MW: No. We’re too small. Look on the issues, and PRs, represents very few people. AH: Ok, so no need right now.

MW: Quite premature to codify this.

AH: Need a `contributing.md`, how to contribute.

MW: Any role for governance is getting the word out. Find the user base. There are a lot of MOM users out there. Steve cited a thousand MOM users in 2010. Governance should be about where are these people, and then get the community. AH: You run the danger of “hey guys come and contribute” and not have processes in place. MW: A good problem to have. AH: What happens if you got 10 PRs tomorrow what would you do with them? MW: Deal with them one at a time. AH: But who would deal with it? We have existing PRs that people aren’t being dealt with, one has been there for a year. MW: Yeah, but it is a bit stupid. Changed an integer to an 8-byte. RF: I would have rejected it because it used a specific kind. Change was a good idea, but badly implemented. MW: Ok, let’s look at this and learn some lessons and produce a document.

MW: So lets look at this PR. What are we going to do about it? So who do we assign this to? I would assign to Russ as he has expressed a strong opinion. AH: Ok. MW: What would you say about this PR RF? RF: I would split into two, as they are unrelated. If you do submit these things, try and keep to individual issues. Specifically about this, first change is non-portable, which can easily be fixed. Would we do the fix, or would he do the fix? All it needs is a selected_integer_kind. MW: This is a portability issue. RF: Yes, anything from now on that gets contributed should be standard. MW: Lesson should be an evolving document about what we want. RF: Yes, standards compliant, independent commits. MW: Would you be ok with being assigned this? AH: Already done it. RF: Second part could be a real bug in remapping vectors. Maybe get back to him and ask if he can show a test case. AH: Ideally this would be associated with an Issue which explains the problem. AK: Really ideally would include a test case that fails and that subsequently passes with the fix in the PR. MW: I would argue against pedantry. GitHub treats Issues and PRs similarly, so while an Issue with a test would be ideal, shouldn’t be mandatory. If you would rather formalise then fine.

RF: Have to leave

Subsequent discussion:

AK: Who feels they have responsibility, so things don’t languish, and who has authority to say what is going to happen. NH: And who has time. MW: Good not to dump maintainer responsibilities on NH.

AH: If suggestion is prescriptive so as not to waste time wondering how to do things. Not mandatory, but good to give guidance.

MW: Current touchy point is assigning stuff to RF. Had a meeting and asked it was ok, but won’t scale. Let’s clean up the pull requests.

MW: I have different idea of purpose of Issues page to AH for payu. You see as dumping idea for ideas (AH: Yes), and I see it as something to keep as small as possible. Ok for a small group, but with a larger group it can balloon and lose track of good ideas. MW: Yes bad idea, or not enough time should be a criteria for closing an issue, without losing the idea. Would be great to get issues down to 1-2 pages, would reveal a lot of lessons.

AK: Paul’s PR is pretty substantial. A whole new file, ocean_basal_tracer. MW: But bungled the build script. Also made a timer that does nothing. This is a good issue with a terrible title. This is a good example.

MW: These projects work best when they’re community driven, that we agree to do it. AH: If you put up your hand to be assigned stuff, you will get assigned stuff. If they don’t, then people won’t. AK: If someone has that role you feel ok assigning to them.

MW: When we talk about governance, lets not worry about users, lets figure out how to assign issues, that is a great place to start. Not worry so much about how users should make PRs.

NH: I agree, but maybe still work on guidelines for contributors. There was a document called “How to contribute”, based on technical specifications. It was a markdown doc in the original website. AH: Maybe that exists somewhere. NH: It’ll be around somewhere, outlines how to do a PR and things like that. AH: Should definitely start with that.

MW: Make it a goal this week to deal with those two PRs. Assign me Paul’s if you want.

AH: Willing to be a maintainer? NH: Yes, I like being part of that community. Hard to know how to dedicate time to it sometimes. We need to do a certain amount of maintenance. Make think of it as pro-bono professional work. Matt England is keen on that sort of stuff, he might be willing to support it/sanction it/pay for it.

Agreed: Maintainers: Aidan, Nic, Marshall. Ask Steve if willing to have BDFL status.

COSIMA Models – ACCESS-OM2-01

AK: Modified topography to remove terraces (bug). Smoothed a seamount near Sverny Island, and eliminated very small cells near tripoles. 54KSU / model year. This is repeat year forcing. Time step is 720s. Did a test run with 900s and it crashed. 720s is a factor of 1800 (coupling time step). No factors in between.

AK: MOM was going unstable, generating 18 m/s velocities, CICE was the first thing to crash with ice remapping error. Was using ndtd (number of CICE time steps per MOM baroclinic timestep). When using ntdt=3, making CICE stable with vaguely unstable MOM. At one point in the run I added Rayleigh damping in Kara Strait as I had done in IAF run. MOM was going unstable in Kara Strait, moving ICE around too quickly, and CICE crashed. Once I stabilised MOM, could get away with ndtd=2, so now the model is CICE bound rather than MOM bound. AH: What is runtime? AK: 2 hours, so can do 2 months/submit. MW: What were the fails at the beginning? Related? AK: Showing out from a run-summarising tool. Just shows when runs failed. Not necessarily crashes, might have just stopped to change something. AH: So when you say MOM went unstable, not so unstable it crashed, but unphysical velocities? AK: Haven’t looked in detail, but imagine it would be grid scale alternation and that sort of thing. AH: Does that mean, in order to compatible with CICE, maybe we need to reduce some of MOM’s thresholds for warnings so they are better matched? AK: Yes. CICE has a CFL condition that is tighter than MOM. MW: They are solving different equations. AH: Not saying they should, or could be, exactly the same, but MOM is more permissive than CICE, but the crashes happen in CICE. So if you reduce the limits in MOM will diagnose the problem in the correct place, rather than indirectly through CICE crashing. A new user would then get the right information.

AK: Haven’t examined the fields. It is right on the limit of the barotropic CFL limit. Splitting factor of 80. The outputs show it is right on the limit. MW: Griffies would hate that. AH: Pretty routine to change that when timestep changes. AK: Had to go 100 for dt=900s for it to pass that check. Might be wise to up that number.

MW: This is tenth degree? Really dropped your CPU count. AK: This is a minimal config. Dropped from 6K to 2K, getting same throughput as larger model. MW: But with lower ndtd and higher timestep. AK: Yes. AH: timestep is like a superpower.

NH: Great! Well done. AK: Thanks for your work. On the number of CPUs, it is good to run minimal configs, as they are inherently more load balanced. Each CPU is doing more work and a greater variety of work. A good reason this minimal config is more efficient. MW: Won’t it run slower if you reduced timestep? Doesn’t make it more efficient necessarily? NH: I think it does.

NH: How does this compare to MOM-SIS? We can’t really compare without running with the same grid? MW: Can usually use the MOM timing and add 15%. I went back and checked this numbers. AH: I think I was getting six months / submit with MOM-SIS at a nominal 7200 core layout, with dt=600s. The larger ACCESS-OM2 config has 6K cores in total, but only 4300 MOM cores.

PD: Have to go now. Bye.

NH: Are you going to try getting the bigger config running? AH: 6 months per submit is very interesting. Could do some decent 100 year runs. Could be worth looking at.

MW: According to the numbers I have been posting, could speed MOM up by a factor of 4 with more cores. CICE ice step is also scaling well. Don’t know if switched to sectrobin, or missing a coupling bottleneck. This model could scale well. Has anyone checked with sectrobin? NH: This is using it. MW: I am talking about much larger, I get improvement up to 12K CICE cores. AH: Load balancing not an issue? MW: Maybe, but I’m not seeing it with the function I am testing. Is this new? Or was this always the case and I’m not seeing what slows the model down? If it is new, maybe we should experiment with throwing more cores at CICE. Even with the low config, maybe you could ramp the CICE cores up to 600? NH: That one is MOM bound. AK: The IAF was using round robin, and dt=450s. MW: Right so, that is your difference right there. AK: Dropping ndtd to 2 makes it MOM bound.

NH: Let me know if there is anything I can do to improve the load balancing in larger config. Pretty sure there is larger config using sectrobin.

MW: I’ve seen consistent improvements in all CICE configs, including 1 deg. Doubling cpus in CICE and MOM individually in 1 degree saw improvement in both, but got errors when I tried to both. Gave up because 1 degree not so important. Will try again in tenth degree.

NH: Sounds like you’re discovering what the best config should be using some evidence based mechanism. The 1 degree config was pulling numbers out of hat that balanced ok with decent efficiency. Maybe want a small/medium/large config, but not for all resolutions. 1 degree maybe just want max throughput. Would be great to capture what you’ve discovered, and actually start to create those configurations.

MW: MOM numbers are clear. Time per step is very informative. Want numbers around 0.1s/step. Allows us to tackle this independent of time step. Surprised CICE is scaling. Reminds me of the last COSIMA talk which everyone said was wrong because I didn’t have enough ice. Will make a figure, which will show strong scaling numbers for CICE. Looks like it a strong scaling model.

AK: On CICE, there is a workshop in Hobart in late Feb. It is a CICE modelling workshop, and Elizabeth Hunke is visiting. AH: Are they up to CICE6? MW: Dealing with a fork. Los Alamos has a version no-one is allowed to see. NH: Worth looking at the CICE repo. Our CICE5 version contains all changes up to CICE6 tag, We’re as up to date as you can be on CICE5. AH: Have you back-ported stuff? NH: Yes, it wasn’t very hard. This is CICE6, but it has CICE5 in the repo. AH: They have a cice6 branch. MW: What is ice-pack? NH: They have split out the column physics to make it more portable. It is just code from CICE5 but repackaged. AH: Icepack is included in CICE6 as a submodule.

AH: Ruth has been having queuing issues, anyone else having problems? Might just be someone using broadwell heavily. Do you know if we have to use broadwell for minimal? NH: My runs used to crash. AK: My runs are ok normal. Running the same as Ruth. AH: Using less than 1GB/processor? NH: I put something in config.yaml so when running in normal it used the high memory nodes. Did you take that out? AK: Yes I removed that. NH: Who knows. So much has changed since then. Memory can spike.AK: Ruth should be able to run on normal without requesting extra memory. MW: What made you think it was a memory issue? NH: Crashed at a particular place on initialisation in MOM where I know there is a big jump in memory usage. I have made an issue about it. It is in FMS. As soon as I went to the high memory node and it went away. Doesn’t print out specific memory errors, but as soon I as I increased memory it ran fine. Good to go back if it has gone away. It is something that gets done on the MOM root PE on initialisation that increase memory usage a lot. AH: Shame we no longer have access to the PBS logs as there was a lot of information in there on this sort of thing.

MW: Intel 19 gives awesome error messages. Not only are tracebacks more readable, but they give 7 lines of code each side of the error. Now getting MPI trackbacks without even using mpi-debug. Can’t use openmpi/1.10 with Intel 19.

Parameter sweep runs

AH: To use extra time at the end of last quarter suggested we could do a parameter sweep, which would use up a lot of compute time in short order. It didn’t end up happening, but Andy Hogg thought it would be a good capability to develop in case we did want to deploy it. I understand AK has done this a lot? AK: Well sweeps of 2 parameters, yes.

AH: Not a big deal, using payu, maybe have a YaML file specifying parameters and how to sweep them, create a bunch of payu configs programmatically and run them. MW: Always wanted to add this a a feature in payu. Spin off 20 runs with one file. AH: Any particular ideas, or anyone want to do it, feel free, otherwise will go on my very long to-do list. NH: I couldn’t think of anything. I don’t know enough about the science. AH: Me too. I have no idea what parameters I would want to change, so that is where it stopped. The parameter search could mean you always have something to run if there is spare time and add it to the database. If you have a database of runs, you effectively have a parameter search if you change anything.

AK: Have a current grant to do a study on parameter sensitivity in CICE. This capability could be useful.

NH: I have written something which can divide your grid by an integer number, creates initial conditions and everything. Could be interesting to programatically create a bunch of configs and run them with something like this.

 

Technical Working Group Meeting, September 2018

Minutes

Date: 11th September 2018
Attendees:

  • Marshall Ward (MW) (Chair), NCI
  • Aidan Heerdegen (AH) and Andrew Kiss (AK), CLEX ANU
  • Russ Fiedler (RF), Matt Chamberlain (MC), CSIRO Hobart
  • Nic Hannah (NH) Double Precision
  • Peter Dobrohotoff (PD), CSIRO Aspendale

Clean up Actions list

Finished:

  • Incorporate RF wave mixing update into MOM5 codebase + bug fix (AH)
  • Code harmonisation updates to ACCESS and ESM meetings (PD, RF)
  • Check red sea fix timing is absolute, not relative (AH)
  • MW liase with AK about tenth model hangs (AK, MW)
  • Profile ACCESS-OM2-01 (MW)

Deleted:

  • Follow up with Andy Hogg regarding shared codebase (MW)
  • Nudging code test case (RF)

CICE in ACCESS-OM2

MW: 4 block success. 16 block didn’t work. sectrobin also didn’t work. Limited perspective on problem.

RF: blow out in time with extra blocks was halo updates. Weakness with round robin. A lot of overhead, no local comms. Maybe 8 tiles/processor might work. Marshall’s profiling showed small number of processors dominated run time. Want to minimise the maximum. That is the limiter

AH: Where are the max tiles?

RF: Seasonal ice near Hudson Bay, Sea of Okhotsk and Aleutian Islands.

MW: Nic used total CPU count less than number of blocks

RF: Could run with more, or less. MW: 80 CPUs less, could solve this.

AH: General strategy to concentrate on not assigning CPUs to the low work (blue areas) and let the high work areas take care of themselves?

RF: Only worried about slowest tile. Nice to have even distribution, but hard to achieve that in practice.

AH: Slowest tiles change over time RF: read in a map of expected ice concentration. Or have a heuristic, say weight by latitude. AH: If identify areas that do very little work, say never want to have many processors there, and free up processors for high work areas.

AK: There are five hot stripes and four cold stripes. Some processors have 5 blocks, some have 4. The outlying busiest ranks are on those hot stripes. If we get rid of striping with more even split, that would have maybe a spike on a lower baseline

RF: About half the processors have 5 about half have 4, request a few more PEs and that would close to balancing this issue.

NH: First attempt 1600 PEs with an even 4 blocks across all. With idealised test case Ocean was not blocking at all. Though could save a couple of hundred PEs, and there was not a big difference. However Andrew’s real world config is behaving differently. Worth going back up to 1600 and doing an even 4 or even 8 blocks. Assumed wanted everything to be even. Seemed roughly the same to have a mix. This profiling shows I was wrong.

RF: Can easily work out to get exactly 5 blocks per PE. AK: If you give me that number I can try it. NH: 5 across the board is better. Don’t want a single PE doing more work. RF: Slowest one kills you.

AH: How does the land masking affect it? A thicker stripe in NH? RF: Yes. Did I post a picture of where tiles are allocated? NH: More blocks means getting rid of more land? RF: Lose with communication cost.

NH: In order to get this working I ran into the raijin problem: messages getting lost and deadlocks. When we got 0.1 deg MOM-SIS working had issues with point to point sends and recvs, and Marshall change that to proper gather to get initialisation working. The gather inside CICE is implemented with point to point sends and recvs. Assume similar. It is doing a send for every block. MW: Andrew’s finished ok? AK: Ran with 30×35. MW: mxm might resolve this problem? NH: Resolved by putting in a barrier after all the sends, otherwise deadlocks. MW: Did you add barriers? NH: Yes to the MPI gather code. MW: Clear that CICE is heavily barriered. NH: Could implement properly with MPI_gather. MW: Caveat didn’t work with the global field. NH: Only does a global gather once when writing out restarts. Not too bad. MW: A lot of MPI ranks? NH: 1600 x number of blocks is the number of sends. MW: So number of messages, not number of ranks. MW: Only added barrier for restart? NH: Could have done that, but added in MPI_gather. Maybe that is bad? Actually didn’t add, just enabled it by defining a preprocessor flag.

AH: Is there an effect that it gets wider in the north that you’re sampling more ice in those areas?

AH: Should we pull out the slowest blocks and see where all the blocks are that contribute to the slowest processors.

RF: Correspond to areas of highest ice concentration. AH: There is ice in Okhotsk in northern summer? RF: Yes.

MW: Arctic and Antarctic are sharing work. RF: How many for this run? MW: 1385 RF: If you run with 1500 or so get an even distribution.

NH: Should decided what is the next step/run?

MW: Two options, massively increase number of blocks, but this is blowing out with comms time, or even divided 5 blocks. RF: Yes that is the one to do next.

AK: sectrobin should solve the communications issue but couldn’t get it to run. NH: Not sure if code needs to change? RF: Test on 1 degree model.

AK: First step to even up current run with 4 or 5 blocks. MW: Should confirm that many blocks is a comms problem and not a tripole issue for example. But this is a research problem.

AK: Will switch to this for 0.1 deg production as it is already better.

NH: New code 1 block per PE gives identical answers to old code. 4 blocks does not give identical answers to old code. Not sure if I should expect it to be the same. Don’t know how CICE works. In terms of coupling it should be the same if you’re coupling to individual blocks or multiple blocks. Not ruling out it should be identical and there is something going wrong. AK: What would make it non-identical? Order of summation? NH: could be something like that. MW: Might be CIE doing a layer calc before doing vertical? Have to know more about CICE. NH: might be worth looking into further so at least we know that we’re not making bugs.

AK: How would I switch to this for the production run? Not bitwise identical? Just check fields look physically reasonable? NH: Hard problem. Can’t see physical difference. Only looking at last few bits of a floating point number. MW: Did an MPI sum on a single rank and it changed the last bit. Found it running the FMS diagnostics and that is why they failed. Don’t fail at GFDL. Scary stuff. NH: Scary and time consuming.

MW: Clear strategy. Get rid of bands. Go with 1600 cores. Have a 16 block job running, will keep everyone updated.

Code Harmonisation

AH: My understanding with the ESM harmonisation is that we’re close, as we haven’t yet put in the coupling changes from CM2 that you had to take out of the ESM code. PD: Dave Bi’s iceberg scheme? AH: If we get the WOMBAT code into MOM5 that would be harmonised I think. PD: Maybe Matt has a better handle?

MC: Are the OM and CM almost harmonised except for iceberg information? Are they almost the same? AH: I believe so. Once we get WOMBAT in there we’re good to go. Russ had a different idea about how to handle the case of different coupling fields.

RF: Have to get rid of ACCESS keyword. In many cases redundant. AH: ACCESS keyword can be replaced  by ACCESS_CM or ACCESS_OM. RF: Yes!

RF: On CICE side of things (and probably MOM) coupling fields are currently defined as parameters. Can use calls to PRISM, test return code, put some tests for legal code/parameters for icebergs for example. Don’t need ifdef’s, can test on the fly. A lot easier than recompiling every time.

AH: How do we implement this? Put WOMBAT code in now so we have an ESM harmonised version and then deal with coupling etc as this is ACCESS-CM? RF: Want to bed down ACCESS-CM and OM harmonised first. The WOMBAT stuff will move in quite simply. I’d like to take that on, have been tasked to do this to take some of the load off Matt. Get this first step out of the way and then move on to WOMBAT and ESM. Until the first step done things can be in a state of flux.

MC: Is wind ehanced mixing in ACCESS-OM? RF: Yes. MC: FAMIP in ACCESS-OM? RF: They’re in MOM5. MC: They weren’t in ACCESS-CM code. AH: That is a 3 year old fork. MC: Can we update ESM from ACCESS-OM? AH: This morning putting WOMBAT changes into MOM5 pull request. Can grab and check if it works. MC: What is the difference in pulling from one direction to the other? AH: ESM is a 3 year old fork with little history in common with current MOM. Couldn’t code  into ESM would be too difficult. Cherry picked your changes into the MOM5 code, but wouldn’t work the other way. Will lease with Russ to get ACCESS-CM changes.

AH: Would WOMBAT always be part of MOM5-SIS. MW: Is it big? RF: No, very small. MW: Let’s leave it in MOM5. Just executable bloat. RF: Just a few fields. MC: Allocated, so if not turned on, then no issues. RF: WOMBAT wants the 10m waves, but we need that for the wave mixing as well.

Travis CI on MOM5

AH: ACCESS-OM no longer compiles because you need libaccessom2 as well. NH: Same before. Always needed OASIS. AH: I’ve got CM compiling by pulling in OASIS and make it. All the compilation tests are passing. Could pull in the libaccessom2 and compile in a similar way to ACCESS-CM. There is no old ACCESS-OM build anymore. It is ACCESS-OM2. MW: Do we want to do this external to the repo? AH: Nice to have the tests there and passing. OM now has different driver code to CM, so can’t be sure you’ve done it properly without an ACCESS-OM compilation test. NH: There always needs to be a dependency on a coupler. libaccessom2 is more than a coupler. Maybe some of it is undesirable. Not worse than having a dependency on OASIS. AH: Just wanted to make sure there wasn’t an ACCESS-OM that was independent of libaccessom2. MW: Can you provide libaccessom2 as a binary and headers? AH: Yes, that is a possibility. NH: Could just be a .a file. MW: that is how you handle dependencies, as a binary, like libc. MW: Do you call OASIS in MOM? NH: Yes. In yatm don’t directly call OASIS. Could change coupler in future without changing models. MW: No problem with wrapping OASIS. AH: Can do the same thing I did with CM, pulled in OASIS, built it. Pretty straightforward.

Actions

New:

  • Create even 5 blocks per PE map for CICE (RF)
  • Get coupling changes into MOM for harmonisation (RF+AH)

Existing:

  • Update model name list and other configurations on OceansAus repo (AK)
  • Shared google doc on reproducibility strategy (AH)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (RF)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • Redo SSS restoring with patch smoothing (AH)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (AK)
  • Provide 1 deg RYF ACCESS-OM-1.0 config to MC (AK)
  • Update ACCESS-OM2 model configs (AK)

COSIMA 2018 Report

Aims & Goals

The third meeting of the Consortium for Ocean Sea Ice Modelling in Australia (COSIMA) was held in Canberra on 7-8 May 2018. This annual COSIMA workshop aims to:

  • Establish a community around ocean-sea ice modelling in Australia;
  • Discuss recent scientific advances in ocean and sea ice research in a forum that is inclusive and model-agnostic, particularly including observational programs;
  • Agree on immediate next steps in the COSIMA model development plan; and
  • Develop a long-term vision for Australian scientific advances in this area.

Participants

The 2018 workshop is our largest workshop yet, with 30 talks and 49 participants.

Attendees included:

Gary Brassington (Bureau of Meteorology), Matt Chamberlain (CSIRO), Chris Chapman (CSIRO), Fabio Dias (UTAS/CSIRO), Prasanth Divakaran (Bureau of Meteorology), Peter Dobrohotoff (CSIRO), Catia Domingues (UTAS), Matthew England (UNSW), Russ Fiedler (CSIRO), Annie Foppert (CSIRO), Leela Frankcombe (UNSW), Bishakhdatta Gayen (ANU), Angus Gibson (ANU), Stephen Griffies (NOAA/GFDL), Nicholas Hannah (COSIMA), Aidan Heerdegen (ANU/CLEX), Petra Heil (AAD & ACE CRC), Andy Hogg (ANU), Ryan Holmes (UNSW), Shane Keating (UNSW), Andrew Kiss (ANU), Vassili Kitsios (CSIRO), Veronique Lago (UNSW), Clothilde Langlais (CSIRO), Andrew Lenton (CSIRO), Kewei Lyu (CSIRO), Jie Ma (CSIRO), Simon Marsland (CSIRO), Paige Martin (University of Michigan), Josue Martinez Moreno (ANU), Richard Matear (CSIRO), Laurie Menviel (UNSW), Mainak Mondal (ANU), Ruth Moorman (ANU), Adele Morrison (ANU), Terry O’Kane (CSIRO), Peter Oke (CSIRO), Ramkrushnbhai Patel (UTAS), Paul Sandery (CSIRO), Abhishek Savita (UTAS-CSIRO), Kate Snow (NCI), Paul Spence (UNSW), Kial Stewart (ANU/UNSW), Veronica Tamsitt (UNSW/CSIRO), Mirko Velic (Bureau of Meteorology), Marshall Ward (NCI), Luwei Yang (IMAS, UTAS), Rui Yang (NCI), Jan Zika (UNSW)

Status

The workshop was structured to focus on scientific questions on Day 1, particularly in the first two sessions. In these sessions, topics ranged from from Antarctic shelf processes to oceanic convection, from reversibility of the Earth system to frictional drag. The final session on day 1 focussed more on technical issues, including assessment of the optimisation status of existing models. On Day 2, talks focussed more on strategic issues, including an outline of Bluelink, ACCESS, CAFE and coastal programs. These strategic talks transitioned to small-group discussions (see synthesis below). The workshop finished with a tutorial on the COSIMA Cookbook framework for model analysis.

The Australian landscape in ocean-sea ice research involves a number of interleaving programs, each of which was represented at this workshop.  The figure below outlines the linkages between these programs:

By way of explanation:

ACCESS-CM2/-ESM1.5 will be Australia’s input to CMIP6, and use MOM5 and CICE at 1°.

CAFE is the decadal prediction system in development, which uses MOM5.

ARCCSS/CLEX, ARC CoE programs, use high-resolution ocean-sea ice models for process studies.

Bluelink/OFAM is the ocean forecasting and reanalysis system which will adopt ACCESS-OM2-01 in future versions.

CSHOR is the Centre for Southern Hemisphere Oceanographic Research; it focuses on observational studies but we hope to establish two-way interactions with this program.

Coastal Modelling includes the Australian coastal oceanography community, as well as Antarctic nearshore programs within AAD and ACE-CRC.

A major theme of the workshop was to review the status of the ACCESS-OM2 model which is the focus of COSIMA. In short, we have had success with model releases at 1° and 0.25° resolution – these models are now actively being used for scientific runs, and are available for download and use by the community. They include a recent upgrade to the file-based atmosphere (YATM) and new JRA55-do forcing datasets. The 0.1° version of the model has progressed significantly in the last year; there are outstanding tasks to evaluate model output and further optimise the model configuration.

The COSIMA Cookbook tutorial was attended by about a third of participants, and some progress was made. The aim of this tutorial was to entrain more active users to the system and encourage input from those users. The Cookbook is similar in style to the analysis system being developed for CAFE and it may be possible to merge elements of each framework at some stage in the future.

Program

Where available, talk files are linked from the presenter’s name.

Monday 7 May
10:00 Arrival & Morning tea
10:30 Session 1 (Chair – Andy Hogg)
Stephen M Griffies (NOAA/GFDL): Understanding and projecting global and regional sea level: More reasons to include refined ocean resolution in global climate models
Andrew Kiss (ANU): Overview of the ACCESS-OM2 model suite
Andrew Lenton (CSIRO): Ocean Reversibility in ACCESS-ESM
Catia Domingues (UTAS): Global and spatial temporal changes in upper-ocean thermometric sea level
Fabio Dias (UTAS/CSIRO): Mean and seasonal states of the ocean heat and salt budgets in ACCESS-OM2
Adele Morrison (ANU): Circumpolar Deep Water transport towards Antarctica driven by dense water export
Jan Zika (UNSW): Getting an ocean model to obey: Prescribing and perturbing exact fluxes of heat and fresh water
12:30 Lunch
13:30 Session 2 (Chair – Clothilde Langlais)
Petra Heil (AAD & ACE CRC): ACCESS-OM2-01 sea ice
Paul Sandery (CSIRO): Sea-ice data assimilation and forecasting using an Ensemble Transform Kalman Filter
Paul Spence (UNSW): Does the Southern Ocean have sleep apnea?
Veronique Lago (UNSW): Impact of projected amplification of Antarctic meltwater on Antarctic Bottom Water formation
Ryan Holmes (UNSW): Numerical Mixing in the COSIMA Models
Luwei Yang (IMAS, UTAS): The impacts of bottom frictional drag on the sensitivity of the Southern Ocean circulation to changing wind
Vassili Kitsios (CSIRO): Stochastic subgrid turbulence parameterisation of eddy-eddy, eddy-topographic, eddy-meanfield and meanfield-meanfield interactions
Matt Chamberlain (CSIRO): Using transport matrices to probe circulation in ocean models
15:30 Afternoon tea
16:00 Session 3 (Chair – Petra Heil)
Nicholas Hannah (COSIMA): ACCESS-OM2 Software Development
Marshall Ward (NCI): ACCESS-OM2 performance analysis
Rui Yang (NCI): Parallel IO in MOM5
Angus Gibson (ANU): Towards an adaptive vertical coordinate in MOM6
Jie Ma (CSIRO ): Investigating interannual-decadal variability of Indian Ocean temperature transport in an eddy-resolving model
Paige Martin (University of Michigan): Frequency-domain analysis of energy transfer in an idealized ocean-atmosphere model
17:30 Close
19:00 Workshop dinner (Debacle24 Lonsdale St Braddon)
Tuesday 8 May
9:00 Session 4 (Chair – Andrew Kiss)
Andy Hogg (ANU): Are we Redi for 0.25° ocean-climate models?
Kial Stewart (ANU): The Repeat Year Forcing for JRA55-do
Terry O’Kane (CSIRO): Coupled data assimilation and ensemble initialization with application to multi-year ENSO prediction
Gary Brassington (Bureau of Meteorology): Ocean forecasting status and outlook
Peter Oke (CSIRO): Bluelink activities and plans
Matthew England (UNSW): A proposal for future projection simulations using COSIMA ocean-ice models
Richard Matear (CSIRO): CSIRO Decadal Climate Forecasting, update of the project’s progress
Simon Marsland (CSIRO): Preparing ACCESS for CMIP6
Clothilde Langlais (CSIRO): Downscaling towards the coast – a perspective on where the coastal modelling group would like to go
11:00 Morning tea
11:30 Discussion: COSIMA planning and strategy
13:00 Lunch
14:00 Strategy and planning summary
14:30 COSIMA Cookbook tutorial
16:00 Close

Synthesis of Discussion

Tuesday afternoon included discussions of present and future needs and directions of the COSIMA community, via breakout sessions on the topics Sea Ice, Coastal / Forecasting, Coupled Modelling, Process Modelling, Biogeochemistry, and Technical. The overall threads of  these discussions are summarised here.

Open and accessible code, configurations, output and analysis

Transparency, accessibility and reproducibility of model code development, run configurations and output data were named as priorities by many groups. Nic Hannah’s proposed REDB (Reproducible Experiment Database, http://redb.io) was widely supported as a means to tie together and curate the source code, configurations, output and analysis of model experiments. Using consistent shared codebases was also a priority. Containerisation was suggested as a method to make experiments self-contained. Extension of the database to include idealised experiments was also suggested.

Model evaluation

There is a need for more model evaluation against observations. Several groups highlighted the importance of better integration of observations for model validation and a desire for this functionality to be better supported in the COSIMA Cookbook. Comparison of CICE to SIS-1 at 1 and 1/4 deg was also suggested.

Technical validation is also needed – e.g. BGC, bit reproducibility, broadened test suite, regression testing. Model performance and stability priorities include: resolve crashes, balance load, MPI benchmarks and stress testing.

Usability

Suggestions included a glossary for beginners, an online portal for control runs, and to minimise difficulty of running new model configurations. Standardised output files and naming conventions would facilitate analysis. Improved functionality and versatility of the COSIMA Cookbook was also suggested.

Documentation was a priority for many, in particular an ACCESS-OM2 documentation paper, but also open/evolving documentation as the models develop.

Parameter selection was also a concern for many – how to choose appropriate parameters (e.g. for ice or BGC), how to assess model sensitivity to parameters, how to document why parameters were chosen or altered. Data assimilation was suggested a way to improve ice parameter selection, including assimilation of under-ice observations (e.g. temperature). BGC was suggested as a way to constrain the dynamics.

It was pointed out that the payu run management software underpins model runs, yet formal funding for its continued development is presently lacking.

Model enhancements

Suggestions for enhanced modelling capability included: interannual forcing, WOMBAT BGC, coupling to an atmosphere model, 1-way nesting, coupling to wavewatch, explicit tides, wet/dry cells.

Community coordination, synergies and strategy

Suggestions included a streamlined process for providing community feedback and deciding on priorities, and for community involvement in developing the BGC component. It was also suggested to foster engagement with atmosphere and sea ice specialists, and have a more formalized ice group. The technical team is also seeking more input from scientists, especially regarding sea ice.

Regarding modelling strategy, it was suggested to have intelligent model diversity (not too many versions), a consensus on standard perturbation experiments, and to decide on resources to commit to MOM5 vs. MOM6.

Summary of Priority Tasks

The following list of tasks was identified as a priority for the near term. Volunteers to lead or assist with tasks much appreciated.

  1. IAF Runs: With the addition of YATM, we now have the facility to run Interannual Forcing (IAF) runs from the JRA55-do forcing dataset in ACCESS-OM2. Once YATM has been tested, we will conduct IAF runs at all resolutions, starting with 1°.
  2. Model Documentation: Production of a model documentation paper is a high priority for the coming months. This will be achieved by:
    1. Writing a larger technical documentation report (https://github.com/OceansAus/ACCESS-OM2-1-025-010deg-report) that will be stripped down to feed into a paper; and
    2. Inviting community evaluation of existing model output.
  3. Model evaluation and analysis: We propose the COSIMA Cookbook as a framework for users to contribute model analyses. In particular, we encourage observational comparisons with existing model output, and also encourage users to submit bug reports and feature requests via https://github.com/OceansAus/cosima-cookbook/issues
  4. WOMBAT: In the coming months we will look to implement the WOMBAT biogeochemistry model (already running in MOM5) into the ACCESS-OM2 framework.
  5. Capability gaps: The COSIMA community has been able to leverage expertise from a number of different programs. However, our community as a whole remains subcritical in several areas, including sea ice modelling and atmospheric dynamics.
  6. REDB: Nic Hannah proposed a new system for tracking simulations and the output data. This system was identified by many discussion groups as a potential solution to some of our collaboration roadblocks. We will investigate the viability of such a system.
  7. MOM6: Plan is to begin transition to MOM6, building up experience in the latter half of 2018.

Recommendations for COSIMA 2019 workshop

  • Institute a James Munroe award for contributions to COSIMA
  • Extend to a 2.5-day workshop to allow more time for discussion (not extra talks)