Technical Working Group Meeting, February 2019

Minutes

Date: 14th February, 2019
Attendees:

  • Marshall Ward (MW) (Chair) NCI
  • Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK)  COSIMA, ANU
  • Russ Fiedler (RF), Matt Chamberlain(MC) CSIRO Hobart
  • Peter Dobrohotoff (PD), CSIRO Aspendale

TWG Meta Stuff

AH will redo MOM5 governance doc for next meeting.
AH finding minutes a burden, MW suggested exploring other options.
MW: Will leave at the end of March. Will maybe try and attend. Given time.
Even in anarchy someone has to send out the email.

CICE Meeting

MW: CICE meeting. AK going. Going to Hobart Ocean Workshop? MC & RF not registered, might drop in.
MW: Also a VC: chat with Elizabeth Hunke. Who is going to attend? Just me? AK: Yes. MW: AH  come too? Ben Evans asked Rui to come. Not sure about NH. Assume interested.
MW: What to ask her about? Agenda? What motivated it? AK: Just that Elizabeth is around and could chat. AK: Any point me turning up a day before, more talking about Petra, but Petra thought it might be useful. Could show her how we set stuff up, some results?
RF: Anyone from Aspendale coming down? PD: Not sure. MC: Simon Marsland and Siobhan on the attendance list.
MW: Might ask about using latest GitHub branch (cice6). If we were to use it what should we do? Incorporate changes from OM2 codebase? Others more interested in physics?
AK: Might be interested in scaling work. Hoping to put some in my talk. MW: Fine with me.
MW: Not done as much as Tony Craig (?) on load balancing.
Monday 18th @3pm with Elizabeth (2 hours)
AK: Valuable networking opportunity.
MW: Would be great for NH to come.
MW: Maybe AK give a run down of some of the runs, start from there.

MOM5 Pull Requests

MW: RF been busy
RF: Bug in one of those in GW scheme. Was testing temperature in the wrong direction. Also something odd happens to temp rebinning at the bottom of a level compared to density. Missing value is zero. Interpolates first non-zero temperature to below bottom level. Because density in the rock is zero, can’t get a bounding. Problem with the way the diagnostic is originally done.
RF: Calculates transport in density one don’t account for transport in lower half of bottom cell, but temperature remapping you do. MW: Haven’t looked at the patch yet. Is this what Ryan Holmes was asking about? RF: This would speed up Ryan’s remapping. His PR was different. Trying to remap onto different levels. He sort of fudged the code. Take code from remapping onto density levels, and made something spoof, pretends neutral density is temp or salt. Don’t like what he’s done. Probably works, but not totally sure, but my optimisations might break some of the things he does. AH: Your optimisations are field dependent? RF: Yes. Assume it is density, with assumption density increases as you get deeper.  MW: He added a neutral density thing? RF: Trying to trick the code into something else.
RF: Can’t do it on more than one variable.
AH: Might be worth telling Ryan this might break his code.
RF: I thought he had put the commit in there. AH: No deleted the PR. He doesn’t have commit rights.
MW: Has a hard coded neutral density point that he has defined.
AH: RF still thought worthwhile? RF: Yeah, have a general thing, remap to level? A lot of code would be copy/paste. Could be a lot of work. AH: classes of rebinning?
MW: Not sure I understand exactly what RF’s commit does. Not sure I can add value.
RF: Just a lot faster.
AH: How did you pick up the error? RF: Was worried about it. Hadn’t checked rebinning to temperature. Wasn’t sure I had accounted for reverse in signs. In transport beta _ gm. Neutral physics utilities module. Checking for maximum and minimum temperatures on wrong levels. Hadn’t tested that diagnostic. Missed temperature. When tested failed. Doesn’t alter results of simulation, diagnostic slightly wrong. Other things were bit repro, all checksums were identical.
AH: So when code changes are made to diagnostics make sure those diagnostics. Make sure we paste in pics of diagnostics. Made sure to double precision in `diag_table`.

MOM5 Governance

Last month agreed to tackle PRs. MW: Paul never answered. Other didn’t answer. AH: AK didn’t answer! ?
MW: A lot of weird hard constants in FMS. Data structures are weird.
MW: Other PRs when we got no answer? Ask for an update without interaction without a month? Have some policy? Paul looks more valuable. Other one is more FMS. Could call phone.
General approach for non-responding PRs: Get in contact again. Warn it will be closed. Close and say they can reopen.
MW: Sometimes got good ideas with poor implementation, accepted and completed reimplemented. RF: Short one best to redo a different way, and reject the FMS stuff. Contact Paul and get it done?
MW: No answer after prolonged time, incorporate good ideas in a different branch.
AH: Why coding now? RF: Had these ideas for ages, but noticed low hanging fruit. Remapping and submeso scale. Knew we could make significant time savings. Knew about these ages ago. Similar with tidal mixing. AH: Uses MOM timings? RF: It was slow, and looked at it and wondered about looping. MC: With changes what improvements? RF: 20-30% in each module. I run short cases, so data writing might dominate a bit. Will depend on the size of the model. Time spend on each tile proportional to mixed layer. MW: Shallow levels will be a big improvement? RF: yes. MW: Not  iterating where there aren’t values? RF: Yes. Two types of tests, check if entire tile can be topped, other times if a latitude can be stopped. RF: Did test of 1200 cpu job on OFAM grid too 30% off those routines.
AH: submeso is 10% of total ocean runtime.
RF: Starting a big run, good time to get it in.
MW; Sometimes said MOM was well balanced. Aggressively masks everything.
RF: Imbalance comes through the parameterisation code. KPP, Tidal mixing. Found another weird thing in the barotropic routines. Takes a lot of time. eta and pbot diagnose. No reason to diagnose the pressure at bottom on a u cell. Except if you’re writing the diagnostic. AH: standard for the code to check if diagnostic used before calculating? RF: Required for restart file. Check at restart stage and write it out that time. AH: don;’t restarts have to be field_table? RF: No
AH: If they don’t affect science can add to 0.1 at any time.
Ocean eta and pbot diagnose 10% of runtime.
AH: should we prioritise any changes. RF: just the ones I have put in. Others not so much. I’ll fix up the PR. Just got compiled and testing.

netCDF Parallel MPI IO

MW: Parallel IO stuff looking good and nearly done. Getting parallel IO without collation. Even restarts. A few masked cases where things look odd  with completely missing values.
MW: Fill value versus zero over land? If I do mppnccombine intelligently turns zero over land into missing values.
RF: When MOM sends diagnostics sends a mask with the call.
MW: Should land be zero or fill value? RF: should be fill. MC: What about restarts? RF: Used to have zero and then changed. Turned up in the density restarts.
AH: Performance?
MW: As fast as the number of disks. Can be subtle to configure. Have to balance the nodes with io_layout with ncpus on node. Negligible with 0.25 deg. Write speeds at about speed of lustre (half speed x number of disks).
PD: Fan of missing_value stuff. Parallel IO work from Dale.
MW: Rui will know about timing variance. Worried GFDL will find it slow and reject. Rui looked into compressed parallel IO. Interesting results. Reasonably fast. It’s half the speed of non-compressed. What is the serial (offline) compression time? No idea. AK: Is speed MB/s. Or twice as slow for total data file? MW: Twice as slow as the entire dataset.
MW: Currently uncompressed. Can then compress.RF: Need to work for regional output. MW: Do at FMS level. AH: Should test for regional output. RF: Regional output done by geographic rather than index. If by index would make it easier. MW: If you can get that for a test.

Actions

New:

  • Amend MOM5 governance doc (AH)
  • Feedback to RF PRs (MW+AH)
  • Check back on Paul’s PR (MW)

Existing:

  • Shared google doc on reproducibility strategy (AH)
  • Pull request for WOMBAT changes into MOM5 repo (MC, MW)
  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (MW)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (RF)
  • Move FMS to submodule of MOM5 github repo (MW)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (MW and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (NH)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (AH, NH)
  • Add RF ocean bathymetry code to OceansAus repo (RF)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (NH).
  • CICE and MATM need to output namelists for metadata crawling (AK)
  • Provide 1 deg RYF ACCESS-OM-1.0 config to MC (AK)