Technical Working Group Meeting, March 2018

Minutes

Date: 13th March 2018
Attendees:

  • Marshall Ward (Chair) (NCI)
  • Aidan Heerdegen and Andrew Kiss (CLEX ANU)
  • Russ Fiedler, Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff, Roger Bodman (CSIRO Aspendale)

Models

ACCESS-OM2-01

Andrew: stability was bad. A couple of weeks ago Ben Menadue said a dodgy cable fixed and then was working fine. Lately maybe 1 of 8 model runs fail. Model stops before time stepping. Does init. No model runtime. All the hangs look the same. Saved most of the outputs. Never timesteps. Hangs Marshall was looking at were OASIS restart writes (OpenMpi 2.1). OpenMPI 1.10.2 fails differently and maybe more randomly. OpenMPI 2.0 was fixing a problem but adding a new problem.
Aidan: Russ’ stuff might be useful to kill the job and resubmit.
Aidan: Should use padb to find where process hang. Marshall: can take 5+ mins to produce trace. Aidan: Need to give Marshall to info on crashes.
Aidan: ACCESS-OM2-01 running on only 4.2K processes. Shouldn’t be as fast as MOM-SIS-01 which was running on 5232 (7200 config masked).
Andrew: can’t get model on to queue during business hours. Runs start in the evening.
Aidan: Queue stuff unavoidable. Just very busy with some large jobs queued that Andrew’s job can’t leapfrog.
Andrew: ACCESS-OM2-01 quite variable in runtime. Between 2 and 3 hours per month. Just one month per submit. Variability forces shorter runs. Currently running 360s, every month but august. Currently testing 400s.
Marshall: tried yalla pml? Any effect? Andrew: hard to tell. Marshall: 100% OpenMPI 2.0 hang went away with yalla. Andrew: of first 4 submits, 2 hanged. Seems ok now, but no big improvement.
Marshall: rewrote global field to use alltoallw (actually using gather). Didn’t fix hang. 1 and 025 deg worked and produced correct response. Tried throttling, update a fraction of the domain and 0.1 ran. Library can’t handle the number of messages.
Russ: global field just for restart? Marshall: No. Yes used to produce OASIS restart. Also used when we use io_layout. Uses function to gather whole io subdomain onto individual master rank. Used inside FMS. Probably don’t fail at the moment. Thought that function was a relic. Still there and still used. Code changes did work, but failed on io servers when I have an io server of 1 (happens with masked runs).
If Marshall can get it working, can chunk alltoallw, free us from magic mpi/accelerator flags. This is a positive thing.
Marshall: Hanging on single tile in masked run is a bug. MOM has some logic checking for single tile and not run stuff, which might make other bits hang if they are expecting some communication.
Marshall: now using MPI types, avoids buffers.
Roadmap for OpenMPI updates:
  1. Resolve issue with io-layout with rank 1 will be working
  2. Clean up MPI code, get out of FMS code
  3. Try chunking/staging in alltoall. Might not need MXM at all
  4. Try to get into FMS and switch to independent FMS module (maybe GFDL)
  5. Address performance issue
  6. Run on OpenMPI 2.0
Marshall: Hope to solve random hangs. Run on supported libraries. Less dependent on magic MPI flags. More resilient for new machine. Maybe wait to submodule until GFDL accepts patch?
Marshall: If MPI issues happens again we have a better strategy. Can’t just replace point to points with collectives. Library has issues. Won’t scale to a new machine.

MOM-SIS-025-WOMBAT

Matt: Paul Spence is happy WOMBAT 025. Paul isn’t hanging? Aidan: Does Matt see WOMBAT runtime variations? Matt: not to frequently. Maybe check with Paul? When Matt did initial testing, got identical output and timings. So similar wasn’t sure if put the new code in!
Marshall: Paul was getting hang in xgrid init. New code required for that. Runtime is weird. Need more info before get worked up.
Marshall: diag_step stuff is really slow. Expensive function. Scales horribly.
Aidan: Matt, have you done a pull request to mom-ocean? Matt: need testing? Marshall: if it works for you, don’t worry. Matt: a number of hooks from tracer package into ocean and ice model. Similar to ACCESS, not quite the same. Few extra variables in boundary package.

Other business

Marshall: Unlimited time axis on sss restoring. Aidan: Yes was an issue, fixed.
Roger: not that busy in this space. Still wondering about change between 12×8, to 8×12. Marshall: no expectation for that to be
reproducible. Marshall: restart issues seem more severe. Has it been looked into? Peter: No. Tasked with fixing, not looked at this year.

Common codebase

Peter: Agenda item that Nic and I would bring OM2 and CICE5 to a common codebase. Nic: doesn’t seen this as being feasible anymore because the models have diverged so much.
Marshall: Does he mean he has put in coupling code that has diverged from yours? Peter: not sure. Hoping to talk to Nick today about this. Is the idea really dead? Just taking a lead from Nic as don’t know about the other codebases. Had a couple of meetings with Nic. Nic went and looked at the code, expected the differences to be trivial, but they weren’t.
Peter: Would like to work from a common codebase. Would like to capture the activity on GitHub. Some scientists would like them to be the same, can’t really make the case for that, but that is what they want. Not sure how to proceed. If we don’t share code now, we won’t ever. Do we just drop our MOM5 and grab the GitHub version? Seems like a lot of work a this point.
Matt: can you clarify the relationship between the code bases? Is it closely related to OM, CM etc? Peter: No, can’t give clarity.
Peter: in 2015 Nic and Hailin put together a version of MOM5 that was to be used with GA6. No idea what was specific about Hailin’s version don’t know. Not sure why they can’t be brought back together. Can we do some emails/issues to get this moving because there is a month.
Russ: I’d like to be brought in on this. Part of my work with decadal work is to couple wave watch 3.0 into ACCESS-CM2 and OM2. Worry that they are diverging so much.
Marshall: Andy would be disappointed about this news. Six years ago aspired to this goal. 3 years passed, nothing happened, ok, but to drop it now is unfortunate. Marshall: difficult for Aspendale, and volatile with runs about to begin. Next CMIP aspiring to do this? Disappointing to the science guys. Can resources be pumped into this?
Andrew: one of the objectives of COSIMA was to avoid duplication of effort. Marshall: doing better than duplication. Some redundancy.
Aidan: I think MOM should be relatively straightforward to get harmonised, CICE is the issue? Russ: yeah, problem with CICE. A lot of the things that are done in individual components should be done in OASIS-mct. Averages and double looping that is really confusing. Using native OASIS calls to do averaging would be much simpler. Old OASIS had to bring it all on to one processor was a disaster. Decisions made were sensible then, but coming back to bite.
Russ: Way to run with OASIS is to call it every timestep. Let OASIS decide. Don’t need specialist code in individual components. It is distributed. Time averages can be done on the local processor.
Marshall: I think the problem Nic found was a inefficiency between ocean/ice code communication. Maybe that makes merge undoable.  Weird log-jamming of messages. Nic has done this and done it in an ocean/ice context without too much consideration of atmosphere.
Marshall: Nic and I were going to sit down and look at it.
Russ: would like to get my head around it.
Aidan: can we converge to a common codebase? Maybe CM needs to make these changes anyway?
Peter: Nic said need to get CM2 and ESM code up to date with MOM5. From CM perspective, need to be a bit risk averse. Also risk with no changing. Already doing spin ups for CMIP6. A specific change I am aware of — pull request from Fabio. Ticket #211? Steve Griffies e patch? Paul Spence convection code changes. Were important to his OM2 model. Haven’t been regression tested. Needed for a student.
Peter: Changes important enough to pull into CMIP6. Conflict with direct merge. Can do by hand. Marshall: hand merge if required at this stage.

Actions

New:

  • Follow up with Andy Hogg regarding shared codebase (Marshall)
  • Marshall liase with Andrew Kiss about tenth model hangs (Andrew, Marshall)
  • Pull request for WOMBAT changes into MOM5 repo (Matt, Marshall)
  • Compare out OASIS/CICE coupling code in ACCESS-CM2 and ACCESS-OM2 (Russ)

Existing:

  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (Marshall)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (Russ)
  • Profile ACCESS-OM2-01 (Marshall)
  • Move FMS to submodule of MOM5 github repo (Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes (Nic, Peter)
  • Russ to add all his ocean bathymetry code to OceansAus repo (Russ)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (Nic).
  • Nudging code test case (Russ)
  • Redo SSS restoring with patch smoothing (Aidan)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (Andrew)

Technical Working Group Meeting, February 2018

Minutes

Date: 13th February 2018
Attendees:

  • Marshall Ward (Chair) (NCI)
  • Aidan Heerdegen (ARCCSS/ARCClEx ANU)
  • James Munroe (Memorial University of Newfoundland)
  • Russ Fiedler, Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff (CSIRO Aspendale)

MPI Errors on raijin

  • Marshall still having MPI problems.
  • Russ is running ~1200 processor jobs, haven’t been running a lot. Haven’t had any big MPI problems.
  • Russ had getcwd() race condition in kernel. Using mom4p1 reading input namelist. Error was “can’t get current working directory”. Hit similar problem 8 years ago. Was a lustre problem then. Run crashed. Only happened once. Same file being read by every processor. Was maybe the issue. Newer MOM does one read and uses MPI to distribute.
  • Marshall found a performance issue in MOM on CRAY system. NFS rather than lustre. Struggled to read ocean_grid.nc. Maybe multiple reads of the same file. Took 40% of run time to open file and read it. Maybe need some testing on different systems.
  • Kial Stewart had some run time blowout issues. Wondered if it could be an MPI problem. Aidan asked if others had seen similar issues. Matt is not running production runs. Not sure if he has had any run time issues. Hasn’t got a clean baseline. Not seeing anything like 2x longer.
  • Marshall: lots of people have hanging jobs issues right now. NCI has had waves of issues over past 5 years where this happens more frequently and then abates. Rhodri at RSES having similar issues, may mean NCI will take more notice.
  • Matt will keep monitoring. Marshall: can switch over to a debug MPI library. Will give better info. Ben M says it will run slower, Marshall hasn’t noticed a slow down. If runtime similar maybe use it. Marshall is learning about MPI debug info, TBs of it. Intel 17 now includes C code in backtrace. Debug gives a bit more about MPI state when bails.
  • Matt using old compilations. In coupled decadal config switched from mom4p1 to mom5 last year, recompiled then.
  • Russ got the openmpi/1.6.3 library path issue, recompiled and was fine.
  • Peter working now. Memory pinning stuffed them up. Perhaps due to resource exhaustion. Had some problems with getting on the queue. Was using up the time too quickly.
  • Marshall: epic issues with MPI 2.x. Running on tenth. Getting random rank fails. Not always the same. When it fails it thrashes the stack and kills the backtrace process. Get a backtrace of the backtrace. Other nodes wipe their stack and stop gracefully. Severe stack overflow, might be in the library itself. Trying valgrind. Failing in collectives. Running access-om2. Test programs are fine. Seems to be a memory thing. Need to run a large model for a decent time. OpenMPI 3.x fails in the same way as OpenMPI 2.x in a memory offset function (longjmp_).

Models

  • Paul’s wombat runs were failing because didn’t have the xgrid alltoall. Matt put the wombat changes into MOM5? Matt: WOMBAT has some ugly hooks into ice/ocean boundary conditions. Putting WOMBAT into access coupled model. Update to MOM5 should be straightforward.
  • Russ has put current MOM5 code into CM2.5 for decadal prediction. Should add in WOMBAT. Try and keep it up to date as much as possible. Kill two birds with one stone.
  • Russ’ monitor code runs on one processor, if whole jobs stops it takes it down. All you have to do is instrument MATM. Wrote it to not interact with WORLD communicator from OASIS. It gets spawned as a slave. Instrument MATM. If it takes a long time between segments will issue an abort, which will cascade. Only needs to communicate with MATM. Need knowledge of how long to expect things to take. Do an MPI_commspawn, issues an ab abort to it’s parent (MATM). Marshall: could we integrate this into MATM? Does it need an extra rank and sub communicator? Russ: yeah, could do something more complicated. This was easy and stands alone. Not interacting with OASIS. MATM already ends up wasting ranks.
  • CICE ncpus. Russ use halo approach like MOM barotropic solver? Marshall: had this suggestion before. CICE future was uncertain, never a big bottle neck or resource use, so not a target.
  • Marshall: profile OM2 with tenth. Russ: optimise processor layout? Nic did it. Could we improve this? Sub-blocks? 90×90. Marshall: slender2? Russ: no, cartesian, single 90×90 block. Russ once you get to large numbers, starts to fall apart. No sub-blocking. Compiled with 90×90. Marshall: CICE best suited to hybrid. 1 CPU/node, n threads per rank, with load balancing of threads. Long way away from that.
  • Russ looked at some of Andrew’s timings. Hard to make much sense. Without knowing where all the blocking/synchronising is happening.
  • Marshall: fan of score-p. Met the developer. Can make cartesian maps of processor timings. Did it for LIM profiling.

MAS Database for COSIMA Cookbook

  • James: has queried the DB and taken a quick look. Marshall: Happy with it? James: useful that a third party doing filesystem crawling. We might need to do some extra crawling for faster update. Their shard approach will scale better than James’. Will go forward, take the schema they have developed and write tools to use schema. Will use MAS and then fallback to local DB when not available. They gave us what we asked for.
  • Aidan still unable to access it due to a low uid number not playing nicely with security settings.
  • MAS will be good for other people might have data they want to use not under our control.
  • James: Regarding the COSIMA Cookbook experiment DB file, if you delete the file and recreate, umask will stuff it up. Can put in logic in the software to do the permissions checking.
  • With MAS DB, James can keep an eye on how often it is updating with a view to requesting more frequent updates if needed.

MPI-IO

  • Marshall: Rui has made a lot of progress on MPI-IO. Marshall wrote a bunch of fortran hooks into parallel netcdf and put it into FMS. Handed over to Rui. Worked on it a lot. netCDF4 struggles. Takes a long time to open and close files. Serious synching issues around metadata. Bottle-neck in MPI-IO. Tapping into parallel hdf5. Been around for a long time but noone uses it. Rui has a working relationship with Urbana HDF5 group. Rui has switched to pnetcdf. Works really well. Doesn’t have meta data synch issues. See speed ups with serial case. Not what we traditionally use. 3-10x faster than serial case. Writes that take a few hours take an hour. It is usable. Metric shows clear improvement.
  • Downside is netcdf3. Traditionally this isn’t what we usually do. Usually stitch multiple files together. This will eliminate post-processing. Will make a single coherent file. Not sure what to do. Do we want to use this approach? Good enough to put in main MOM code, but turned off by default. Very sensitive to lustre striping, number of writers. Correct lustre settings are essential.
  • If the overhead is a few minutes, it might be convenient to eliminate post processing. Thinking ahead to 1/30th simulations. Aidan: would still need to post-process and convert to compressed netCDF4.
  • James: mppnccombine isn’t based on pnetcdf? Russ: no. Was attempted, but now abandoned. Marshall: Rewrite mppnccombine to use MPI-IO? Yes good idea, but NCI wanted to test it inside a model.
  • Been very instructive. netcdf gets in the way, so hdf5 gets in the way. Not sure what is the best way forward. Might scale with the number of writers. Maybe 1-2 writers per node. Collectives on the nodes, and written by number of writers.
  • Dale has done this with native IO on UM and got 4x speed up. Quite profligate for CPU hours.

Actions

New:

  • After FMS moved to submodule, incorporate MPI-IO changes into FMS (Marshall)
  • Incorporate WOMBAT into CM2.5 decadal prediction codebase and publish to Github (Russ)
  • Profile ACCESS-OM2-01 (Marshall)

Existing:

  • Move FMS to submodule of MOM5 github repo (Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes (Nic, Peter)
  • Russ to add all his ocean bathymetry code to OceansAus repo (Russ)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (Nic).
  • Nudging code test case (Russ)
  • Redo SSS restoring with patch smoothing (Aidan)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (Andrew)

Technical Working Group Meeting, January 2018

Minutes

Date: 10th January 2018
Attendees:

  • Marshall Ward (Chair) (NCI)
  • Aidan Heerdegen, Andrew Kiss (ARCCSS/ARCClEx ANU)
  • Fanghua Wu (National Climate Center, China Meteorological Administration, Visitor ANU)
  • James Munroe (Memorial University of Newfoundland)
  • Russ Fiedler, Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff (CSIRO Aspendale)

MPI Errors on raijin

  • Marshall: MPI updates on raijin to try and improve MPI performance by turning on all mellanox accelerators. Mellanox can be hit and miss. Had to recompile as fabric was updated. OpenMPI has direct calls to mellanox in library?
  • Andrew: tenth job hung. Raised issue. Ben said turn off all mca flags. Then reading the conf file. Lustre issue. Should be fixed.
  • Marshall: fca is now turned off. MOM has few collectives.
  • NCI changed default config settings which may have let other jobs work. Has now turned everything off. Maybe try and get some UM/MOM testing into NCI upgrade testing suite to avoid these issues.
  • Nic: if we can’t get tenth running, fairly sure it is mxm stuff. Marshall: mxm reported too many retries error (jobs become list). Andrew’s jobs are hanging in MPI_Init(). Nic: experienced hang on gather in OASIS without mxm.
  • Russ: working on CM2.5. Getting sat-vap pressure errors in first timestep. Which is why Russ implemented traceback update in MOM5 code. Sat vap errors written to stdout. Anything written to stdout not on root PE don’t get output. FMS has stdlog, stdout and stderr. Unless recompiled will not make it to stdout. Most gets redirected to /dev/null in FMS for every PE except root PE. Performance issue.
  • mpirun can report which rank is reporting which line (tag output)
  • Nic: so any “print *” statements will have a major performance hit. Used to be a lot of them in MOM5.
  • Marshall: will have to assess all accelerators separately from now. 5 accelerators, 2 never implemented, all off by default. Now all 5 available, all off by default now, but initially were on but caused instabilities.
  • Marshall: NCI are continuing to push openmpi 2/3. Still have speed issues. Need to get in top of it. Memory pinning has changed. Either explicit or changed. This is a critical performance issue for us. We need to be able to say why we don’t use openmpi 2.
  • Peter: need a broader discussion around older versions, compiled against older libraries. Papers have been published with these executables. Do we continue to support software which is still being used for science. Scientists aren’t always interested in same things as HPC people. Bigger discussion needs to be had. To what extent does NCI as a partner need to be supporting these deprecated libraries?
  • Marshall: our jobs to communicate our issues to program leaders
  • Peter: is reproducibility off the table once hardware changes? Yes. By 2nd quarter next year clean slate.
  • James: do we need to run these models on other hardware for reproducibility? Marshall: NCI have some secret stuff (other architectures)! Not on the normally accessible queues, but maybe could be used for testing?
  • If we have our jobs in MOM6 tests we get cross platform tests for free.

Action Clean-up

  • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic. Can’t merge pull request without testing. Ping Mirko about pull request? Russ Might be able to test Mirko’s code. Russ to make test case
  • Create document outlining options for configuration sharing? No, configs now in github. Iterate on that
  • Ask Dale Roberts about effects of OpenMP for Roger (Marshall):  not relevant. Delete.
  • Start a new google doc about coupler issues and MATM (Marshall). Too vague. Delete.
  • Add new access-om2 test cases to Jenkins test suite (Nic). Done and ongoing. Delete.
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic, Marshall) — Marshall raised this in front of NCI Systems. Does not want ANY SERVER OF ANY SORT running on any machine. Process should only talk to files on a filesystem. Could be sticky. What about an IO server of some kind?
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes — no progress to date. Important step.
  • Russ to add all his ocean bathymetry code to OceansAus repo. Not done.
  • Check current sea surface salinity restoring smoothing (Aidan) Russ: can see some strange patterns in north. Laptev/Kara sea. Russ to provide images on slack.
  • Test Andy’s 5 year config with different netcdf library versions to check MATM error is not a just a library issue (Aidan). No need. Fixed.
  • Send link to spinup diagnostics spreadsheet to Russ (Andrew Kiss). Done.
  • Follow up with NCI MAS people (Marshall). Need to turn on netcdf crawler on hh5, and need read access to postgres DB. There was some follow up to an email Aidan sent at the end of last year, promising “early January”.
  • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation? Marshall will do this.
  • Collation errors on regional outputs (Aidan). Fixed on Paul’s newest runs. Unknown why it occurred, possibly mismatched t and u grids.
  • Nic said he would get IAF working. Had to rewrite MATM to fix stuff up.

Actions

New:

  • Nudging code test case (Russ)
  • Redo SSS restoring with patch smoothing (Aidan)
  • Follow up with NCI MAS people. Need something by end of the month (Marshall)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • CICE and MATM need to output namelists for metadata crawling (Andrew)
  • Doodle poll for new meeting time (Marshall)

Existing:

  • Move FMS to submodule of MOM5 github repo (Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes (Nic, Peter)
  • Russ to add all his ocean bathymetry code to OceansAus repo (Russ)
  • Add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation (Nic).

Technical Working Group Meeting, December 2017

Minutes

Date: 14th December 2017
Attendees:

  • Marshall Ward (Chair) (NCI)
  • Aidan Heerdegen, Andrew Kiss (ARCCSS/ARCClEx ANU)
  • Fanghua Wu (National Climate Center, China Meteorological Administration, Visitor ANU)
  • James Munroe (Memorial University of Newfoundland, Visitor ANU)
  • Russ Fiedler (CSIRO Hobart)

 

Output file metadata indexing

  • MAS database at NCI. POSIX info. ncdump blob. nodal style. Can put index on netcdf files and search by them.
  • James did similar thing for COSIMA cookbook running in user space. James has had no action on MAS DB so far.
  • Currently NCI is POSIX crawling hh5. James: need to switch on netcdf for certain directories.
  • James: Ben pitched MAS as a great innovation. Maybe Andy needs to formally ask Ben for this?
  • What can MAS deliver that existing DB cannot? James: stopped developing DB because of MAS. SQLite was 40-50K vars/files. Spinup of 1 deg model have 1M+ variables/metadata. SQLite already 1-2GB. Only scales to 1M rows. Can’t deploy postgres without admin access. Could host one, but should live on NCI resource. Makes sense to MAS.
  • James: just a user role in DB and switching on netcdf indexing — should be fine. Marshall will follow up with NCI MAS bods to make sure this happens soon.
  • Andrew was concerned that this will have on-going support. Use of MAS in other high profile projects (geoscience australia for example) means this is a critical piece of infrastructure.
  • James: can we just access their schema? Want to open source, not sure how. James: NCI has confluence, do they have bitbucket license? Marshall: no.
  • Need mom.out copied to hh5 also to be able index important info with f90nml. Russ: logfile has just namelist info.
  • Andrew: any equivalent for CICE and MATM? Maybe not? Andrew: need to make CICE and MATM print out namelists.
  • Marshall: get Ben/Andy to endorse official use of MAS by CoE.
  • Marshall: do we need to add attributes to files to accomodate this? James: does the executable spit out a version string? No. Marshall his build script puts a version string. Russ: version part of FMS? Russ: Marshall took version out when moved to oom version of FMS. See Issue #31 on GitHub (can’t find issue Russ refers to). Russ: already have a version.c
  • James: CSIRO wants some of the automated processing for decadal prediction. Can we apply to both?
  • James: make a MOM module? Marshall: make codebase a submodule of payu
  • Aidan talks about reproducible builds using spack. Reproducible builds require a package manager so that it can find and know about all the components of the build.
  • Marshall will put hashes in executable in MOM.

COSIMA Models

  • Andrew tenth degree runs: salinity crashes in the arctic. Recent crash: MPI Abort error code 111. Resubmit? Use broadwell
  • Andrew: has added regional runoff caps. Tighter caps in arctic rivers.
  • Paul Spence issue with regional outputs, had incorrect bounds. Might affect in future. High temporal resolution in small regions.
  • Russ: weird happened a while ago. Mixing velocity and tracer grids in a single file? At least for regional output. Mixing u and t grids? — Aidan look into it.
  • Migrating to FMS submodule. When Marshall updated to oom one of the open boundary cases broke. Took 2-3 weeks of scientific coding to fix.
  • Russ looking at CM2.5 and new FMS. AM4 has been released.
  • Marshall: will make FMS a submodule.  This works for decadal prediction people who will need this work done in any case.
  • COSIMA will do JRA55 IAF tenth run.
Wednesday meetings next year. 11.30am.

Actions

New:

  • CICE and MATM need to output namelists for metadata crawling (no-one assigned)
  • Get Ben/Andy to endorse provision of MAS to CoE (no-one assigned)
  • Make MOM (and other models) emit GitHub version hash (Marshall)
  • Collation errors on regional outputs (Aidan)
  • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation?
  • Follow up with NCI MAS people (Marshall)

Existing:

  • Send link to spinup diagnostics spreadsheet to Russ (Andrew Kiss)
  • Nic add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation.
  • Test Andy’s 5 year config with different netcdf library versions to check MATM error is not a just a library issue (Aidan)
  • Check current sea surface salinity restoring smoothing (Aidan)
  • Russ to add all his ocean bathymetry code to OceansAus repo.
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic, Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Nic create a discussion document (on COSIMA?) to document current approaches and strategies for future
  • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic.
  • Add new test cases to Jenkins test suite (Nic).
  • Start a new google doc about coupler issues and MATM (Marshall)
  • Ask Dale Roberts about effects of OpenMP for Roger (Marshall)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Create document outlining options for configuration sharing (?)

Technical Working Group Meeting, November 2017

Minutes

Date: 14th November 2017
Attendees:

  • Aidan Heerdegen (Chair), Andrew Kiss (ARCCSS ANU)
  • Fanghua Wu (National Climate Center, China Meteorological Administration, Visitor ANU)
  • James Munroe (Memorial University of Newfoundland, Visitor ANU)
  • Nicholas Hannah, Anthony (Double Precision)
  • Russ Fiedler and Matt Chamberlain (CSIRO Hobart)

COSIMA Models

  • Discussion around publicising 1/10th model spin up, in case interested parties would like diagnostics saved.
  • Bluelink are interested in full JRA55-do IAF style spin up, and would want 15-20 years of daily full 3D U,V,T,S and eta fields from that. What is required to construct ensembles/climatologies.
  • Nic looking into ACCESS-OM2-01 performance issues. Lots of time in ice coupling field halo updates. In serial so holding up ocean when it does this. Definite target for optimisation. Should use OASIS to fill the halos when it does the coupling step? Russ disagrees. OASIS shouldn’t know anything about what goes on in models. Gridding using block trains, a 1:1 mapping between grids. If you do this have a 1:many mapping. No longer have identical grids when put in halo information, might break optimisations. When Russ looked at 1/4 deg, hold up was due to synchro just before that. Not sure about 1/10th. Want a barrier just before calling clock before halo update. See if synchronisation issue, or actual time take with halo distribution. 5 halo distributions being done. Heaps more done in CICE itself. Nic: land imbalance between ice processors? Russ: yes my hunch. Load imbalances change a lot with resolution and processor layout. Nic: a problem doing halo updates without considering where field is used. Russ: agree. Velocities need updating, not sure about tracers.
  • Fanghua has been running the new tenth bathymetry with the MOM-SIS-KDS75 config. With JRA55 RYF forcing time step now 450s (from 150s initially). Runoff data now a problem with very low salinities in the arctic at about 7m depth, even with 150s timestep. Created new runoff data, spread more into the ocean but still have issues. Russ saw very high salinities in the Arctic (Laptev Sea). Might be brine rejection from forming sea ice from ice free start. Suggests decreasing salinity restoring timescale from current 60 days to 10 days or even 1 day, to get the model over the initialisation. Andrew suggested issue could be resolved with initial sea ice climatology. There were issues with these files and not been used for a long time. Recent poster to mom users google group has identified some of the problems.
  • Nic’s online runoff redistribution may help, as it is possible to specify maximum runoff per cell, which can help in these areas with very large runoff. Would require ACCESS-OM2-01 config.
  • Nic currently working on getting ACCESS-OM2-01 working with Russ’ new bathymetry. Had a couple of attempts. Getting close, various technical glitches with masks and so on.
  • Andy Hogg has MATM issue when running ACCESS-OM2-1deg for more than 4 years at a time. There is an error on netCDF open call, which comes from HDF layer. Nic ran valgrind, found a bunch of errors, and so recommends everyone update their MATM, but this did not fix the 5 year issue. Determined this was not a memory errors, but an HDF library error. Russ suggested using some HDF library calls to try and determine why the crash occurred. Also try different versions of the netcdf library.
  • Nic suggested we could change MATM to make few file open calls. Aidan has a new payu feature that allows multiple runs per PBS submission, so decided not a priority as MATM needs complete rewrite.
  • Regridding. Nic: need to choose which interpolation schemes to use for which fields. 2nd order cons for everything? Russ: Velocity should not be conservative. Momentum is not conserved. Patch for velocities, T and S. Will give smooth flux fields. Nic: 2nd order cons will be very smooth. Russ: do whatever is cheaper for T, S. U,V should be as smooth as possible. Patch should be 1st order cons, possible 1nd order.  AK: 1st order cons is piecewise constant (bad for wind stress curl). 2nd order is piecewise linear? So similar to bilinear. Need to go to patch for smoother. Russ: tried 2nd order cons, see problems at corners, nodes and edges with wind stress curl. Coarse to fine get artefacts. Patch should work. AK: half of the fields are fluxes. Those should be conservative (2nd order ideally). The remaining are not fluxes, don’t see strong argument for conservative. Is there an issue with different interpolation schemes from different fields? Will bulk formula at fine scale be an issue? Russ: will get jumps in some of the calculated fields. Quantities like T, S should be done with patch, end up with smooth fluxes. AK: Surface stress bulk formula, does it take atmosphere stability into account? Any drag coefficient? Russ: it does. Looks at a profile, figure out a profile. AK: Use SST and 10m T to determine stability? MC: Yes. Say warm atmosphere sitting over cold surface, that’s stable so air would slide over. Daytime, warm surface, near neutral stability so not so sensitive. Possible for temperature and humidity to have small effect on drag coefficient. AK: If we use different interpolation method for 10m winds/T, will it cause issues? Russ: Small jump in sensible heat maybe? Just go with patch or bilinear for all scalar quantities. Velocity go for patch. How will it take into account rotation in tripolar? Presume it is handled  well? AK: only an issue with velocity. Checked with current forcing fields and was ok. Will check new fields the same way.
  • AK: Final decision:
    • patch (the smoothest available) for u_10, v_10
    • 2nd order conservative for fluxes (rain, rdls, rsds, runoff_all and snow)
    • patch or bilinear for non-flux scalars (q_10, slp, t_10) suggest trying patch and only using bilinear if performance with patch is bad
  • Nic: what does MOM-SIS do? Aidan: Thought Steve said bicubic, used to use bilinear but wasn’t smooth enough. Smoother the better.
  • AK: Should WOA salinity restoring fields be smooth in the same way? Nic: What do we currently do? Nic: bilinear? Aidan did it. Russ: not a big issue if salinity restoring not too strong.

Task follow ups

  • Should be using GFDL FMS code directly. Would work better to collaborate with GFDL. Use same code, submit bug reports easily.
  • Once we have FMS as submodule, use all pre/post processing code from GFDL. Make MOM5 leaner, easier to keep updated. Russ: what is the latest FMS version? Aidan: don’t know, and it is hard to tell. Russ: noticed there are new features, like new diagnostic output options, e.g. RMS on the fly, statistics. So things like diag_manager has been updated. Could be some other powerful tools.
  • Aidan: Currently huge step to upgrade. Small step, but could be really good. Not sure how Marshall did it, but not simple.
  • Nic has updated the access-om2 repo structure. Every single test case/experiment is in it’s own repo. Makes it easier for users to grab config without worrying about other configs and source code. OceansAus now has more experiment repos. Aidan: Andy has an issue with git clashes with multiple runs in a single repo. This will fix this.
  • Blog posts?

Actions

New:

  • Will have a December meeting. Tue 12th.
  • Determine if COSIMA intend to do IAF JRA55 spinup of tenth model (Aidan)
  • Send link to spinup diagnostics spreadsheet to Russ (Andrew Kiss)
  • Nic add MPI barrier before ice halo updates timer to check if slow timing issues are just ice load imbalances that appear as longer times due to synchronisation.
  • Test Andy’s 5 year config with different netcdf library versions to check MATM error is not a just a library issue (Aidan)
  • Check current sea surface salinity restoring smoothing (Aidan)

Existing:

  • Russ to add all his ocean bathymetry code to OceansAus repo.
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic, Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Nic create a discussion document (on COSIMA?) to document current approaches and strategies for future
  • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation?
  • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic.
  • Add new test cases to Jenkins test suite (Nic).
  • Start a new google doc about coupler issues and MATM (Marshall)
  • Ask Dale Roberts about effects of OpenMP for Roger (Marshall)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Create document outlining options for configuration sharing (?)

Technical Working Group Meeting, August 2017

Minutes

Date: 8th August 2017
Attendees:

  • Marshall Ward (NCI, Chair)
  • Aidan Heerdegen (ARCCSS ANU)
  • Nicholas Hannah (ARCCSS/Double Precision)
  • Russ Fiedler and Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff and Roger Bodman (CSIRO Aspendale)

COSIMA Models

  • Nic: Toy model is good for yes/no type tests
  • Invite James Munroe to the TWG meetings. He could be of assistance getting a test suit together.
  • Nic: using Issues interface on github has been very helpful. Hasn’t written emails and has answers to problems. Steve and Russ been super helpful. Marshall: good that users have been Issues. Nic: every time I go to write an email, can I make this a github issue? Dave Bi, Siobhan have useful input. Marshall: how do we get them onto github too? Russ: Arnold has used it. Hopefully Dave and Siobhan will jump on to it also.
  • Peter: using a different MOM. Nic and Peter to meet to sort this out. Roger: can cice be on the same repo? Nic + Peter will liase.
  • Invite Arnold to these meetings.
  • Marshall: Bit repo issues. Surprised ACCESS-CM2 has bit reproducibility problems. Nic found 4 issues that affected this. Payu was missing restarts. Also found a couple of issues where there was a different code branch for first coupling time step. Red Sea and another one. Added checksums to all coupling fields. And tested all of these. Now restarts ok from restarts, but 3 or 4 time steps starts to diverge. Probably still some small issue from restarts. Wasn’t just one problem. Some of these were Nic’s issue with restarts and payu. Wasn’t just one problem. Some were particular with Nic’s setup. When it is solved can talk to CM guys. The general pproach is useful though. Concentrate on coupling fields.
  • Pete: combing log files and looking for checksums and numbers to compare between the runs.
  • Nic: once I get through all this I can talk to Peter about the method and use it to work on Peter’s issues.
  • Marshall: was Red Sea confirmed to be a repro issue? Did Russ fix it? Russ: Aspendale code was fixed. Marshall: was fixed in Aspendale but not main repo? Yes. Marshall: Arnold knew about that. Peter: we had the fix in that case and hadn’t shared it.

CMIP6

  • Peter talked to David Karoly. Wondering about spin-up for CMIP6. 500-1000yr spinup. Is it possible to spin up ocean first by forcing an Ocean/Ice model with JRA? Then add CM when it is ready. Ice issues might make big difference to stratification. Aidan: model drift issues won’t help. Marshall: can you stabilise stratification with OM run? Russ: deep stuff takes thousands of years to get into steady state. Marshall: ask this at a MOM meeting? Russ: ask Ocean modellers.
  • What is CMIP schedule? Peter+ Roger: about six months behind. Start production run early next year. Still working on configuration. Coupled cable model running now. Reproducibility issue has become more issues which need to be nailed down.
  • Marshall: when do you feel like you need to fix source/versions? Roger: delaying until the end of the year. How long would a 500 year 1 deg MOM spin up take? Nic: 0.5 hour/year. 50 years/day without queuing issues. Have to take crashing into account. Marshall: does 1 deg crash? Nic: probably not.
  • Nic: created issue recently, wanted to 50 years in single submit. Memory leaks limit how long you can run. Maybe only 3 years.
  • Aidan: Should add multi-year runs per submission ability to payu.

Bathymetry

  • Aidan: how do you deal with non-advective cells? Russ: it is potholes with no advective velocity possible. If you allow cells to fill if they’re too thin, can create cells that have no velocities.
  • Russ to add his code to OceansAus repo.

New HPC

  • Marshall: Tender for new machine. Understand current limits of codes, and if new machine will work, and what we need to get more performance. Convinced MOM is a RAM bound code. Vectorisation is not making a difference. Want a machine with more RAM bandwidth, not more vectorisation. Away from KNL and SkyLake, towards IBM power and AMD.
  • Peter: Met Office XC40 can run coupled model with 48*24 processors. They are 32 processor  Broadwell nodes. Marshall: maybe running more threads? Roger: they run 2 threads.
  • Marshall:  bring errors that stop it working. Roger: ok, will get some info together.
  • Marshall: incorrect Message Parsing and halo understanding. MPI messages in MOM5 are healthy. Get GB/s bandwidth, even corners. Problems are related to library or load imbalance, or maybe CPU throttling. We are doing a reasonable job of MPI. Faster interconnect may be useful. Broadwell is 10-15% faster, as it has faster interconnect.

Actions

New:

  • Invite Arnold Sullivan and James Munroe to TWG meetings.
  • Add feature request to payu: multiple runs per submission
  • Ask MOM Ocean meeting about 1000yr OM spin-up possibility
  • Russ to add all his ocean bathymetry code to OceansAus repo.

Existing:

  • Aidan investigate tenth degree MOM configs for benchmarks.
  • Possible bench-mark configs (everyone)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic, Marshall)
  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Nic create a discussion document (on COSIMA?) to document current approaches and strategies for future
  • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation?
  • Test Nic’s access-om model config on OceansAus (All)
  • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic.
  • Add new test cases to Jenkins test suite (Nic).
  • Start a new google doc about coupler issues and MATM (Marshall)
  • Ask Dale Roberts about effects of OpenMP for Roger (Marshall)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Do longer runs with Nic’s 1 deg and 0.25 deg ACCESS-OM2-JRA55 configs (Andy and Aidan)
  • Try repeat year forcing with Nic’s configurations (Nic and Andy)
  • Create document outlining options for configuration sharing (?)

Technical Working Group Meeting, July 2017

Minutes

Date: 11th July 2017
Attendees:

  • Marshall Ward (NCI, Chair)
  • Aidan Heerdegen (ARCCSS ANU)
  • Nicholas Hannah (ARCCSS/Double Precision)
  • Russ Fiedler and Matt Chamberlain (CSIRO Hobart)
  • Peter Dobrohotoff and Roger Bodman (CSIRO Aspendale)

Reproducibility

  • Peter is looking at reproducibility across resubmit periods. ocean_solo.F90 has changed since old CMIP5 setup. A call to the coupler has been commented out. May have upset forcings? CMIP5 worked. Doesn’t work on CMIP6.
  • Some discussion ensued about when and who might have changed the file.
  • Roger says Martin Dix sees differences in restarts happening in Red Sea. Nic suggested turning off red sea fix and see if reproducibility issues goes away.
  • Nic: what is plan of action of ocean_solo issues? Peter will send a diff to Nic. Roger will ask Martin about turning off Red Sea Fix.
  • Nic can help Peter with updating their GitHub fork of MOM.
  • Nic wondered if there is there a better way to do the Red Sea fix? Use sponges in localised areas? This is the way it is done in other models. Our method is not particularly standard. Specialised code runs over specific region doing clamping. Maybe do this more generically? Russ: reason it is done that way is to conserve the salt. Restoration doesn’t conserve salt. Nic: open channels? Russ: don’t want to change land masks. Not issue of channel size, just not enough mixing. They are mixing locations far apart. Can’t do cross-land mixing. Not on the same processor. Tenth and quarter don’t need need it this fix.
  • Russ: might be an issue about when Red Sea fix is called. Done so many steps after start of the model, so will see a different history of the model. Maybe should change to time rather than step based call. Nic reports that Fabio says the red sea fix never runs on first coupled run. After looking at code Nic says a single 2 day run will never do salinity fix. 2×1 day runs will do salinity fix twice. Code is not reproducible.
  • Marshall asked Nic if ACCESS-OM2 models are reproducible? Nic: don’t know, haven’t done that yet.

COSIMA Models

  • Russ fixed heat budget in ACCESS-OM2.
  • Nic used Russ’ offline kd-tree runoff regridding and implemented it online. Without conservation checks it is fast enough to run online. Russ: set up connections and read them in? Nic: build tree once at beginning of run, and tree searched at runoff frequency.  Being run on MATM core. Hopefully won’t slow other models as it is doing it in parallel while other models working.
  • Nic: this runoff regridding might be relevant to coupled model. Don’t know how runoff works, but believe it has to do with land/sea masks match as closely as possible. As they’re different resolutions might still lose some. This technique is guaranteed to get all runoff into ocean. Also if you change ocean mask, you have to also change your atmosphere land/sea mask. This would avoid this. Nic asked Peter/Roger if they were 100% certain all runoff goes into ocean? Peter didn’t know for sure. Roger reported that this was anecdotally a problem. Peter will pass idea on to Dave.
  • Russ: did you just implement nearest neighbour or spread? Nic: no spread. Russ: will blow up near amazon. Nic: doing a conservative remapping onto fine model grid and will then remap from each land grid cell with runoff. So will not dump all runoff in one location. Aidan suggested river spread module could be used to redistribute runoff, but Russ said better not to use river spread if can be avoided (can be across cells and so increase communication, slow model).
  • Aidan explained Andrea Dittus had salinity issues with her coupled chemistry model that were to do with a bad river routing table. Maybe this approach could help?
  • Aidan explained the JRA55 data set, as a replacement for CORE II and how the RYF data was created as a replacement for CORE NYF. There was interest amongst the group at using JRA55.
  • Matt explained CORE II is a weird reanalysis product which is a mish-mash of other products. Some of the component products have ceased so CORE II also ceased.
  • Aidan explained the JRA55 IAF forcing dataset is incompatible with MOM, as it is split into separate years. Aidan developed some rudimentary code to support time formats in the data_table, but this breaks on time interpolation.
  • Nic thinks we should use OpenDAP to overcome this. OpenDAP access via URLs is fully supported by netCDF library. Should work in MOM. Marshall wonders if it would be too slow. Aidan also pointed out that it would require an OpenDAP/THREDDS server which is not publicly facing as JRA55 has limits on redistribution. Nic made an issue for this on MOM5 repo already.

Benchmarking

  • Marshall: NCI needs benchmarking code/config ASAP. Want to package MOM benchmarks. Currently packing stock MOM-SIS-025. Can’t choose everything. Will dilute scores.
  • Marshall: Is the Hobart THREDDS data ok? Nic: Put up 2-3 years ago. Maybe worth running through it all to make sure it works ok.
  • Can’t use coupled model due to UM licensing.
  • Wants MOM6. Not sure which.
  • Do we want to include ACCESS-OM2? Nic: yes want OM2 tenth. Marshall: restricted by CPU count. Can’t really bench tenth model. 1000 CPUs was too big for Broadwell expansion. 500 was the limit. 1000 might be pushing it.
  • Aidan had a bunch of tenth configs when checking out optimal configurations for production. Will look into tenth layout configs.
  • Roger: looking at N96 benchmark from MetOffice that doesn’t run. How does Bureau do benchmarking? Marshall: BoM gets vendors to sign confidentiality contracts. Need lawyers but NCI might not.
  • Smallest benchmark. Maybe less than 1000CPUs.

Actions

New:

  • Aidan to tell TWG about JRA55 location.
  • Aidan investigate tenth degree MOM configs for benchmarks.
  • Possible bench-mark configs (everyone)
  • Nic to help Peter get his MOM repo up to date with MOM5 master branch, and then merge changes
  • Look into OpenDAP/THREDDS for use with MOM on raijin (Aidan, Nic, Marshall)

Existing:

  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Nic create a discussion document (on COSIMA?) to document current approaches and strategies for future
  • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation?
  • Test Nic’s access-om model config on OceansAus (All)
  • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic.
  • Add new test cases to Jenkins test suite (Nic).
  • Start a new google doc about coupler issues and MATM (Marshall)
  • Ask Dale Roberts about effects of OpenMP for Roger (Marshall)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Do longer runs with Nic’s 1 deg and 0.25 deg ACCESS-OM2-JRA55 configs (Andy and Aidan)
  • Try repeat year forcing with Nic’s configurations (Nic and Andy)
  • Create document outlining options for configuration sharing (?)
  • Test OpenDap netcdf (Aidan)

Technical Working Group Meeting, June 2017

Minutes

Date: 13th June 2017
Attendees:

  • Marshall Ward (NCI, Chair)
  • Aidan Heerdegen (ARCCSS ANU)
  • Scott Wales (ARCCSS Melbourne Uni)
  • Nicholas Hannah (ARCCSS/Double Precision)
  • Russ Fiedler and Matt Chamberlain (CSIRO Hobart)
  • Justin Freeman (BoM)
  • Peter Dobrohotoff and Roger Bodman (CSIRO Aspendale)

COSIMA Workshop

  •  Marshall: Good meeting. A couple of main messages:
    1. Must have convergence of source code
    2. System for efficient sharing of model configurations
  • First item we are well on the way to achieve. Second has never been addressed.

Configuration Sharing

  • Marshall: Should focus our energies on configuration sharing.
  • Marshall: having configurations in OceansAus is close to what we want?
  • Currently when using payu configs are saved into local git repos, free floating. This could work for everyone. Should gather config files for runs.
  • Aidan: can have a GitHub organisation and push configurations there, or push to own GitHub accounts and tag repositories to allow search and discovery. Use OceansAus or dedicated GitHub organisation for more canonical configurations.
  • Peter agreed it is ok to have many thousands of configurations. Scott agreed as long as there is a canonical version that is easily identifiable for others to fork from.
  • Nic felt that unless there was a good pay off most individuals would not make the effort to push their configurations up to shared spaces, or make the visible to others, but there is a pay-off for everyone if configurations are shared.
  • Nic: How do create a searchable index or database for others to find the stuff? May be bigger problem than storing data.
  • Marshall: Payu users can be forced to do it. Tracking history not seem such a big deal but a pay off for users down the track
  • Pete: if someone wants to use my config I give them my config and people branch off that. rosie-graph will show relationships between them.
  • Marshall: rose solves a lot of these problems already. We don’t benefit from that. Do we need to just use rose or emulate what it does best?
  • Scott: rose setup: metadata file in each repo branch, hook within repo which reads meta data file and adds to database. Currently there is not a lot of meta data in most branches.
  • Justin: does rose use git? Scott: should be back-end agnostic. Nic: possible solution? Use GitHub for configs, but use rose DB to share and index configurations?
  • Marshall: should we draft a plan?
  • Nic: Just try with rose? Gives us DB which is searchable and viewable. Doesn’t stop anyone from working in the way we want?
  • Marshall: Do we need accessdev to do this? Scott: I think it is a just a sqlite DB, so can just be on filesystem. Hooks in GitHub repo won’t work the same way to update it however.
  • Marshall: just start with something searchable? Nic: yeah, see what people are running.
  • Scott: Just start with a simple list. Nic: would want a GitHub repo and commit hash in meta data as a minimum.
  • Justin: make notes about what we think is possible. Common set of functionality will guide us. Rose sounds good, but may be too large a solution? Want something that won’t get in my way and lightweight. Don’t know too much about rose. Everyone agrees this is a good idea.
  • Scott: can just have a list of experiments. Don’t need the editor. Will send around link so Justin and others can take a look.
  • Marshall: really need a bunch of git repos and a way to organise them. Justin: metadata would be a great, to find things without bothering others.
  • Marshall: how do we share input data?
  • Justin: share through RDS? Jingbo working on this.
  • Nic: Justin said all our data sources should be pointing to URLs. Great idea! Never have to care about where data is coming from. Input through THREDDS URIs. Marshall: too ambitious? If you’re using netCDF interface, the library takes care of network. Should all happen under netCDF. Model thinks it is a file.
  • Aidan: slower? Nic: might not be slower?

COSIMA Models

  • Marshall: talking with Peter and Roger at optimisation meetings, seem to be bit-reproducibility issues with CM2. Has Nic ever looked at bit repro in OM2?
  • Nic: have checksums, and make sure they don’t change with code changes under same layout etc. haven’t noticed things aren’t reproducible in a typical short run. As soon as processor layout changes you’ll run into issues.
  • Peter: if run model 3 months and then 1 month at a time, get different answers. A few degrees over a few months. Restarts are not reproducible. CMIP5 worked ok. Now doesn’t work. Maybe need to check that the ice dumping ok.
  • Marshall: have faith in MOM5 core. Issues with flux exchange reproducibility. Anyone got issues with that?
  • Russ: Fabio Dias and Russ have been working on an energy leakage issue. MOM-SIS conserves, MOM-CICE has small leakage. Is a computational issue. ASCII diagnostics don’t close. Should close 10^-12 W/m2, only 10^-6 W/m2.
  • Marshall: Has CM2 been subjected to same scrutiny for flux exchange. How far has Dave Bi got?
  • Peter: does UM have this issue? Scott: bit repro is heavily tested. Processor decomp and run length shouldn’t be an issue.
  • Peter: maybe CICE is a bit more vulnerable? Nic: if someone hasn’t made an effort to make it reproducible, odds on that is isn’t. As far as Nic knows no checking has been done on the ACCESS-CM/OM specific boundary code. Suspicious this might be the issue.
  • Roger: reproducibility tends to be over the higher latitudes and higher altitudes. A couple of degrees over a few months.
  • Marshall: MOM5 is not reproducible over processor layout changes north of tripole. There is an expensive operation in the MOM5 flux exchange that isn’t turned on by default. Scott: MPI reduce is not generally reproducible unless special steps taken.
  • Nic: repro needs to be tested with correct compiler flags turned on. Normally optimised code will not reproduce. Roger did this, but will double check this for CICE. Should have fpmodel precise across all 3 models.
  • Russ: fixed ice salinity issue that Nic had heard of before from Dave Bi. Was a default namelist option that people don’t often change. Nic: will forward the conversation to Roger.
  • Marshall: if we routinely shared configs this would not be an issue.

Actions

New:

  • Create document outlining options for configuration sharing (?)

Existing:

  • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
  • Nic create a discussion document (on COSIMA?) to document current approaches and strategies for future
  • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation?
  • Test Nic’s access-om model config on OceansAus (All)
  • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic.
  • Add new test cases to Jenkins test suite (Nic).
  • Start a new google doc about coupler issues and MATM (Marshall)
  • Ask Dale Roberts about effects of OpenMP for Roger (Marshall)
  • Make a proper plan for model release — discuss at COSIMA meeting. Ask students/researchers what they need to get started with a model (Marshall and TWG)
  • Blog post around issues with high core count jobs and mxm mtl (Nic)
  • Do longer runs with Nic’s 1 deg and 0.25 deg ACCESS-OM2-JRA55 configs (Andy and Aidan)
  • Try repeat year forcing with Nic’s configurations (Nic and Andy)

Technical Working Group Meeting, April 2017

Minutes

Date: 11th April 2017
Attendees:

  • Marshall Ward (NCI, Chair)
  • Aidan Heerdegen (ARCCSS)
  • Nicholas Hannah (ARCCSS/Double Precision)
  • Russ Fiedler and Matt Chamberlain (CSIRO Hobart)
  • Justin Freeman and Mirko Velic (BoM)

Updates on previous actions

  • Nic has transferred MOM code repos to a new organisation MOM-ocean
  • Official MOM website URL is now mom-ocean.science
  • MOM5 input files have been moved to a new location, administered by Paola Petrelli (ARCCSS CMS team) portal.sf.utas.edu.au/thredds/catalog/. Licensing has been confirmed as GPL by Stephen Griffies.

    COSIMA Workshop

    • Agreement that we want to present some summary of the work of the TWG to date. Important to show what we’ve done: Model sharing, Grid sharing, Important infrastructure activities.
    • Discussion around what to put cover in the presentation: What are the most important projects? What is most interesting? What do we want to work on?
    • Some ideas for important topics:
      1. Coherent sharing of experiments with others, as is possible with Rose/cylc
      2. Performance is always important, can motivate others to support our work
      3. Coherence around OASIS.
    • Nic will present a 1/10th coupled model update and results. Should have something decent by then
    • Need to raise issue of the coupler, present some options and get some feedback from the stakeholders. Justin is concerned about wave coupling in future. Inform wider group about what we’re thinking. Judge appetite for changing tools.
    • OASIS is not well liked, is difficult to use and performs poorly.
    • If someone has concerns about the coupler/MATM then we need to back that up with numbers. Marshall will start a new google discussion doc for this
    • Marshall has CM2 numbers that are difficult to understand. Probably not worth presenting to COSIMA
    • Aidan not presenting at present. Russ might talk about grid stuff. Matt might talk about some of the technical aspects of 0.25 deg run. Maybe parameterisation of bulk formulas.
    • There was a general feeling the TWG was functioning well and no significant change required. Matt suggested a face to face meet-up might be worthwhile. Mirko pointed out we need some metrics to show to others that TWG is a worthwhile venture and why we’re so pleased with ourselves

    Updates

    • Nic has had 3 week break. Visited GFDL. Presented his coupler at a coupling workshop. Talked to OASIS, MCT, ESMF and YAC (Yet Another Coupler). ESMF are building 2nd order conservative support into their remapping weight software which is good for us.
    • Matt: has been working with GFDL coupled model for Terry O’Kane. Using ACCESS 1 deg grid in CM2 with AM2 atmosphere. Several 100 years. Decadal forecasting. Atmosphere is same as CM2.1. Maybe 2.5 deg. Would be nice to have higher resolution, but for forecasting need ensembles. 10-40 ensembles. UM is too slow for this sort of work. There is some innterest in benchmarking for ocean structure, maybe against CM3. Maybe the go to 0.25 ocean, AM3 atmosphere model. Then put through data assimilation system. Then put in forecast mode. Look for predictive skill.

    COSIMA Models

    • Not made much progress on ACCESS-OM2-01. Technically coupled. Now trying to speed it up for multi-year runs. Currently running 600s timestep. Still in January.
    • Not currently JRA55 forced. Can’t talk about tenth ACCESS-OM2 performance yet.
    • CICE has some shocks on spin-up. Was aiming for 600s, but the current operational time step is 450s. Pulling back to 450s will be a lot easier. Goal to make 600s the prroduction time step. Russ can sometimes run 720s or even 900s for this tenth simulation.
    • Matt pointed out it is difficult to start a simulation with CICE when initial conditions have no sea ice. Better to create initial conditions from World Ocean Atlas.
    • Matt is now running COREII IAF. Sometimes blows up, just drops time step for a month, and if no issue, doesn’t try to diagnose problem, just goes back to normal time step. Matt is running 600s with a tenth MOM/SIS.
    • Aidan talked about issues with the current ACCESS-OM2-025+JRA55, discovered the scripts for MOM5 are not using the same OASIS build as the other components. Trying to work out why the model is running so slowly.

    Actions

    New:

    • Start a new google doc about coupler issues and MATM (Marshall)

    Existing:

    • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
    • Add Peter’s CICE5.1 config to OceansAus github repo (Nic and Peter)
    • Port MOM5 build system to cmake (Aidan)
    • Push updated MATM code with JRA-55 support to OceansAus github (Aidan)
    • Nic create a discussion document (on COSIMA?) to document current approaches and strategies for future
    • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation?
    • Test Nic’s access-om model config on OceansAus (All)
    • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic.
    • Move to CICE5 on OceansAus repo (Nic).
    • Add new test cases to Jenkins test suite (Nic).

    Technical Working Group Meeting, March 2017

    Minutes

    Date: 14th March 2017
    Attendees:

    • Aidan Heerdegen, Scott Wales (ARCCSS, Acting Chair)
    • Nicholas Hannah (ARCCSS/Double Precision)
    • Russ Fiedler and Matt Chamberlain (CSIRO Hobart)
    • Peter Dobrohotoff (CSIRO Aspendale)

    Updates on previous actions

    • MATM update: Nic will do a code review and document probably beginning of April.
    • CICE5.1 on repo: This will be part of the ACCESS-OM-0.1 development when this code is incorporated into the model
    • Matt has 1/10th data, and diagnostics were ok. Needed them to explore how heat is going into the ocean, and how that is affected by bulk formula and forcing fields. Could be useful analyses for changing to JRA55.

    COSIMA Models

    • Nic time-stepping ACCESS-OM-0.1 with core forcing. Will not be working on this again for about a month. Is time stepping with CORE forcing CICE 4.1. Will add in cice5 and JRA55. Took about as long as he thought, but there were a number of issues that required fixing.
    • Nic wants to cleanup coupling code, remove cruft and commented out code. Also clean up date handling. There are twenty places where you have to change time step for example. Similar for run time. Need to set up regression tests to allow code changes without breaking stuff. In MOM and CICE namelists there are duplicate fields. Need code changes. Maybe C-style macros are a simple approach?
    • Nic Has developed some tools for generating model inputs required for ACCESS-OM. For example, creating the cice grid out of ocean grid. Using ESMF for creating grids and restarts. Started a wiki to document steps to develop config.

    Updates

    • Russ is working on some mixing models, more complex than KPP. GOTM was patched on to MOM. Have to advect around a couple of extra tracers but doesn’t seem to impact speed much. Supposed to be a lot better way of mixing.
    • Russ has been having issues with number of retries time outs with PBS, occurs in first few lines of code. Aidan suggested noting error message and PBS ids and passing on to NCI Help, as this is often the side effect of a bad node.
    • Nic also having issues with tenth. Crashes, with lots of MPI output and no error code. A lot of the processors were sitting in an MPP_global call. ACCESS-OM coupling code was making it fall over. Raijin doesn’t like MPP_global operations as implemented by FMS. Due to implementation issues. To globally distribute an array in FMS you use MPP_global, which uses multiple MPI_Sends, rather than MPI_broadcast.
    • Peter: status is similar to last month’s update.
    • Russ has found issues with MLD diagnostic in MOM. If only have 3 levels returns zero. Strange issues near coast, getting zeroes at those locations.
    • Russ will be working on bathymetry for COSIMA 1/10th configuration. Maybe need different bathymetry for climate and BRANS type runs? Aidan was adamant the goal was to have the same bathymetry for all COSIMA models
    • Russ is running his 1/10th degree simulation with a 900s time step, whereas Aidan’s configuration is 450s due to instability in Arctic
    • JRA-55 is now being housed centrally. All those who wish to use it can they please make their requirements (versions, update frequency) known to the ARCCSS CMS team, and Paola Petrelli in particular

    MOM5 Repo Move

    • Aidan has had some preliminary discussion with Stephen Griffies and Alistair Adcroft about having a new “official” home for the MOM5 source code repository. They favour a separate repo with the community supported model, and infrequent but regular updates of the “Official GFDL” repo version, which could be used by those needing a badge of officialness.
    • Nic felt that there was little point in having an “official” repo, especially if it just creates more work for little actual gain.
    • Nic was in favour of a standalone MOM github organisation, as it is bigger than any one of the groups that use it. It could also host all the older versions of MOM also. This was considered a good approach. If all partner organisations (COSIMA, ARCCSS, BoM, CSIRO, GFDL?) then gave this github organisation their support, similar to the CICE development model, this could tick the boxes for those that required an endorsed software product.

    Actions

    New:

    • Nic to organise transferring MOM code repos to a new organisation MOM-ocean

    Existing:

    • Nic to present MATM code re-write proposal to TWG for feedback before sign-off. Will then be presented to Andy Hogg for approval.
    • Add Peter’s CICE5.1 config to OceansAus github repo (Nic and Peter)
    • Port MOM5 build system to cmake (Aidan)
    • Push updated MATM code with JRA-55 support to OceansAus github (Aidan)
    • Get licensing for MOM5 input files (Marshall)
    • Work on hosting MOM5 input files on NCI THREDDS server (Marshall, Aidan)
    • Nic create a discussion document (on COSIMA?) to document current approaches and strategies for future
    • Move FMS to submodule of MOM5 github repo (Marshall). Liase with Nic on implementation?
    • Test Nic’s access-om model config on OceansAus (All)
    • Work up test cases to cover the nudging code (Justin, Mirko) and supply them to Nic.
    • Move to CICE5 on OceansAus repo (Nic).
    • Add new test cases to Jenkins test suite (Nic).