Date: 10th April, 2019
- Aidan Heerdegen (AH) CLEX, Andrew Kiss (AK) COSIMA, ANU
- Marshall Ward (MW) GFDL
- Russ Fiedler (RF), Matt Chamberlain(MC) CSIRO Hobart
- Nic Hannah (Double Precision)
Date: 10th April, 2019
Date: 14th February, 2019
Date: 11th December 2018
MW: Been profiling CICE, score-p profiling doesn’t work. Been timing by time step. Anomalously long time spent at step 72. AH: could it be atmosphere being updated. JRA55 is 3 hourly. Not sure timestep. MW: Seem to have lost my logs. Not sure best way to handle it.
AH: Peter has been testing release candidate. Russ supplied a diag_table which just outputs fields for first 2 time steps which is really good for seeing code issues. Russ found some bugs introduced by me. A couple of logic errors with preprocessor flags and omission of a couple of lines that got lost in translation. Confident latest update has squashed all the bugs. MW: Not old bugs? AH: Did find some old issues. Russ found a stuffed iceberg file. RF: Not related, but is something they were using for CMIP6. AH: Did find some old bugs, had to emulate the lack of reproducibility from a the readsea salinity fix timing bug to be able to closely reproduce CM2 output. Put a flag in to do the wrong thing to do the same as theirs, will remove before merging. MW: I thought reds fix had been changed to be faster but not reproducible. RF: That’s right, but not issue. This has to do with timing. Aidan fixed it, but not compatible with what they are using. AH: Just need something that reproduces CM2 output.
Narrator: The new way of doing salt fix will reproduce over time steps, but is not bit reproducible with the old algorithm. Don’t see that effect in these tests.
AH: Peter has a test suite which is old CM2, and a copy which uses updated MOM. He compiles the new code manually and runs the two suites side by side. Both use Russ’ diag_table. Just find out which fields don’t match. Most are the same, few different, seem to be affected by the same issue. Once we’re good for a few time steps then maybe look at them after a few months RF: Once chaos starts, hard to say. As long as nothing gross happening. Unless there is something further on with coupling. AH: Yes, look after a month and check it looks close. MW: Not trying to be bit reproducible? AH: Just want to fix my bugs. RF: Make sure you’re getting the same forcing fields. Can see out in the open ocean hardly any change. Just noise. This means we’re close. Saw the outline of where the forcing field is supposed to be. The bug in the forcing field data showed up, which indicated the issue. AH: Once we’ve confirmed fixed, will merge PR and then move on to ESM.
MW: Will the CM2 code remain in step with the MOM5 code? RF: CSIRO Aspendale not doing much code development at the moment. AH: Peter is pulling directly from his GitHub repo, but once it is harmonised they will pull directly from the MOM5 repo. They will want to have a tag and pull from the tag. RF: Yes they will want frozen versions. AH: Should have some automated tested, if we find a bug, should be able to updated CM2 code and confirm doesn’t change important answers.
AH: Short answer: Lots of progress. I made lots of bugs and Russ found them. Thanks Russ. NH: Yes thanks Russ.
Date: 11th September 2018
MW: 4 block success. 16 block didn’t work. sectrobin also didn’t work. Limited perspective on problem.
RF: blow out in time with extra blocks was halo updates. Weakness with round robin. A lot of overhead, no local comms. Maybe 8 tiles/processor might work. Marshall’s profiling showed small number of processors dominated run time. Want to minimise the maximum. That is the limiter
AH: Where are the max tiles?
RF: Seasonal ice near Hudson Bay, Sea of Okhotsk and Aleutian Islands.
MW: Nic used total CPU count less than number of blocks
RF: Could run with more, or less. MW: 80 CPUs less, could solve this.
AH: General strategy to concentrate on not assigning CPUs to the low work (blue areas) and let the high work areas take care of themselves?
RF: Only worried about slowest tile. Nice to have even distribution, but hard to achieve that in practice.
AH: Slowest tiles change over time RF: read in a map of expected ice concentration. Or have a heuristic, say weight by latitude. AH: If identify areas that do very little work, say never want to have many processors there, and free up processors for high work areas.
AK: There are five hot stripes and four cold stripes. Some processors have 5 blocks, some have 4. The outlying busiest ranks are on those hot stripes. If we get rid of striping with more even split, that would have maybe a spike on a lower baseline
RF: About half the processors have 5 about half have 4, request a few more PEs and that would close to balancing this issue.
NH: First attempt 1600 PEs with an even 4 blocks across all. With idealised test case Ocean was not blocking at all. Though could save a couple of hundred PEs, and there was not a big difference. However Andrew’s real world config is behaving differently. Worth going back up to 1600 and doing an even 4 or even 8 blocks. Assumed wanted everything to be even. Seemed roughly the same to have a mix. This profiling shows I was wrong.
RF: Can easily work out to get exactly 5 blocks per PE. AK: If you give me that number I can try it. NH: 5 across the board is better. Don’t want a single PE doing more work. RF: Slowest one kills you.
AH: How does the land masking affect it? A thicker stripe in NH? RF: Yes. Did I post a picture of where tiles are allocated? NH: More blocks means getting rid of more land? RF: Lose with communication cost.
NH: In order to get this working I ran into the raijin problem: messages getting lost and deadlocks. When we got 0.1 deg MOM-SIS working had issues with point to point sends and recvs, and Marshall change that to proper gather to get initialisation working. The gather inside CICE is implemented with point to point sends and recvs. Assume similar. It is doing a send for every block. MW: Andrew’s finished ok? AK: Ran with 30×35. MW: mxm might resolve this problem? NH: Resolved by putting in a barrier after all the sends, otherwise deadlocks. MW: Did you add barriers? NH: Yes to the MPI gather code. MW: Clear that CICE is heavily barriered. NH: Could implement properly with MPI_gather. MW: Caveat didn’t work with the global field. NH: Only does a global gather once when writing out restarts. Not too bad. MW: A lot of MPI ranks? NH: 1600 x number of blocks is the number of sends. MW: So number of messages, not number of ranks. MW: Only added barrier for restart? NH: Could have done that, but added in MPI_gather. Maybe that is bad? Actually didn’t add, just enabled it by defining a preprocessor flag.
AH: Is there an effect that it gets wider in the north that you’re sampling more ice in those areas?
AH: Should we pull out the slowest blocks and see where all the blocks are that contribute to the slowest processors.
RF: Correspond to areas of highest ice concentration. AH: There is ice in Okhotsk in northern summer? RF: Yes.
MW: Arctic and Antarctic are sharing work. RF: How many for this run? MW: 1385 RF: If you run with 1500 or so get an even distribution.
NH: Should decided what is the next step/run?
MW: Two options, massively increase number of blocks, but this is blowing out with comms time, or even divided 5 blocks. RF: Yes that is the one to do next.
AK: sectrobin should solve the communications issue but couldn’t get it to run. NH: Not sure if code needs to change? RF: Test on 1 degree model.
AK: First step to even up current run with 4 or 5 blocks. MW: Should confirm that many blocks is a comms problem and not a tripole issue for example. But this is a research problem.
AK: Will switch to this for 0.1 deg production as it is already better.
NH: New code 1 block per PE gives identical answers to old code. 4 blocks does not give identical answers to old code. Not sure if I should expect it to be the same. Don’t know how CICE works. In terms of coupling it should be the same if you’re coupling to individual blocks or multiple blocks. Not ruling out it should be identical and there is something going wrong. AK: What would make it non-identical? Order of summation? NH: could be something like that. MW: Might be CIE doing a layer calc before doing vertical? Have to know more about CICE. NH: might be worth looking into further so at least we know that we’re not making bugs.
AK: How would I switch to this for the production run? Not bitwise identical? Just check fields look physically reasonable? NH: Hard problem. Can’t see physical difference. Only looking at last few bits of a floating point number. MW: Did an MPI sum on a single rank and it changed the last bit. Found it running the FMS diagnostics and that is why they failed. Don’t fail at GFDL. Scary stuff. NH: Scary and time consuming.
MW: Clear strategy. Get rid of bands. Go with 1600 cores. Have a 16 block job running, will keep everyone updated.
AH: My understanding with the ESM harmonisation is that we’re close, as we haven’t yet put in the coupling changes from CM2 that you had to take out of the ESM code. PD: Dave Bi’s iceberg scheme? AH: If we get the WOMBAT code into MOM5 that would be harmonised I think. PD: Maybe Matt has a better handle?
MC: Are the OM and CM almost harmonised except for iceberg information? Are they almost the same? AH: I believe so. Once we get WOMBAT in there we’re good to go. Russ had a different idea about how to handle the case of different coupling fields.
RF: Have to get rid of ACCESS keyword. In many cases redundant. AH: ACCESS keyword can be replaced by ACCESS_CM or ACCESS_OM. RF: Yes!
RF: On CICE side of things (and probably MOM) coupling fields are currently defined as parameters. Can use calls to PRISM, test return code, put some tests for legal code/parameters for icebergs for example. Don’t need ifdef’s, can test on the fly. A lot easier than recompiling every time.
AH: How do we implement this? Put WOMBAT code in now so we have an ESM harmonised version and then deal with coupling etc as this is ACCESS-CM? RF: Want to bed down ACCESS-CM and OM harmonised first. The WOMBAT stuff will move in quite simply. I’d like to take that on, have been tasked to do this to take some of the load off Matt. Get this first step out of the way and then move on to WOMBAT and ESM. Until the first step done things can be in a state of flux.
MC: Is wind ehanced mixing in ACCESS-OM? RF: Yes. MC: FAMIP in ACCESS-OM? RF: They’re in MOM5. MC: They weren’t in ACCESS-CM code. AH: That is a 3 year old fork. MC: Can we update ESM from ACCESS-OM? AH: This morning putting WOMBAT changes into MOM5 pull request. Can grab and check if it works. MC: What is the difference in pulling from one direction to the other? AH: ESM is a 3 year old fork with little history in common with current MOM. Couldn’t code into ESM would be too difficult. Cherry picked your changes into the MOM5 code, but wouldn’t work the other way. Will lease with Russ to get ACCESS-CM changes.
AH: Would WOMBAT always be part of MOM5-SIS. MW: Is it big? RF: No, very small. MW: Let’s leave it in MOM5. Just executable bloat. RF: Just a few fields. MC: Allocated, so if not turned on, then no issues. RF: WOMBAT wants the 10m waves, but we need that for the wave mixing as well.
AH: ACCESS-OM no longer compiles because you need libaccessom2 as well. NH: Same before. Always needed OASIS. AH: I’ve got CM compiling by pulling in OASIS and make it. All the compilation tests are passing. Could pull in the libaccessom2 and compile in a similar way to ACCESS-CM. There is no old ACCESS-OM build anymore. It is ACCESS-OM2. MW: Do we want to do this external to the repo? AH: Nice to have the tests there and passing. OM now has different driver code to CM, so can’t be sure you’ve done it properly without an ACCESS-OM compilation test. NH: There always needs to be a dependency on a coupler. libaccessom2 is more than a coupler. Maybe some of it is undesirable. Not worse than having a dependency on OASIS. AH: Just wanted to make sure there wasn’t an ACCESS-OM that was independent of libaccessom2. MW: Can you provide libaccessom2 as a binary and headers? AH: Yes, that is a possibility. NH: Could just be a .a file. MW: that is how you handle dependencies, as a binary, like libc. MW: Do you call OASIS in MOM? NH: Yes. In yatm don’t directly call OASIS. Could change coupler in future without changing models. MW: No problem with wrapping OASIS. AH: Can do the same thing I did with CM, pulled in OASIS, built it. Pretty straightforward.
Date: 14th August 2018
Peter Dobrohotoff (CSIRO Aspendale) gave his apologies that he couldn’t attend.
MW: Fixed constellation of bugs. 1/10th still not working under MPI, looks like a new issue. How do I approach getting this code into respositories? Not invested, but want working in long term. Intel v18 has a bug, fixed, but found another. Dale has built an OpenMPI3 library built using Intel v17. Do people want to use it? Or afraid another issue?
AH: What advantages? MW: Hangs in MPI_Init, commsplit, and hangs at random time steps. MXM seems to have solved random hangs. First two still happening. Still getting random fails during initialisation. Betting on newer library to solve them. Do we want to invest in new libraries and hope we get new solutions, or happy with status quo?
AH: Is there another way? Maybe dev branch on MOM5, submit to more testing? MW: Yes. I can add all build stuff to all repos, independent on NCI configs. Optionally turn them on over time. AH: Just using different versions of OpenMPI? MW: Yes, have my FMS changes as well. AH: FMS changes in master branch? MW: Could do multiple ways. FMS not been updated to GFDL version. Could have subtree/submodule or just dump in. Will change everyone’s code, so might not be solution. I want people to start testing soon. AH: The init hangs: critical core count when these occur? MW: Yes. Tenth gets them, not at 1 degree. AH: Don’t have these at a quarter either? MW: don’t run 1/4, but frequency of errors increases with cores. AH: If we can test this by running the model for a single time step lots of times to quantify issue, see how bad it is, and see if we get improvement. MW: Have not seen commsplit hang with newer versions of OpenMPI. AH: Do you get hangs AK? AK: Get hangs at initialisation about 10% of time. Since going to YATM not had these issue. Also much more consistent with timing. Was 1.5-2.5 hours/month. Now 4h5m-4h10m for 3 month submit. MW: When we studied variability always IO issues. AK: Don’t know what is behind it. AH: Has coincided with transition to YATM? AK: Yes.
AH: Just got to a point where the model is running at all. AK: By late today will have 1 year of IAF. Need capacity to shift to new MPI versions. MW: Concerned about next machine. OpenMPI 1.0 will not be available. Not concerned short term. Concerned long term and stability issues. AK: Yet to fail
MW: Just updating. Want a plan. Will submit build changes to all the projects. Will also replace FMS with submodule but no change in code, but can be changed when required. AH: with submodule can easily test changes. Need a plan to for implementation, testing, check timings.
NH: YATM makes sure it does all it’s work before ICE asks for anything. Does reads and regridding and waiting until required. Would hide any jitter in this disk access. AK: Preemptive fetching data. MW: Good case for IO servers.
AH: We all good now? This fixed? AK: Yes. Just had to tell it to ignore the date in ice restart for the first one. AH: We have a method for this strange restart? Will we need to do this again? NH: Will happen any time you want to use someone else’s restart and don’t want to use their calendar. AH: Are there code changes we need to support this? AK: No. One off thing. Just need change a value in CICE restart and then change back again. AH: Kial got burnt with this once. MW: We could do this at payu level. NH: Little more to it. Also necessary to MOM and YATM date. AK: 3 or 4 things I needed to do to make it go. AH: Think if we need to streamline this? NH: To begin with just document on wiki. AK: I can put it up there.
RF: Should not be able to do an ACCESS-OM run but not do langmuir mixing unless using u10 calculate from empirical formula. Won’t break current ACCESS-OM runs. Have to look into CICE5 and work out how to get winds into ACCESS-OM. ACCESS-CM was fine. OM I thought there was a option to pass winds, misread a preprocessor flag.
RF: A couple of other issues. The order it passes the fields between ice and ocean, the 10m winds are 22nd, another one at 21 which isn’t used. They can’t be done as a common thing. Strange code. The changes I have done will make it safe for the time being. Have to explicitly compile you want to use the winds. AH: Under what circumstances can you use langmuir in ACCESS-OM? RF: Can use MOM6 style to calculate 10m winds. Need to turn on another namelist option. Currently don’t pass winds in OM models. AH: If want to use langmuir do you need to also set the compile flag? RF: At moment can’t pass 10m winds from cice to MOM. Defaults will be fine. ACCESS wind preprocessor flag is for ACCESS CM. ACCESS wind flag is a placeholder at the moment. All allocations made no matter what. Problem all allocations no matter what. Currently initialised to zeroes. AH: A placeholder for the future when get winds through CICE? RF: Yes. Would like to figure out how they pass 10m winds in fully coupled model, and whether they mask them with ice or not. Currently not clear. Would like to make them compatible. AH: I will put in those changes, add new changes and submit a PR. MW: Turn off langmuir by default? It’s broken? RF: No, can calculate winds in MOM6 style. Not using this at the moment.
AH: Above bug brought home issue of testing on MOM5 repo. Currently have 3 targets, MOM-SIS, ACCESS-OM, ACCESS-CM. This got through beause I only tested MOM-SIS. NH: There is a jenkins setup which runs every MOM6 test case. Can’t remember if it has ACCESS-OM, ACCESS-CM builds/test cases. Spent a lot of time to set up testing but it takes maintenance work. Doesn’t run periodically. Love the idea of testing MOM5, please look at what I have already done. Like the idea of production ready, but it takes effort to maintain the system, might not be justified by the number of tests we have to run. If we had weekly PRs would make sense. If infrequent need to revisit the testing every time. AH: Idea was to do some simple builds. MW: build tests on travis? AH: Yes. MW: Don’t have to run, just build. Nic did a lot of work to do runs.
NH: Periodically: ACCESS-OM2 build test, and a fast run test (1 day experiment). There is a lot of stuff being done for MOM5, no build test. MW: Is this the GFDL tests? AH: Nic runs the GFDL tests. NH: MOM5 runs not run as frequently. Not maintained, going red. Sure if something simple, maybe not worth doing on Jenkins. But definitely take a look? MW: Travis for commits? Weekly Jenkins runs for commits. AH: can see five MOM5 builds. NH: folder mom-ocean.org on Jenkins. MOM6 guys get a lot of value from it. AH: will take a look. NH: If you can’t change anything let me know.
MC: Didn’t work with https. AH: Made an ESM1.5 repo on OceansAus for Matt to upload MOM5+Wombat code. Pretty much frozen. Peter wanted somewhere to put this. Should it be possible to https? RF: Always had to use ssh. NH: Just need to put in password. MW: https should work, help to know error. AH: give it another crack. Complain on slack. Get on slack.
AH: ESM1.5 repo on OceansAus won’t change much (now frozen), but we have goal of getting WOMBAT code into main MOM5 repo. MC: Might do that in parallel. Who knows what will happen to ESM1.5. Depends on where investment with ACCESS-CM investment goes. ACCESS-CM2 is quite expensive. In the process of putting WOMBAT into ACCESS-OM-1.0. Going through steps. Put WOMBAT into it and submit PR. AH: ESM1.5 is just MOM+ Wombat? AH: We’re doing the harmonisation, so MOM5 master will have all the important changes. Once we have WOMBAT we have an ESM1.5 equivalent. ESM1.5 will be workhorse coupled model for CoE because ACCESS-CM2 is too expensive. Whilst ESM1.5 on OceansAus will be the canonical version, the MOM5 repo will be effectively the same but can included updates to diagnostics etc.
MC: Checked out ACCESS-OM2-1.0, checked out, compiled, but falling over on running payu. Config file has changed a lot since I last used. Want to run a 1 degree RYF model as basis. MW: Is Matt using the version that isn’t working? Is that what Matt is dealing with? AH: Matt, get on slack and let us know your issues, and we’ll get you going. AK: looking for a working setup? MC: Yes, 1 deg JRA RYF. AK: Can point you to working config. MC: A month since I cloned. AK: Yeah, need to update.
AK: Asking for just config? MC: Yes, but any information useful. AH: Kial has a lot of configs. I cloned one and changed exe paths and was up and running very quickly. MC: I cloned and built, but when I checked out the config it was pointing to common shared exes. AK: should change that. NH: Maybe documentation is out of date. Should follow the simple “if you’re a raijin user” instructions. MC: Yes mostly worked. AH: Get on slack! MC: Browser is out of date. NH: If you do it again, follow the quickstart for raijin users instructions. If that doesn’t work we need to fix stuff. MW: a lot of use problems we don’t know about. We have to think about students who will be coming to run this. If Matt can’t figure it out there is no hope. AK: there is a lot that needs to be updated for the more complex instructions. Also the configurations in control are not what they’re currently using. Could fix that easily.
AH: Andrew has issues with a ‘latest’ directory that has symlinks that point to most recent version. AK: Common use case is perturbation experiments. Go back to previous restart and branch a new experiment, but need to know what forcing was used. Rather than latest, have a directory which is named for the date it was setup, or date forcing was updated. If and when things are changed, make another one. All softlinks. AH: One good thing about latest is you have a config that always works with most recent version. If you have a config with latest, they start a new model and they can be confident that it works. AK: No problem extending forcing, only an issue if old forcing files change. AH: they have versioning issues with CMIP5, have a database. NH: latest is not reproducible. Experiment I ran, but latest is changed. Problem with old system, every version jumbled in one directory. At times there were different variables which had different versions. Not all variables had the same version. AH: That is correct. NH: If there is a single directory that has all the variables for that version that is fine. AH: some cases the variables don’t have the same version. I agree this is an issue, but best solved with manifests in payu. MW: filename is not a good system. Filenames change and hashes don’t. AH: If someone has a naming scheme they want, then happy to implement it, but will keep latest, and solve using manifests. NH: was there a reason to put all versions in same directory? AH: the way the JRA55 people publish it.
AK: Do we care if JRA forcing is extended? Does it affect reproducibility? NH: Not an issue. YATM has no end date for an experiment. You set a forcing start/end date, so no problem.
RF: Pavel Sakov is running a KDS75 MOM only on OFAM -75/+75 tenth model. Running 600s timestep from the start, hoping to get up to 900s. The problems in global model is not between +-75. NH: Just poles messing us up. RF: From a flat surface, huge heave. NH: all those little grid boxes. AK: Yes the tripole is the issue. AH: redo bathymetry? RF: did a naive regridding, some issues, potholes etc. Still works. Will be running a 100 member ensemble. AH: What is he trying to find out? RF: look at some issues with OFAM/BRAN/OceanMaps. Interested to see how much is due to vertical resolution. Also a test for the future. An intermediate model between what we run at the moment, and what Andrew is running. MC: interested in a figure from Kial at the COSIMA meeting, showing how variability changes with surface resolution. AH: how long will he run? RF: A year or two. Thought you might be interested.
Date: 10th July 2018
Date: 10th April 2018
AH: Have rebased PDs MOM code on to latest MOM5 source. Wrapped all PDs changes that are incompatible in ifdef ACCESS_CM statements. This is available as branch cm2 on MOM5 GitHub repo. Created a pull request to allow code review (https://github.com/mom-ocean/MOM5/pull/214).
AH: PD has provided a rose suite (u-aw048) for testing. AH successfully copied (u-aw405) and ran this suite. Created another suite (u-aw445) to test that this reproduces first copy with no changes. It does. Created a third suite (u-aw497), changed git URL to point at cm2 branch, but initial compile failed to find the source. Eventually got it to recognise the updated source, now compile fails due to an absence of a main routine. Needs some modification.
PD: Best to make small incremental changes to a suite. In this case just change the fortran files and see if it works. Avoid changing how the compile is done, definitely avoid changing rose app conf. Was just trying to determine the compile flags use in UM fcm build from Met Office was not a trivial task.
MW: cylc is good, small and configurable. rose is difficult and opaque.
AH: Will get some help from others via the cm2-om2-harmonisation slack channel
AH: Next step is to do the same for CICE5 as has been done for MOM5.
NH: CSIRO is getting CICE from the UKMO. There are code changes under src, not just in the drivers.
AH: Does UKMO version of CICE5 have a special licence? Will we be able to host UKMO modifications to CICE5 on OceansAus repo? NH: CICE has a CICE licence
PD: ACCESS-CM2 doesn’t reproduce over restarts. Would like to run CICE stand alone. Does CICE reproduce over restarts?
NH: cleanest way to test is in the coupling code, before any model sends anything to another model, checksum fields. That is the point you can compare the output of a model. In the case of restarts, can include in checksum what current model run time is.
MW: there are 2 restarts, individual models and oasis restarts. Have to make sure trigger the same number of time steps in models.
Date: 13th March 2018
Date: 13th February 2018
Date: 10th January 2018