Technical Working Group Meeting, November 2019

Minutes

Date: 27th November, 2019

Attendees:

Aidan Heerdegen (AH) CLEX ANU, Angus Gibson (AG) ANU, Andrew Kiss (AK) COSIMA ANU
Russ Fiedler (RF), Matt Chamberlain (MC) CSIRO Hobart
Rui Yang (RY), Paul Leopardi (PL) NCI
Nic Hannah (NH) Double Precision
Marshall Ward (MW) GFDL

ACCESS-OM2 on gadi

PL: Submodules not updated (#176). Reported bug from CICE5 but not being built. AK: not sure how to release this. Sometimes model components updated but not tested. AH: gadi transition branch? AK: Yes. PL: Science bug.

PL: To test had to copy files around. Needed to update config.yaml and atmosphere.json. Made fork of 1deg_JRA55_RYF for testing. Had to move to non-public places as don’t have access to public places. Will send details in an email.

PL: conda/analysis3-unstable needs to be updated, payu not working on gadi. AH: Did update, still not working. Update only tested on interactive job. PBS job strips out environment. Wanted to consult with Marshall about why payu works as it does currently. Difficult to debug as payu-run as it does not have the same environment as “payu run”. PL: Work-around to add -V option to qsub_flags in config.yaml. AH: This is what I am considering to change payu to by default. Not sure. Currently looking into this.

PL: nccmp module not on gadi. Been using for reproducibility testing. In backlog. RY: Can install personally, don’t have to wait for system install.

PL: Running on gadi. Got 1 deg RYF55 finished. Did not have mppnccombine compiled. Will have to do this to get this working correctly. Got something for baseline for comparison. Report by the end of the week.

RY: gadi 48 cores. Default based on broadwell (28 cores). Do you have an up to date config? Paul currently changes core count in his config, but is it done in official config?

AH: I was in the process of making an official configuration for gadi. Copied all inputs that were in /short/public to the ik11 project. Once directory structure finalised will make a config that runs, update on GitHub, and look at making the same changes for other configs. Make an exemplar config with those changes. RY: Should work on same configs.

RY: Anyone else running on gadi? AH: No.

AH: What are the impediments to others updating ACCESS-OM2 on GitHub? People not sure if they can? How they should go about tit? AK: Put my hand up to do this. Other model components also need updating. AH: Maybe dev branch that everyone pulls from. Easier to make changes without worrying about breaking. So everyone working from the same version and don’t have to re-fix known bugs.

AH: Environment stuff? MW: Something about python exec command. Nuance? Wholesale copy everything? Wanted to create idealised processes, rather than depend on what users haves stop. payu run submits job to PBS with whole new environment. Explicitly give environment variables.

AH: Drawback payu-run does not use same environment as payu run. MW: Not launching a process. payu run submits to PBS and starts posix process with defined environment. Exception when explicitly give it environment variables. AH: One work-around is to make list of environment variables want to keep. Losing MODULEPATH variables. PL: module env being used by payu required modules 3. Modules 4 works differently. Python code from modules 4 may work better.

MW: Fixed? AH: Thought I had, but was fooled because using payu-run. MW: If you set MODULEPATH locally, it won’t be exported to payu run process.

PL: What is the fix? MW: On raijin there was a bootstrap script in init dir, which sets everything. I duplicated those commands and put them in the payu module that did equivalent bootstrap. If moving to gadi and it is different none of that bootstrap script works. PL: Bootstrap script there, but completely different. MW: Was old version, and never actually used the bootstrap script. Maybe exec the bootstrap script they provide? AH: Or pass through environment variables that are set already. MW: Do whatever you think is best. Did try and make it so ‘payu run’ job was clean and always looked the same regardless of who submits. If we take entire ENV and submit to run, every run will be different. One variable is a controlled solution. Solution should be possible to have job on submitted node can set it up on it’s own. Should get it going and not be held up by my purist notions. AH: Try/except blocks can be used to support multiple approaches. MW: Definitely need to bootstrap the modules. PL: Sent through email with details.

OpenMPI/4.0.1 on gadi

AH: Angus reported openmpi/4.0.1 seems broken. Has this been fixed?

AG: Any wrapped commands (mpicc, mpifort) will print whitespace before output. In most cases ok, but can break configure scripts. Ben M knows about it, but not why.

PL: Divide by zero error in MPI_Init. MW: Remember that one UCX back-end, FP exception. Evaluates a log function when evaluating binary tree when working out communication. Ben M told them about it, but got nothing back. We use FP exception checking, but can’t ignore for just MPI. PL: Work-around like turn off UCX? MW: Could turn off FP exceptions. A race condition, so not every job sees it. RY: Can turn off UCX. Can use ob1 instead of UCX. Also try that. PL: Wasn’t sure it would work on gadi.

AH: Maybe 4.0.1 not a good candidate for testing? Get intermittent crashes.

Russ update on model performance on gadi

RF: Been testing OFAM bluelink, compiled as MOM-SIS without doing ice. Performance was fantastic. 2x faster than Sandy Bridge. Don’t get hammered with extra cost on new CPUs. Initialisation was very fast. A lot of files, so might be a low load issue. Dropped from 100s to 8s. Doing data assimilation runs, run 3 days at a time. 25% of the run time was init. Now pretty much zero. MOM5 performance was really good.

RF: Did notice some variation on start up of CM4. Still a lot faster. Reads in a lot more files and a lot more data. Still considerably faster than on raijin. MW: MOM has IO timers, do you have those on? FMS timers. Rui used them a lot. RF: No, didn’t turn them on.

RF: Running CM4 was about 15% faster than Broadwell. Improved but will cost a lot more for decadal prediction. RY: 15% is normal. Martin report UM is 30% quicker. RF: SIS2 load balance is bad. Probably a bunch of things being covered up. Needs more testing.

MW: Bob has never talked about SIS2 load imbalance. Presumably oblivious to them. RF: Would have to be. Regular layout would lead to many redundant processors. MW: Alistair has done some iceberg code load balance improvements. RF: Doesn’t take much time. Had to turn off iceberg stuff on raijin. netcdf stuff broke it. Might turn back on. Time spent in iceberg code minimal.

Stack array errors and heap array option

RF: When compiling need to set heap-arrays option in compiler, otherwise get segfaults with stack, even when stack set to unlimited. Wasn’t an issue on raijin. Happened for both MOM5 and CM4. PL: Dale mentioned about stack size limited to 8MB. RF: I unlimited stack size, so shouldn’t have been an issue. Got all sorts of issues with unmapped addresses. First one saw it was automatic so tried moving to allocatable, moved error. Then tried different heap-arrays size options, which moved error again. MOM5 dropped to heap-arrays 5KB. Same for CM4 but set to zero for SIS2 and it got through. Different models, seems ubiquitous. MW: Intel fortran?

MW: When compile and run on CRAY machines stack vars use malloc, so heap variables not stack. Same model, same compiler on laptop (gcc), same variables are stack variables. Is it possible moving from raijin to gadi something different about malloc. RY: CentOS 7 v 8 makes some difference. MW: Is kernel making some decisions on malloc? RY: Had similar issues with UM. Stacksize unlimited seemed to fix for UM. But Dale talked about this in ACCESS meeting, kernel changed something that caused this problem.

NH: Intel compiler has heap always arrays option. Useful in some cases. Models can have array bounds overruns, and easier to track when trash heap compared to stack. RY: Slower? NH: Depends. Doesn’t do it for everything, just the larger arrays. RF: If you just set heap-arrays, all on heap. Can control it. MW: In MOM6 explicit places we declare variables we know we won’t use, contingent on assumption they are stack vars. Can’t make those assumptions any longer.

NH: Surprised to hear linux kernel. Would think it was Fortran runtime or compiler. MW: runtime or libc. Couldn’t figure out why different results with same compiler on different platforms. NH: Calculating variables addresses, compiler computes stack offsets. Looking at the executable there are static offsets. Needs to be done at compile time. MW: Shouldn’t be running models that need to use heap. Should be resilient to either choice. No? NH: Comes down to algorithms used to manage memory. Heap has algorithm to minimise fragmentation. Don’t have an answer, will need to think about it.

MW: Can you send a bug report for SIS2? RF: Could be everywhere that has run out of stack space. Just the first one I tried to fix this.

AH: What OS are you running on your laptop? MW: Archlinux. Comparing them to the travis VMs. AH: At some point the compiler has to query the system to see what resources are available? MW: The fact that you’re typing stacksize unlimited shows you accessing the kernel. AH: Seems strange, system has plenty of memory. MW: I’m interested in this problem. AH: Problem should be reported to relevant NCI people (Dale/Ben?). Potentially affecting a lot of codes. Not tenable that everyone who has this issue have to debug it themselves. MW: Bad memory explicit in stack, buried in the heap? NH: Can make a huge difference. Layout of memory is different. More likely something on HEAP won’t affect other variables. More fragmented on stack. Heap memory more tightly packed. MW: Fixed a couple of dozen memory access bugs in MOM6 and they take it seriously. RF: Old versions I’m using with CM4 release. Happens with MOM5. Only FMS common. MW: Wondering if this is a bug that is hidden moving from stack to heap.

MW: Using GCC9.0 to find these. Few flags to find stuff. Initialise with NaNs. malloc-perturb is an environment variables you can turn on and that helps. Turns on signal NaNs. Any FP op generates an error now. Finds a lot of zeroes in bad memory accesses that didn’t trigger errors. Trying to not use valgrind, but that would work also.

RF: Switch in GCC that does something similar to valgrind. Puts in guards around arrays. MW: Don’t know the explicit option, using -Wall, turns it on for me. GCC9.0 is very aggressive at finding issues in a way that 5/6/7 were not.

AH: Same compiler on raijin and gadi, see if gadi only issue. RF: Not sure if it was the same version of 2019 I was using. AG: One overlapping compiler 2019.3. RF: Recently recompiled MOM-SIS build. Will look and see if it is the same. AH: Useful data point if same issue is gadi specific.

Update on BGC

AH: Andy Hogg has asked for an update. People at Melbourne would like to us eit. RF: On my desk with Hakase. Been promising. Will prioritise. Almost there for a while. Been distracted with gadi. On to-do list.

MC: Do we know who in Melbourne wants to use it? AH: A student, not sure who.

New projects to support COSIMA and ACCESS-OM2 on gadi

AH: /g/data/ik11 is where inputs that were on /short/public will now live. Not sure exactly how this will be organised. Will mostly likely have input and output directories. Might be some pre-published COSIMA datasets there. Part of a publishing pipeline. AK: Moving data from scratch to this as a holding area? AH: People were using datasets from hh5 that had no status, not sure how to reference them.

AK: Control directories are separate, and not well connected to the data on hh5. Nice to have ways to link things more firmly. AH: To-do for payu is have experiment tracking IDs. Generate UUIDs as unique identifiers for experiments. Will go in metadata file. Not linked to git hash. If they don’t exist, make new ones. AK: Have data on hh5 and the control directories have been moved or deleted. Lose the git history of the runs that were used to generate the output. AH: Nothing to stop that all being in the same directory. Nic has advocated this for some time. Could change the way we do things. AK: Not sure on solution, but flagging as an issue.

AH: Published dataset from the COSIMA paper is almost ready. New location for COSIMA published data will be cj50. To do this publishing have created a python/xarray tool to create published dataset from raw model data. Splits data into separate files for each variable, a year per file in most cases. Needs a specific naming convention for THREDDS publishing. Using xarray it doesn’t matter what the temporal range of each model output file. Uses pandas style resampling to generate outputs. In theory simple, in practice there are many many exceptions and specific tweaks to be standards compliant. Same tool can handle MOM and CICE outputs, which are different models, and radically different file metadata and layout. If you have something that you might find it useful for it is called splitvar. Also made a tool called addmeta for adding metadata. Do the metadata modification as a separate step as it is always fiddly. Uses yaml formatted files to define metadata. The metadata for the COSIMA data publishing is available.

PL: Published data is netCDF format with all the correct metadata? AH: MOM doesn’t put much metadata in the files. To make this better connection between runs and outputs is to insert the experiment tracking id mentioned above into the files. Would be nice to put that into a namelist so that MOM could put it in the file. Best option, and if anyone knows how would like to know. Another option is a post-processing step, on all the tiled outputs. MOM isn’t the only model we run. Not all output netCDF. Would be nice if there was a consistent way for payu to do this. COSIMA published data should be up before the end of the year.

PL: Will ik11 replace hh5 and v45. AH: hh5 is storage space that is part of a ARC LIEF grant from the Australian climate community. The COE CMS team was tasked with managing this, and people could ask for temporary storage allocations. In practice it is harder to get people to remove their data. COSIMA was one of the first to ask for an allocation, but it somewhat outgrown the original intent of hh5, as it has been there for a long time and grown quite large. hh5 might still be used for some models outputs. Not sure. ik11 started because we needed somewhere to put common model inputs/exes because /short/public went away and /scratch/public is ephemeral. /scratch space is difficult to utilise because of the ephemeral nature. NH: Have some experienced /scratch space on Pawsey. Once you lose data you make sure you have a better system to make sure your data is backed up. Possibly a good thing. AH: Doesn’t suit the workflow people currently use, where they come back and run some more of a model after a break. Suits workflows that create large amounts of data and then do a massive reduction and only save the reduced dataset. Maybe suits ensemble guys. Our models everything we create we want to keep. NH: Doesn’t all the model output go to scratch. AH: Yes, but model output doesn’t get reduced, so end up having to mirror the data.