3  Setup and Usage

The evoland-plus HPC pipeline consists out of various scripts that can be found in the src directory. For each included step of Figure 2.1, there is a subdirectory in src/steps.

This part is structured as follows:

The task of the evoland-plus HPC pipeline is to streamline the process, so that varying the climate scenarios and other parameters can be carried out efficiently. Introducing parallelization through SLURM batch jobs, and adding HPC compatibility, are the main tasks of the pipeline. Meanwhile, the pipeline keeps track of the intermediate results, a centralized configuration file, and the execution of each step. Details on the individual steps are given in the following sections.

3.1 Setup

Before you set up the evoland-plus HPC pipeline, you should make sure to satisfy a few requirements. This section will go over hardware and software requirements, and then guide you through the evoland-plus HPC repository setup. The following pages guide through the details of each step in the pipeline, before concluding with the execution of the pipeline.

3.1.1 Requirements

We are using a Linux cluster with SLURM as scheduler. If your cluster uses a different scheduler, you can see if it is compatible with the SLURM syntax, or you can adapt the scripts to your scheduler.

Note 3.1: What is a Linux cluster?

As this pipeline is specifically designed to simulate a large number of scenarios, it has been optimized for high-performance computing (HPC) environments. If you only need to run a few scenarios, it might be easier to run the steps manually. Otherwise, you do need to have access to a Linux cluster with SLURM. Feel free to reach out to a technically savvy colleague or your local HPC support for help.

3.1.1.1 Hardware

The minimum memory and CPU requirements cannot generally be stated, as they depend on the area of interest, input data, and the number of scenarios. A viable starting point for a country with the size of Switzerland, using a resolution of 100 m, is 16 GB of memory and 4 CPUs. This is the case for a few scenarios and no parallelization within the steps. Scaling up to around 1000 scenarios, we suggest at least 128 GB of memory and 16 CPUs, to achieve a viable runtime. As this is an estimate, it is essential to monitor runtime before scaling up.

3.1.1.2 Software

Additionally, you need to install the following software:

3.1.1.2.1 Micromamba/Conda

For some pipeline steps, we use conda environments. Conda is a package manager that helps you manage dependencies in isolated environments. We recommend using micromamba, which does the same job as Conda, but resolves dependencies much faster, with the flexibility of miniconda (CLI of Conda). Find the installation instructions for Micromamba here. We have added compatibility for micromamba, mamba and conda, in this order of preference, but only tested with micromamba1.

We have chosen conda-forge as the default channel for the conda environments, as it is a single source for our R, Python, and lower-level dependencies (e.g., gdal, proj). This is independent of the modules and applications provided by the HPC environment.

3.1.1.2.2 Apptainer

Running containerized applications on HPCs can be challenging. To simplify the process, we use the Apptainer (formerly Singularity) container runtime. Make sure your HPC environment supports Apptainer, and that you have the necessary permissions to run containers. If this is not the case, contact your HPC support team for help.

3.1.1.2.3 Docker

Building the LULCC container requires Docker2 before converting it to the Apptainer format. The lulcc container uses the dinamica-ego-docker container (version 7.5).

This step can be done on a local machine, and will be explained in the LULCC step.

3.1.1.2.4 Dinamica EGO

Dinamica EGO is an environmental modeling platform used in the LULCC step. It is available on the project website. But as aforementioned, it will be used from the LULCC docker image, as it is only integrated from the command line interface (CLI), not with the usual graphical user interface (GUI).

3.1.1.2.5 Yaml Parser yq

For the bash scripts, we use yq to parse the yaml configuration file. yq needs to be available in the PATH variable of the shell. To install the latest version3, run the following command:

bin_dir=/usr/bin &&\
wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O $bin_dir/yq
chmod +x $bin_dir/yq

Other installation options and binaries can be found on the repository’s README. To make yq available in the PATH variable, make sure the $bin_dir is in the PATH variable. To check the parser is installed correctly, run yq --version in the shell.

3.1.1.2.6 LULCC Repository

The version used for evoland-plus HPC is a reduced version of the original model, adapted for containerized execution on HPCs, and can be found on the hpc branch of the repository. Clone the repository to the HPC using git or download the repository as a zip. If you have never used git before, search online for a guide on how to clone a repository.

3.1.2 evoland-plus HPC Repository

After you have set up the requirements, you can clone the evoland-plus HPC repository. This repository contains the pipeline and all necessary scripts to run it.

Before you start the pipeline, you need to configure the pipeline. These settings are centralized in the config.yml file. There are only a few mandatory changes we will highlight, but you can find more settings with descriptive names in the file.

src/config.yml
# Bash variables
bash_variables:
  FUTURE_EI_CONFIG_FILE: ~/evoland-plus HPC/src/config.yml
  FUTURE_EI_OUTPUT_DIR: ~/evoland-plus HPC-Output
  ...
  # LULCC HPC version
  LULCC_CH_HPC_DIR: ~/LULCC-CH
  ...
  # Overwrites $TMPDIR if not set by the system. $TMPDIR is used by Dinamica EGO.
  # and conda/libmamba
  ALTERNATIVE_TMPDIR: /scratch
...

For each script, src/bash_common.sh is sourced to set the environment variables. First, FUTURE_EI_CONFIG_FILE needs to be set to the absolute path of this configuration file. FUTURE_EI_OUTPUT_DIR is the directory where the outputs of the pipeline will be stored. As the pipeline needs a multiple more temporary space than the output itself, having a fast and large temporary directory is crucial. If the HPC does not set the $TMPDIR variable, you can set it to a different directory using ALTERNATIVE_TMPDIR. This will be used in the LULCC and NCP steps for temporary files. Finally, LULCC_CH_HPC_DIR is the directory where the LULCC repository is stored, which was cloned in the previous step.

src/config.yml
# Focal LULC
FocalLULCC:
  ...

# LULC check
CheckLULCC:
  ...

To mention the FocalLULCC and CheckLULCC sections, these are settings dedicated to separate steps in the pipeline and are specifically loaded in the respective scripts. We will touch on these settings in the respective steps. To see the current settings (and test yq), print the contents of config.yml as idiomatic YAML to stdout:

yq -P -oy src/config.yml

As a last general note, make sure to set the permissions of the scripts to executable. To make all bash scripts in the source executable, give them the permission as follows:

# possibly activate globstar: shopt -s globstar
chmod +x src/**/*.sh

The next sections will guide you through the setup of each step in the pipeline.

3.2 Steps

3.3 Land Use Land Cover Change

LULCC is a Dinamica EGO (Leite-Filho et al. 2020) model, and makes use of the R (R Core Team 2022) ecosystem, including packages from the Comprehensive R Archive Network (CRAN). You can find the LULCC model, as well as an adapted version for use with evoland-plus HPC, in the LULCC repository (Black 2024), as mentioned in the setup section.

Caution 3.2: Used Versions in this step

LULCC needs a variety of inputs. These are set via environment variables in the src/config.yml file — the assumption being that the src directory contains project-specific code, and hence also setup details. Here is an excerpt of the file:

src/config.yml
# Bash variables
bash_variables:
  ...
  # Model Variables - from LULCC_CH_HPC root
  LULCC_M_CLASS_AGG: Tools/LULC_class_aggregation.xlsx
  LULCC_M_SPEC: Tools/Model_specs.csv
  LULCC_M_PARAM_GRID: Tools/param-grid.xlsx
  LULCC_M_PRED_TABLE: Tools/Predictor_table.xlsx
  LULCC_M_REF_GRID: Data/Ref_grid.tif
  LULCC_M_CAL_PARAM_DIR: Data/Allocation_parameters/Calibration
  LULCC_M_SIM_PARAM_DIR: Data/Allocation_parameters/Simulation
  LULCC_M_RATE_TABLE_DIR: Data/Transition_tables/prepared_trans_tables
  LULCC_M_SIM_CONTROL_TABLE: ~/LULCC-CH/Tools/Simulation_control.csv
  LULCC_M_SPAT_INTS_TABLE: Tools/Spatial_interventions.csv
  LULCC_M_EI_INTS_TABLE: Tools/EI_interventions.csv
  LULCC_M_SCENARIO_SPEC: Tools/Scenario_specs.csv
  LULCC_M_EI_LAYER_DIR: Data/EI_intervention_layers
  LULCC_M_REMOVE_PRED_PROB_MAPS: True # remove prediction probability maps after
  # simulation if 1, True or TRUE

A relevant parameter to change is the LULCC_M_SIM_CONTROL_TABLE variable. This is the only path that is absolute, and it should point to the Simulation_control.csv file. All further paths are relative to the LULCC repository root: the files under Tools are configuration files, while the Data directory contains input and working data. For information on the further variables, see the LULCC repository and paper (Black 2024).

3.3.1 Simulation Control Table

Simulation_control.csv is a table that controls the scenarios to be simulated, including the data described in Table 3.1. This format extends the original format from the LULCC model.

~/LULCC-CH/Tools/Simulation_control.csv
Simulation_num.,Scenario_ID.string,Simulation_ID.string,Model_mode.string,Scenario_start.real,Scenario_end.real,Step_length.real,Parallel_TPC.string,Pop_scenario.string,Econ_scenario.string,Climate_scenario.string,Spatial_interventions.string,EI_interventions.string,Deterministic_trans.string,Completed.string,EI_ID.string
1,BAU,1,Simulation,2020,2060,5,N,Ref,Ref_Central,rcp45,Y,Y,Y,N,1
217,EINAT,217,Simulation,2020,2060,5,N,Low,Ecolo_Urban,rcp26,Y,Y,Y,N,217
433,EICUL,433,Simulation,2020,2060,5,N,Ref,Ecolo_Central,rcp26,Y,Y,Y,N,433
649,EISOC,649,Simulation,2020,2060,5,N,Ref,Combined_Urban,rcp45,Y,Y,Y,N,649
865,BAU,865,Simulation,2020,2060,5,N,Ref,Ref_Central,rcp85,Y,Y,Y,N,1

Each colum describes one scenario to be simulated. This table controls which data is used to simulate the land use changes.

Table 3.1: Description of the columns in the Simulation_control.csv file.
Column Name Description
Simulation_num. The number of the simulation.
Scenario_ID.string The scenario ID.
Simulation_ID.string The simulation ID.
Model_mode.string The model mode.
Scenario_start.real The start year of the scenario.
Scenario_end.real The end year of the scenario.
Step_length.real The length of the steps.
Parallel_TPC.string Whether the simulation is parallelized.
Pop_scenario.string The population scenario.
Econ_scenario.string The economic scenario.
Climate_scenario.string The climate scenario (e.g., rcp45, rcp26, rcp85).
Spatial_interventions.string Whether spatial interventions are used.
EI_interventions.string Whether EI interventions are used
Deterministic_trans.string Whether deterministic transitions are used.
Completed.string Whether the simulation is completed.
EI_ID.string The EI ID.

3.3.2 Container Setup

For a platform independent execution of Dinamica EGO, we created a dinamica-ego-docker container container. This way, the glibc version is fixed, and the container can be used system independently5. This one is used in the LULCC docker container. Our Dockerfile src/steps/10_LULCC/Dockerfile then adds the necessary R packages for LULCC to the container. The Apptainer Definition File src/steps/10_LULCC/lulcc.def bootstraps the docker container, mounts the LULCC_CH_HPC_DIR to the /model directory (it is not shipped within the container), and translates the entry point to the Apptainer format. This includes adding the necessary environment variables, connecting the Simulation Control Table, pointing Dinamica EGO to the correct R binary, among other details found in the Definition File. Figure 3.1 summarizes the levels of wrapping.

flowchart LR
    Dinamica([Dinamica EGO]) --> Docker(dinamica-ego-docker)
    Docker --> LULCC(LULCC docker)
    LULCC --> Apptainer[Apptainer container]

    style Dinamica color:#2780e3, fill:#e9f2fc, stroke:#000000
    style Docker color:#0e7895, fill:#cbf4ff, stroke:#000000
    style LULCC color:#0e7895, fill:#cbf4ff, stroke:#000000
    style Apptainer color:#07946e, fill:#def9f2, stroke:#000000
Figure 3.1: Visualization on how Dinamica EGO is wrapped until it can be used in Apptainer on the HPC.

To load the LULCC docker onto your system, it can be automatically installed or built using the src/steps/10_LULCC/docker_setup.sh script, which uses variables from the src/config.yml. If you have docker installed, the setup script guides you through the building, pushing, or pulling of the LULCC docker container. This step can be done on a local machine. Consecutively, when having apptainer installed, the LULCC docker can be converted to an Apptainer container. On the HPC, this latter step suffices if you use the pre-configured LULCC_DOCKER_REPO, unless you want to rebuild the container. The decisive line in the script is:

src/steps/10_LULCC/docker_setup.sh (lines 84ff)
apptainer build \
      --build-arg "namespace=$namespace" --build-arg "repo=$repo" \
      --build-arg "version=$version" \
      "$APPTAINER_CONTAINERDIR/${repo}_${version}.sif" "$SCRIPT_DIR/lulcc.def"

Depending on your system, you might want to reconfigure the Apptainer variables:

src/config.yml
# Bash variables
bash_variables:
  ...
  # Apptainer variables for the apptainer container
  APPTAINER_CONTAINERDIR: ~/apptainer_containers
  APPTAINER_CACHEDIR: /scratch/apptainer_cache

APPTAINER_CONTAINERDIR is used to store the Apptainer containers, and APPTAINER_CACHEDIR is used when building them. If your HPC does not have a /scratch directory, you might want to change it to another temporary directory.

After all previous steps are completed, you can test the LULCC model with some test scenarios in the simulation control table. src/steps/10_LULCC/slurm_job.sh submits. Before the full, parallelized simulation can be started, read the following sections.

3.4 Check LULCC

For checking the LULCC output integrity of the previous step, an intensity analysis is performed. As a previous measure of checking the LULCC output integrity, a simple visual inspection of the output maps is recommended. Subsequently, the intensity analysis regards the cumulative pixel-wise change in land use and land cover (LULC) classes, and computes the contingency table over a time series, as a measure of change between each land use class. These changes should be in a realistic range (e.g., between \(0\%\) and \(5\%\)), otherwise this can point to issues in the input data or the model itself.

Note 3.3

This step automatically analyzes all LULCC scenarios. But it is not integrated into the main execution script, as detailed later in the Running the pipeline section.

The configuration section for this step is as follows:

src/config.yml
# LULC check
CheckLULCC:
  InputDir: # keep empty to use FUTURE_EI_OUTPUT_DIR/LULCC_CH_OUTPUT_BASE_DIR
  OutputDir: # keep empty to use FUTURE_EI_OUTPUT_DIR/CHECK_LULCC_OUTPUT_DIR
  BaseName: LULCC_intensity_analysis # Can be used to distinguish different runs
  Parallel: True
  NWorkers: 0  # 0 means use all available cores

This step uses a conda environment with raster~=3.6-26 aside further R packages. The automatic setup script src/steps/11_CheckLULCC/11_CheckLULCC_setup.sh needs to be executed to set up the conda environment. It sets up a conda environment check_lulc with the packages found in 11_checklulcc_env.yml.

Running the intensity analyis is as easy as sbatching the job script slurm_job.sh.

sbatch src/steps/11_CheckLULCC/slurm_job.sh

The sbatch command submits the job to the HPC scheduler with the running options specified in the header of the job script.

src/steps/11_CheckLULCC/slurm_job.sh (lines 1-10)
#!/bin/bash
#SBATCH --job-name="11_check_lulcc"
#SBATCH -n 1                  # Number of cores requested
#SBATCH --cpus-per-task=25    # Number of CPUs per task
#SBATCH --time=4:00:00        # Runtime
#SBATCH --mem-per-cpu=4G      # Memory per cpu in GB (see also --mem)
#SBATCH --tmp=2G              # https://scicomp.ethz.ch/wiki/Using_local_scratch
#SBATCH --output="logs/11_check_lulcc-%j.out"
#SBATCH --error="logs/11_check_lulcc-%j.err"
#SBATCH --mail-type=NONE       # Mail events (NONE, BEGIN, END, FAIL, ALL)

Change these settings according to your needs and the available resources. Monitor the logs in the logs directory to check the progress of the job. If you want to specify more options, refer to the SLURM documentation or your local HPC documentation.

3.5 Focal LULC

This step calculates focal statistics for the land use and land cover change (LULCC) data. The resulting focal windows are used for the N-SDM model (Black 2024). It uses a similar structure to the previous Check LULCC step, as it uses another conda environment and this task also has a separate job script. The configuration section for this step is as follows:

src/config.yml
# Focal LULC
FocalLULCC:
  InputDir: # keep empty to use FUTURE_EI_OUTPUT_DIR/LULCC_CH_OUTPUT_BASE_DIR
  OutputDir: # keep empty to use FUTURE_EI_OUTPUT_DIR/FOCAL_OUTPUT_BASE_DIR
  BaseName: ch_lulc_agg11_future_pixel  # Underscores will be split into folders
  RadiusList: [ 100, 200, 500, 1500, 3000 ]
  WindowType: circle
  FocalFunction: mean
  Overwrite: False # False -> skip if output exists, True -> overwrite
  Parallel: True
  NWorkers: 0  # 0 means use all available cores

This script recursively goes through the input directory and calculates the focal statistics for each scenario. It creates the outputs in a similar structure, inside the output directory, named after the BaseName. For each scenario, the focal statistics by WindowType and FocalFunction are calculated for each radius in RadiusList. For details, consult the docstring of the method 20_focal_statistics::simulated_lulc_to_predictors.

This step uses a conda environment with raster~=3.6-26, terra~=1.7-71 (only used for conversion), and further R packages. The conda environment focal_lulc is set up by executing the setup script
src/steps/20_FocalLULC/20_FocalLULC_setup.sh.

As for the previous steps, the job script slurm_job.sh needs to be submitted to the HPC scheduler.

sbatch src/steps/20_FocalLULC/slurm_job.sh
Focal LULC output check

The src/steps/20_FocalLULC/show_files.py script checks whether the focal window output files are complete. It verifies the presence of expected files in the output directory structure, calculates the percentage of completed files for each year, and lists any missing files. The script outputs a summary table and saves the missing file names to a text file called missing_files.txt.

3.6 Nature’s Contributions to People

Based on the code written for Külling et al. (2024), we automatized the calculation of eight NCP. To note, our study includes more NCP as this, as some of them are characterized by the plain focal windows (Black et al. 2025). (Ben: true?)

Additionally to R and CRAN packages, InVEST is used via the Python module natcap.invest in this step.

Caution 3.4: Used Versions in this step

Due to previous API changes, the code is compatible with natcap.invest=3.13.0, but not earlier versions. raster=3.4-13 and terra=1.5-21 are used for the NCP calculation, among other packages6.

As the previous two steps, setting up the conda environment ncps is done using the src/steps/40_NCPs/40_NCPs_setup.sh script.

3.6.1 NCPs

Table 2.1 lists all NCP calculated in the evoland-plus HPC project. Here, we detail the eight NCP calculated in this step. The config.yml file includes a few variables which are automatically used for the NCP calculation.

src/config.yml (54-59)
  # NCP variables
  NCP_PARAMS_YML: ~/evoland-plus HPC/src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
  NCP_RUN_SCENARIO_ID: # Scenario ID, automatically set for each configuration
  NCP_RUN_YEAR: # Year for which to run NCPs, automatically set
  NCP_RUN_OUTPUT_DIR: # Output directory for NCPs, automatically set
  NCP_RUN_SCRATCH_DIR: # Scratch directory for NCPs, automatically set

The more detailed configuration for each NCP is stored in the 40_NCPs_params.yml file. For parallelization purposes, each array job receives a copy of this file with the respective scenario ID and year. The bash variables NCP_RUN_* from the config.yml act as a placeholder.

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
# Run Params (are passed when calling the run_all_ncps.py script)
run_params:
  NCP_RUN_SCENARIO_ID:
  NCP_RUN_YEAR:
  NCP_RUN_RCP:  # programmatically set in load_params.py
  NCP_RUN_INPUT_DIR:
  NCP_RUN_OUTPUT_DIR:
  NCP_RUN_SCRATCH_DIR:
  LULCC_M_EI_LAYER_DIR:  # set in load_params.py (uses config.yml)  # SDR

For preparation, it is indispensable to set the paths to the input data. Some of these are shared among multiple NCP, as noted in the comments. The first three layers are automatically found in the NCP_RUN_INPUT_DIR and depend on the scenario ID and year. These three are constructed with the template that the LULCC model produces, as can be seen in the load_params.py script.

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
# Data
data:
  # LULC               - CAR, FF, HAB, NDR, POL, SDR, WY
  lulc: # automatically found in NCP_RUN_INPUT_DIR
  # Rural residential  - HAB
  rur_res: # automatically found in NCP_RUN_INPUT_DIR
  # Urban residential  - HAB
  urb_res: # automatically found in NCP_RUN_INPUT_DIR
  # Production regions - CAR
  prodreg: Data/PRODUCTION_REGIONS/PRODREG.shp
  # DEM                - CAR, NDR
  dem: Data/DEM_mean_LV95.tif
  # DEM filled         - SDR
  dem_filled: Data/DEM_mean_LV95_filled.tif
  # Wathersheds        - NDR, SDR, WY
  watersheds: Data/watersheds/watersheds.shp
  # Subwatersheds      - WY
  sub_watersheds: Data/watersheds/Subwatersheds.shp
  # ETO                - WY
  eto: Data/evapotranspiration/
  # PAWC               - WY
  pawc: Data/Water_storage_capacity_100m_reclassified1.tif
  # Erodibility path   - SDR
  erodibility_path: Data/Kst_LV95_ch_nib.tif
  # Erosivity path     - SDR
  erosivity_path: Data/rainfall_erosivity/
  # Precipitation      - WY, NDR
  yearly_precipitation: Data/yearly_prec/
  # Soil depth         - WY
  depth_to_root_rest_layer: Data/rrd_100_mm_rexport.tif
  # Precipitation avgs - FF
  pavg_dir: Data/monthly_prec/
  # Temperature avgs   - FF
  tavg_dir: Data/monthly_temp/
  # Soil texture       - FF
  ph_raster: Data/ch_edaphic_eiv_descombes_pixel_r.tif
  # Distance to lakes  - REC
  distlakes_path: Data/distlakes.tif

# Projection Settings - change for different regions
proj:
  # CRS
  crs: epsg:2056
  # Extent
  ext: [ 2480000, 2840000, 1070000, 1300000 ]
  # Resolution
  res: 100
Caution 3.5: Resolution

Make sure that the resolution of all input data is the same and matches the proj.res setting in the 40_NCPs_params.yml file.

For each NCP, the configuration is detailed in the following sections.

3.6.1.1 CAR: Regulation of climate

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
CAR:
  # 1_CAR_S_CH.R
  # 2_CAR_S_CH.py
  bp_tables_dir:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/CAR/BPTABLE/
  # 3_CAR_S_CH.R
  # output prefix
  out_prefix: tot_c_cur_

To calculate the carbon stored in biomass and soil, the CAR NCP needs biophysical tables that specify the carbon content of different land use classes. The natcap.invest-model Carbon Storage and Sequestration is used for this calculation.

3.6.1.2 FF: Food and feed

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
FF:
  # 0_FF_ecocrop.R
  crops_data:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/FF/crops.txt
  ecocrop_dir: evoland-plus HPC-Output/FF_preprocessing_ecocrop/

The FF NCP calculates the crop production potential using the ecocrop package. The package uses a limiting factor approach Hackett (1991). This NCP has a data preparation step which needs to be executed once before running the parallelized NCP calculation. It is a single R script that can easily be triggered with calling src/steps/40_NCPs/NCP_models/prepare_ncps.sh, no SLURM needed.

3.6.1.3 HAB: Habitat creation and maintenance

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
HAB:
  # 0_thread_layers_generation.R
  # 1_HAB_S_CH.py
  half_saturation_constant: 0.075
  bp_table_path:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/HAB/BPTABLE/
  sensitivity_table_path:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/HAB/BPTABLE/hab_sensitivity.csv
  threats_table_path:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/HAB/BPTABLE/threats.csv

The HAB NCP calculates the habitat quality index, another natcap.invest model. Set the three biophysical tables accordingly.

As we had problems how natcap.invest==3.13.0 handles its treat layer table, we had to introduce a hotfix in the source code to keep compatibility with the existing NCP configuration. When loading in the threat layers, natcap.invest wants to convert the column names to lowercase to be case-insensitive, but the layer paths are also converted to lowercase, but our threat layers are case-sensitive. To fix this bug, we changed the to_lower argument in the execute function in the habitat_quality.py file and set the column name to match our lowercase column name.

.../ncps/lib/python3.10/site-packages/natcap/invest/habitat_quality.py (line 384)
# Change from:
            args['threats_table_path'], 'THREAT', to_lower=True,
# to:
            args['threats_table_path'], 'threat', to_lower=False,

In later versions, the InVEST developers have changed the modality of loading in these tables. Compatibility with the latest version of natcap.invest can be added when adapting breaking changes with the further NPC. We want to note that changing the source code is a bad practice and should only be considered as a last resort.

To find the corresponding natcap folder, navigate to the environment folder, from where you find the site-packages folder.

# activate the ncps environment with micromamba or conda
micromamba activate ncps
# find the site-packages folder
python -c "import site; print(site.getsitepackages())"
>>> ['.../micromamba/envs/ncps/lib/python3.10/site-packages']

In this folder, you navigate further down to find .../site-packages/natcap/invest/habitat_quality.py.

3.6.1.4 NDR: Nutrient Delivery Ratio

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
NDR:
  # 1_NDR_S_CH.py
  # Biophysical table
  biophysical_table_path:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/NDR/BPTABLE/ndr_bptable_ds25_futei.csv
  calc_n: true
  calc_p: true
  k_param: 2
  # Suffix for output files
  # Subsurface critical length
  subsurface_critical_length_n: 100
  # Subsurface effective retention
  subsurface_eff_n: 0.75
  # Threshold flow accumulation
  threshold_flow_accumulation: 200

The NDR NCP calculates the Nutrient Delivery Ratio. The biophysical table specifies the nutrient retention by vegetation using various variables, e.g., root depth and more detailed soil properties described in the natcap.invest documentation.

3.6.1.5 POL: Pollination and dispersal of seeds

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
POL:
  # 1_POL_S_CH.py
  # Farm vector path
  farm_vector_path: ''
  # Guild table path
  guild_table_path:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/POL/BPTABLE/guild.csv
  # Landcover biophysical table path
  landcover_biophysical_table_path:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/POL/BPTABLE/pollination_bptable_ds25_futei.csv
  # 2_POL_S_CH_aggregating.R

The POL NCP calculates the natcap.invest Crop Pollination model. Followed by an aggregation step in R.

3.6.1.6 REC: Recreation potential

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
REC:
  # 1_REC.R
  # lulc naturality lookup table
  lutable_nat_path:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/REC/BPTABLE/lutable_naturality.csv

The REC NCP returns a Recreation Potential (RP) indicator. This is a normalized aggregate of three landscape characteristics maps:

  • Degree of naturalness (DN): Aggregate sum of naturalness scores for each LULC class.
  • Natural protected areas (NP): Binary map of 0=outside protected areas, 1=inside protected areas.
  • Water components (W): Inverse relative distance to lake coasts, with the highest value at the lake coast and a decreasing value for 2 km.

The output is a single map of recreation potential.

3.6.1.7 SDR: Formation, protection and decontamination of soils

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
SDR:
  # 1_SDR_S_CH.py
  # Biophysical table
  biophysical_table_path:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/SDR/BPTABLE/bptable_SDR_v2_futei.csv
  # Drainage path
  ic_0_param: 0.4
  k_param: 2
  l_max: 100
  # SDR max
  sdr_max: 0.75
  # Threshold flow accumulation
  threshold_flow_accumulation: 200

Sediment export and retention are calculated in the SDR NCP with the Sediment Delivery Ratio model from natcap.invest.

3.6.1.8 WY: Regulation of freshwater quantity, location and timing

src/steps/40_NCPs/NCP_models/40_NCPs_params.yml
WY:
  # 1_WY_S_CH.py
  # Biophysical table
  biophysical_table_path:
    evoland-plus HPC/src/steps/40_NCPs/NCP_models/WY/BPTABLE/wy_bptable_ds25_futei.csv
  # Seasonality constant
  seasonality_constant: 25

Annual Water Yield is the final NCP calculated in this step. The WY NCP calculates the hydropower potential.

3.6.2 Running the NCP calculation

Assuming the ncps environment is set up, all previous configurations are correctly set, the input data is available, and the FF NCP has been prepared using src/steps/40_NCPs/NCP_models/prepare_ncps.sh, the NCP calculation can be started.

To calculate all NCP for one scenario and year, the run_all_ncps.py script bundles the execution of all NCP. It is used like so:

# Usage: bash run_all_ncps.sh <NCP_RUN_SCENARIO_ID> <NCP_RUN_YEAR> <NCP_RUN_INPUT_DIR> <NCP_RUN_OUTPUT_DIR> <NCP_RUN_SCRATCH_DIR>
bash src/steps/40_NCPs/NCP_models/run_all_ncps.sh 1 2015 /path/to/input_dir /path/to/output_dir /path/to/scratch_dir

The simplified execution of this using the HPC scheduler SLURM is done with sbatch src/steps/40_NCPs/NCP_models/slurm_job.sh. The scenario ID and year are set in the job script.

The full, parallelized execution of the evoland-plus HPC pipeline for all scenarios with LULCC and NCP calculation is done with the 10_40_combined_array_job.sh script and SLURM, for this consult the following Running section.

NCP output check

The src/steps/40_NCPs/show_files.py script shows existing NCP results by scenario. It reads numbers from files to generate a histogram of counts and checks if each expected file is present in each scenario. The script outputs a summary table of file coverage and lists any missing or unexpected files. Missing files are saved to scenarios_with_missing_files.txt, and unexpected files are saved to unexpected_files.txt. Such unexpected files might be intermediate files that are not cleaned up properly and can be deleted.

3.7 Running

The pipeline is executed in three parts, each part is a separate Slurm job. Remember Figure 2.1 from the Structure section. The most computationally intensive steps, LULCC and NCP are parallelized and submitted as one Slurm array job. For all of these steps, you need to have followed the previous sections to set up and configure the pipeline. This includes preparing the FF NCP using src/steps/40_NCPs/NCP_models/prepare_ncps.sh and filling the simulation control table with all the scenarios you want to run.

3.7.1 evoland-plus HPC pipeline

Land Use Simulation and NCP Estimation can separately be calculated for one scenario with the jobs src/steps/10_LULCC/slurm_job.sh and src/steps/40_NCPs/slurm_job.sh. The 10_40_combined_array_job.sh slurm job calculates both steps for all scenarios in parallel. Each array job receives a subset of the scenarios to calculate. All scenarios are calculated in parallel with the following slurm job:

sbatch src/steps/10_40_combined_array_job.sh

This would submit the job to the cluster and start the calculation with the default settings.

src/steps/10_40_combined_array_job.sh
#!/bin/bash
#SBATCH --job-name="10_40_combined_array"
#SBATCH -n 1                  # Number of cores requested
#SBATCH --cpus-per-task=2     # Number of CPUs per task
#SBATCH --time=7-00:00:00     # Runtime in D-HH:MM:SS
#SBATCH --mem-per-cpu=2G
#SBATCH --tmp=2G
#SBATCH --output="logs/10_40_combined_array-%j.out"
#SBATCH --error="logs/10_40_combined_array-%j.err"
#SBATCH --mail-type=NONE      # Mail events (NONE, BEGIN, END, FAIL, ALL)
## Array job
#SBATCH --array=1-216%12      # start-end%num_parallel
#        ! step size needs to be 1

The speed-up of the combined job is achieved by running multiple scenarios in parallel. We do this, as the speed-up assigning more CPUs to one scenario is limited. Each of the 216 array jobs is assigned one core with two CPUs and 4 GB of memory. %12 in the array specification ensures that 12 array jobs are run in parallel, if one job finishes, the next one is started. Each array job has a time limit of 7 days.

In our case, we had 1080 scenarios to calculate, so we set the array job to run 216 scenarios in parallel to have five scenarios per array job. Before, we have tested with only having one scenario in the simulation control table, 10 GB of memory, and SBATCH --array=1-1 to check if the job runs correctly. Running the Switzerland map at a resolution of 100 m by 100 m, the job took 6:23:06 hours to complete at a CPU efficiency of 75.29% and a memory efficiency of 75.58%. With tail -f logs/10_40_combined_array_-*.out logs/10_40_combined_array_-*.err it is easy to monitor the progress of the job. For explanations and more details on the sbatch options, see the Slurm documentation.

When running a large array of scenarios, the array jobs vary in the amount of memory they require and time they take. It is a valid approach to start with a memory limit that works for the majority of scenarios. Some jobs might fail due to memory issues, but after all array jobs have finished, it is possible to rerun the failed scenarios with a higher memory limit. This is possible because the LULCC and NCP are only calculated if each output file is missing, down to the level of each NCP.

The cluster might have a limit on the number of array jobs that can be run in parallel. To find out the limit, use scontrol show config | grep MaxArraySize.

To get a simple estimation on how long the job array takes, you can use cross-multiplication, starting with the time it took to calculate one scenario \(t_{\text{one}}\). With the number of scenarios \(n_{\text{all}}\) and the number of scenarios calculated in parallel \(n_{\text{parallel}}\), the time it takes to calculate all scenarios \(t_{\text{all}}\) is:

\[ t_{\text{all}} = \frac{n_{\text{all}}}{n_{\text{parallel}}} \times t_{\text{one}} \]

Tip 3.1: Selective running

If you only want to either run the LULCC or the NCP, you can modify the 10_40_combined_array_job.sh script to only run the respective part. This comes down to commenting-out one line in the script.

3.7.2 Check LULCC and Focal LULC

As explained in their respective sections, the steps Check LULCC and Focal LULC can be already run after the LULC layers are present. Both src/steps/11_CheckLULCC/slurm_job.sh and src/steps/20_FocalLULC/slurm_job.sh are also submitted with sbatch. In contrast, these are simple jobs and their parallelization is achieved by assigning more CPUs to the job and using R’s asynchronous processing future::plan(future::multisession).

3.7.3 Logging

There are multiple levels of logging in the pipeline. When running the Slurm jobs, the output and error logs are written to the specified files. These are coming from three main sources: the R scripts, the Python scripts, and the Slurm job scripts. Generally, slurm logs are written to the file specified in the job script. For the logs regarding the scripts written for this pipeline, FUTURE_EI_LOG_LEVEL: debug in the config.yml file can be set to debug, info, warning, or error. The NCP calculation uses natcap.invest which has detailed logs written to the console. For the LULCC container, Dinamica EGO has more detailed logs of the integrated R scripts. They are written to the mounted LULCC_CH_HPC_DIR directory and do not show up in the Slurm logs. Dinamica EGO has a separate log level that can be set through the DINAMICA_EGO_CLI_LOG_LEVEL environment variable.

4 Further Steps

Upon completing the four steps, we obtain LULC layers, focal windows, and NCP. These outputs can be further analyzed and utilized for additional processes, such as species distribution modeling.

In our scenario, we have incorporated a fifth step that leverages the same repository structure and configuration as the previous steps. The code for this step is located in a different repository, which ensures that the workflow remains consistent and manageable. This additional step allows for more comprehensive analysis and extends the capabilities of the initial four steps, providing a robust framework for further research and application.


  1. The installed CLI is identified via bash variables in src/bash_common.sh. If none is found, an error highlights the issue.↩︎

  2. The Docker version used is 24.0.7, but the container should be compatible with most versions.↩︎

  3. We have used yq v4.40.3, but any version >=4.18.1 should work.↩︎

  4. The versions of the R packages used with LULCC are listed in the note below.

    R Packages

    Versions used with LULCC docker 0.3.0:

    • raster: 3.6-26
    • tidyverse: 2.0.0
    • data.table: 1.15.4
    • randomForest: 4.7-1.1
    • callr: 3.7.6
    • future: 1.33.2
    • future.apply: 1.11.2
    • future.callr: 0.8.2
    • sp: 2.1-3
    • stringi: 1.8.3
    • stringr: 1.5.1
    ↩︎
  5. For more information on the compatibility of Dinamica EGO with Linux, see the Dinamica EGO documentation.↩︎

  6. The versions of the packages used in the NCP calculation are listed in the note below.

    Versions used for the NCP calculation

    R (4.1.3) packages:

    • raster: 3.4-13
    • terra: 1.5-21
    • meteor: 0.4.5
    • Recocrop: 0.4.0
    • rgdal: 1.5-29
    • codetools: 0.2-19
    • data.table: 1.14.8
    • remotes: 2.4.2
    • sf: 1.0-7
    • yaml: 2.3.7

    Python (3.10.13) packages:

    • natcap.invest: 3.13.0
    ↩︎