The covariance_filter returns a set of covariates for land use land cover change
(LULCC) models based on a two-stage variable selection: a first statistical fit
estimates a covariate's quality for a given prediction task. A second step selects
all variables below a given correlation threshold: We iterate over a correlation
matrix ordered in the first step. Starting within the leftmost column, all rows (i.e.
candidates) greater than the given threshold are dropped from the full set of
candidates. This candidate selection is retained and used to select the next column,
until no further columns are left to investigate. The columns that were iterated over
are those returned as a character vector of selected variable names.
Usage
covariance_filter(
data,
result_col = "result",
rank_fun = rank_poly_glm,
weights = compute_balanced_weights(data[[result_col]]),
corcut = 0.7,
...
)
rank_poly_glm(x, y, weights = NULL, ...)
compute_balanced_weights(trans_result, legacy = FALSE)
select_by_correlation(cor_mat, corcut)Arguments
- data
A data.table of target variable and candidate covariates to be filtered; wide format with one predictor per column.
- result_col
Name of the column representing the transition results (0: no trans, 1: trans)
- rank_fun
Optional function to compute ranking scores for each covariate. Should take arguments (x, y, weights, ...) and return a single numeric value (lower = better). Defaults to polynomial GLM p-value ranking.
- weights
Optional weights vector
- corcut
Correlation cutoff threshold
- ...
Additional arguments passed to rank_fun.
- x
A numeric vector representing a single covariate
- y
A binary outcome vector (0/1)
- trans_result
Binary outcome vector (0/1)
- legacy
Bool, use legacy weighting?
- cor_mat
Absolute correlation matrix
Details
The function first ranks covariates using the provided ranking function (default: quasibinomial polynomial GLM). Then, it iteratively removes highly (Pearson) correlated variables based on the correlation cutoff threshold, preserving variables in order of their ranking. See https://github.com/ethzplus/evoland-plus-legacy/blob/main/R/lulcc.covfilter.r for where the concept came from. The original author was Antoine Adde, with edits by Benjamin Black. A similar mechanism is found in https://github.com/antadde/covsel/.
Functions
rank_poly_glm(): Default ranking function using polynomial GLM. Returns the lower p value for each of the polynomial termscompute_balanced_weights(): Compute class-balanced weights for imbalanced binary outcomes; returns a numeric vectorselect_by_correlation(): Implements the iterative selection procedure.