Clean Dataset Based on Missing Data Threshold and MCAR Test

clean_na function is designed to handle missing data in a given dataset by excluding variables and observations based on the provided parameters. It provides a flexible and comprehensive approach to handling missing data by allowing to exclude variables based on patterns in their names and filter observations that have a percentage of missing data above a specified threshold. Additionally, it can perform a MCAR (Missing Completely At Random) test if requested.

Usage

clean_na(
  data,
  scenario_based_vars = NULL,
  missing_threshold = 60,
  full_names = FALSE,
  MCAR = FALSE,
  main_vars = NULL,
  answers_only = FALSE
)

Arguments

data: The dataset to be cleaned.
scenario_based_vars: Character vector with base names or full names of variables to exclude from the dataset. Can also be a numeric vector with column indices to be removed.
missing_threshold: Percentage threshold for missing data per observation (defaults to 60 percent).
full_names: Logical value indicating whether scenario_based_vars are full names of the variables (default is FALSE).
MCAR: Logical value indicating whether to perform a MCAR test (defaults to FALSE).
main_vars: Character vector with base names of variables to include in the MCAR test. If NULL, all variables are included (default is NULL).
answers_only: Logical value indicating whether to exclude non-answer variables (variables that do not contain a number nor 'demo' or 'Demo' in their name). If TRUE, these variables are treated the same way as scenario-based variables.

Value

A dataset with cleaned missing data.

References

Remember to add reference here.

Examples

if (FALSE) {
# Generate some objects
df <- data.frame(a = c(1, 2, NA, 4, 5),
                 b = c("one", "two", "three", NA, "five"),
                 c = c(NA, NA, 3, 4, 5))

# Clean the environment but keep lists and the object named 'a'
clean_na(df, missing_threshold = 50, MCAR = TRUE, answers_only = FALSE)
}