This function fills missing values - a median for numeric variables and a mode for categorical variables (factors). Additionally, the outliers from numeric variables are replaced according to the IQR rule for outliers. In factors rare levels are merged into 'Other' level.

hugo_clean_data(data, prop = 0.05)

Arguments

data

data.frame to clean

prop

proportion of occurence of the level in a categorical variable which decides which levels are rare

Value

data.frame that has been cleaned

Examples

# NOT RUN {
# Dataset in base R: airquality
# There are 44 missing values
sum(is.na(airquality))

hugo_clean_data(airquality)
# The data was cleaned.

# Two original rows from data:

# Ozone Solar.R  Wind Temp Month Day
#     8      19  20.1   61     5   9
#     NA      NA 14.3   56     5   5

# After cleaning:

# Ozone Solar.R  Wind Temp Month Day
#     8      19 17.65   61     5   9
#  31.5     205 14.30   56     5   5

# We can see that the outlier in 'Wind' was
# replaced by the value Q3+1.5*IGR for this column.
# Missing values were replaced with medians.
# }