hugo_clean_data.Rd
This function fills missing values - a median for numeric variables and a mode for categorical variables (factors). Additionally, the outliers from numeric variables are replaced according to the IQR rule for outliers. In factors rare levels are merged into 'Other' level.
hugo_clean_data(data, prop = 0.05)
data |
|
---|---|
prop | proportion of occurence of the level in a categorical variable which decides which levels are rare |
data.frame
that has been cleaned
# NOT RUN { # Dataset in base R: airquality # There are 44 missing values sum(is.na(airquality)) hugo_clean_data(airquality) # The data was cleaned. # Two original rows from data: # Ozone Solar.R Wind Temp Month Day # 8 19 20.1 61 5 9 # NA NA 14.3 56 5 5 # After cleaning: # Ozone Solar.R Wind Temp Month Day # 8 19 17.65 61 5 9 # 31.5 205 14.30 56 5 5 # We can see that the outlier in 'Wind' was # replaced by the value Q3+1.5*IGR for this column. # Missing values were replaced with medians. # }