Vol. 36 No. 1 (2024)
Articles

A comparative study on univariate outlier winsorization methods in data science context

Ali Abuzaid
Department of Mathematics, Al Azhar University - Gaza, Gaza
Eyad Alkronz
Department of Information Technology, Al Azhar University - Gaza, Gaza

Published 2024-04-16

Keywords

  • Capping,
  • flooring,
  • outlier,
  • quantile-based

How to Cite

Abuzaid, A., & Alkronz, E. (2024). A comparative study on univariate outlier winsorization methods in data science context. Italian Journal of Applied Statistics, 36(1), 85–99. https://doi.org/10.26398/IJAS.0036-004

Abstract

Handling outliers is an important step in data analysis, and it can be approached through three different ways, namely; accommodation, omission, or winsorization. This article investigates the impact of four winsorization statistics (mean, median, mode, and quantiles) on parameter estimation through an extensive simulation study. Three prob- ability distributions (normal, negative binomial, and exponential) are considered, each with varying degrees of contamination. The simulation results suggest that winsoriza- tion is effective for small contamination levels and large sample sizes. Furthermore, it is recommended to winsorize outliers in symmetric distributions using any of the loca- tion parameters. However, for asymmetric distributions, the median should be employed. To illustrate these findings, a real dataset on internet usage session durations for 4,500 users, comprising over 2 million records, are fitted to the exponential distribution. The identified outliers were winsorized using the aforementioned statistics.