Published 2024-04-16
Keywords
- Capping,
- flooring,
- outlier,
- quantile-based
How to Cite
Copyright (c) 2024 Ali Abuzaid, Eyad Alkronz
This work is licensed under a Creative Commons Attribution 4.0 International License.
Abstract
Handling outliers is an important step in data analysis, and it can be approached through three different ways, namely; accommodation, omission, or winsorization. This article investigates the impact of four winsorization statistics (mean, median, mode, and quantiles) on parameter estimation through an extensive simulation study. Three prob- ability distributions (normal, negative binomial, and exponential) are considered, each with varying degrees of contamination. The simulation results suggest that winsoriza- tion is effective for small contamination levels and large sample sizes. Furthermore, it is recommended to winsorize outliers in symmetric distributions using any of the loca- tion parameters. However, for asymmetric distributions, the median should be employed. To illustrate these findings, a real dataset on internet usage session durations for 4,500 users, comprising over 2 million records, are fitted to the exponential distribution. The identified outliers were winsorized using the aforementioned statistics.