0 avis
On Binary Classification in Extreme Regions
Archive ouverte
Edité par CCSD -
International audience. In pattern recognition, a random label Y is to be predicted based upon observing a random vector X valued in Rd with d ≥ 1 by means of a classificationrule with minimum probability of error. In a wide variety of applications, rangingfrom finance/insurance to environmental sciences through teletraffic data analysisfor instance, extreme (i.e. very large) observations X are of crucial importance,while contributing in a negligible manner to the (empirical) error however, simplybecause of their rarity. As a consequence, empirical risk minimizers generallyperform very poorly in extreme regions. It is the purpose of this paper to develop a general framework for classification in the extremes. Precisely, undernon-parametric heavy-tail assumptions for the class distributions, we prove thata natural and asymptotic notion of risk, accounting for predictive performance inextreme regions of the input space, can be defined and show that minimizers of anempirical version of a non-asymptotic approximant of this dedicated risk, basedon a fraction of the largest observations, lead to classification rules with goodgeneralization capacity, by means of maximal deviation inequalities in low probability regions. Beyond theoretical results, numerical experiments are presented inorder to illustrate the relevance of the approach developed