18 December 2013
San Francesco - Cappella Guinigi
Text classification, the task of automatically attributing items of natural language text (e.g., newspaper articles, email messages, product reviews, blog posts, etc.) to one or more of a predefined set of classes based on their content, is a ubiquitous task in nowadays' computer science. The use of modern supervised machine learning techniques now allows text classification to be achieved with high accuracy in many application contexts. However, it has recently been pointed out that, in many such contexts, the real goal is not correctly classifying each individual text, but correctly estimating the relative frequency (or "prevalence") of each class of interest. For instance, in a stream of consumer reviews of the Kindle ebook reader, the goal is not deciding to which among the classes {very positive, positive, neutral, negative, very negative} the review of John Smith belongs, but deciding how many reviews out of the total belong to each class. In other words, given a set of n classes of interest, the goal is estimating the distribution of these classes in a set of unlabelled documents. This task is called (text) quantification. In this talk we will discuss a number of new developments that the shift of focus from classification to quantification entails, with special emphasis on new supervised learning models and new measures of accuracy. We also argue that the advent of "big data" will place increasingly less importance on classification and increasingly more importance on quantification.
relatore:
Sebastiani , Fabrizio
Units:
SysMA