Matteo Dell'Amico, Researcher, EURECOM, France
Analyzing large and heterogeneous data is a recurrent problem for analysts and researchers. For example, a dataset of network events may contain, among others, information as diverse as text, IP addresses, numeric and categorical fields, DNS names, log lines from firewalls, and inspection tools. All these data can be correlated in a non-trivial way; additionally, some information may be missing for some events, and the reliability of what is collected may vary. The typical workflow often seen for machine learning approaches–in which a feature extraction step converts the original data to numeric vectors, before being further processed–may fall short in these cases, because the conversion to numeric form may lose important information. Moreover, we might not have a reliable “ground truth” to feed a classifier. We will discuss an alternative approach that does not use vectorial data representation: relationships between data items are represented as a (dis)similarity function applied on the original data–an arbitrary piece of code written by experts, which gives them complete freedom to encode their domain knowledge. We will introduce clustering algorithms based on this approach which, among other desirable properties, avoid the computational bottleneck of comparing everything against everything. We will also see how this proved useful in practice, with a diverse set of use cases in domains such as text analysis, demographic estimation, and computer security. Finally, we’ll consider opportunities for future research in this area.