IMDEA Software

IMDEA initiative

Home > News > 2023 > IMDEA Software and Norton Research Group present the article: "A Deep Dive into the VirusTotal File Feed"

September 20, 2023

IMDEA Software and Norton Research Group present the article: "A Deep Dive into the VirusTotal File Feed"

Researchers Kevin van Liebergen, Juan Caballero (from IMDEA Software), Platon Kotzias and Chris Gates (from Norton Research Group) present the study “A Deep Dive into the VirusTotal File Feed” in which they perform a comprehensive analysis of the reports made by the VirusTotal platform in one year, with the aim of finding out the state of malware.

The online scanners analyze samples submitted by users using a large number of security tools and provide access to the analysis results. VirusTotal (VT) is the most popular among the security community. Each file analyzed by VT creates a report containing, among other things, the metadata, specific data, or the list of detection tags assigned by up to 70 antivirus tools used to scan files.

The study

The sample includes a total of 328 million reports from 235 million files analyzed between December 21, 2020 and December 20, 2021 from VT and Gen Digital, previously known as Norton LifeLock.


Figure 1: Number of daily VT reports and files collected.

Among the reports analyzed there is a wide variety of filetypes and those with the highest incidence are reflected in Figure 2. As can be seen, most of the files analyzed, 220.3 million (66%), are “peexe” files, which include Windows PE files such as EXE, DLL, or CPL, among others. Notably, the top five file types comprise 88.4% of the total sample: Windows PE (66%), JavaScript (8.9%), HTML (5.3%), PDF (4.8%), and Android Apps (3.4%).


Figure 2: Top 20 file types among all observed files.

In a first observation, 53% of the samples do not present any malicious detection by VT. This percentage includes benign files as well as malware not detected in the first analysis which may be considered malicious in subsequent analyses. In this sense, IMDEA Software and Norton Research Group researchers consider that the percentage of files that can be considered malicious is between 41-47%, depending on the number of antivirus that have detected the file as malicious (between 1 and 4).

The reports collected are diverse in terms of malware families. 33,000 different malware families have been found, of which 4,900 are quite prevalent as they appear in at least 100 files.

It has also been found that 0.3% of the samples, 600,000 files, are originally FUD files. VT does not detect malware in a first analysis, but in subsequent analyses they are considered malicious by at least 4 antivirus engines.

The researchers have compared the millions of VirusTotal reports with the Gen Digital dataset, which pertain to files located on user devices (AV) and, among other things, they show that VT, despite having a volume 17 times lower than the ones from Gen Digital antivirus, observes 16 times more malware, making it a great platform to search for malware.

Finally, the study has shown that the malware families that appear in both datasets differ, while VT shows more malware such as ransomware (data encryption) or banking viruses, among others, in the users’ files appear more potentially unwanted programs (PUP), which can be spyware (collects keystrokes, activates camera) or adware (intrusive advertising).