Scientific results
The pre-doctoral researcher of the IMDEA Software Institute, Pedro Valero, explains the work he has done recently that has taken him to Facebook.
“Technological giants generate vast amounts of information constantly and the only way they have to manage this information efficiently is by compressing it. However, they keep this information because they want to process it later, so saving the information in compressed form means that every time they want to query the information it is necessary to decompress it and perform the search on the original data. We specifically have been working with textual data. In this scenario, when compressing the file you need to find parts of the text that are repeated several times. Then the repeated text is written the first time it occurs and, for the rest of the occurrences, we simply indicate that what comes next is the fragment that appeared previously. This idea allows us to compress the file and save space but it also allows us to speed up the searches because, when we are processing the text, each time we find that what comes next has already occurred before, we can reuse the information we had already obtained when processing that same string of text. This idea is conceptually simple, but we have been among the first ones to put it into practice. We have developed an algorithm that allows us to perform these searches and really saves time.
If the file can be well compressed then the search in the compressed text without decompression can be faster than the whole process of decompressing and searching.
The problem with this approach is that not every compression algorithm works. Only a certain set of algorithms that are known as dictionary algorithms, or grammar-based, are worthwhile, and those algorithms are typically not studied within industry since they are generally slightly slower and more expensive, although they can produce good results.
The work that I have been doing with Facebook is to study this type of algorithms and find out in what context they can be used since, as usual, it is not always convenient to use them, but there are some scenarios in which it is worth to spend a little more time to compress the file in exchange for being able to search faster”.