Zearch: Regular Expression Matching on Compressed Text

Tuesday, January 16, 2018

10:45am Lecture hall 1, level B

Pedro Valero, PhD Student, IMDEA Software Institute

Zearch: Regular Expression Matching on Compressed Text

Abstract:

Facebook, Google, Amazon and many other large companies produce huge amounts of data that they need to store and process. From the need for storing the generated information surges the development of compression algorithms such as brotli and zstd. Similarly, the need for processing the stored data results in the development of regular expression (regex) engines such as Hyperscan or RE2 since regex matching is a key operation for handling text files. To this day very efficient solutions have been found these two problems by considering them independently. However, when facing the need to search with a regular expression in a compressed file, the state of the art approach goes through decompressing it and, in parallel, searching on the original text as it is recovered by the decompresser. ZEARCH challenges this standard practice by searching directly on the compressed file while being competitive with the state of the art technologies, even though the current implementation is purely sequential.