Lossless Compression of Textual Data

Tuesday, February 4, 2020

10:45am Meeting room 302 (Mountain View), level 3

Pedro Valero, PhD Student, IMDEA Software Institute

Abstract:

This talk is a combination of what I have learned about grammar-based compression while developing our tool -zearch- for searching on compressed text [1] and what I have learned about LZ77-based compression during my internship at Facebook.

We will begin with an introduction to these two paradigms of compression of textual data: Grammar-based and LZ77-based, followed by an explanation of one algorithm from each family: REcursive PAIRing (Grammar) [2] and zstd (LZ77) [3]. After that, we will compare these two classes of algorithms and discuss some of the advantages of each of them with respect to the other.

Finally, we will talk about dictionary-enhanced compression as a mechanism for efficiently compressing large amounts of small but similar files. This technique has proven so efficient that has been deployed at large scale within Facebook creating the so called “Managed Compression”.

[1] https://ieeexplore.ieee.org/document/8712660 [2] https://ieeexplore.ieee.org/abstract/document/755679 [3] https://facebook.github.io/zstd/