Dictionary Based Transform for Improving English Text Compression (Dbt)
Owuor, Paul Otieno
MetadataShow full item record
Text compression algorithms reduce the redundancy in data representation to decrease the storage and transmission time required for that data. One way of increasing text compression is to preprocess text before applying it as input to a backend compressor. There are a number of context based algorithms in this area including, Star Encoding or * -encoding, Length Index Preserving Transform (LIPT), Reverse Length Preserving Transform (RLPT), and Shortened Context Length Preserving Transform (SCLPT). However, none of these methods has been able to achieve the theoretical best-case compression ratio, suggesting that better algorithms are possible. This research focused on the dictionary-based method of preprocessing and specialized coding to increase text compression. It developed a lossless dictionary-based text-preprocessing algorithm called Dictionary Based Transform for improving English text compression (DBT). The DBT approach consisted of transforming text by replacing often-used words with special codes in a dictionary given in advance. This transformation produces the desirable effect of precompressing the original input text, maintaining some of the original context information at the word level and creating an additional stronger context in the transformed text. The transformation also ensures that frequently used characters have higher probabilities in the transformed text. The combined effect is that DBT achieve some compression at the preprocessing stage as well as retaining enough contexts and providing stronger character predictions for backend algorithms to give better results. The DBT achieved a precompression of about 17%, an average improvement of about 6.9% over LHA, 3.4% over WINZIP, 3.9% over DMC, and 0.9% over Bzip2 and PPMD, for our test corpus. Since Bzip2 and PPMD algorithms are among the best in the market, they are recommended as the preferred backend algorithms for use with DBT. Structurally, this report begins by discussing the problem of text compression, the motivation behind this work and how to achieve the stated aim of optimal textual compression. A review of relevant literature mainly on statistical and dictionary coding approaches provides the theoretical context. This is followed by the new algorithm, the methods and designs used as an attempt to solve the problem. Finally, the test results and an explanation of the results, and a conclusion, complete with suggestions for further work, is given.