The LZ77 algorithm
Terms used in the algorithm
- Input stream: the sequence of characters to be compressed;
- Character: the basic data element in the input stream;
- Coding position: the position of the character in the input stream that is currently being coded (the beginning of the lookahead buffer);
- Lookahead buffer: the character sequence from the coding position to the end of the input stream;
- The Window of size W contains W characters from the coding position backwards, i.e. the last W processed characters;
- A Pointer points to the match in the window and also specifies its length.
The principle of encoding
The algorithm searches the window for the longest match with the beginning of the lookahead buffer and outputs a pointer to that match. Since it is possible that not even a one-character match can be found, the output cannot contain just pointers. LZ77 solves this problem this way: after each pointer it outputs the first character in the lookahead buffer after the match. If there is no match, it outputs a null-pointer and the character at the coding position.
The encoding algorithm
- Set the coding position to the beginning of the input stream;
- find the longest match in the window for the lookahead buffer;
- output the pair (P,C) with the following meaning:
- P is the pointer to the match in the window;
- C is the first character in the lookahead buffer that didn't match;
- if the lookahead buffer is not empty, move the coding position (and the window) L+1 characters forward and return to step 2.
The encoding process is presented in Table 1.
- The column Step indicates the number of the encoding step. It completes each time the encoding algorithm makes an output. With LZ77 this happens in each pass through the step 3.
- The column Pos indicates the coding position. The first character in the input stream has the coding position 1.
- The column Match shows the longest match found in the window.
- The column Char shows the first character in the lookahead buffer after the match.
- The column Output presents the output in the format (B,L) C:
- (B,L) is the pointer (P) to the Match. This gives the following instruction to the decoder: "Go back B characters in the window and copy L characters to the output";
- C is the explicit Character.
Input stream for encoding:
Table 1: The encoding process
|Step ||Pos ||Match ||Char ||Output |
|5.||7||A B||C||(5,2) C|
The window is maintained the same way as while encoding. In each step the algorithm reads a pair (P,C) from the input. It outputs the sequence from the window specified by P and the character C.
The compression ratio this method achieves is very good for many types of data, but the encoding can be quite time-consuming, since there is a lot of comparisons to perform between the lookahead buffer and the window. On the other hand, the decoding is very simple and fast. Memory requirements are low both for the encoding and the decoding. The only structure held in memory is the window, which is usually sized between 4 and 64 kilobytes.