Skip to content Skip to sidebar Skip to footer

Csv File Compression Without Using Existing Libraries In Python

I'm trying to compress a .csv file without using any 3rd party or framework provided compression libraries. I have tried, what I wish to think, everything. I looked at Huffman, but

Solution 1:

It is unclear whether you need to create a generic compression algorithm or a custom one that works reasonably well for this kind of data.

It is also unclear whether the output should be another CSV, a string made of printable ASCII characters or plain binary data.

I'm going to assume that we're talking about a custom algorithm and a CSV output. (The same principles would apply to another output format anyway.)

It appears that your input is well formatted and always repeat the same kind of fields:

0 '6NH8'     : 4-character code
1 'F'        : character
2 'A'        : character
3 '0'        : integer
4 '60541567' : integer \_ some kind of
5 '60541567' : integer /  timestamps?
6 '78.78'    : float
7 '20'       : integer

Building dictionaries

See how many distinct codes are used in column #0 and how many distinct combinations of 'column #1' + 'column #2' you have.

If the same values are used frequently, then it's definitely worth building dictionaries that will be stored only once and then referenced in the compressed rows.

For instance:

column0_dictionary = [ '6NH8', '6AH8', 'QMH8' ]
column12_dictionary = [ 'FA', 'FB' ];

So, 6NH8 would be referenced as 0, 6AH8 as 1, etc.

In the same way, F,A would be referenced as 0 and F,B as 1.

Encoding timestamps in a shorter format

Assuming that columns #4 and #5 are indeed timestamps, a quick win would be to store the minimum value and subtract it from the actual value in each compressed row.

minimum_timestamp = 60437395

Therefore, 60541569 becomes 60541569 - 60437395 = 104174.

Example output

Here is what we get when applying these two simple methods to your example input:

# header
6NH8,6AH8,QMH8
FA,FB
60437395
# payload data
0,0,0,104172,104172,78.78,20
0,0,0,104174,104174,78.78,25
1,1,0,104370,104370,90.52,1
2,1,0,0,0,950.5,1

You could also store in column #5 the difference between column #5 and column #4, if it turns out that they correspond to the 'start of something' and 'end of something'.

As is, the size of the compressed payload is about 70% of the size of the original input. (Keep in mind that the size of the header should become negligible when you have much more rows.)

Your example is too short to detect any other obvious patterns for the remaining fields, but hopefully these examples will give you some ideas.

UPDATE

It turns out that the timestamps are expressed in number of milliseconds elapsed since midnight. So they are probably evenly distributed in 0-86399999 and it's not possible to subtract a minimum.

These numbers can however be encoded in a more compact manner than the ASCII representation of their decimal value.

The easiest way is to convert them to hexadecimal:

60541567 = 39BCA7F

A slightly more complicated way is to encode them in Base64:

  1. Convert timestamp to its 4-byte representation (all values from 0 to 86399999 will fit in 4 bytes):

  2. Build a string made of the 4 corresponding characters and encode it in Base64.

For example:

60541567 = 03 9B CA 7F  # in hexadecimal and big-endian order

BASE64(CHR(0x03) + CHR(0x9B) + CHR(0xCA) + CHR(0x7F)) = A5vKfw
# here without the padding characters

Solution 2:

Try running your algorithm on the contents of each cell instead of individual characters and then creating a new CSV file with the compressed cell values.

If the data you have provided is an example of the larger file you may want to run the compression algorithm on each column separately. For example it may only help to compress columns 0,4 and 5.

For reading and writing CSV files check out the csv module where you can do things like:

import csv
with open('eggs.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    for row in spamreader:
        print ', '.join(row)

Solution 3:

For each line, search for matching substrings in the previous line or lines. For each matching substring (e.g. 6NH8,F,A,0,6054156 or ,78.78,2), send the length of the match and distance back to copy from instead. This is called LZ77 compression.


Post a Comment for "Csv File Compression Without Using Existing Libraries In Python"