Why Does Saving/loading Data In Python Take A Lot More Space/time Than Matlab?

March 20, 2024 Post a Comment

I have some variables, which include dictionaries, list of list, and numpy arrays. I save all of them with the following code, where obj=[var1,var2,...,varn]. The variables size is

Solution 1:

Try this:

To save to disk

import gzip
gz = gzip.open(filename + '.gz', 'wb')
gz.write(pickle.dumps(obj, pickle.HIGHEST_PROTOCOL))
gz.close()

To load from disk

import gzip
gz = gzip.open(filename + '.gz', 'rb')
obj = pickle.loads(gz.read())
gz.close()

Solution 2:

Matlab uses HDF5 and compression to save mat-Files; HDF5 is a format to access large amount of data very fast. Python-pickle safes information to recreate the objects, it's not optimized for speed and size but flexibility. If you like, use HDF5 for python.

Solution 3:

Well, the issue is with pickle not Python per se. As others have mentioned, .mat files saved in version 7.3 or higher, use HDF5 format. HDF5 is optimized to efficiently store and retrieve large datasets; Pickle handles data differently. You can replicate or even surpass the performance of Matlab's save function by using the h5py or netcf4 Python modules; NetCDF is a subset of HDF5. For example, using HDF5, you may do:

import h5py
import numpy as np

f = h5py.File('test.hdf5','w')
a = np.arange(10)
dset = f.create_dataset("init", data=a)
f.close()

I'm not sure if doing the equivalent in MATLAB will result in a file of exactly the same size but it should be close. You can play around to with the HDF5's compression features to get the results you want.

Edit 1:

To load an HDF5 file, such as .mat file, you could do something like M2 = h5py.File('file.mat'). M2 is a HDF5 group, which is kinda like a python dictionary. Doing M2.keys() gives you the variable names. If one of the variables is an array called "data", you can read it out by doing data = M2["data"][:].

Edit 2:

To save multiple variables, you can create multiple datasets. The basic syntax is f.create_dataset("variable_name", data=variable). See link for more options. For e.g.

import h5py
import numpy as np

f = h5py.File('test.hdf5','w')

data1 = np.ones((4,4))
data2 = 2*data1
f.create_dataset("ones", data=data1)
f.create_dataset("twos", data=data2)

f is both a file object and a HDF5 group. So doing f.keys() gives:

[u'ones', u'twos']

To view what's stored under the 'ones' key, you would do:

f['ones'][:]

array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

You can create as many datasets as you would like. When you're done writing files, close the file object: f.close().

I should add that my approach here only works for array-like datasets. You can save other Python objects, such as lists and dictionaries, but doing so requires a bit more work. I only resort to HDF5 for large numpy arrays. For everything else, pickle works just fine for me.

Python Freelancers