Reduce Memory Usage Of A Line Of Code That Uses Numpy
Solution 1:
If all elements of matrix
are >=0, then you can do:
np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0)
This uses the fact that any float or integer other than 0
is interpreted as True
when viewed as a boolean (which np.any
does internally). Saves you from building that big boolean matrix seperately.
Since you're boolean indexing in matrix[...]
, you're creating a temporary copy that you don't really care if it gets overwritten during the percentile
process. Thus you can use overwrite_input = True
to save even more memory.
mat = matrix.copy()
perc = np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0, overwrite_input = True)
np.array_equals(mat, matrix) # is `matrix` still the same?True
Finally, depending on your other archetecture, I'd recommend looking into making matrix
some flavor of scipy.sparse
, which should siginficantly reduce your memory usage again (although with some drawbacks depending on the type you use).
Solution 2:
I'm putting this as an answer since there's more than will fit in a comment, although it may not be complete. There's two suspicious things - first off percentile should run fine on a 20Gb matrix if your machine has 200Gb of ram available. That's a lot of memory, so start looking into what else might be using it. Start with top
- is there another process or is your python program using all of that?
The second suspicious thing is that the documentation for utils.percentile
doesn't match it's actual behavior. Here's the relevant bits from the code you've linked to:
defpercentile(matrix, p):
"""
Estimation of percentile without zeros.
....
Returns
-------
float
Calculated percentile.
"""return np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)
What it actually does is return the (columnwise) percentile calculated for rows which are not all zeros. edit That's rows which contain at least one positive element. If values are non-negative that's the same thing, but in general that will be a very different result.
np.any(matrix > 0, axis=1)
returns a boolean array to index rows which are not all zeros. For example
>>> np.any(array([[3, 4], [0, 0]]) > 0, axis=1)
array([ True, False])
>>> np.any(array([[3, 4], [1, 0]]) > 0, axis=1)
array([ True, True])
>>> np.any(array([[3, 0], [1, 0]]) > 0, axis=1)
array([ True, True])
That array is used to index matrix
, which selects only rows which are not all zeros and returns those.
You should read over the numpy docs for indexing if you aren't familiar with that way of indexing.
Calculating that takes a lot of memory - matrix > 0
creates a boolean array of the same dimension as matrix, then the indexing creates a copy of matrix
which probably contains most of the rows.
So, probably 2-4Gb for the boolean array and close to 20Gb for the copy.
That can be reduced,
## Find rows with all zeros, one row at a time to reduce memorymask = [np.any(r > 0) for r in matrix]
## Find percentile for each column, excluding rows with all zerosperc = [np.percentile(c[mask], p) for c in matrix.T]
However, as stated earlier that doesn't match the function documentation.
There may be a reason for this logic, but it is odd.
If you don't know the reason for it you might be fine calling np.percentile
directly - just check that it returns a close value for a smaller subset of your data.
There's also nanpercentile
, which can be used the same way but ignores nan
values.
You can use boolean indexing to replace the values you don't want included with nan
(i.e. matrix[matrix < 0] = np.nan
) and then call that.
Post a Comment for "Reduce Memory Usage Of A Line Of Code That Uses Numpy"