Skip to content Skip to sidebar Skip to footer

How To Efficiently Operate A Large Numpy Array

I have a segment of codes which is based on a large numpy array and then to operate another array. Because this is a very large array, could you please let me know whether there is

Solution 1:

You are here basically losing the efficiency of numpy, by performing the processing in Python. The idea of numpy is process the items in bulk, since it has efficiency algorithms in C++ behind the curtains that do the actual processing. You can see the Python end of numpy more as an "interface".

Now to answer your question, we can basically first construct an array of random numbers between 0 and 2, by multiplying it with 2 already:

rand = 2.0 * np.random.rand(N)

next we can use np.where(..) [numpy-doc] that acts like a conditional selector: we here pass it three "arrays": the first one is an array of booleans that encodes the truthiness of the "condition", the second is the array of values to fill in in case the related condition is true, and the third value is the array of values to plug in in case the condition is false, so we can then write it like:

N = 1000000000rand = 2 * np.random.rand(N)
beta = np.where(rand < 1.0, rand, 1.0 / (2.0 - rand))

Solution 2:

N = 1000000000 caused a MemoryError for me. Reducing to 100 for a minimal example. You can use np.where routine.

In both case, fundamentally you are iterating over your array and applying a function. However, np.where uses a way faster loop (it's compiled code basically), while your "python" loop is interpreted and thus really slow for a big N.

Here's an example of implementation.

N = 100rand = np.random.rand(N)
beta = np.where(rand < 0.5,  2.0 * rand, 1.0/(2.0*(1.0-rand))

Solution 3:

As other answers have pointed out, iterating over the elements of a numpy array in a Python loop should (and can) almost always be avoided. In most cases going from a Python loop to an array operation gives a speedup of ~100x.

However, if performance is absolutely critical, you can often squeeze out another factor of between 2x and 10x (in my experience) by using Cython. Here's an example:

%%cython
cimport numpy as np
import numpy as np
cimport cython
from cython cimport floating

@cython.boundscheck(False)@cython.wraparound(False)@cython.cdivision(True)
cpdefnp.ndarray[floating, ndim=1] beta(np.ndarray[floating, ndim=1] arr):
    cdef:
        Py_ssize_t i
        Py_ssize_t N = arr.shape[0]
        np.ndarray[floating, ndim=1] result = np.zeros(N)

    for i inrange(N):
        if arr[i] < 0.5:
            result[i] = 2.0*arr[i]
        else:
            result[i] = 1.0/(2.0*(1.0-arr[i]))

    return result

You would then call it as beta(rand). As you can see, this allows you to use your original loop structure, but now using efficient typed native code. I get a speedup of ~2.5x compared to np.where.

It should be noted that in many cases this is not worth the extra effort compared to the one-liner in numpy -- but it may well be where performance is critical.

Post a Comment for "How To Efficiently Operate A Large Numpy Array"