You are currently viewing Python for Data Analysis Part-4

Python for Data Analysis Part-4

Universal Functions: Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.
Many ufuncs are simple element-wise transformations, like sqrt or exp:

In [1]: arr = np.arange(10)
In [2]: arr
Output: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [3]: np.sqrt(arr)
Output: array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
               2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])
In[4]: np.exp(arr)
Output: array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
               5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
               2.98095799e+03, 8.10308393e+03])

These are referred to as unary ufuncs. Others, such as add or maximum, take two arrays (thus, binary ufuncs) and return a single array as the result:

In [5]: x = np.random.randn(8)
In [6]: y = np.random.randn(8)
In [7]: x
Output: array([-0.02290426, -0.08916755,  0.88118988, -0.52292579,  1.13871   ,
                0.67017463,  0.88709011, -1.37270741])
In [8]: y
Output: array([ 0.45301008,  1.56853366,  1.10285681, -0.25368907,  0.05783312,
                1.77564351, -0.88618735, -0.64202772])
In [9]: np.maximum(x, y)
Output: array([ 0.45301008,  1.56853366,  1.10285681, -0.25368907,  1.13871   ,
                1.77564351,  0.88709011, -0.64202772])

numpy.maximum computed the element-wise maximum of the elements in x and y.
While not common, a ufunc can return multiple arrays. modf is one example, a vectorized version of the built-in Python divmod; it returns the fractional and integral parts of a floating-point array:

In [10]: arr = np.random.randn(7) * 5
In [11]: arr
Output: array([-3.06158748,  5.54673478, 12.32597205,  1.44032041,  7.36989174,
               -1.54463045,  0.83924527])
In [12]: remainder, whole_part = np.modf(arr)
In [13]: remainder
Output: array([-0.06158748,  0.54673478,  0.32597205,  0.44032041,  0.36989174,
               -0.54463045,  0.83924527])
In [14]: whole_part
Output: array([-3.,  5., 12.,  1.,  7., -1.,  0.])

Ufuncs accept an optional out argument that allows them to operate in-place on arrays:

In [15]: arr
Output: array([-3.06158748,  5.54673478, 12.32597205,  1.44032041,  7.36989174,
               -1.54463045,  0.83924527])
In [16]: np.sqrt(arr)
Output: array([       nan, 2.35515069, 3.51083637, 1.2001335 , 2.71475445,
               nan,         0  .91610331]
In [17]: np.sqrt(arr, arr)
Output: array([       nan, 2.35515069, 3.51083637, 1.2001335 , 2.71475445,
               nan, 0.91610331])
In [18]: arr
Output: array([       nan, 2.35515069, 3.51083637, 1.2001335 , 2.71475445,
               nan, 0.91610331])
Function Description
abs, fabsCompute the absolute value element-wise for integer, „oating-point, or complex values
sqrtCompute the square root of each element (equivalent to arr ** 0.5)
squareCompute the square of each element (equivalent to arr ** 2)
expCompute the exponent e^x of each element
log, log10, log2, log1pNatural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively
signCompute the sign of each element: 1 (positive), 0 (zero), or –1 (negative)
ceilCompute the ceiling of each element (i.e., the smallest integer greater than or equal to that number)
floorCompute the floor of each element (i.e., the largest integer less than or equal to each element)
rintRound elements to the nearest integer, preserving the dtype
modfReturn fractional and integral parts of array as a separate array
isnanReturn boolean array indicating whether each value is NaN (Not a Number)
isfinite, isinfReturn boolean array indicating whether each element is finite (non-inf, non-NaN) or infinite, respectively
cos, cosh, sin, sinh,tan, tanhRegular and hyperbolic trigonometric functions
arccos, arccosh, arcsin, arcsinh, arctan, arctanhInverse trigonometric functions
logical_notCompute truth value of not x element-wise (equivalent to ~arr)
Unary ufuncs
FunctionDescription
add Add corresponding elements in arrays
subtractSubtract elements in second array from first array
multiplyMultiply array elements
divide, floor_divideDivide or floor divide (truncating the remainder)
powerRaise elements in first array to powers indicated in second array
maximum, fmaxElement-wise maximum; fmax ignores NaN
minimum, fminElement-wise minimum; fmin ignores NaN
modElement-wise modulus (remainder of division)
copysignCopy sign of values in second argument to values in first argument
greater, greater_equal, less, less_equal, equal, not_equalPerform element-wise comparison, yielding boolean array (equivalent to infix operators >, >=, <, <=, ==, !=)
logical_and ,logical_or, logical_xorCompute element-wise truth value of logical operation (equivalent to infix operators & |, ^)
Binary universal functions

Array-Oriented Programming with Arrays

Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact in any kind of numerical computations.

As a simple example, suppose we wished to evaluate the function sqrt(x^2 + y^2) across a regular grid of values. The np.meshgrid function takes two 1D arrays and produces two 2D matrices corresponding to all pairs of (x, y) in the two arrays:

In [19]: points = np.arange(-5, 5, 0.01) # 1000 equally spaced points
In [20]: xs, ys = np.meshgrid(points, points)
In [21]: ys
Output:array([[-5.  , -5.  , -5.  , ..., -5.  , -5.  , -5.  ],
              [-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
              [-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
              ...,
              [ 4.97,  4.97,  4.97, ...,  4.97,  4.97,  4.97],
              [ 4.98,  4.98,  4.98, ...,  4.98,  4.98,  4.98],
              [ 4.99,  4.99,  4.99, ...,  4.99,  4.99,  4.99]])

Now, evaluating the function is a matter of writing the same expression you would write with two points:

In [22]: z = np.sqrt(xs ** 2 + ys ** 2)
In [23]: z
Output: array([[7.07106781, 7.06400028, 7.05693985, ..., 7.04988652, 7.05693985,
                7.06400028],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815, 7.05692568],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354, 7.04985815],
       ...,
       [7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 , 7.03571603, 7.04279774],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354, 7.04985815],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815, 7.05692568]])

I use matplotlib to create visualizations of this two dimensional array:

In [24]: import matplotlib.pyplot as plt
In [25]: plt.imshow(z, cmap=plt.cm.gray); plt.colorbar()
Output: <matplotlib.colorbar.Colorbar at 0x2aa2ce16280>
Plot of function evaluated on grid

Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condition else y. Suppose we had a boolean array and two arrays of values:

In [26]: xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
In [27]: yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
In [28]: cond = np.array([True, False, True, True, False])

Suppose we wanted to take a value from xarr whenever the corresponding value in cond is True, and otherwise take the value from yarr. A list comprehension doing this might look like:

In [29]: result = [(x if c else y)
         for x, y, c in zip(xarr, yarr, cond)]
In [30]: result
Output: [1.1, 2.2, 1.3, 1.4, 2.5]

This has multiple problems. First, it will not be very fast for large arrays (because all the work is being done in interpreted Python code). Second, it will not work with multidimensional arrays. With np.where you can write this very concisely:

In [31]: result = np.where(cond, xarr, yarr)
In [32]: result
Output: array([ 1.1, 2.2, 1.3, 1.4, 2.5])

The second and third arguments to np.where don’t need to be arrays; one or both of them can be scalars. A typical use of where in data analysis is to produce a new array of values based on another array. Suppose you had a matrix of randomly generated data and you wanted to replace all positive values with 2 and all negative values with –2. This is very easy to do with np.where:

In [33]: arr = np.random.randn(4, 4)
In [34]: arr
Output: array([[-0.44193858,  0.07157006,  2.2441608 , -0.37031764],
       [-0.54374974, -2.62563459,  0.35334962,  1.45375499],
       [-0.09094342,  0.79745038,  0.85707545, -0.6020405 ],
       [-0.61344176, -0.92205991, -0.75273804, -1.69501996]])
In [35]: arr > 0
Output: array([[False,  True,  True, False],
       [False, False,  True,  True],
       [False,  True,  True, False],
       [False, False, False, False]])
In [36]: np.where(arr > 0, 2, -2)
Output: array([[-2,  2,  2, -2],
       [-2, -2,  2,  2],
       [-2,  2,  2, -2],
       [-2, -2, -2, -2]])

You can combine scalars and arrays when using np.where. For example, I can replace all positive values in arr with the constant 2 like so:

In [37]: np.where(arr > 0, 2, arr) # set only positive values to 2
Output: array([[-0.44193858,  2.        ,  2.        , -0.37031764],
       [-0.54374974, -2.62563459,  2.        ,  2.        ],
       [-0.09094342,  2.        ,  2.        , -0.6020405 ],
       [-0.61344176, -0.92205991, -0.75273804, -1.69501996]])

The arrays passed to np.where can be more than just equal-sized arrays or scalars.

Leave a Reply