Python for Data Analysis Part-5

Mathematical and Statistical Methods

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class. You can use aggregations (often called reductions) like sum, mean, and std (standard deviation) either by calling the array instance method or using the top-level NumPy function. Here I generate some normally distributed random data and compute some aggregate statistics:

In [1]: arr = np.random.randn(5, 4)
In [2]: arr
Output: array([[-1.41926999, -0.51497632, -1.5654179 , -0.93438156],
               [-0.23054614, -0.35317017,  1.63842749, -0.05204039],
               [ 0.66333846, -0.44345483, -0.19005873, -1.02629071],
               [-1.06231653,  0.74428324,  0.52912063, -0.84593841],
               [ 0.22920461, -0.40649209,  0.35509243,  1.76018706]])
In [3]: arr.mean()
Output: -0.15623499285157422
In [4]: np.mean(arr)
Output: -0.15623499285157422
In [5]: arr.sum()
Output: -3.1246998570314846

Functions like mean and sum take an optional axis argument that computes the statistic over the given axis, resulting in an array with one fewer dimension:

In [6]: arr.mean(axis=1)
Output: array([ 1.022 , 0.1875, -0.502 , -0.0881, 0.3611])
In [7]: arr.sum(axis=0)
Output: array([ 3.1693, -2.6345, 2.2381, 1.1486])

Here, arr.mean(1) means “compute mean across the columns” where arr.sum(0) means “compute sum down the rows.”
Other methods like cumsum and cumprod do not aggregate, instead producing an array of the intermediate results:

In [8]: arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
In [9]: arr.cumsum()
Output: array([ 0, 1, 3, 6, 10, 15, 21, 28])

In multidimensional arrays, accumulation functions like cumsum return an array of the same size, but with the partial aggregates computed along the indicated axis according to each lower dimensional slice:

In [10]: arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
In [11]: arr
Output:
 array([[0, 1, 2],
               [3, 4, 5],
               [6, 7, 8]])
In [12]: arr.cumsum(axis=0)
Output:
 array([[ 0,  1,  2],
               [ 3,  5,  7],
               [ 9, 12, 15]], dtype=int32)
In [13]: arr.cumprod(axis=1)
Output:
 array([[  0,   0,   0],
               [  3,  12,  60],
               [  6,  42, 336]], dtype=int32)

Method	Description
sum	Sum of all the elements in the array or along an axis; zero-length arrays have sum 0
mean	Arithmetic mean; zero-length arrays have NaN mean
std, var	Standard deviation and variance, respectively, with optional degrees of freedom adjustment (default denominator n)
min, max	Minimum and maximum
argmin, argmax	Indices of minimum and maximum elements, respectively
cumsum	Cumulative sum of elements starting from 0
cumprod	Cumulative product of elements starting from 1

Basic array statistical methods

Methods for Boolean Arrays

Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, sum is often used as a means of counting True values in a boolean array:

In [14]: arr = np.random.randn(100)
In [15]: (arr > 0).sum() 
Output: 55

There are two additional methods, any and all, useful especially for boolean arrays. any tests whether one or more values in an array is True, while all checks if every value is True:

In [16]: bools = np.array([False, False, True, False])
In [17]: bools.any()
Output: True
In [18]: bools.all()
Output: False

These methods also work with non-boolean arrays, where non-zero elements evaluate to True.

Sorting

Like Python’s built-in list type, NumPy arrays can be sorted in-place with the sort method:

In [19]: arr = np.random.randn(6)
In [20]: arr
Output: array([ 0.73179344, -2.06654121,  0.13480418, -0.53344594,  0.2144884 ,
                -0.60337149])
In [21]: arr.sort()
In [22]: arr
Output: array([-2.06654121, -0.60337149, -0.53344594,  0.13480418,  0.2144884 ,
                0.73179344])

You can sort each one-dimensional section of values in a multidimensional array inplace along an axis by passing the axis number to sort:

In [23]: arr = np.random.randn(5, 3)
In [24]: arr
Output:
 array([[-2.62539204,  1.84833573,  0.74607911],
               [ 0.06213531, -1.2281643 , -0.6566129 ],
               [ 1.79469638, -1.05991128, -0.38980222],
               [ 0.83709998, -0.00542368, -0.50167048],
               [ 0.77767724,  0.84626049, -0.8792752 ]])
In [25]: arr.sort(1)
In [26]: arr
Output: array([[-2.62539204,  0.74607911,  1.84833573],
               [-1.2281643 , -0.6566129 ,  0.06213531],
               [-1.05991128, -0.38980222,  1.79469638],
               [-0.50167048, -0.00542368,  0.83709998],
               [-0.8792752 ,  0.77767724,  0.84626049]])

The top-level method np.sort returns a sorted copy of an array instead of modifying the array in-place. A quick-and-dirty way to compute the quantiles of an array is to sort it and select the value at a particular rank:

In [27]: large_arr = np.random.randn(1000)
In [28]: large_arr.sort()
In [29]: large_arr[int(0.05 * len(large_arr))] # 5% quantile
Output: -1.6229011811372236

For more details on using NumPy’s sorting methods, and more advanced techniques like indirect sorts.

Unique and Other Set Logic

NumPy has some basic set operations for one-dimensional ndarrays. A commonly used one is np.unique, which returns the sorted unique values in an array:

In [30]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
In [31]: np.unique(names)
Output: array(['Bob', 'Joe', 'Will'], dtype='<U4')
In [32]: ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
In [33]: np.unique(ints)
Output: array([1, 2, 3, 4])

Contrast np.unique with the pure Python alternative:

In [34]: sorted(set(names))
Output: ['Bob', 'Joe', 'Will']

Another function, np.in1d, tests membership of the values in one array in another, returning a boolean array:

In [35]: values = np.array([6, 0, 0, 3, 2, 5, 6])
In [36]: np.in1d(values, [2, 3, 6])
Output: array([ True, False, False, True, True, False, True], dtype=bool)

Method	Description
unique(x)	Compute the sorted, unique elements in x
intersect1d(x, y)	Compute the sorted, common elements in x and y
union1d(x, y)	Compute the sorted union of elements
in1d(x, y)	Compute a boolean array indicating whether each element of x is contained in y
setdiff1d(x, y)	Set difference, elements in x that are not in y
setxor1d(x, y)	Set symmetric differences; elements that are in either of the arrays, but not both

Array set operations

Tech insights for the curious mind