Python for Data Analysis part-3

Boolean Indexing

Let’s consider an example where we have some data in an array and an array of names with duplicates. I’m going to use here the randn function in numpy.random to generate some random normally distributed data:

In [1]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
In [2]: names
Output: array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')
In [3]: data = np.random.randn(7, 4)
In [4]: data
Output: array([[-0.2713569 ,  0.40268757, -0.2054311 , -1.85882617],
               [-1.41180945,  0.37541494, -0.3877    ,  0.44347866],
               [ 0.85662917,  1.08506375, -0.11029386,  0.0993109 ],
               [-0.64470778,  0.08320262, -0.01493337, -0.68207249],
               [-0.71813298, -0.24733692, -1.34792016,  0.41605464],
               [ 0.57849793,  1.05163699, -0.08045042,  0.84083471],
               [ 0.41226876,  0.35230496, -0.45277177, -1.30446054]])

Suppose each name corresponds to a row in the data array and we wanted to select all the rows with corresponding name ‘Bob’. Like arithmetic operations, comparisons (such as ==) with arrays are also vectorized. Thus, comparing names with the string ‘Bob’ yields a boolean array:

In [5]: names == 'Bob'
Output: array([ True, False, False,  True, False, False, False])

This boolean array can be passed when indexing the array:

In [6]: data[names == 'Bob'] 
Output: array([[-0.2713569 ,  0.40268757, -0.2054311 , -1.85882617],
               [-0.64470778,  0.08320262, -0.01493337, -0.68207249]])

The boolean array must be of the same length as the array axis it’s indexing. You can even mix and match boolean arrays with slices or integers (or sequences of integers; more on this later).

Note: Boolean selection will not fail if the boolean array is not the correct length, so I recommend care when using this feature.

In these examples, I select from the rows where names == ‘Bob’ and index the columns, too:

In [7]: names != 'Bob'
Output: array([False,  True,  True, False,  True,  True,  True])

IN [8]: data[~(names == 'Bob')]
Output: array([[-1.41180945,  0.37541494, -0.3877    ,  0.44347866],
               [ 0.85662917,  1.08506375, -0.11029386,  0.0993109 ],
               [-0.71813298, -0.24733692, -1.34792016,  0.41605464],
               [ 0.57849793,  1.05163699, -0.08045042,  0.84083471],
               [ 0.41226876,  0.35230496, -0.45277177, -1.30446054]])

The ~ operator can be useful when you want to invert a general condition:

In [9]: cond = names == 'Bob'
In [10]: data[~cond]
Output: array([[-1.41180945,  0.37541494, -0.3877    ,  0.44347866],
               [ 0.85662917,  1.08506375, -0.11029386,  0.0993109 ],
               [-0.71813298, -0.24733692, -1.34792016,  0.41605464],
               [ 0.57849793,  1.05163699, -0.08045042,  0.84083471],
               [ 0.41226876,  0.35230496, -0.45277177, -1.30446054]])

Selecting two of the three names to combine multiple boolean conditions, use boolean arithmetic operators like & (and) and | (or):

In [11]: mask = (names == 'Bob') | (names == 'Will')
In [12]: mask
Output: array([ True, False,  True,  True,  True, False, False])
In [13]: data[mask]
Output: array([[-0.2713569 ,  0.40268757, -0.2054311 , -1.85882617],
               [ 0.85662917,  1.08506375, -0.11029386,  0.0993109 ],
               [-0.64470778,  0.08320262, -0.01493337, -0.68207249],
               [-0.71813298, -0.24733692, -1.34792016,  0.41605464]])

Selecting data from an array by boolean indexing always creates a copy of the data, even if the returned array is unchanged.

Note: The Python keywords and and or do not work with boolean arrays. Use & (and) and | (or) instead

Setting values with boolean arrays works in a common-sense way. To set all of the negative values in data to 0 we need only do:

In [16]: data[data < 0] = 0
In [15]: data
Output: array([[0.        , 0.40268757, 0.        , 0.        ],
               [0.        , 0.37541494, 0.        , 0.44347866],
               [0.85662917, 1.08506375, 0.        , 0.0993109 ],
               [0.        , 0.08320262, 0.        , 0.        ],
               [0.        , 0.        , 0.        , 0.41605464],
               [0.57849793, 1.05163699, 0.        , 0.84083471],
               [0.41226876, 0.35230496, 0.        , 0.        ]])

Setting whole rows or columns using a one-dimensional boolean array is also easy:

In [16]: data[names != 'Joe'] = 7
In [17]: data
Output: array([[7.        , 7.        , 7.        , 7.        ],
               [0.        , 0.37541494, 0.        , 0.44347866],
               [7.        , 7.        , 7.        , 7.        ],
               [7.        , 7.        , 7.        , 7.        ],
               [7.        , 7.        , 7.        , 7.        ],
               [0.57849793, 1.05163699, 0.        , 0.84083471],
               [0.41226876, 0.35230496, 0.        , 0.        ]])

Fancy Indexing

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays. Suppose we had an 8 × 4 array:

In [18]: arr = np.empty((8, 4))
In [19]: for i in range(8):
              arr[i] = i
In [20]: arr
Output: array([[0., 0., 0., 0.],
               [1., 1., 1., 1.],
               [2., 2., 2., 2.],
               [3., 3., 3., 3.],
               [4., 4., 4., 4.],
               [5., 5., 5., 5.],
               [6., 6., 6., 6.],
               [7., 7., 7., 7.]])

To select out a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order:

In [21]: arr[[4, 3, 0, 6]]
Output: array([[4., 4., 4., 4.],
               [3., 3., 3., 3.],
               [0., 0., 0., 0.],
               [6., 6., 6., 6.]])

Hopefully this code did what you expected! Using negative indices selects rows from the end:

In [22]: arr[[-3, -5, -7]]
Output: array([[5., 5., 5., 5.],
               [3., 3., 3., 3.],
               [1., 1., 1., 1.]])

Passing multiple index arrays does something slightly different; it selects a one dimensional array of elements corresponding to each tuple of indices:

In [23]: arr = np.arange(32).reshape((8, 4))
In [24]: arr
Output: array([[ 0,  1,  2,  3],
               [ 4,  5,  6,  7],
               [ 8,  9, 10, 11],
               [12, 13, 14, 15],
               [16, 17, 18, 19],
               [20, 21, 22, 23],
               [24, 25, 26, 27],
               [28, 29, 30, 31]])
In [25]: arr[[1, 5, 7, 2], [0, 3, 1, 2]]
Output: array([ 4, 23, 29, 10])

Here the elements (1, 0), (5, 3), (7, 1), and (2, 2) were selected. Regardless of how many dimensions the array has (here, only 2), the result of fancy indexing is always one-dimensional.
The behavior of fancy indexing in this case is a bit different from what some users might have expected (myself included), which is the rectangular region formed by selecting a subset of the matrix’s rows and columns. Here is one way to get that:

In [26]: arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]
Output: array([[ 4,  7,  5,  6],
               [20, 23, 21, 22],
               [28, 31, 29, 30],
               [ 8, 11,  9, 10]])

Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array.

Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping that similarly returns a view on the under-lying data without copying anything. Arrays have the transpose method and also the special T attribute:

In [27]: arr = np.arange(15).reshape((3, 5))
In [28]: arr
Output: array([[ 0,  1,  2,  3,  4],
               [ 5,  6,  7,  8,  9],
               [10, 11, 12, 13, 14]])
In [29]: arr.T
Output: array([[ 0,  5, 10],
               [ 1,  6, 11],
               [ 2,  7, 12],
               [ 3,  8, 13],
               [ 4,  9, 14]])

When doing matrix computations, you may do this very often—for example, when computing the inner matrix product using np.dot:

In [30]: arr = np.random.randn(6, 3)
In [31]: arr
Output: array([[ 1.09629077,  0.17855129, -0.55832893],
               [ 0.41781671, -0.5937268 , -1.86360961],
               [ 0.17917602, -1.23890631, -0.55298743],
               [-0.23008981,  1.43399644,  0.46172795],
               [ 0.52366647,  1.05202095,  0.78947106],
               [ 0.01712223,  0.75484918, -0.6522437 ]])
In [32]: np.dot(arr.T, arr)
Output: array([[ 1.73598936, -0.04042232, -1.1938074 ],
               [-0.04042232,  5.65217206,  2.69219495],
               [-1.1938074 ,  2.69219495,  5.35244615]])

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes (for extra mind bending):

In [33]: arr = np.arange(16).reshape((2, 2, 4))
In [34]: arr
Output: array([[[ 0,  1,  2,  3],
                [ 4,  5,  6,  7]],

               [[ 8,  9, 10, 11],
                [12, 13, 14, 15]]])
In [35]: arr.transpose((1, 0, 2))
Output: array([[[ 0,  1,  2,  3],
                [ 8,  9, 10, 11]],

               [[ 4,  5,  6,  7],
                [12, 13, 14, 15]]])

Here, the axes have been reordered with the second axis first, the first axis second, and the last axis unchanged.
Simple transposing with .T is a special case of swapping axes. ndarray has the method swapaxes, which takes a pair of axis numbers and switches the indicated axes to rear‐ range the data:

In [36]: arr
Output: 
array([[[ 0, 1, 2, 3],
                [ 4, 5, 6, 7]],
               [[ 8, 9, 10, 11],
                [12, 13, 14, 15]]])
In [37]: arr.swapaxes(1, 2)
Output:
 array([[[ 0, 4],
                [ 1, 5],
                [ 2, 6],
                [ 3, 7]],
               [[ 8, 12],
                [ 9, 13],
                [10, 14],
                [11, 15]]])

swapaxes similarly returns a view on the data without making a copy.

Tech insights for the curious mind