You are currently viewing Python For Data Analysis Part-10

Python For Data Analysis Part-10

Essential Functionality

Reindexing

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example:

Input[1]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
Input[2]: obj
Output:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

Input[3]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
Input[4]: obj2
Output:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values:

Input[5]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
Input[6]: obj3
Output:
0     blue
2   purple
4   yellow
dtype: object

Input[7]: obj3.reindex(range(6), method='ffill')
Output:
0     blue
1     blue
2   purple
3   purple
4   yellow
5   yellow
dtype: object

With DataFrame, reindex can alter either the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:

Input[8]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                  index=['a', 'c', 'd'],
                  columns=['Ohio', 'Texas', 'California'])
Input[9]: frame
Output:
	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

Input[10]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
Input[11]: frame2
Output:
	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

The columns can be reindexed with the columns keyword:

Input[12]: states = ['Texas', 'Utah', 'California']
Input[13]: frame.reindex(columns=states)
Output:
	Texas	Utah	California
a	1	NaN	2
c	4	NaN	5
d	7	NaN	8

for more about the arguments to reindex

ArgumentDescripton
index New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying.
methodInterpolation (fill) method; ‘ffill’ fills forward, while ‘bfill’ fills backward.
fill_valueSubstitute value to use when introducing missing data by reindexing.
limitWhen forward- or backfilling, maximum size gap (in number of elements) to fill.
toleranceWhen forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches.
levelMatch simple Index on level of MultiIndex; otherwise select subset of.
copyIf True, always copy underlying data even if new index is equivalent to old index; if False, do not copy
the data when the indexes are equivalent.
reindex function arguments

you can reindex more succinctly by label-indexing with loc, and many users prefer to use it exclusively:

Input[14]: frame.loc[['a', 'b', 'c', 'd'], states]
Output:
        Texas   Utah    California
a       1.0     NaN     2.0
b       NaN     NaN     NaN
c       4.0     NaN     5.0
d       7.0     NaN     8.0

Dropping Entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

Input: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
Input: obj
Output:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
Input: new_obj = obj.drop('c')
Input: new_obj
Output:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
Input: obj.drop(['d', 'c'])
Output:
a 0.0
b 1.0
e 4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

Input: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
Input: data
Output:
  	       one	two	three	four
Ohio	        0	1	2	3
Colorado	4	5	6	7
Utah	        8	9	10	11
New York	12	13	14	15

Calling drop with a sequence of labels will drop values from the row labels (axis 0):

Input: data.drop(['Colorado', 'Ohio'])
Output:
 	        one	two	three	four
Utah	         8	9	10	11
New York	12	13	14	15

You can drop values from the columns by passing axis=1 or axis=’columns’:

Input: data.drop('two', axis=1)
Output:
	        one	three	four
Ohio	        0	2	3
Colorado	4	6	7
Utah	        8	10	11
New York	12	14	15

Input: data.drop(['two', 'four'], axis='columns')
Output:
	        one	three
Ohio	        0	2
Colorado	4	6
Utah	        8	10
New York	12	14

Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object:

Input: obj.drop('c', inplace=True)
Input: obj
Output:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Be careful with the inplace, as it destroys any data that is dropped.

Indexing, Selection, and Filtering

Series indexing (obj[…]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

Input: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
Input: obj
Output:
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64
Input: obj['b']
Output: 1.0

Input: obj[1]
Output: 1.0

Input: obj[2:4Out[121]:
c    2.0
d    3.0
dtype: float64

Input: obj[['b', 'a', 'd']]
Output:
b    1.0
a    0.0
d    3.0
dtype: float64

Input: obj[[1, 3]]
Output:
b    1.0
d    3.0
dtype: float64

Input: obj[obj < 2]
Output:
a    0.0
b    1.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:

Input: obj['b':'c']
Output:
b    1.0
c    2.0
dtype: float64

Setting using these methods modifies the corresponding section of the Series:

Input: obj['b':'c'] = 5
Input: obj
Output:
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:

Input: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
              index=['Ohio', 'Colorado', 'Utah', 'New York'],
              columns=['one', 'two', 'three', 'four'])
Input: data
Output:
	        one	two	three	four
Ohio	         0	1	2	3
Colorado	 4	5	6	7
Utah	         8	9	10	11
New York        12	13	14	15

Input: data['two']
Output:
Ohio        1
Colorado    5
Utah        9
New York    13
Name: two, dtype: int64

Input: data[['three', 'one']]
Output:
	       three	one
Ohio	        2	0
Colorado	6	4
Utah       	10	8
New York	14	12


Indexing like this has a few special cases. First, slicing or selecting data with a boolean array:

Input: data[:2]
Output:
               one	two	three	four
Ohio      	0	1	2	3
Colorado	4	5	6	7

Input: data[data['three'] > 5]
Output:
	       one	two	three	four
Colorado	4	5	6	7
Utah	        8	9	10	11
New York	12	13	14	15

The row selection syntax data[:2] is provided as a convenience. Passing a single element or a list to the [] operator selects columns.

Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:

Input: data < 5
Output:
	        one	two	three	four
Ohio     	True	True	True	True
Colorado	True	False	False	False
Utah	        False	False	False	False
New York	False	False	False	False

Input: data[data < 5] = 0
Input: data
Output:
	       one	two	three	four
Ohio	        0	0	0	0
Colorado	0	5	6	7
Utah	        8	9	10	11
New York	12	13	14	15

This makes DataFrame syntactically more like a two-dimensional NumPy array in this particular case.

Selection with loc and iloc

For DataFrame label-indexing on the rows, I introduce the special indexing operators loc and iloc. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (loc) or integers (iloc).

As a preliminary example, let’s select a single row and multiple columns by label:

Input: data.loc['Colorado', ['two', 'three']]
Output:
two     5
three   6
Name: Colorado, dtype: int64

We’ll then perform some similar selections with integers using iloc:

Input: data.iloc[2, [3, 0, 1]]
Output:
four    11
one      8
two      9
Name: Utah, dtype: int64

Input: data.iloc[2]
Output:
one     8
two     9
three   10
four    11
Name: Utah, dtype: int64

Input: data.iloc[[1, 2], [3, 0, 1]]
Output:
	       four	one	two
Colorado	7	0	5
Utah     	11	8	9

Both indexing functions work with slices in addition to single labels or lists of labels:

Input: data.loc[:'Utah', 'two']
Output:
	        four	one	two
Colorado	7	0	5
Utah     	11	8	9

Input: data.iloc[:, :3][data.three > 5]
Output:

               one	two	three
Colorado	0	5	6
Utah           	8	9	10
New York	12	13	14

So there are many ways to select and rearrange the data contained in a pandas object.

TypesNotes
df[val] Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)
df.loc[val]Selects single row or subset of rows from the DataFrame by label
df.loc[:, val]Selects single column or subset of columns by label
df.loc[val1, val2]Select both rows and columns by label
df.iloc[where]Selects single row or subset of rows from the DataFrame by integer position
df.iloc[:, where] Selects single column or subset of columns by integer position
df.iloc[where_i, where_j] Select both rows and columns by integer position
df.at[label_i, label_j]Select a single scalar value by row and column label
df.iat[i, j]Select a single scalar value by row and column position (integers)
reindex methodSelect either rows or columns by labels
get_value, set_value methodsSelect single value by row and column label
Indexing options with DataFrame

Integer Indexes

Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error:

ser = pd.Series(np.arange(3.))
ser
ser[-1]

In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in general without introducing subtle bugs. Here we have an index containing 0, 1, 2, but inferring what the user wants (label-based indexing or position-based) is difficult:

Input: ser
Output:
0   0.0
1   1.0
2   2.0
dtype: float64

On the other hand, with a non-integer index, there is no potential for ambiguity:

Input: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
Input: ser2[-1]
Output: 2.0

To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc (for integers):

Input: ser[:1]
Output:
0   0.0
dtype: float64

Input: ser.loc[:1]
Output:
0   0.0
1   1.0
dtype: float64

Input: ser.iloc[:1]
Output:
0   0.0
dtype: float64

Arithmetic and Data Alignment

An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. Let’s look at an example:

Input: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
Input: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
                   index=['a', 'c', 'e', 'f', 'g'])
Input: s1
Output:
a   7.3
c   -2.5
d   3.4
e   1.5
dtype: float64

Input: s2
Output:
a   -2.1
c   3.6
e   -1.5
f   4.0
g   3.1
dtype: float64

Adding these together yields:

Input: s1 + s2
Output:
a   5.2
c   1.1
d   NaN
e   0.0
f   NaN
g   NaN
dtype: float64

The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations. In the case of DataFrame, alignment is performed on both the rows and the columns:

Input: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                      index=['Ohio', 'Texas', 'Colorado'])
Input: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                       index=['Utah', 'Ohio', 'Texas', 'Oregon'])
Input: df1
Output:
                 b	c	d
Ohio	        0.0	1.0	2.0
Texas	        3.0	4.0	5.0
Colorado	6.0	7.0	8.0

Input: df2
Output:
	 b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:


Input: df1 + df2
Output:
	         b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	        3.0	NaN	6.0	NaN
Oregon	        NaN	NaN	NaN	NaN
Texas	        9.0	NaN	12.0	NaN
Utah	        NaN	NaN	NaN	NaN

Since the ‘c’ and ‘e’ columns are not found in both DataFrame objects, they appear as all missing in the result. The same holds for the rows whose labels are not common to both objects.

If you add DataFrame objects with no column or row labels in common, the result will contain all nulls:

Input: df1 = pd.DataFrame({'A': [1, 2]})
Input: df2 = pd.DataFrame({'B': [3, 4]})
Input: df1
Output:
        A
0	1
1	2

Input: df2
output:
	B
0	3
1	4

Input: df1 - df2
Output:
        A	B
0	NaN	NaN
1	NaN	NaN

Arithmetic methods with fill values

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:

Input: df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                      columns=list('abcd'))
Input: df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                      columns=list('abcde'))
Input: df2.loc[1, 'b'] = np.nan
Input: df1
Output:
        a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

Input: df2
Output:
        a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	6.0	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

Adding these together results in NA values in the locations that don’t overlap:

Input: df1 + df2
Output:
        a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	11.0	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

Using the add method on df1, I pass df2 and an argument to fill_value:

Input: df1.add(df2, fill_value=0)
Output
        a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	11.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

See Table for a listing of Series and DataFrame methods for arithmetic. Each of them has a counterpart, starting with the letter r, that has arguments flipped. So these two statements are equivalent:

Input: 1 / df1
Output:
        a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250	0.200000	0.166667	0.142857
2	0.125	0.111111	0.100000	0.090909

Input: df1.rdiv(1)
Output:
        a	b	c	d
0	inf	1.000000	0.500000	0.333333
1	0.250	0.200000	0.166667	0.142857
2	0.125	0.111111	0.100000	0.090909

Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:

Input: df1.reindex(columns=df2.columns, fill_value=0)
Output:
        a	b	c	d	e
0	0.0	1.0	2.0	3.0	0
1	4.0	5.0	6.0	7.0	0
2	8.0	9.0	10.0	11.0	0
Method     Description
add, raddMethods for addition (+)
sub, rsub      Methods for subtraction (-)
div, rdiv  Methods for division (/)
floordiv, rfloordiv      Methods for floor division (//)
mul, rmul  Methods for multiplication (*)
pow, rpow                 Methods for exponentiation (**)
Flexible arithmetic methods

Operations between DataFrame and Series

As with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined. First, as a motivating example, consider the difference between a two-dimensional array and one of its rows:

Input: arr = np.arange(12.).reshape((3, 4))
Input: arr
Output:
array([[ 0., 1., 2., 3.],
       [ 4., 5., 6., 7.],
       [ 8., 9., 10., 11.]])

Input: arr[0]
Output: array([ 0., 1., 2., 3.])

Input: arr - arr[0]
Output:
array([[ 0., 0., 0., 0.],
       [ 4., 4., 4., 4.],
       [ 8., 8., 8., 8.]])

When we subtract arr[0] from arr, the subtraction is performed once for each row. This is referred to as broadcasting and is explained in more detail as it relates to general NumPy arrays in Appendix A. Operations between a DataFrame and a Series are similar:

Input: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                          columns=list('bde'),
                          index=['Utah', 'Ohio', 'Texas', 'Oregon'])
Input: series = frame.iloc[0]
Input: frame
Output:
        b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

Input: series
Output:
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows:

Input: frame - series
Output:
	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union:

Input: series2 = pd.Series(range(3), index=['b', 'e', 'f'])
Input: frame + series2
Output:
        b	d	e	f
Utah	0.0	NaN	3.0	NaN
Ohio	3.0	NaN	6.0	NaN
Texas	6.0	NaN	9.0	NaN
Oregon	9.0	NaN	12.0	NaN

If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods. For example:

Input: series3 = frame['d']
Input: frame
Output:
        b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

Input: series3
Output:
Utah    1.0
Ohio    4.0
Texas   7.0
Oregon  10.0
Name: d, dtype: float64

Input: frame.sub(series3, axis='index')
Output:
         b	d	e
Utah	-1.0	0.0	1.0
Ohio	-1.0	0.0	1.0
Texas	-1.0	0.0	1.0

Oregon	-1.0	0.0	1.0

The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame’s row index (axis=’index’ or axis=0) and broadcast across.

Function Application and Mapping

NumPy ufuncs (element-wise array methods) also work with pandas objects:

Input: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                           index=['Utah', 'Ohio', 'Texas', 'Oregon']) 
Input: frame
Output:
                b	        d	       e
Utah	-0.823828	-2.044561	1.198334
Ohio	-1.445452	-0.085084	1.196690
Texas	-1.878601	-0.895981	-1.271459
Oregon	-0.697214	-0.067048	-0.280121

Input: np.abs(frame)
Output:
	       b	       d	       e
Utah	0.823828	2.044561	1.198334
Ohio	1.445452	0.085084	1.196690
Texas	1.878601	0.895981	1.271459
Oregon	0.697214	0.067048	0.280121

Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:


Input: f = lambda x: x.max() - x.min()
Input: frame.apply(f)
Output:
b    1.802165
d    1.684034
e    2.689627
dtype: float64

Here the function f, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in frame. The result is a Series having the columns of frame as its index.

If you pass axis=’columns’ to apply, the function will be invoked once per row instead:


Input: frame.apply(f, axis='columns')
Output:
Utah    0.998382
Ohio    2.521511
Texas   0.676115
Oregon  2.542656
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

The function passed to apply need not return a scalar value; it can also return a Series with multiple values:


Input: def f(x):
           
return pd.Series([x.min(), x.max()], index=['min', 'max'])
Input: frame.apply(f) 
Output:
                b	        d	       e
min	-1.878601	-2.044561	-1.271459
max	-0.697214	-0.067048	1.198334

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame. You can do this with apply map:

Output:

b d e

Utah -0.20 0.48 -0.52

Ohio -0.56 1.97 1.39

Texas 0.09 0.28 0.77

Oregon 1.25 1.01 -1.30

Utah -0.52

Ohio 1.39

Texas 0.77

Oregon -1.30

Name: e, dtype: object

Input: format = lambda x: '%.2f' % x     
Input: frame.applymap(format)
Output:
	 b	d	e
Utah	-0.82	-2.04	1.20
Ohio	-1.45	-0.09	1.20
Texas	-1.88	-0.90	-1.27
Oregon	-0.70	-0.07	-0.28

The reason for the name applymap is that Series has a map method for applying an element-wise function:

Input: frame['e'].map(format)
Output:
Utah    -0.52
Ohio     1.39
Texas    0.77
Oregon  -1.30
Name: e, dtype: object

Sorting and Ranking

Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

Input: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
Input: obj.sort_index()
Output:
a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame, you can sort by index on either axis:

Input: frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                           index=['three', 'one'],
                           columns=['d', 'a', 'b', 'c'])
Input: frame.sort_index()
Output:
	d	a	b	c
one	4	5	6	7
three	0	1	2	3

Input: frame.sort_index(axis=1)
Output:
        a	b	c	d
three	1	2	3	0
one	5	6	7	4

The data is sorted in ascending order by default, but can be sorted in descending order, too:

Input: frame.sort_index(axis=1, ascending=False)
Output:
	d	c	b	a
three	0	3	2	1
one	4	7	6	5

To sort a Series by its values, use its sort_values method:

Input: obj = pd.Series([4, 7, -3, 2])
Input: obj.sort_values()
Output:
2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default:

Input: obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
Input: obj.sort_values()
Output:
4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the by option of sort_values:

Input: frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
Input: frame
Output:
        b	a
0	4	0
1	7	1
2	-3	0
3	2	1

Input: frame.sort_values(by='b')
Output:
	b	a
2	-3	0
3	2	1
0	4	0
1	7	1

To sort by multiple columns, pass a list of names:


Input: frame.sort_values(by=['a', 'b'])
Output:
	b	a
2	-3	0
0	4	0
3	2	1
1	7	1

Ranking assigns ranks from one through the number of valid data points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:

Input: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
Input: obj.rank()
Output:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0 
5    2.0
6    4.5
dtype: float64

Ranks can also be assigned according to the order in which they’re observed in the data:

Input: obj.rank(method='first')
Output:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have been set to 6 and 7 because label 0 precedes label 2 in the data. You can rank in descending order, too:


# Assign tie values the maximum rank in the group

Input: obj.rank(ascending=False, method='max')
Output:
0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

See below Table for a list of tie-breaking methods available.

MethodDescription
‘average’Default: assign the average rank to each entry in the equal group
‘min’Use the minimum rank for the whole group
‘max’Use the maximum rank for the whole group
‘first’Assign ranks in the order the values appear in the data
‘dense’Like method=’min’, but ranks always increase by 1 in between groups rather than the number of equal elements in a group
Table:- Tie-breaking methods with rank

DataFrame can compute ranks over the rows or the columns:

Input: frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                                                        'c': [-2, 5, 8, -2.5]})
Input: frame
Output:
         b	a	c
0	4.3	0	-2.0
1	7.0	1	5.0
2	-3.0	0	8.0
3	2.0	1	-2.5

Input: frame.rank(axis='columns')
Output:
         b	a	c
0	3.0	2.0	1.0
1	3.0	1.0	2.0
2	1.0	2.0	3.0
3	3.0	2.0	1.0

Axis Indexes with Duplicate Labels

Up until now all of the examples we’ve looked at have had unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:

Input: obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
Input: obj
Output:
a    0
a    1
b    2
b    3
c    4
dtype: int64

The index’s is_unique property can tell you whether its labels are unique or not:

Input: obj.index.is_unique
Output: False

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value:

Input: obj['a']
Output:
a    0
a    1
dtype: int64

Input: obj['c']
Output: 4

This can make your code more complicated, as the output type from indexing can vary based on whether a label is repeated or not.
The same logic extends to indexing rows in a DataFrame:

Input: df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
Input: df
Output:
               0	      1	               2
a	0.953704	2.376202	-0.158151
a	0.173216	1.101896	1.869667
b	1.181897	-1.909708	1.341146
b	0.229768	-0.683772	1.506388

Input: df.loc['b']
Output:
               0         	1	      2
b	1.181897	-1.909708	1.341146
b	0.229768	-0.683772	1.506388

This Post Has 2 Comments

  1. alfaji

    great content

Leave a Reply