Essential Functionality
Reindexing
An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example:
Input[1]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
Input[2]: obj
Output:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:
Input[3]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
Input[4]: obj2
Output:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values:
Input[5]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
Input[6]: obj3
Output:
0 blue
2 purple
4 yellow
dtype: object
Input[7]: obj3.reindex(range(6), method='ffill')
Output:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
With DataFrame, reindex can alter either the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:
Input[8]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
Input[9]: frame
Output:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
Input[10]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
Input[11]: frame2
Output:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
The columns can be reindexed with the columns keyword:
Input[12]: states = ['Texas', 'Utah', 'California']
Input[13]: frame.reindex(columns=states)
Output:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
for more about the arguments to reindex
Argument | Descripton |
index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying. |
method | Interpolation (fill) method; ‘ffill’ fills forward, while ‘bfill’ fills backward. |
fill_value | Substitute value to use when introducing missing data by reindexing. |
limit | When forward- or backfilling, maximum size gap (in number of elements) to fill. |
tolerance | When forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches. |
level | Match simple Index on level of MultiIndex; otherwise select subset of. |
copy | If True, always copy underlying data even if new index is equivalent to old index; if False, do not copy the data when the indexes are equivalent. |
you can reindex more succinctly by label-indexing with loc, and many users prefer to use it exclusively:
Input[14]: frame.loc[['a', 'b', 'c', 'd'], states]
Output:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
Dropping Entries from an Axis
Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:
Input: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
Input: obj
Output:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
Input: new_obj = obj.drop('c')
Input: new_obj
Output:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
Input: obj.drop(['d', 'c'])
Output:
a 0.0
b 1.0
e 4.0
dtype: float64
With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:
Input: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
Input: data
Output:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
Calling drop with a sequence of labels will drop values from the row labels (axis 0):
Input: data.drop(['Colorado', 'Ohio'])
Output:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
You can drop values from the columns by passing axis=1 or axis=’columns’:
Input: data.drop('two', axis=1)
Output:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
Input: data.drop(['two', 'four'], axis='columns')
Output:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object:
Input: obj.drop('c', inplace=True)
Input: obj
Output:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
Be careful with the inplace, as it destroys any data that is dropped.
Indexing, Selection, and Filtering
Series indexing (obj[…]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:
Input: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
Input: obj
Output:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
Input: obj['b']
Output: 1.0
Input: obj[1]
Output: 1.0
Input: obj[2:4Out[121]:
c 2.0
d 3.0
dtype: float64
Input: obj[['b', 'a', 'd']]
Output:
b 1.0
a 0.0
d 3.0
dtype: float64
Input: obj[[1, 3]]
Output:
b 1.0
d 3.0
dtype: float64
Input: obj[obj < 2]
Output:
a 0.0
b 1.0
dtype: float64
Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:
Input: obj['b':'c']
Output:
b 1.0
c 2.0
dtype: float64
Setting using these methods modifies the corresponding section of the Series:
Input: obj['b':'c'] = 5
Input: obj
Output:
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:
Input: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
Input: data
Output:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
Input: data['two']
Output:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int64
Input: data[['three', 'one']]
Output:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
Indexing like this has a few special cases. First, slicing or selecting data with a boolean array:
Input: data[:2]
Output:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Input: data[data['three'] > 5]
Output:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
The row selection syntax data[:2] is provided as a convenience. Passing a single element or a list to the [] operator selects columns.
Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:
Input: data < 5
Output:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
Input: data[data < 5] = 0
Input: data
Output:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
This makes DataFrame syntactically more like a two-dimensional NumPy array in this particular case.
Selection with loc and iloc
For DataFrame label-indexing on the rows, I introduce the special indexing operators loc and iloc. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (loc) or integers (iloc).
As a preliminary example, let’s select a single row and multiple columns by label:
Input: data.loc['Colorado', ['two', 'three']]
Output:
two 5
three 6
Name: Colorado, dtype: int64
We’ll then perform some similar selections with integers using iloc:
Input: data.iloc[2, [3, 0, 1]]
Output:
four 11
one 8
two 9
Name: Utah, dtype: int64
Input: data.iloc[2]
Output:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int64
Input: data.iloc[[1, 2], [3, 0, 1]]
Output:
four one two
Colorado 7 0 5
Utah 11 8 9
Both indexing functions work with slices in addition to single labels or lists of labels:
Input: data.loc[:'Utah', 'two']
Output:
four one two
Colorado 7 0 5
Utah 11 8 9
Input: data.iloc[:, :3][data.three > 5]
Output:
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
So there are many ways to select and rearrange the data contained in a pandas object.
Types | Notes |
df[val] | Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion) |
df.loc[val] | Selects single row or subset of rows from the DataFrame by label |
df.loc[:, val] | Selects single column or subset of columns by label |
df.loc[val1, val2] | Select both rows and columns by label |
df.iloc[where] | Selects single row or subset of rows from the DataFrame by integer position |
df.iloc[:, where] | Selects single column or subset of columns by integer position |
df.iloc[where_i, where_j] | Select both rows and columns by integer position |
df.at[label_i, label_j] | Select a single scalar value by row and column label |
df.iat[i, j] | Select a single scalar value by row and column position (integers) |
reindex method | Select either rows or columns by labels |
get_value, set_value methods | Select single value by row and column label |
Integer Indexes
Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error:
ser = pd.Series(np.arange(3.))
ser
ser[-1]
In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in general without introducing subtle bugs. Here we have an index containing 0, 1, 2, but inferring what the user wants (label-based indexing or position-based) is difficult:
Input: ser
Output:
0 0.0
1 1.0
2 2.0
dtype: float64
On the other hand, with a non-integer index, there is no potential for ambiguity:
Input: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
Input: ser2[-1]
Output: 2.0
To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc (for integers):
Input: ser[:1]
Output:
0 0.0
dtype: float64
Input: ser.loc[:1]
Output:
0 0.0
1 1.0
dtype: float64
Input: ser.iloc[:1]
Output:
0 0.0
dtype: float64
Arithmetic and Data Alignment
An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. Let’s look at an example:
Input: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
Input: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
index=['a', 'c', 'e', 'f', 'g'])
Input: s1
Output:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
Input: s2
Output:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
Adding these together yields:
Input: s1 + s2
Output:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations. In the case of DataFrame, alignment is performed on both the rows and the columns:
Input: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
Input: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
Input: df1
Output:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
Input: df2
Output:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:
Input: df1 + df2
Output:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
Since the ‘c’ and ‘e’ columns are not found in both DataFrame objects, they appear as all missing in the result. The same holds for the rows whose labels are not common to both objects.
If you add DataFrame objects with no column or row labels in common, the result will contain all nulls:
Input: df1 = pd.DataFrame({'A': [1, 2]})
Input: df2 = pd.DataFrame({'B': [3, 4]})
Input: df1
Output:
A
0 1
1 2
Input: df2
output:
B
0 3
1 4
Input: df1 - df2
Output:
A B
0 NaN NaN
1 NaN NaN
Arithmetic methods with fill values
In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:
Input: df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
Input: df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))
Input: df2.loc[1, 'b'] = np.nan
Input: df1
Output:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
Input: df2
Output:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
Adding these together results in NA values in the locations that don’t overlap:
Input: df1 + df2
Output:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
Using the add method on df1, I pass df2 and an argument to fill_value:
Input: df1.add(df2, fill_value=0)
Output
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
See Table for a listing of Series and DataFrame methods for arithmetic. Each of them has a counterpart, starting with the letter r, that has arguments flipped. So these two statements are equivalent:
Input: 1 / df1
Output:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250 0.200000 0.166667 0.142857
2 0.125 0.111111 0.100000 0.090909
Input: df1.rdiv(1)
Output:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250 0.200000 0.166667 0.142857
2 0.125 0.111111 0.100000 0.090909
Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:
Input: df1.reindex(columns=df2.columns, fill_value=0)
Output:
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0
Method | Description |
add, radd | Methods for addition (+) |
sub, rsub | Methods for subtraction (-) |
div, rdiv | Methods for division (/) |
floordiv, rfloordiv | Methods for floor division (//) |
mul, rmul | Methods for multiplication (*) |
pow, rpow | Methods for exponentiation (**) |
Operations between DataFrame and Series
As with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined. First, as a motivating example, consider the difference between a two-dimensional array and one of its rows:
Input: arr = np.arange(12.).reshape((3, 4))
Input: arr
Output:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
Input: arr[0]
Output: array([ 0., 1., 2., 3.])
Input: arr - arr[0]
Output:
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
When we subtract arr[0] from arr, the subtraction is performed once for each row. This is referred to as broadcasting and is explained in more detail as it relates to general NumPy arrays in Appendix A. Operations between a DataFrame and a Series are similar:
Input: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
Input: series = frame.iloc[0]
Input: frame
Output:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
Input: series
Output:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows:
Input: frame - series
Output:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union:
Input: series2 = pd.Series(range(3), index=['b', 'e', 'f'])
Input: frame + series2
Output:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods. For example:
Input: series3 = frame['d']
Input: frame
Output:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
Input: series3
Output:
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
Input: frame.sub(series3, axis='index')
Output:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame’s row index (axis=’index’ or axis=0) and broadcast across.
Function Application and Mapping
NumPy ufuncs (element-wise array methods) also work with pandas objects:
Input: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
Input: frame
Output:
b d e
Utah -0.823828 -2.044561 1.198334
Ohio -1.445452 -0.085084 1.196690
Texas -1.878601 -0.895981 -1.271459
Oregon -0.697214 -0.067048 -0.280121
Input: np.abs(frame)
Output:
b d e
Utah 0.823828 2.044561 1.198334
Ohio 1.445452 0.085084 1.196690
Texas 1.878601 0.895981 1.271459
Oregon 0.697214 0.067048 0.280121
Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:
Input: f = lambda x: x.max() - x.min()
Input: frame.apply(f)
Output:
b 1.802165
d 1.684034
e 2.689627
dtype: float64
Here the function f, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in frame. The result is a Series having the columns of frame as its index.
If you pass axis=’columns’ to apply, the function will be invoked once per row instead:
Input: frame.apply(f, axis='columns')
Output:
Utah 0.998382
Ohio 2.521511
Texas 0.676115
Oregon 2.542656
dtype: float64
Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.
The function passed to apply need not return a scalar value; it can also return a Series with multiple values:
Input: def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])
Input: frame.apply(f)
Output:
b d e
min -1.878601 -2.044561 -1.271459
max -0.697214 -0.067048 1.198334
Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame. You can do this with apply map:
Output:
b d e
Utah -0.20 0.48 -0.52
Ohio -0.56 1.97 1.39
Texas 0.09 0.28 0.77
Oregon 1.25 1.01 -1.30
Utah -0.52
Ohio 1.39
Texas 0.77
Oregon -1.30
Name: e, dtype: object
Input: format = lambda x: '%.2f' % x
Input: frame.applymap(format)
Output:
b d e
Utah -0.82 -2.04 1.20
Ohio -1.45 -0.09 1.20
Texas -1.88 -0.90 -1.27
Oregon -0.70 -0.07 -0.28
The reason for the name applymap is that Series has a map method for applying an element-wise function:
Input: frame['e'].map(format)
Output:
Utah -0.52
Ohio 1.39
Texas 0.77
Oregon -1.30
Name: e, dtype: object
Sorting and Ranking
Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:
Input: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
Input: obj.sort_index()
Output:
a 1
b 2
c 3
d 0
dtype: int64
With a DataFrame, you can sort by index on either axis:
Input: frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
Input: frame.sort_index()
Output:
d a b c
one 4 5 6 7
three 0 1 2 3
Input: frame.sort_index(axis=1)
Output:
a b c d
three 1 2 3 0
one 5 6 7 4
The data is sorted in ascending order by default, but can be sorted in descending order, too:
Input: frame.sort_index(axis=1, ascending=False)
Output:
d c b a
three 0 3 2 1
one 4 7 6 5
To sort a Series by its values, use its sort_values method:
Input: obj = pd.Series([4, 7, -3, 2])
Input: obj.sort_values()
Output:
2 -3
3 2
0 4
1 7
dtype: int64
Any missing values are sorted to the end of the Series by default:
Input: obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
Input: obj.sort_values()
Output:
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64
When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the by option of sort_values:
Input: frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
Input: frame
Output:
b a
0 4 0
1 7 1
2 -3 0
3 2 1
Input: frame.sort_values(by='b')
Output:
b a
2 -3 0
3 2 1
0 4 0
1 7 1
To sort by multiple columns, pass a list of names:
Input: frame.sort_values(by=['a', 'b'])
Output:
b a
2 -3 0
0 4 0
3 2 1
1 7 1
Ranking assigns ranks from one through the number of valid data points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:
Input: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
Input: obj.rank()
Output:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
Ranks can also be assigned according to the order in which they’re observed in the data:
Input: obj.rank(method='first')
Output:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have been set to 6 and 7 because label 0 precedes label 2 in the data. You can rank in descending order, too:
# Assign tie values the maximum rank in the group
Input: obj.rank(ascending=False, method='max')
Output:
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
See below Table for a list of tie-breaking methods available.
Method | Description |
‘average’ | Default: assign the average rank to each entry in the equal group |
‘min’ | Use the minimum rank for the whole group |
‘max’ | Use the maximum rank for the whole group |
‘first’ | Assign ranks in the order the values appear in the data |
‘dense’ | Like method=’min’, but ranks always increase by 1 in between groups rather than the number of equal elements in a group |
DataFrame can compute ranks over the rows or the columns:
Input: frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})
Input: frame
Output:
b a c
0 4.3 0 -2.0
1 7.0 1 5.0
2 -3.0 0 8.0
3 2.0 1 -2.5
Input: frame.rank(axis='columns')
Output:
b a c
0 3.0 2.0 1.0
1 3.0 1.0 2.0
2 1.0 2.0 3.0
3 3.0 2.0 1.0
Axis Indexes with Duplicate Labels
Up until now all of the examples we’ve looked at have had unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:
Input: obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
Input: obj
Output:
a 0
a 1
b 2
b 3
c 4
dtype: int64
The index’s is_unique property can tell you whether its labels are unique or not:
Input: obj.index.is_unique
Output: False
Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value:
Input: obj['a']
Output:
a 0
a 1
dtype: int64
Input: obj['c']
Output: 4
This can make your code more complicated, as the output type from indexing can vary based on whether a label is repeated or not.
The same logic extends to indexing rows in a DataFrame:
Input: df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
Input: df
Output:
0 1 2
a 0.953704 2.376202 -0.158151
a 0.173216 1.101896 1.869667
b 1.181897 -1.909708 1.341146
b 0.229768 -0.683772 1.506388
Input: df.loc['b']
Output:
0 1 2
b 1.181897 -1.909708 1.341146
b 0.229768 -0.683772 1.506388
great content
Interesting blog! Is your theme custom made or did you download it from somewhere? A design like yours with a few simple adjustements would really make my blog jump out. Please let me know where you got your theme. With thanks