Abstract: Learning Pandas sorting method is a good way to start or practice basic data analysis using Python. The most common data analysis is done using spreadsheets, SQL or pandas. One of the major advantages of using Pandas is that it can process large amounts of data and provide high-performance data manipulation capabilities.
This article is shared from Huawei Cloud Community " Pandas Sort: Your Python Data Sorting Guide ", author: Yuchuan.
Learning the Pandas sorting method is a great way to start or practice basic data analysis using Python. The most common data analysis is done using spreadsheets, SQL or pandas. One of the major advantages of using Pandas is that it can process large amounts of data and provide high-performance data manipulation capabilities.
In this tutorial, you will learn how to use .sort_values() and .sort_index(), which will enable you to effectively sort the data in the DataFrame.
At the end of this tutorial, you will know how to:
• Sort Pandas DataFrame by one or more column values
• Use the ascending parameter to change the sort order
• Use index to sort the DataFrame. sort_index()
• Organize missing data when sorting values
• Use set to to sort DataFrame inplace inplaceTrue
To follow this tutorial, you need a basic understanding of Pandas DataFrames and a certain understanding of reading data from files.
Getting started with pandas sorting method
As a quick reminder, DataFrame is a data structure with marked axes in rows and columns. You can sort the DataFrame by row or column value and row or column index.
Both rows and columns have an index, which is a numeric representation of the position of the data in the DataFrame. You can use the index position of the DataFrame to retrieve data from a specific row or column. By default, the index number starts from zero. You can also manually assign your own index.
Prepare the data set
In this tutorial, you will use the fuel economy data compiled by the U.S. Environmental Protection Agency (EPA) for vehicles manufactured between 1984 and 2021. The EPA fuel economy data set is great because it contains many different types of information, and you can sort it from text to numeric data types. The data set contains a total of eighty-three columns.
To continue, you need to install the pandas Python library. The code in this tutorial was executed using pandas 1.2.0 and Python 3.9.1.
Note: The entire fuel economy data set of It may take a minute or two to read the entire data set into memory. Limiting the number of rows and columns can help improve performance, but it still takes a few seconds to download the data.
For analysis purposes, you will view the MPG (miles per gallon) data of a vehicle by make, model, year, and other vehicle attributes. You can specify the columns to be read into the DataFrame. For this tutorial, you only need a subset of the available columns.
The following is the command to read the relevant columns of the fuel economy data set into the DataFrame and display the first five rows:
>>>
>>> import pandas as pd
>>> column_subset = [
... "id",
... "make",
... "model",
... "year",
... "cylinders",
... "fuelType",
... "trany",
... "mpgData",
... "city08",
... "highway08"
... ]
>>> df = pd.read_csv(
... "https://www.fueleconomy.gov/feg/epadata/vehicles.csv",
... usecols=column_subset,
... nrows=100
... )
>>> df.head()
city08 cylinders fuelType ... mpgData trany year
0 19 4 Regular ... Y Manual 5-spd 1985
1 9 12 Regular ... N Manual 5-spd 1985
2 23 4 Regular ... Y Manual 5-spd 1985
3 10 8 Regular ... N Automatic 3-spd 1985
4 17 4 Premium ... N Manual 5-spd 1993
[5 rows x 10 columns]
By calling .read_csv() using the dataset URL, you can load the data into the DataFrame. Shrinking the column will result in faster load time and less memory usage. In order to further limit memory consumption and quickly understand the data, you can use nrows to specify the number of rows to be loaded.
Familiar with .sort_values()
You use .sort_values() to sort the values in DataFrame along any axis (column or row). Generally, you want to sort the rows in a DataFrame by the values of one or more columns:
The figure above shows the result of using .sort_values() to sort the rows of the DataFrame according to the values in the highway08 column. This is similar to using columns to sort data in a spreadsheet.
Familiar with .sort_index()
You use .sort_index() to sort the DataFrame by row index or column label. The difference with using.sort_values() is that you sort the DataFrame according to its row index or column name, not according to the values in these rows or columns:
The row index of the DataFrame is marked in blue in the figure above. Indexes are not considered as one column, you usually only have one row index. The row index can be thought of as a zero-based row number.
Sort DataFrame on a single column
To sort the DataFrame based on the values in a single column, you will use .sort_values(). By default, this will return a new DataFrame sorted in ascending order. It will not modify the original DataFrame.
Sort by column in ascending order
To use .sort_values(), pass a single parameter to the method that contains the name of the column you want to sort by. In this example, you sort the DataFrame by the city08 column, which represents the city MPG for fuel-only cars:
>>>
>>> df.sort_values("city08")
city08 cylinders fuelType ... mpgData trany year
99 9 8 Premium ... N Automatic 4-spd 1993
1 9 12 Regular ... N Manual 5-spd 1985
80 9 8 Regular ... N Automatic 3-spd 1985
47 9 8 Regular ... N Automatic 3-spd 1985
3 10 8 Regular ... N Automatic 3-spd 1985
.. ... ... ... ... ... ... ...
9 23 4 Regular ... Y Automatic 4-spd 1993
8 23 4 Regular ... Y Manual 5-spd 1993
7 23 4 Regular ... Y Automatic 3-spd 1993
76 23 4 Regular ... Y Manual 5-spd 1993
2 23 4 Regular ... Y Manual 5-spd 1985
[100 rows x 10 columns]
This will sort your DataFrame using the column values in city08, showing the vehicle with the lowest MPG first. By default, the data is sorted in ascending order.sort_values(). Although you did not specify a name for the parameter passed to, you actually used the by parameter in .sort_values(), which you will see in the next example.
Change the sort order
Another parameter of .sort_values() is ascending. By default, .sort_values() has ascending set to True. If you want the DataFrame to be sorted in descending order, you can pass False to this parameter:
>>>
>>> df.sort_values(
... by="city08",
... ascending=False
... )
city08 cylinders fuelType ... mpgData trany year
9 23 4 Regular ... Y Automatic 4-spd 1993
2 23 4 Regular ... Y Manual 5-spd 1985
7 23 4 Regular ... Y Automatic 3-spd 1993
8 23 4 Regular ... Y Manual 5-spd 1993
76 23 4 Regular ... Y Manual 5-spd 1993
.. ... ... ... ... ... ... ...
58 10 8 Regular ... N Automatic 3-spd 1985
80 9 8 Regular ... N Automatic 3-spd 1985
1 9 12 Regular ... N Manual 5-spd 1985
47 9 8 Regular ... N Automatic 3-spd 1985
99 9 8 Premium ... N Automatic 4-spd 1993
[100 rows x 10 columns]
You can reverse the sort order by passing False to ascending. Now your DataFrame is sorted in descending order by the average MPG measured under city conditions. The vehicle with the highest MPG value is in the first row.
Select sorting algorithm
It is worth noting that pandas allows you to choose different sorting algorithms to use with .sort_values() and .sort_index(). Available algorithms quicksort, mergesort and heapsort. For more information about these different sorting algorithms, check out Sorting Algorithms in Python.
The default algorithm used when sorting a single column is quicksort. To change it to a stable sorting algorithm, use mergesort. You can use the kind parameter in or to perform this operation as follows: .sort_values().sort_index()
>>>
>>> df.sort_values(
... by="city08",
... ascending=False,
... kind="mergesort"
... )
city08 cylinders fuelType ... mpgData trany year
2 23 4 Regular ... Y Manual 5-spd 1985
7 23 4 Regular ... Y Automatic 3-spd 1993
8 23 4 Regular ... Y Manual 5-spd 1993
9 23 4 Regular ... Y Automatic 4-spd 1993
10 23 4 Regular ... Y Manual 5-spd 1993
.. ... ... ... ... ... ... ...
69 10 8 Regular ... N Automatic 3-spd 1985
1 9 12 Regular ... N Manual 5-spd 1985
47 9 8 Regular ... N Automatic 3-spd 1985
80 9 8 Regular ... N Automatic 3-spd 1985
99 9 8 Premium ... N Automatic 4-spd 1993
[100 rows x 10 columns]
With kind, you set the sorting algorithm to mergesort. The previous output used the default quicksort algorithm. Looking at the highlighted index, you can see that the order of the rows is different. This is because quicksort is not a stable sorting algorithm, but mergesort.
Note: In Pandas, kind will be ignored when you sort multiple columns or labels.
When you sort multiple records with the same key, a stable sorting algorithm will maintain the original order of these records after sorting. Therefore, if you plan to perform multiple sorts, you must use a stable sorting algorithm.
Sort DataFrame on multiple columns
In data analysis, you usually want to sort the data based on the values of multiple columns. Imagine you have a data set containing people's first and last names. It makes sense to sort by last name and then by first name, so that people with the same last name will be sorted alphabetically according to their first name.
In the first example, you sorted the DataFrame on a single column named city08. From an analytical point of view, MPG under urban conditions is an important factor in determining the popularity of cars. In addition to MPG in urban conditions, you may also want to view MPG in highway conditions. To sort by two keys, you can pass a list of column names to by:
>>>
>>> df.sort_values(
... by=["city08", "highway08"]
... )[["city08", "highway08"]]
city08 highway08
80 9 10
47 9 11
99 9 13
1 9 14
58 10 11
.. ... ...
9 23 30
10 23 30
8 23 31
76 23 31
2 23 33
[100 rows x 2 columns]
By specifying a list of column names city08 and highway08, you can use .sort_values() to sort the DataFrame on the two columns. The next example will explain how to specify the sort order and why it is important to pay attention to the list of column names you use.
Sort by multiple columns in ascending order
To sort a DataFrame on multiple columns, you must provide a list of column names. For example, to sort by make and model, you should create the following list and then pass it to .sort_values():
>>>
>>> df.sort_values(
... by=["make", "model"]
... )[["make", "model"]]
make model
0 Alfa Romeo Spider Veloce 2000
18 Audi 100
19 Audi 100
20 BMW 740i
21 BMW 740il
.. ... ...
12 Volkswagen Golf III / GTI
13 Volkswagen Jetta III
15 Volkswagen Jetta III
16 Volvo 240
17 Volvo 240
[100 rows x 2 columns]
Now your DataFrame is sorted in ascending order by make. If there are two or more identical brands, they are sorted by model. The order in which the column names are specified in the list corresponds to the sorting method of the DataFrame.
Change the column sort order
Since you use multiple columns for sorting, you can specify the sort order of the columns. If you want to change the logical sort order in the previous example, you can change the order of the column names in the list passed to the by parameter:
>>>
>>> df.sort_values(
... by=["model", "make"]
... )[["make", "model"]]
make model
18 Audi 100
19 Audi 100
16 Volvo 240
17 Volvo 240
75 Mazda 626
.. ... ...
62 Ford Thunderbird
63 Ford Thunderbird
88 Oldsmobile Toronado
42 CX Automotive XM v6
43 CX Automotive XM v6a
[100 rows x 2 columns]
Your DataFrame is now sorted by column in ascending order by model, and then sorted by whether make has two or more identical models. You can see that changing the order of the columns also changes the sort order of the values.
Sort by multiple columns in descending order
So far, you have only sorted multiple columns in ascending order. In the next example, you will sort in descending order based on the make and model columns. To sort in descending order, set ascending to False:
>>>
>>> df.sort_values(
... by=["make", "model"],
... ascending=False
... )[["make", "model"]]
make model
16 Volvo 240
17 Volvo 240
13 Volkswagen Jetta III
15 Volkswagen Jetta III
11 Volkswagen Golf III / GTI
.. ... ...
21 BMW 740il
20 BMW 740i
18 Audi 100
19 Audi 100
0 Alfa Romeo Spider Veloce 2000
[100 rows x 2 columns]
The values in the make column are sorted in alphabetical order in reverse model. For text data with the same make, the sorting is case-sensitive, which means that uppercase text will first appear in ascending order, and finally in descending order.
Sort by multiple columns with different sort orders
You may be wondering if you can use multiple columns for sorting and have these columns use different ascending parameters. With pandas, you can accomplish this with a single method call. If you want to sort some columns in ascending order, and sort some columns in descending order, you can pass a list of boolean values to ascending.
In this example, you arrange the data frame by make, model and city08 columns, with the first two columns in ascending order and city08 in descending order. To do this, you pass a list of column names to by and a list of boolean values to ascending:
>>>
>>> df.sort_values(
... by=["make", "model", "city08"],
... ascending=[True, True, False]
... )[["make", "model", "city08"]]
make model city08
0 Alfa Romeo Spider Veloce 2000 19
18 Audi 100 17
19 Audi 100 17
20 BMW 740i 14
21 BMW 740il 14
.. ... ... ...
11 Volkswagen Golf III / GTI 18
15 Volkswagen Jetta III 20
13 Volkswagen Jetta III 18
17 Volvo 240 19
16 Volvo 240 18
[100 rows x 3 columns]
Now your data frame is sorted by make, and the model is sorted in ascending order, but with city08 in descending order. This is useful because it groups the cars in sort order and displays the cars with the highest MPG first.
Sort the DataFrame according to the index
Before sorting the index, it is best to understand what the index represents. DataFrame has an .index property, which by default is a numeric representation of its row position. You can think of the index as a row number. It helps to find and identify quickly.
Sort by index in ascending order
You can sort the DataFrame according to the row index. sort_index(). Sorting by column value as in the previous example reorders the rows in the DataFrame, so the index becomes messy. This can also happen when you filter the DataFrame or delete or add rows.
To illustrate the use of .sort_index(), first use the following method to create a new sorting DataFrame .sort_values():
>>>
>>> sorted_df = df.sort_values(by=["make", "model"])
>>> sorted_df
city08 cylinders fuelType ... mpgData trany year
0 19 4 Regular ... Y Manual 5-spd 1985
18 17 6 Premium ... Y Automatic 4-spd 1993
19 17 6 Premium ... N Manual 5-spd 1993
20 14 8 Premium ... N Automatic 5-spd 1993
21 14 8 Premium ... N Automatic 5-spd 1993
.. ... ... ... ... ... ... ...
12 21 4 Regular ... Y Manual 5-spd 1993
13 18 4 Regular ... N Automatic 4-spd 1993
15 20 4 Regular ... N Manual 5-spd 1993
16 18 4 Regular ... Y Automatic 4-spd 1993
17 19 4 Regular ... Y Manual 5-spd 1993
[100 rows x 10 columns]
You have created a DataFrame sorted by multiple values. Note how the row indexes are in no particular order. To restore the new DataFrame to its original order, you can use .sort_index():
>>>
>>> sorted_df.sort_index()
city08 cylinders fuelType ... mpgData trany year
0 19 4 Regular ... Y Manual 5-spd 1985
1 9 12 Regular ... N Manual 5-spd 1985
2 23 4 Regular ... Y Manual 5-spd 1985
3 10 8 Regular ... N Automatic 3-spd 1985
4 17 4 Premium ... N Manual 5-spd 1993
.. ... ... ... ... ... ... ...
95 17 6 Regular ... Y Automatic 3-spd 1993
96 17 6 Regular ... N Automatic 4-spd 1993
97 15 6 Regular ... N Automatic 4-spd 1993
98 15 6 Regular ... N Manual 5-spd 1993
99 9 8 Premium ... N Automatic 4-spd 1993
[100 rows x 10 columns]
The index is now sorted in ascending order. Just like the default parameter of in.sort_values(), you can change it to descending order by passing it. Sorting the index has no effect on the data itself, because the value does not change. ascending.sort_index()TrueFalse
When you use .set_index(). If you want to use make and model columns to set a custom index, you can pass the list to .set_index():
>>>
>>> assigned_index_df = df.set_index(
... ["make", "model"]
... )
>>> assigned_index_df
city08 cylinders ... trany year
make model ...
Alfa Romeo Spider Veloce 2000 19 4 ... Manual 5-spd 1985
Ferrari Testarossa 9 12 ... Manual 5-spd 1985
Dodge Charger 23 4 ... Manual 5-spd 1985
B150/B250 Wagon 2WD 10 8 ... Automatic 3-spd 1985
Subaru Legacy AWD Turbo 17 4 ... Manual 5-spd 1993
... ... ... ... ...
Pontiac Grand Prix 17 6 ... Automatic 3-spd 1993
Grand Prix 17 6 ... Automatic 4-spd 1993
Grand Prix 15 6 ... Automatic 4-spd 1993
Grand Prix 15 6 ... Manual 5-spd 1993
Rolls-Royce Brooklands/Brklnds L 9 8 ... Automatic 4-spd 1993
[100 rows x 8 columns]
Using this method, you can replace the default integer-based row index with two axis labels. This is considered a MultiIndex or a hierarchical index. Your DataFrame is now indexed by multiple keys, you can use .sort_index() to sort the following keys:
>>>
>>> assigned_index_df.sort_index()
city08 cylinders ... trany year
make model ...
Alfa Romeo Spider Veloce 2000 19 4 ... Manual 5-spd 1985
Audi 100 17 6 ... Automatic 4-spd 1993
100 17 6 ... Manual 5-spd 1993
BMW 740i 14 8 ... Automatic 5-spd 1993
740il 14 8 ... Automatic 5-spd 1993
... ... ... ... ...
Volkswagen Golf III / GTI 21 4 ... Manual 5-spd 1993
Jetta III 18 4 ... Automatic 4-spd 1993
Jetta III 20 4 ... Manual 5-spd 1993
Volvo 240 18 4 ... Automatic 4-spd 1993
240 19 4 ... Manual 5-spd 1993
[100 rows x 8 columns]
First use make and columns to assign a new index model to the DataFrame, and then use .sort_index() to sort the index. You can read more about using .set_index() in the pandas documentation.
Sort by index in descending order
For the next example, you will sort the DataFrame in descending order by index. Remember, by sorting the DataFrame.sort_values(), you can reverse the sort order by setting ascending to False. This parameter also applies to .sort_index(), so you can sort the DataFrame in reverse order as follows:
>>>
>>> assigned_index_df.sort_index(ascending=False)
city08 cylinders ... trany year
make model ...
Volvo 240 18 4 ... Automatic 4-spd 1993
240 19 4 ... Manual 5-spd 1993
Volkswagen Jetta III 18 4 ... Automatic 4-spd 1993
Jetta III 20 4 ... Manual 5-spd 1993
Golf III / GTI 18 4 ... Automatic 4-spd 1993
... ... ... ... ...
BMW 740il 14 8 ... Automatic 5-spd 1993
740i 14 8 ... Automatic 5-spd 1993
Audi 100 17 6 ... Automatic 4-spd 1993
100 17 6 ... Manual 5-spd 1993
Alfa Romeo Spider Veloce 2000 19 4 ... Manual 5-spd 1985
[100 rows x 8 columns]
Now your DataFrame is sorted in descending order by its index. One difference between using .sort_index() and .sort_values() is that .sort_index() has no by parameter because it sorts the DataFrame on the row index by default.
Explore advanced index sorting concepts
There are many situations in data analysis where you want to sort the hierarchical index. You have seen how to use make and model in MultiIndex. For this data set, you can also use the id column as an index.
Setting the id column as an index may help link related data sets. For example, the EPA's emission data set is also used for id to represent the vehicle record ID. This links emissions data with fuel economy data. You can use other methods to sort the indexes of two data sets in a DataFrame (such as .merge(). To learn more about combining data in Pandas, see Using merge(), .join() in Pandas) Combine data with concat().
Sort the columns of the DataFrame
You can also use the column labels of the DataFrame to sort the row values. Using the optional parameter set to .sort_index() will sort the DataFrame by column label. The sorting algorithm is applied to the axis labels instead of the actual data. This facilitates visual inspection of the DataFrame. axis1
Use data frame axis
When you use .sort_index() without passing any explicit parameter axis=0, it will be used as the default parameter. The axis of the DataFrame refers to the index (axis=0) or the column (axis=1). You can use these two axes to index and select the data in the DataFrame and sort the data.
Sort using column labels
You can also use the column label of the DataFrame as .sort_index(). Set the column axis of the DataFrame to be sorted according to the column label:
>>>
>>> df.sort_index(axis=1)
city08 cylinders fuelType ... mpgData trany year
0 19 4 Regular ... Y Manual 5-spd 1985
1 9 12 Regular ... N Manual 5-spd 1985
2 23 4 Regular ... Y Manual 5-spd 1985
3 10 8 Regular ... N Automatic 3-spd 1985
4 17 4 Premium ... N Manual 5-spd 1993
.. ... ... ... ... ... ... ...
95 17 6 Regular ... Y Automatic 3-spd 1993
96 17 6 Regular ... N Automatic 4-spd 1993
97 15 6 Regular ... N Automatic 4-spd 1993
98 15 6 Regular ... N Manual 5-spd 1993
99 9 8 Premium ... N Automatic 4-spd 1993
[100 rows x 10 columns]
The columns of the DataFrame are sorted alphabetically from left to right in ascending order. If you want to sort the column in descending order, you can use ascending=False:
>>>
>>> df.sort_index(axis=1, ascending=False)
year trany mpgData ... fuelType cylinders city08
0 1985 Manual 5-spd Y ... Regular 4 19
1 1985 Manual 5-spd N ... Regular 12 9
2 1985 Manual 5-spd Y ... Regular 4 23
3 1985 Automatic 3-spd N ... Regular 8 10
4 1993 Manual 5-spd N ... Premium 4 17
.. ... ... ... ... ... ... ...
95 1993 Automatic 3-spd Y ... Regular 6 17
96 1993 Automatic 4-spd N ... Regular 6 17
97 1993 Automatic 4-spd N ... Regular 6 15
98 1993 Manual 5-spd N ... Regular 6 15
99 1993 Automatic 4-spd N ... Premium 8 9
[100 rows x 10 columns]
Using axis=1in .sort_index(), you can sort the columns of the DataFrame in ascending and descending order. This may be more useful in other datasets, such as datasets where the column labels correspond to months of the year. In this case, it makes sense to sort the data in ascending or descending order by month.
Dealing with missing data when sorting in Pandas
Usually, real-world data has many flaws. Although Pandas has a variety of methods to clean up the data before sorting, sometimes it is good to see the missing data while sorting. You can do this with the na_position parameter.
The fuel economy data subset used in this tutorial has no missing values. In order to illustrate the use of na_position, first you need to create some missing data. The following code creates a new column based on the existing mpgData column, mapping True where mpgData is equal to Y and NaN is not equal:
>>>
>>> df["mpgData_"] = df["mpgData"].map({"Y": True})
>>> df
city08 cylinders fuelType ... trany year mpgData_
0 19 4 Regular ... Manual 5-spd 1985 True
1 9 12 Regular ... Manual 5-spd 1985 NaN
2 23 4 Regular ... Manual 5-spd 1985 True
3 10 8 Regular ... Automatic 3-spd 1985 NaN
4 17 4 Premium ... Manual 5-spd 1993 NaN
.. ... ... ... ... ... ... ...
95 17 6 Regular ... Automatic 3-spd 1993 True
96 17 6 Regular ... Automatic 4-spd 1993 NaN
97 15 6 Regular ... Automatic 4-spd 1993 NaN
98 15 6 Regular ... Manual 5-spd 1993 NaN
99 9 8 Premium ... Automatic 4-spd 1993 NaN
[100 rows x 11 columns]
Now you have a new column named mpgData_ containing both True and NaN values. You will use this column to see the effect of na_position when using these two sorting methods. To learn more about using .map(), you can read the Pandas project: Using Python and Pandas to Make Gradebooks.
Understand the na_position parameter.sort_values()
.sort_values() accepts a parameter named na_position, which helps organize missing data in the columns you sort. If you sort the columns with missing data, the rows with missing values will appear at the end of the DataFrame. This happens regardless of whether you are sorting in ascending or descending order.
When you sort the column with missing data, your DataFrame looks like this:
>>>
>>> df.sort_values(by="mpgData_")
city08 cylinders fuelType ... trany year mpgData_
0 19 4 Regular ... Manual 5-spd 1985 True
55 18 6 Regular ... Automatic 4-spd 1993 True
56 18 6 Regular ... Automatic 4-spd 1993 True
57 16 6 Premium ... Manual 5-spd 1993 True
59 17 6 Regular ... Automatic 4-spd 1993 True
.. ... ... ... ... ... ... ...
94 18 6 Regular ... Automatic 4-spd 1993 NaN
96 17 6 Regular ... Automatic 4-spd 1993 NaN
97 15 6 Regular ... Automatic 4-spd 1993 NaN
98 15 6 Regular ... Manual 5-spd 1993 NaN
99 9 8 Premium ... Automatic 4-spd 1993 NaN
[100 rows x 11 columns]
To change this behavior, and the missing data appears in your data frame for the first time, you can set na_position to first. The na_position parameter only accepts the value last, which is the default value, and first. Here is how to use na_postion's .sort_values():
>>>
>>> df.sort_values(
... by="mpgData_",
... na_position="first"
... )
city08 cylinders fuelType ... trany year mpgData_
1 9 12 Regular ... Manual 5-spd 1985 NaN
3 10 8 Regular ... Automatic 3-spd 1985 NaN
4 17 4 Premium ... Manual 5-spd 1993 NaN
5 21 4 Regular ... Automatic 3-spd 1993 NaN
11 18 4 Regular ... Automatic 4-spd 1993 NaN
.. ... ... ... ... ... ... ...
32 15 8 Premium ... Automatic 4-spd 1993 True
33 15 8 Premium ... Automatic 4-spd 1993 True
37 17 6 Regular ... Automatic 3-spd 1993 True
85 17 6 Regular ... Automatic 4-spd 1993 True
95 17 6 Regular ... Automatic 3-spd 1993 True
[100 rows x 11 columns]
Now any missing data in the column you used for sorting will be displayed at the top of the DataFrame. This is useful when you first start analyzing data and are not sure whether there are missing values.
Understand the na_position parameter.sort_index()
.sort_index() also accepts na_position. Your DataFrame usually does not have NaN values as part of its index, so this parameter is in .sort_index(). However, it is good to know that if your DataFrame does have NaN in the row index or column name, then you can use it. sort_index() and quickly recognize this na_position.
By default, this parameter is set to last, placing the NaN value at the end of the sorted result. To change this behavior and have missing data in your data frame first, set na_position to first.
Use the sort method to modify your DataFrame
In all the examples you have seen so far, both .sort_values() and .sort_index() have returned the data frame object when you call those methods. This is because sorting in Pandas does not work in place by default. Generally, this is the most common and preferred method of analyzing data using Pandas, because it creates a new DataFrame instead of modifying the original data. This allows you to preserve the state of the data as it was read from the file.
However, you can directly modify the original DataFrame True by specifying the optional parameter of the inplace value. Most Pandas methods include the inplace parameter. Below, you will see some examples where inplace=True is used to properly sort a DataFrame.
.sort_values() is used in place
With inplace set to True, you modify the original data frame, so the sort method returns None. city08 sorts the DataFrame by column values like the first example, but inplace is set to True:
>>>
>>> df.sort_values("city08", inplace=True)
Please note how calling .sort_values() does not return a DataFrame. This is what the original df looks like:
>>>
>>> df
city08 cylinders fuelType ... trany year mpgData_
99 9 8 Premium ... Automatic 4-spd 1993 NaN
1 9 12 Regular ... Manual 5-spd 1985 NaN
80 9 8 Regular ... Automatic 3-spd 1985 NaN
47 9 8 Regular ... Automatic 3-spd 1985 NaN
3 10 8 Regular ... Automatic 3-spd 1985 NaN
.. ... ... ... ... ... ... ...
9 23 4 Regular ... Automatic 4-spd 1993 True
8 23 4 Regular ... Manual 5-spd 1993 True
7 23 4 Regular ... Automatic 3-spd 1993 True
76 23 4 Regular ... Manual 5-spd 1993 True
2 23 4 Regular ... Manual 5-spd 1985 True
[100 rows x 11 columns]
In the df object, the values are now sorted in ascending order based on the city08 column. Your original DataFrame has been modified, and the changes will persist. It is usually a good idea to avoid inplace=True for analysis, because changes to the DataFrame cannot be undone.
.sort_index() is used in place
The next example shows that this inplace also applies to .sort_index().
Since the index is created in ascending order when you read the file into the DataFrame, you can modify the object again by df to restore it to the original order. Use .sort_index() with inplace set to True to modify the data frame:
>>>
>>> df.sort_index(inplace=True)
>>> df
city08 cylinders fuelType ... trany year mpgData_
0 19 4 Regular ... Manual 5-spd 1985 True
1 9 12 Regular ... Manual 5-spd 1985 NaN
2 23 4 Regular ... Manual 5-spd 1985 True
3 10 8 Regular ... Automatic 3-spd 1985 NaN
4 17 4 Premium ... Manual 5-spd 1993 NaN
.. ... ... ... ... ... ... ...
95 17 6 Regular ... Automatic 3-spd 1993 True
96 17 6 Regular ... Automatic 4-spd 1993 NaN
97 15 6 Regular ... Automatic 4-spd 1993 NaN
98 15 6 Regular ... Manual 5-spd 1993 NaN
99 9 8 Premium ... Automatic 4-spd 1993 NaN
[100 rows x 11 columns]
Now your DataFrame has used .sort_index(). Since your DataFrame still has its default index, sorting it in ascending order will put the data back to its original order.
If you are familiar with Python's built-in functions sort() and sorted(), the parameters available in the inplacepandas sort method may feel very similar. For more information, you can see how to use sorted() and sort() in Python.
in conclusion
You now know how to use the two core methods of the pandas library: .sort_values() and .sort_index(). With this knowledge, you can use DataFrame to perform basic data analysis. Although there are many similarities between the two methods, by looking at the differences between them, you can clearly know which method is used to perform different analysis tasks.
In this tutorial, you learned how to:
• Sort Pandas DataFrame by one or more column values
• Use the ascending parameter to change the sort order
• Use index to sort the DataFrame. sort_index()
• Organize missing data when sorting values
• Use set to to sort DataFrame inplace inplaceTrue
These methods are an important part of being proficient in data analysis. They will help you build a strong foundation on which you can perform more advanced Pandas operations. If you want to see some examples of more advanced usage of Pandas sorting methods, then the Pandas documentation is a great resource.
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。