Abstract: Learning Pandas sorting method is a good way to start or practice basic data analysis using Python. The most common data analysis is done using spreadsheets, SQL or pandas. One of the major advantages of using Pandas is that it can process large amounts of data and provide high-performance data manipulation capabilities.

This article is shared from Huawei Cloud Community " Pandas Sort: Your Python Data Sorting Guide ", author: Yuchuan.

Learning the Pandas sorting method is a great way to start or practice basic data analysis using Python. The most common data analysis is done using spreadsheets, SQL or pandas. One of the major advantages of using Pandas is that it can process large amounts of data and provide high-performance data manipulation capabilities.

In this tutorial, you will learn how to use .sort_values() and .sort_index(), which will enable you to effectively sort the data in the DataFrame.

At the end of this tutorial, you will know how to:

• Sort Pandas DataFrame by one or more column values
• Use the ascending parameter to change the sort order
• Use index to sort the DataFrame. sort_index()
• Organize missing data when sorting values
• Use set to to sort DataFrame inplace inplaceTrue

To follow this tutorial, you need a basic understanding of Pandas DataFrames and a certain understanding of reading data from files.

Getting started with pandas sorting method

As a quick reminder, DataFrame is a data structure with marked axes in rows and columns. You can sort the DataFrame by row or column value and row or column index.

Both rows and columns have an index, which is a numeric representation of the position of the data in the DataFrame. You can use the index position of the DataFrame to retrieve data from a specific row or column. By default, the index number starts from zero. You can also manually assign your own index.

Prepare the data set

In this tutorial, you will use the fuel economy data compiled by the U.S. Environmental Protection Agency (EPA) for vehicles manufactured between 1984 and 2021. The EPA fuel economy data set is great because it contains many different types of information, and you can sort it from text to numeric data types. The data set contains a total of eighty-three columns.

To continue, you need to install the pandas Python library. The code in this tutorial was executed using pandas 1.2.0 and Python 3.9.1.

Note: The entire fuel economy data set of It may take a minute or two to read the entire data set into memory. Limiting the number of rows and columns can help improve performance, but it still takes a few seconds to download the data.

For analysis purposes, you will view the MPG (miles per gallon) data of a vehicle by make, model, year, and other vehicle attributes. You can specify the columns to be read into the DataFrame. For this tutorial, you only need a subset of the available columns.

The following is the command to read the relevant columns of the fuel economy data set into the DataFrame and display the first five rows:

>>>
>>> import pandas as pd

>>> column_subset = [
...     "id",
...     "make",
...     "model",
...     "year",
...     "cylinders",
...     "fuelType",
...     "trany",
...     "mpgData",
...     "city08",
...     "highway08"
... ]

>>> df = pd.read_csv(
...     "https://www.fueleconomy.gov/feg/epadata/vehicles.csv",
...     usecols=column_subset,
...     nrows=100
... )

>>> df.head()
   city08  cylinders fuelType  ...  mpgData            trany  year
0      19          4  Regular  ...        Y     Manual 5-spd  1985
1       9         12  Regular  ...        N     Manual 5-spd  1985
2      23          4  Regular  ...        Y     Manual 5-spd  1985
3      10          8  Regular  ...        N  Automatic 3-spd  1985
4      17          4  Premium  ...        N     Manual 5-spd  1993
[5 rows x 10 columns]

By calling .read_csv() using the dataset URL, you can load the data into the DataFrame. Shrinking the column will result in faster load time and less memory usage. In order to further limit memory consumption and quickly understand the data, you can use nrows to specify the number of rows to be loaded.

Familiar with .sort_values()

You use .sort_values() to sort the values ​​in DataFrame along any axis (column or row). Generally, you want to sort the rows in a DataFrame by the values ​​of one or more columns:
image.png

The figure above shows the result of using .sort_values() to sort the rows of the DataFrame according to the values ​​in the highway08 column. This is similar to using columns to sort data in a spreadsheet.

Familiar with .sort_index()

You use .sort_index() to sort the DataFrame by row index or column label. The difference with using.sort_values() is that you sort the DataFrame according to its row index or column name, not according to the values ​​in these rows or columns:
image.png

The row index of the DataFrame is marked in blue in the figure above. Indexes are not considered as one column, you usually only have one row index. The row index can be thought of as a zero-based row number.

Sort DataFrame on a single column

To sort the DataFrame based on the values ​​in a single column, you will use .sort_values(). By default, this will return a new DataFrame sorted in ascending order. It will not modify the original DataFrame.

Sort by column in ascending order

To use .sort_values(), pass a single parameter to the method that contains the name of the column you want to sort by. In this example, you sort the DataFrame by the city08 column, which represents the city MPG for fuel-only cars:

>>>
>>> df.sort_values("city08")
    city08  cylinders fuelType  ...  mpgData            trany  year
99       9          8  Premium  ...        N  Automatic 4-spd  1993
1        9         12  Regular  ...        N     Manual 5-spd  1985
80       9          8  Regular  ...        N  Automatic 3-spd  1985
47       9          8  Regular  ...        N  Automatic 3-spd  1985
3       10          8  Regular  ...        N  Automatic 3-spd  1985
..     ...        ...      ...  ...      ...              ...   ...
9       23          4  Regular  ...        Y  Automatic 4-spd  1993
8       23          4  Regular  ...        Y     Manual 5-spd  1993
7       23          4  Regular  ...        Y  Automatic 3-spd  1993
76      23          4  Regular  ...        Y     Manual 5-spd  1993
2       23          4  Regular  ...        Y     Manual 5-spd  1985
[100 rows x 10 columns]

This will sort your DataFrame using the column values ​​in city08, showing the vehicle with the lowest MPG first. By default, the data is sorted in ascending order.sort_values(). Although you did not specify a name for the parameter passed to, you actually used the by parameter in .sort_values(), which you will see in the next example.

Change the sort order

Another parameter of .sort_values() is ascending. By default, .sort_values() has ascending set to True. If you want the DataFrame to be sorted in descending order, you can pass False to this parameter:

>>>
>>> df.sort_values(
...     by="city08",
...     ascending=False
... )
    city08  cylinders fuelType  ...  mpgData            trany  year
9       23          4  Regular  ...        Y  Automatic 4-spd  1993
2       23          4  Regular  ...        Y     Manual 5-spd  1985
7       23          4  Regular  ...        Y  Automatic 3-spd  1993
8       23          4  Regular  ...        Y     Manual 5-spd  1993
76      23          4  Regular  ...        Y     Manual 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
58      10          8  Regular  ...        N  Automatic 3-spd  1985
80       9          8  Regular  ...        N  Automatic 3-spd  1985
1        9         12  Regular  ...        N     Manual 5-spd  1985
47       9          8  Regular  ...        N  Automatic 3-spd  1985
99       9          8  Premium  ...        N  Automatic 4-spd  1993
[100 rows x 10 columns]

You can reverse the sort order by passing False to ascending. Now your DataFrame is sorted in descending order by the average MPG measured under city conditions. The vehicle with the highest MPG value is in the first row.

Select sorting algorithm

It is worth noting that pandas allows you to choose different sorting algorithms to use with .sort_values() and .sort_index(). Available algorithms quicksort, mergesort and heapsort. For more information about these different sorting algorithms, check out Sorting Algorithms in Python.

The default algorithm used when sorting a single column is quicksort. To change it to a stable sorting algorithm, use mergesort. You can use the kind parameter in or to perform this operation as follows: .sort_values().sort_index()

>>>
>>> df.sort_values(
...     by="city08",
...     ascending=False,
...     kind="mergesort"
... )
    city08  cylinders fuelType  ...  mpgData            trany  year
2       23          4  Regular  ...        Y     Manual 5-spd  1985
7       23          4  Regular  ...        Y  Automatic 3-spd  1993
8       23          4  Regular  ...        Y     Manual 5-spd  1993
9       23          4  Regular  ...        Y  Automatic 4-spd  1993
10      23          4  Regular  ...        Y     Manual 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
69      10          8  Regular  ...        N  Automatic 3-spd  1985
1        9         12  Regular  ...        N     Manual 5-spd  1985
47       9          8  Regular  ...        N  Automatic 3-spd  1985
80       9          8  Regular  ...        N  Automatic 3-spd  1985
99       9          8  Premium  ...        N  Automatic 4-spd  1993
[100 rows x 10 columns]

With kind, you set the sorting algorithm to mergesort. The previous output used the default quicksort algorithm. Looking at the highlighted index, you can see that the order of the rows is different. This is because quicksort is not a stable sorting algorithm, but mergesort.

Note: In Pandas, kind will be ignored when you sort multiple columns or labels.

When you sort multiple records with the same key, a stable sorting algorithm will maintain the original order of these records after sorting. Therefore, if you plan to perform multiple sorts, you must use a stable sorting algorithm.

Sort DataFrame on multiple columns

In data analysis, you usually want to sort the data based on the values ​​of multiple columns. Imagine you have a data set containing people's first and last names. It makes sense to sort by last name and then by first name, so that people with the same last name will be sorted alphabetically according to their first name.

In the first example, you sorted the DataFrame on a single column named city08. From an analytical point of view, MPG under urban conditions is an important factor in determining the popularity of cars. In addition to MPG in urban conditions, you may also want to view MPG in highway conditions. To sort by two keys, you can pass a list of column names to by:

>>>
>>> df.sort_values(
...     by=["city08", "highway08"]
... )[["city08", "highway08"]]
    city08  highway08
80       9         10
47       9         11
99       9         13
1        9         14
58      10         11
..     ...        ...
9       23         30
10      23         30
8       23         31
76      23         31
2       23         33
[100 rows x 2 columns]

By specifying a list of column names city08 and highway08, you can use .sort_values() to sort the DataFrame on the two columns. The next example will explain how to specify the sort order and why it is important to pay attention to the list of column names you use.

Sort by multiple columns in ascending order

To sort a DataFrame on multiple columns, you must provide a list of column names. For example, to sort by make and model, you should create the following list and then pass it to .sort_values():

>>>
>>> df.sort_values(
...     by=["make", "model"]
... )[["make", "model"]]
          make               model
0   Alfa Romeo  Spider Veloce 2000
18        Audi                 100
19        Audi                 100
20         BMW                740i
21         BMW               740il
..         ...                 ...
12  Volkswagen      Golf III / GTI
13  Volkswagen           Jetta III
15  Volkswagen           Jetta III
16       Volvo                 240
17       Volvo                 240
[100 rows x 2 columns]

Now your DataFrame is sorted in ascending order by make. If there are two or more identical brands, they are sorted by model. The order in which the column names are specified in the list corresponds to the sorting method of the DataFrame.

Change the column sort order

Since you use multiple columns for sorting, you can specify the sort order of the columns. If you want to change the logical sort order in the previous example, you can change the order of the column names in the list passed to the by parameter:

>>>
>>> df.sort_values(
...     by=["model", "make"]
... )[["make", "model"]]
             make        model
18           Audi          100
19           Audi          100
16          Volvo          240
17          Volvo          240
75          Mazda          626
..            ...          ...
62           Ford  Thunderbird
63           Ford  Thunderbird
88     Oldsmobile     Toronado
42  CX Automotive        XM v6
43  CX Automotive       XM v6a
[100 rows x 2 columns]

Your DataFrame is now sorted by column in ascending order by model, and then sorted by whether make has two or more identical models. You can see that changing the order of the columns also changes the sort order of the values.

Sort by multiple columns in descending order

So far, you have only sorted multiple columns in ascending order. In the next example, you will sort in descending order based on the make and model columns. To sort in descending order, set ascending to False:

>>>
>>> df.sort_values(
...     by=["make", "model"],
...     ascending=False
... )[["make", "model"]]
          make               model
16       Volvo                 240
17       Volvo                 240
13  Volkswagen           Jetta III
15  Volkswagen           Jetta III
11  Volkswagen      Golf III / GTI
..         ...                 ...
21         BMW               740il
20         BMW                740i
18        Audi                 100
19        Audi                 100
0   Alfa Romeo  Spider Veloce 2000
[100 rows x 2 columns]

The values ​​in the make column are sorted in alphabetical order in reverse model. For text data with the same make, the sorting is case-sensitive, which means that uppercase text will first appear in ascending order, and finally in descending order.

Sort by multiple columns with different sort orders

You may be wondering if you can use multiple columns for sorting and have these columns use different ascending parameters. With pandas, you can accomplish this with a single method call. If you want to sort some columns in ascending order, and sort some columns in descending order, you can pass a list of boolean values ​​to ascending.

In this example, you arrange the data frame by make, model and city08 columns, with the first two columns in ascending order and city08 in descending order. To do this, you pass a list of column names to by and a list of boolean values ​​to ascending:

>>>
>>> df.sort_values(
...     by=["make", "model", "city08"],
...     ascending=[True, True, False]
... )[["make", "model", "city08"]]
          make               model  city08
0   Alfa Romeo  Spider Veloce 2000      19
18        Audi                 100      17
19        Audi                 100      17
20         BMW                740i      14
21         BMW               740il      14
..         ...                 ...     ...
11  Volkswagen      Golf III / GTI      18
15  Volkswagen           Jetta III      20
13  Volkswagen           Jetta III      18
17       Volvo                 240      19
16       Volvo                 240      18
[100 rows x 3 columns]

Now your data frame is sorted by make, and the model is sorted in ascending order, but with city08 in descending order. This is useful because it groups the cars in sort order and displays the cars with the highest MPG first.

Sort the DataFrame according to the index

Before sorting the index, it is best to understand what the index represents. DataFrame has an .index property, which by default is a numeric representation of its row position. You can think of the index as a row number. It helps to find and identify quickly.

Sort by index in ascending order

You can sort the DataFrame according to the row index. sort_index(). Sorting by column value as in the previous example reorders the rows in the DataFrame, so the index becomes messy. This can also happen when you filter the DataFrame or delete or add rows.

To illustrate the use of .sort_index(), first use the following method to create a new sorting DataFrame .sort_values():

>>>
>>> sorted_df = df.sort_values(by=["make", "model"])
>>> sorted_df
    city08  cylinders fuelType  ...  mpgData            trany  year
0       19          4  Regular  ...        Y     Manual 5-spd  1985
18      17          6  Premium  ...        Y  Automatic 4-spd  1993
19      17          6  Premium  ...        N     Manual 5-spd  1993
20      14          8  Premium  ...        N  Automatic 5-spd  1993
21      14          8  Premium  ...        N  Automatic 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
12      21          4  Regular  ...        Y     Manual 5-spd  1993
13      18          4  Regular  ...        N  Automatic 4-spd  1993
15      20          4  Regular  ...        N     Manual 5-spd  1993
16      18          4  Regular  ...        Y  Automatic 4-spd  1993
17      19          4  Regular  ...        Y     Manual 5-spd  1993
[100 rows x 10 columns]

You have created a DataFrame sorted by multiple values. Note how the row indexes are in no particular order. To restore the new DataFrame to its original order, you can use .sort_index():

>>>
>>> sorted_df.sort_index()
    city08  cylinders fuelType  ...  mpgData            trany  year
0       19          4  Regular  ...        Y     Manual 5-spd  1985
1        9         12  Regular  ...        N     Manual 5-spd  1985
2       23          4  Regular  ...        Y     Manual 5-spd  1985
3       10          8  Regular  ...        N  Automatic 3-spd  1985
4       17          4  Premium  ...        N     Manual 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
95      17          6  Regular  ...        Y  Automatic 3-spd  1993
96      17          6  Regular  ...        N  Automatic 4-spd  1993
97      15          6  Regular  ...        N  Automatic 4-spd  1993
98      15          6  Regular  ...        N     Manual 5-spd  1993
99       9          8  Premium  ...        N  Automatic 4-spd  1993
[100 rows x 10 columns]

The index is now sorted in ascending order. Just like the default parameter of in.sort_values(), you can change it to descending order by passing it. Sorting the index has no effect on the data itself, because the value does not change. ascending.sort_index()TrueFalse

When you use .set_index(). If you want to use make and model columns to set a custom index, you can pass the list to .set_index():

>>>
>>> assigned_index_df = df.set_index(
...     ["make", "model"]
... )
>>> assigned_index_df
                                  city08  cylinders  ...            trany  year
make        model                                    ...
Alfa Romeo  Spider Veloce 2000        19          4  ...     Manual 5-spd  1985
Ferrari     Testarossa                 9         12  ...     Manual 5-spd  1985
Dodge       Charger                   23          4  ...     Manual 5-spd  1985
            B150/B250 Wagon 2WD       10          8  ...  Automatic 3-spd  1985
Subaru      Legacy AWD Turbo          17          4  ...     Manual 5-spd  1993
                                  ...        ...  ...              ...   ...
Pontiac     Grand Prix                17          6  ...  Automatic 3-spd  1993
            Grand Prix                17          6  ...  Automatic 4-spd  1993
            Grand Prix                15          6  ...  Automatic 4-spd  1993
            Grand Prix                15          6  ...     Manual 5-spd  1993
Rolls-Royce Brooklands/Brklnds L       9          8  ...  Automatic 4-spd  1993
[100 rows x 8 columns]

Using this method, you can replace the default integer-based row index with two axis labels. This is considered a MultiIndex or a hierarchical index. Your DataFrame is now indexed by multiple keys, you can use .sort_index() to sort the following keys:

>>>
>>> assigned_index_df.sort_index()
                               city08  cylinders  ...            trany  year
make       model                                  ...
Alfa Romeo Spider Veloce 2000      19          4  ...     Manual 5-spd  1985
Audi       100                     17          6  ...  Automatic 4-spd  1993
           100                     17          6  ...     Manual 5-spd  1993
BMW        740i                    14          8  ...  Automatic 5-spd  1993
           740il                   14          8  ...  Automatic 5-spd  1993
                               ...        ...  ...              ...   ...
Volkswagen Golf III / GTI          21          4  ...     Manual 5-spd  1993
           Jetta III               18          4  ...  Automatic 4-spd  1993
           Jetta III               20          4  ...     Manual 5-spd  1993
Volvo      240                     18          4  ...  Automatic 4-spd  1993
           240                     19          4  ...     Manual 5-spd  1993
[100 rows x 8 columns]

First use make and columns to assign a new index model to the DataFrame, and then use .sort_index() to sort the index. You can read more about using .set_index() in the pandas documentation.

Sort by index in descending order

For the next example, you will sort the DataFrame in descending order by index. Remember, by sorting the DataFrame.sort_values(), you can reverse the sort order by setting ascending to False. This parameter also applies to .sort_index(), so you can sort the DataFrame in reverse order as follows:

>>>
>>> assigned_index_df.sort_index(ascending=False)
                               city08  cylinders  ...            trany  year
make       model                                  ...
Volvo      240                     18          4  ...  Automatic 4-spd  1993
           240                     19          4  ...     Manual 5-spd  1993
Volkswagen Jetta III               18          4  ...  Automatic 4-spd  1993
           Jetta III               20          4  ...     Manual 5-spd  1993
           Golf III / GTI          18          4  ...  Automatic 4-spd  1993
                               ...        ...  ...              ...   ...
BMW        740il                   14          8  ...  Automatic 5-spd  1993
           740i                    14          8  ...  Automatic 5-spd  1993
Audi       100                     17          6  ...  Automatic 4-spd  1993
           100                     17          6  ...     Manual 5-spd  1993
Alfa Romeo Spider Veloce 2000      19          4  ...     Manual 5-spd  1985
[100 rows x 8 columns]

Now your DataFrame is sorted in descending order by its index. One difference between using .sort_index() and .sort_values() is that .sort_index() has no by parameter because it sorts the DataFrame on the row index by default.

Explore advanced index sorting concepts

There are many situations in data analysis where you want to sort the hierarchical index. You have seen how to use make and model in MultiIndex. For this data set, you can also use the id column as an index.

Setting the id column as an index may help link related data sets. For example, the EPA's emission data set is also used for id to represent the vehicle record ID. This links emissions data with fuel economy data. You can use other methods to sort the indexes of two data sets in a DataFrame (such as .merge(). To learn more about combining data in Pandas, see Using merge(), .join() in Pandas) Combine data with concat().

Sort the columns of the DataFrame

You can also use the column labels of the DataFrame to sort the row values. Using the optional parameter set to .sort_index() will sort the DataFrame by column label. The sorting algorithm is applied to the axis labels instead of the actual data. This facilitates visual inspection of the DataFrame. axis1

Use data frame axis

When you use .sort_index() without passing any explicit parameter axis=0, it will be used as the default parameter. The axis of the DataFrame refers to the index (axis=0) or the column (axis=1). You can use these two axes to index and select the data in the DataFrame and sort the data.

Sort using column labels

You can also use the column label of the DataFrame as .sort_index(). Set the column axis of the DataFrame to be sorted according to the column label:

>>>
>>> df.sort_index(axis=1)
    city08  cylinders fuelType  ...  mpgData            trany  year
0       19          4  Regular  ...        Y     Manual 5-spd  1985
1        9         12  Regular  ...        N     Manual 5-spd  1985
2       23          4  Regular  ...        Y     Manual 5-spd  1985
3       10          8  Regular  ...        N  Automatic 3-spd  1985
4       17          4  Premium  ...        N     Manual 5-spd  1993
..     ...        ...      ...  ...      ...              ...   ...
95      17          6  Regular  ...        Y  Automatic 3-spd  1993
96      17          6  Regular  ...        N  Automatic 4-spd  1993
97      15          6  Regular  ...        N  Automatic 4-spd  1993
98      15          6  Regular  ...        N     Manual 5-spd  1993
99       9          8  Premium  ...        N  Automatic 4-spd  1993
[100 rows x 10 columns]

The columns of the DataFrame are sorted alphabetically from left to right in ascending order. If you want to sort the column in descending order, you can use ascending=False:

>>>
>>> df.sort_index(axis=1, ascending=False)
    year            trany mpgData  ... fuelType cylinders  city08
0   1985     Manual 5-spd       Y  ...  Regular         4      19
1   1985     Manual 5-spd       N  ...  Regular        12       9
2   1985     Manual 5-spd       Y  ...  Regular         4      23
3   1985  Automatic 3-spd       N  ...  Regular         8      10
4   1993     Manual 5-spd       N  ...  Premium         4      17
..   ...              ...     ...  ...      ...       ...     ...
95  1993  Automatic 3-spd       Y  ...  Regular         6      17
96  1993  Automatic 4-spd       N  ...  Regular         6      17
97  1993  Automatic 4-spd       N  ...  Regular         6      15
98  1993     Manual 5-spd       N  ...  Regular         6      15
99  1993  Automatic 4-spd       N  ...  Premium         8       9
[100 rows x 10 columns]

Using axis=1in .sort_index(), you can sort the columns of the DataFrame in ascending and descending order. This may be more useful in other datasets, such as datasets where the column labels correspond to months of the year. In this case, it makes sense to sort the data in ascending or descending order by month.

Dealing with missing data when sorting in Pandas

Usually, real-world data has many flaws. Although Pandas has a variety of methods to clean up the data before sorting, sometimes it is good to see the missing data while sorting. You can do this with the na_position parameter.

The fuel economy data subset used in this tutorial has no missing values. In order to illustrate the use of na_position, first you need to create some missing data. The following code creates a new column based on the existing mpgData column, mapping True where mpgData is equal to Y and NaN is not equal:

>>>
>>> df["mpgData_"] = df["mpgData"].map({"Y": True})
>>> df
    city08  cylinders fuelType  ...            trany  year mpgData_
0       19          4  Regular  ...     Manual 5-spd  1985     True
1        9         12  Regular  ...     Manual 5-spd  1985      NaN
2       23          4  Regular  ...     Manual 5-spd  1985     True
3       10          8  Regular  ...  Automatic 3-spd  1985      NaN
4       17          4  Premium  ...     Manual 5-spd  1993      NaN
..     ...        ...      ...  ...              ...   ...      ...
95      17          6  Regular  ...  Automatic 3-spd  1993     True
96      17          6  Regular  ...  Automatic 4-spd  1993      NaN
97      15          6  Regular  ...  Automatic 4-spd  1993      NaN
98      15          6  Regular  ...     Manual 5-spd  1993      NaN
99       9          8  Premium  ...  Automatic 4-spd  1993      NaN
[100 rows x 11 columns]

Now you have a new column named mpgData_ containing both True and NaN values. You will use this column to see the effect of na_position when using these two sorting methods. To learn more about using .map(), you can read the Pandas project: Using Python and Pandas to Make Gradebooks.

Understand the na_position parameter.sort_values()
.sort_values() accepts a parameter named na_position, which helps organize missing data in the columns you sort. If you sort the columns with missing data, the rows with missing values ​​will appear at the end of the DataFrame. This happens regardless of whether you are sorting in ascending or descending order.

When you sort the column with missing data, your DataFrame looks like this:

>>>
>>> df.sort_values(by="mpgData_")
    city08  cylinders fuelType  ...            trany  year mpgData_
0       19          4  Regular  ...     Manual 5-spd  1985     True
55      18          6  Regular  ...  Automatic 4-spd  1993     True
56      18          6  Regular  ...  Automatic 4-spd  1993     True
57      16          6  Premium  ...     Manual 5-spd  1993     True
59      17          6  Regular  ...  Automatic 4-spd  1993     True
..     ...        ...      ...  ...              ...   ...      ...
94      18          6  Regular  ...  Automatic 4-spd  1993      NaN
96      17          6  Regular  ...  Automatic 4-spd  1993      NaN
97      15          6  Regular  ...  Automatic 4-spd  1993      NaN
98      15          6  Regular  ...     Manual 5-spd  1993      NaN
99       9          8  Premium  ...  Automatic 4-spd  1993      NaN
[100 rows x 11 columns]

To change this behavior, and the missing data appears in your data frame for the first time, you can set na_position to first. The na_position parameter only accepts the value last, which is the default value, and first. Here is how to use na_postion's .sort_values():

>>>
>>> df.sort_values(
...     by="mpgData_",
...     na_position="first"
... )
    city08  cylinders fuelType  ...            trany  year mpgData_
1        9         12  Regular  ...     Manual 5-spd  1985      NaN
3       10          8  Regular  ...  Automatic 3-spd  1985      NaN
4       17          4  Premium  ...     Manual 5-spd  1993      NaN
5       21          4  Regular  ...  Automatic 3-spd  1993      NaN
11      18          4  Regular  ...  Automatic 4-spd  1993      NaN
..     ...        ...      ...  ...              ...   ...      ...
32      15          8  Premium  ...  Automatic 4-spd  1993     True
33      15          8  Premium  ...  Automatic 4-spd  1993     True
37      17          6  Regular  ...  Automatic 3-spd  1993     True
85      17          6  Regular  ...  Automatic 4-spd  1993     True
95      17          6  Regular  ...  Automatic 3-spd  1993     True
[100 rows x 11 columns]

Now any missing data in the column you used for sorting will be displayed at the top of the DataFrame. This is useful when you first start analyzing data and are not sure whether there are missing values.

Understand the na_position parameter.sort_index()

.sort_index() also accepts na_position. Your DataFrame usually does not have NaN values ​​as part of its index, so this parameter is in .sort_index(). However, it is good to know that if your DataFrame does have NaN in the row index or column name, then you can use it. sort_index() and quickly recognize this na_position.

By default, this parameter is set to last, placing the NaN value at the end of the sorted result. To change this behavior and have missing data in your data frame first, set na_position to first.

Use the sort method to modify your DataFrame

In all the examples you have seen so far, both .sort_values() and .sort_index() have returned the data frame object when you call those methods. This is because sorting in Pandas does not work in place by default. Generally, this is the most common and preferred method of analyzing data using Pandas, because it creates a new DataFrame instead of modifying the original data. This allows you to preserve the state of the data as it was read from the file.

However, you can directly modify the original DataFrame True by specifying the optional parameter of the inplace value. Most Pandas methods include the inplace parameter. Below, you will see some examples where inplace=True is used to properly sort a DataFrame.

.sort_values() is used in place

With inplace set to True, you modify the original data frame, so the sort method returns None. city08 sorts the DataFrame by column values ​​like the first example, but inplace is set to True:

>>>
>>> df.sort_values("city08", inplace=True)

Please note how calling .sort_values() does not return a DataFrame. This is what the original df looks like:

>>>
>>> df
    city08  cylinders fuelType  ...            trany  year mpgData_
99       9          8  Premium  ...  Automatic 4-spd  1993      NaN
1        9         12  Regular  ...     Manual 5-spd  1985      NaN
80       9          8  Regular  ...  Automatic 3-spd  1985      NaN
47       9          8  Regular  ...  Automatic 3-spd  1985      NaN
3       10          8  Regular  ...  Automatic 3-spd  1985      NaN
..     ...        ...      ...  ...              ...   ...      ...
9       23          4  Regular  ...  Automatic 4-spd  1993     True
8       23          4  Regular  ...     Manual 5-spd  1993     True
7       23          4  Regular  ...  Automatic 3-spd  1993     True
76      23          4  Regular  ...     Manual 5-spd  1993     True
2       23          4  Regular  ...     Manual 5-spd  1985     True
[100 rows x 11 columns]

In the df object, the values ​​are now sorted in ascending order based on the city08 column. Your original DataFrame has been modified, and the changes will persist. It is usually a good idea to avoid inplace=True for analysis, because changes to the DataFrame cannot be undone.

.sort_index() is used in place

The next example shows that this inplace also applies to .sort_index().

Since the index is created in ascending order when you read the file into the DataFrame, you can modify the object again by df to restore it to the original order. Use .sort_index() with inplace set to True to modify the data frame:

>>>
>>> df.sort_index(inplace=True)
>>> df
    city08  cylinders fuelType  ...            trany  year mpgData_
0       19          4  Regular  ...     Manual 5-spd  1985     True
1        9         12  Regular  ...     Manual 5-spd  1985      NaN
2       23          4  Regular  ...     Manual 5-spd  1985     True
3       10          8  Regular  ...  Automatic 3-spd  1985      NaN
4       17          4  Premium  ...     Manual 5-spd  1993      NaN
..     ...        ...      ...  ...              ...   ...      ...
95      17          6  Regular  ...  Automatic 3-spd  1993     True
96      17          6  Regular  ...  Automatic 4-spd  1993      NaN
97      15          6  Regular  ...  Automatic 4-spd  1993      NaN
98      15          6  Regular  ...     Manual 5-spd  1993      NaN
99       9          8  Premium  ...  Automatic 4-spd  1993      NaN
[100 rows x 11 columns]

Now your DataFrame has used .sort_index(). Since your DataFrame still has its default index, sorting it in ascending order will put the data back to its original order.

If you are familiar with Python's built-in functions sort() and sorted(), the parameters available in the inplacepandas sort method may feel very similar. For more information, you can see how to use sorted() and sort() in Python.

in conclusion

You now know how to use the two core methods of the pandas library: .sort_values() and .sort_index(). With this knowledge, you can use DataFrame to perform basic data analysis. Although there are many similarities between the two methods, by looking at the differences between them, you can clearly know which method is used to perform different analysis tasks.

In this tutorial, you learned how to:

• Sort Pandas DataFrame by one or more column values
• Use the ascending parameter to change the sort order
• Use index to sort the DataFrame. sort_index()
• Organize missing data when sorting values
• Use set to to sort DataFrame inplace inplaceTrue

These methods are an important part of being proficient in data analysis. They will help you build a strong foundation on which you can perform more advanced Pandas operations. If you want to see some examples of more advanced usage of Pandas sorting methods, then the Pandas documentation is a great resource.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量