Python 数据分析之 pandas 进阶(二)

六、分组

对于“group by”操作，我们通常是指以下一个或多个操作步骤：
（Splitting）按照一些规则将数据分为不同的组
（Applying）对于每组数据分别执行一个函数
（Combining）将结果组合刀一个数据结构中
将要处理的数组是：

    df = pd.DataFrame({
            'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
            'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
            'C': np.random.randn(8),
            'D': np.random.randn(8)
        })
    df
     
        A    B    C            D
    0    foo    one    0.961295    -0.281012
    1    bar    one    0.901454    0.621284
    2    foo    two    -0.584834    0.919414
    3    bar    three    1.259104    -1.012103
    4    foo    two    0.153107    1.108028
    5    bar    two    0.115963    1.333981
    6    foo    one    1.421895    -1.456916
    7    foo    three    -2.103125    -1.757291

1、分组并对每个分组执行sum函数：

    df.groupby('A').sum()
     
        C            D
    A        
    bar    2.276522    0.943161
    foo    -0.151661    -1.467777

2、通过多个列进行分组形成一个层次索引，然后执行函数：

    df.groupby(['A', 'B']).sum()
     
            C            D
    A    B        
    bar    one    0.901454    0.621284
            three    1.259104        -1.012103
            two    0.115963        1.333981
    foo    one    2.383191    -1.737928
            three    -2.103125    -1.757291
            two    -0.431727    2.027441

七、Reshaping

Stack

    tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                         'foo', 'foo', 'qux', 'qux'],
                        ['one', 'two', 'one', 'two',
                         'one', 'two', 'one', 'two']]))
    tuples
     
    [('bar', 'one'),
     ('bar', 'two'),
     ('baz', 'one'),
     ('baz', 'two'),
     ('foo', 'one'),
     ('foo', 'two'),
     ('qux', 'one'),
     ('qux', 'two')]

    index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
    df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
    df2 = df[:4]
    df2
     
             A            B
    first    second        
    bar    one    -0.907306    -0.009961
            two    0.905177    -2.877961
    baz    one    -0.356070    -0.373447
            two    -1.496644    -1.958782

    stacked = df2.stack()
    stacked 
     
    first  second   
    bar    one     A   -0.907306
                   B   -0.009961
           two     A    0.905177
                   B   -2.877961
    baz    one     A   -0.356070
                   B   -0.373447
           two     A   -1.496644
                   B   -1.958782
    dtype: float64

    stacked.unstack()
     
            A            B
    first    second        
    bar    one    -0.907306    -0.009961
            two    0.905177    -2.877961
    baz    one    -0.356070    -0.373447
            two    -1.496644    -1.958782

    stacked.unstack(1)
     
        second    one           two
    first            
    bar    A    -0.907306    0.905177
            B    -0.009961    -2.877961
    baz    A    -0.356070    -1.496644
            B    -0.373447    -1.958782

八、相关操作

要处理的数组为：

    df
     
                A            B            C            D    F
    2013-01-01    0.000000    0.000000    0.135704    5    NaN
    2013-01-02    0.139027    1.683491    -1.031190    5    1
    2013-01-03    -0.596279    -1.211098    1.169525    5    2
    2013-01-04    0.367213    -0.020313    2.169802    5    3
    2013-01-05    0.224122    1.003625    -0.488250    5    4
    2013-01-06    0.186073    -0.537019    -0.252442    5    5

(一)、统计

1、执行描述性统计：

    df.mean()
     
    A    0.053359
    B    0.153115
    C    0.283858
    D    5.000000
    F    3.000000
    dtype: float64

2、在其他轴上进行相同的操作：

    df.mean(1)
     
    2013-01-01    1.283926
    2013-01-02    1.358266
    2013-01-03    1.272430
    2013-01-04    2.103341
    2013-01-05    1.947899
    2013-01-06    1.879322
    Freq: D, dtype: float64

3、对于拥有不同维度，需要对齐的对象进行操作，pandas会自动的沿着指定的维度进行广播

    dates
    s = pd.Series([1,3,4,np.nan,6,8], index=dates).shift(2)
    s
     
    DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                   '2013-01-05', '2013-01-06'],
                  dtype='datetime64[ns]', freq='D')
     
    2013-01-01   NaN
    2013-01-02   NaN
    2013-01-03     1
    2013-01-04     3
    2013-01-05     4
    2013-01-06   NaN
    Freq: D, dtype: float64

(二)、Apply

对数据应用函数：

    df.apply(np.cumsum)
     
                A            B            C            D    F
    2013-01-01    0.000000    0.000000    0.135704    5    NaN
    2013-01-02    0.139027    1.683491    -0.895486    10    1
    2013-01-03    -0.457252    0.472393    0.274039    15    3
    2013-01-04    -0.090039    0.452081    2.443841    20    6
    2013-01-05    0.134084    1.455706    1.955591    25    10
    2013-01-06    0.320156    0.918687    1.703149    30    15

    df.apply(lambda x: x.max() - x.min())
     
    A    0.963492
    B    2.894589
    C    3.200992
    D    0.000000
    F    4.000000
    dtype: float64

(三)、字符串方法

Series对象在其str属性中配备了一组字符串处理方法，可以很容易的应用到数组中的每个元素。

    s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
    s.str.lower()
     
    0       a
    1       b
    2       c
    3    aaba
    4    baca
    5     NaN
    6    caba
    7     dog
    8     cat
    dtype: object

九、时间序列

1、时区表示：

    rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
    ts = pd.Series(np.random.randn(len(rng)), rng)
    ts
     
    2012-03-06   -0.932261
    2012-03-07   -1.405305
    2012-03-08    0.809844
    2012-03-09   -0.481539
    2012-03-10   -0.489847
    Freq: D, dtype: float64

    ts_utc = ts.tz_localize('UTC')
    ts_utc
     
    2012-03-06 00:00:00+00:00   -0.932261
    2012-03-07 00:00:00+00:00   -1.405305
    2012-03-08 00:00:00+00:00    0.809844
    2012-03-09 00:00:00+00:00   -0.481539
    2012-03-10 00:00:00+00:00   -0.489847
    Freq: D, dtype: float64

2、时区转换

    ts_utc.tz_convert('US/Eastern')
     
    2012-03-05 19:00:00-05:00   -0.932261
    2012-03-06 19:00:00-05:00   -1.405305
    2012-03-07 19:00:00-05:00    0.809844
    2012-03-08 19:00:00-05:00   -0.481539
    2012-03-09 19:00:00-05:00   -0.489847
    Freq: D, dtype: float64

3、时区跨度转换

    rng = pd.date_range('1/1/2012', periods=5, freq='M')
    ts = pd.Series(np.random.randn(len(rng)), index=rng)
    ps = ts.to_period()
    ts
    ps
    ps.to_timestamp()
     
    2012-01-31    0.932519
    2012-02-29    0.247016
    2012-03-31   -0.946069
    2012-04-30    0.267513
    2012-05-31   -0.554343
    Freq: M, dtype: float64
   
  
    2012-01    0.932519
    2012-02    0.247016
    2012-03   -0.946069
    2012-04    0.267513
    2012-05   -0.554343
    Freq: M, dtype: float64
     
    2012-01-01    0.932519
    2012-02-01    0.247016
    2012-03-01   -0.946069
    2012-04-01    0.267513
    2012-05-01   -0.554343
    Freq: MS, dtype: float64

十、画图

    ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
    ts = ts.cumsum()
    ts

图片描述

十一、Categorical

从0.15版本开始，pandas可以在DataFrame中支持Categorical类型的数据。

    df = pd.DataFrame({
            'id':[1,2,3,4,5,6],
            'raw_grade':['a','b','b','a','a','e']
        })
    df
     
        id    raw_grade
    0    1    a
    1    2    b
    2    3    b
    3    4    a
    4    5    a
    5    6    e

1、将原始的grade转换为Categorical数据类型：

    df['grade'] = df['raw_grade'].astype('category', ordered=True)
    df['grade'] 
     
    0    a
    1    b
    2    b
    3    a
    4    a
    5    e
    Name: grade, dtype: category
    Categories (3, object): [a < b < e]

2、将Categorical类型数据重命名为更有意义的名称：

    df['grade'].cat.categories = ['very good', 'good', 'very bad']

3、对类别进行重新排序，增加缺失的类别：

    df['grade'] = df['grade'].cat.set_categories(['very bad', 'bad', 'medium', 'good', 'very good'])
    df['grade']
     
    0    very good
    1         good
    2         good
    3    very good
    4    very good
    5     very bad
    Name: grade, dtype: category
    Categories (5, object): [very bad < bad < medium < good < very good]

4、排序是按照Categorical的顺序进行的而不是按照字典顺序进行：

    df.sort('grade')
     
        id    raw_grade    grade
    5    6    e            very bad
    1    2    b            good
    2    3    b            good
    0    1    a            very good
    3    4    a            very good
    4    5    a            very good

5、对Categorical列进行排序时存在空的类别：

    df.groupby("grade").size()
     
    grade
    very bad     1
    bad          0
    medium       0
    good         2
    very good    3
    dtype: int64

以上代码不想自己试一试吗？
镭矿 raquant提供 jupyter(研究）在线练习学习 python 的机会，无需安装 python 即可运行 python 程序。

Python 数据分析之 pandas 进阶(二)

六、分组

七、Reshaping

八、相关操作

九、时间序列

十、画图

十一、Categorical

raquant

引用和评论

java尝试编写macd，试验顶背离底背离

Anaconda安装教程以及Anaconda和pip配置国内镜像

科学计算编程涉及到的技术栈简介

使用 chardet 判断文件编码需要注意的坑——过大的文件会导致高耗时

Python3 格式化时间（qbit）

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

怎么判断自己下载的 trae 是国际版还是国内版？