itertools.groupby() ,python文档源码问题

文档地址:https://docs.python.org/zh-cn...

class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def __next__(self):
        self.id = object()
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)    # Exit on StopIteration
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey, self.id))
    def _grouper(self, tgtkey, id):
        while self.id is id and self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                return
            self.currkey = self.keyfunc(self.currvalue)

请问 _grouper里面,while self.id is id,,在什么情况下会出现 self.id is not id 的情况呢?

阅读 2.8k
2 个回答

groupby 本身是一个迭代器,每次迭代出的元组第二个元素是生成器,所以它有个设计约束,必须要顺序迭代展开每个生成器,像下面这样。

In [1]: from operator import itemgetter

In [2]: from itertools import groupby

In [3]: d1={'name':'zhangsan','age':20,'country':'China'}
   ...: d2={'name':'wangwu','age':19,'country':'USA'}
   ...: d3={'name':'lisi','age':22,'country':'JP'}
   ...: d4={'name':'zhaoliu','age':22,'country':'USA'}
   ...: d5={'name':'pengqi','age':22,'country':'USA'}
   ...: d6={'name':'lijiu','age':22,'country':'China'}
   ...: lst = [d1, d2, d3, d4, d5, d6]

In [4]: lstg = groupby(lst, key=itemgetter('country'))
   ...: for k, gs in lstg:
   ...:     print(k)
   ...:     for g in gs:
   ...:         print(g)
   ...:
   ...:
China
{'name': 'zhangsan', 'age': 20, 'country': 'China'}
USA
{'name': 'wangwu', 'age': 19, 'country': 'USA'}
JP
{'name': 'lisi', 'age': 22, 'country': 'JP'}
USA
{'name': 'zhaoliu', 'age': 22, 'country': 'USA'}
{'name': 'pengqi', 'age': 22, 'country': 'USA'}
China
{'name': 'lijiu', 'age': 22, 'country': 'China'}

再看不按顺序迭代的情况

In [7]: lstg = groupby(lst, key=itemgetter('country'))    
   ...: k1, gs1 = next(lstg)                              
   ...: k2, gs2 = next(lstg)                              
   ...: list(gs2)                                         
   ...:                                                   
   ...:                                                   
   ...:                                                   
   ...:                                                   
Out[7]: [{'name': 'wangwu', 'age': 19, 'country': 'USA'}] 
                                                          
In [8]: list(gs1)                                         
Out[8]: []                                                
                                                          
In [9]:                                                   

先迭代gs2后,gs1已经失效了,这是设计约束,这个约束是通过 _grouper的第一行达到的:

    while self.id is id and self.currkey == tgtkey:

仅靠 self.currkey == tgtkey 不足以约束,如我给出的数据为例,lst 没有按 'country' 排序, 所以在迭代groupby 的对象时, 'China', 'USA' 这样的key都会两次出现。所以self.id是必须的标记。

在给出的代码片段里,self.id 就没有更新过,所以 self.id is id 应该永远为真

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题