头图

Preface

As a tool that is used every day, everyone in the industry must be very familiar with the basic usage of Git, for example:

  • Use 0613343b2c4497 to find out which colleague introduced the bug in a certain line, and he will git-blame
  • Use git-merge to merge other people's code into your own flawless branch, and then find that the unit test fails to run;
  • Use git-push -f to cover all the submissions of other people in the team.

In addition, Git is actually a key-value database with version functions:

  • All submitted content is stored under the directory .git/objects/ ;
  • A memory file content blob object store file metadata tree objects, as well as storage of records submitted commit objects and so on;
  • Git provides key-value style read and git-cat-file git-hash-object .

I have read my previous article "What exactly are we friends should know that if a merge is not fast-forward , then a new commit will be generated, and it has two The parent commit object. Take the warehouse of the well-known Go language web framework gin as an example. The submission with a hash value of e38955615a14e567811e390c87afe705df957f3a was generated by a merge. There are two lines of parent

➜  gin git:(master) git cat-file -p 'e38955615a14e567811e390c87afe705df957f3a'
tree 93e5046e502847a6355ed26223a902b4de2de7c7
parent ad087650e9881c93a19fd8db75a86968aa998cac
parent ce26751a5a3ed13e9a6aa010d9a7fa767de91b8c
author Javier Provecho Fernandez <javiertitan@gmail.com> 1499534953 +0200
committer Javier Provecho Fernandez <javiertitan@gmail.com> 1499535020 +0200

Merge pull request #520 from 178inaba/travis-import_path

Through a submitted parent attribute, all submitted objects form a directed acyclic graph. But you should be clever. git-log is linear, so Git uses some kind of graph traversal algorithm.

Check man git-log , you can see Commit Ordering

By default, the commits are shown in reverse chronological order.

Smart, you must already know how to implement this graph traversal algorithm.

Write one by git-log

Resolve commit object

If you want to print commit object in the correct order, you have to parse it first. We don't need to open the file, read the byte stream, and decompress the file content from scratch, just call git-cat-file as above. git-cat-file of the content printed by 0613343b2c46bf needs to be extracted for backup:

  • The line starting with parent The hash value of this row should be used to locate a node in the directed acyclic graph;
  • The line starting with committer The UNIX timestamp of this line will be used as the sorting basis for determining who is the "next node".

You can write a class in Python to parse a commit object

class CommitObject:
    """一个Git中的commit类型的对象解析后的结果。"""
    def __init__(self, *, commit_id: str) -> None:
        self.commit_id = commit_id

        file_content = self._cat_file(commit_id)
        self.parents = self._parse_parents(file_content)
        self.timestamp = self._parse_commit_timestamp(file_content)

    def _cat_file(self, commit_id: str) -> str:
        cmd = ['git', 'cat-file', '-p', commit_id]
        return subprocess.check_output(cmd).decode('utf-8')

    def _parse_commit_timestamp(self, file_content: str) -> int:
        """解析出提交的UNIX时间戳。"""
        lines = file_content.split('\n')
        for line in lines:
            if line.startswith('committer '):
                m = re.search('committer .+ <[^ ]+> ([0-9]+)', line.strip())
                return int(m.group(1))

    def _parse_parents(self, file_content: str) -> List[str]:
        lines = file_content.split('\n')
        parents: List[str] = []
        for line in lines:
            if line.startswith('parent '):
                m = re.search('parent (.*)', line.strip())
                parent_id = m.group(1)
                parents.append(parent_id)
        return parents

Traverse the commit composed of 0613343b2c4760-big root pile

Congratulations, the data structure you have learned can come in handy.

Assuming that the above class CommitObject parse the gin with a hash value of e38955615a14e567811e390c87afe705df957f3a , then there will be two strings in parents

  • ad087650e9881c93a19fd8db75a86968aa998cac
  • ce26751a5a3ed13e9a6aa010d9a7fa767de91b8c

in:

  • The time of submission with a hash value of ad087650e9881c93a19fd8db75a86968aa998cac Sat Jul 8 12:31:44 ;
  • The submission time with a hash value of ce26751a5a3ed13e9a6aa010d9a7fa767de91b8c Jan 28 02:32:44 .

Obviously, if you print the log in reverse chronological order ( reverse chronological ), the next node printed should be ad087650e9881c93a19fd8db75a86968aa998cac -you can confirm this with the git-log

After printing ad087650e9881c93a19fd8db75a86968aa998cac , from its parent submission and ce26751a5a3ed13e9a6aa010d9a7fa767de91b8c , select the next submission object to be printed. Obviously, this is a cyclical process:

  1. commit objects to be printed, find the one with the largest submission timestamp;
  2. Print its message;
  3. commit the pool of objects to be printed, and return to the first step;

This process continues until there are no commit objects to be printed, and all commit objects to be printed form a priority queue-which can be implemented with a large root heap.

However, I do not intend to actually implement a heap data structure in this short demonstration-I will replace it with insertion sort.

class MyGitLogPrinter():
    def __init__(self, *, commit_id: str, n: int) -> None:
        self.commits: List[CommitObject] = []
        self.times = n

        commit = CommitObject(commit_id=commit_id)
        self._enqueue(commit)

    def run(self):
        i = 0
        while len(self.commits) > 0 and i < self.times:
            commit = self.commits.pop(0)

            for parent_id in commit.parents:
                parent = CommitObject(commit_id=parent_id)
                self._enqueue(parent)

            print('{} {}'.format(commit.commit_id, commit.timestamp))
            i += 1

    def _enqueue(self, commit: CommitObject):
        for comm in self.commits:
            if commit.commit_id == comm.commit_id:
                return
        # 插入排序,先找到一个待插入的下标,然后将从i到最后一个元素都往尾部移动,再将新节点插入下标i的位置。
        i = 0
        while i < len(self.commits):
            if commit.timestamp > self.commits[i].timestamp:
                break
            i += 1
        self.commits = self.commits[0:i] + [commit] + self.commits[i:]

Finally, you can experience it by providing a startup function

@click.command()
@click.option('--commit-id', required=True)
@click.option('-n', default=20)
def cli(commit_id: str, n: int):
    MyGitLogPrinter(commit_id=commit_id, n=n).run()


if __name__ == '__main__':
    cli()

Comparison of True and False Monkey King

In order to see if commit objects printed by the above code is correct, I first redirect its output to a file

➜  gin git:(master) python3 ~/SourceCode/python/my_git_log/my_git_log.py --commit-id 'e38955615a14e567811e390c87afe705df957f3a' -n 20 > /tmp/my_git_log.txt

Then print it out in the same format with git-log

➜  gin git:(master) git log --pretty='format:%H %ct' 'e38955615a14e567811e390c87afe705df957f3a' -n 20 > /tmp/git_log.txt

Finally, let the diff command tell us if there is a difference between the two files

➜  gin git:(master) diff /tmp/git_log.txt /tmp/my_git_log.txt
20c20
< 2521d8246d9813d65700650b29e278a08823e3ae 1499266911
\ No newline at end of file
---
> 2521d8246d9813d65700650b29e278a08823e3ae 1499266911

It can be said to be exactly the same.

read the original text


用户bPGfS
169 声望3.7k 粉丝