This article is from:
Li Zhennan, GitLab R&D Engineer
Warriors, have you ever wondered how Git and GitLab work? Now, grab your beloved IDE and join us on a journey of discovery!
Basic knowledge
Before starting the journey, we need to do a three-minute knowledge reserve, and the timing begins!
Inside the Git repository
A project that uses Git will have a .git folder (hidden) in its root directory, which carries all the information saved by Git. Here are the parts we focus on this time:
.git
├── HEAD # 当前工作空间处于的分支(ref)
├── objects # git对象,git根据这些对象可以重建出仓库的全部commit及当时的全部文件
│ ├── 20 # 稀疏对象,基于对象hash的第一个字节按文件夹分片,避免某个目录有太多的文件
│ │ └── 7151a78fb5e2d99f1185db7ebbd7d883ebde6c
│ ├── 43 # 另一组稀疏对象
│ │ └── 49b682aeaf8dc281c7a7c8d8460f443835c0c2
│ └── pack # 压缩过的对象
└── refs # 分支,文件内容是commit的hash
├── heads
│ ├── feat
│ │ └── hello-world # 某个feature分支
│ └── main # 主分支
├── remotes
│ └── origin
│ └── HEAD # 本地记录的远端分支
└── tags # 标签,文件内容是commit的hash
Figure: Pro Git on git-scm.com
Legend: The red part is provided by refs, the rest is provided by objects, the commit object (yellow) points to the tree object (blue) that saves the file structure, and the latter points to each file object (gray)
The Git server only stores the information in the .git folder (called bare repository, bare repository), git clone is the operation of pulling this information from the remote end to the local and then rebuilding the state of the warehouse in HEAD, and git push is to put The local ref and its associated commit objects, tree objects, and file objects are sent to remote operations.
Git compresses objects as they are transmitted over the network, and the compressed objects are called packfiles.
Git transport protocol
Let's sort out what happens with git push in chronological order:
- User runs git push on the client
- The git-send-pack service of Git on the client side brings the repository identifier and calls the git-receive-pack service on the server side
- The server returns the commit hash of each ref in the current server warehouse, and each hash is recorded as 40-bit hex-encoded text, which looks like this:
001f# service=git-receive-pack
000000c229859bcc73cdab4db2b70ed681077a5885f80134 refs/heads/main\x00report-status report-status-v2 delete-refs side-band-64k quiet atomic ofs-delta push-options object-format=sha1 agent=git/2.37.1.gl1
0000
We can see that the main branch of the server is located at 229859bcc73cdab4db2b70ed681077a5885f80134 (ignore the previous protocol content).
- According to the returned ref, the client finds out those commits that it has but the server does not have, and informs the server of the ref that will be changed:
009f0000000000000000000000000000000000000000 8fa91ae7af0341e6524d1bc2ea067c99dff65f1c refs/heads/feat/hello-world
In the example above, we are pushing a new branch feat/hello-world, which now points to 8fa91ae7af0341e6524d1bc2ea067c99dff65f1c, since it is a new branch, it was previously pointed to 000000000000000000000000000000000000000.
- The client packs and compresses the relevant commits and their tree objects and file objects into packfiles and sends them to the server. Packfiles are binary:
report-status side-band-64k agent=git/2.20.10000PACK\x00\x00\x00\x02\x00\x00\x00\x03\x98\x0cx\x9c\x8d\x8bI
\xc30\x0c\x00\xef~\x85\xee\x85"[^$(\xa5_\x91m\x85\xe6\xe0\xa4\x04\xe7\xff]^\xd0\xcb0\x87\x99y\x98A\x11\xa5\xd8\xab,\xbdSA]Z\x15\xcb(\x94|4\xdf\x88\x02&\x94\xa0\xec^z\xd86!\x08'\xa9\xad\x15j]\xeb\xe7\x0c\xb5\xa0\xf5\xcc\x1eK\xd1\xc4\x9c\x16FO\xd1\xe99\x9f\xfb\x01\x9bn\xe3\x8c\x01n\xeb\xe3\xa7\xd7aw\xf09\x07\xf4\\\x88\xe1\x82\x8c\xe8\xda>\xc6:\xa7\xfd\xdb\xbb\xf3\xd5u\x1a|\xe1\xde\xac\xe29o\xa9\x04x\x9c340031Q\x08rut\xf1u\xd5\xcbMap\xf6\xdc\xd6\xb4n}\xef\xa1\xc6\xe3\xcbO\xdcp\xe3w\xb10=p\xc8\x10\xa2(%\xb1$U\xaf\xa4\xa2\x84\xa1T\xe5\x8eO\xe9\xcf\xd3\x0c\\R\x7f\xcf\xed\xdb\xb9]n\xd1\xea3\xa2\x00\xd3\x86\x1db\xbb\x02x\x9c\x01+\x00\xd4\xff2022\xe5\xb9\xb4 09\xe6\x9c\x88 01\xe6\x97\xa5 \xe6\x98\x9f\xe6\x9c\x9f\xe5\x9b\x9b 15:52:13 CST
\xa4d\x11\xa1\xe8\x86\xdeQ\x90\xb1\xe0Z\xfd\x7f\x91\x90\xc3\xd6\x17\xe8\x02&K\xd0
- The server unpacks the packfile, updates the ref, and returns the processing result:
003a\x01000eunpack ok
0023ok refs/heads/feat/hello-world
The Git transport protocol can be carried by SSH or HTTP(S).
Still pretty straightforward, right?
Components of GitLab
Extreme Fox GitLab is a common Git code hosting service that supports collaborative development, task tracking, CI/CD and other functions.
GitLab's service is not a single unit. Let's take the major version 15 as an example. The components related to git push are as follows:
- Extreme Fox GitLab: Developed using Ruby, it is divided into two parts, the Web service/API service of Extreme Fox GitLab (referred to as Rails below) and the task queue/background task (referred to as Sidekiq below).
- Gitaly: Developed using Go, the Git service backend of GitLab is responsible for the storage and reading and writing of Git warehouses, and exposes various Git operations as GRPC calls. In the early days, Rails directly operated the Git repository on NFS through the Git command line. After the scale became large, the network IO delay was touching, so Gitaly was decomposed.
- Workhorse: Developed in Go, as a front-end proxy for Rails, handling "slow" HTTP requests such as Git push/pull, file download/upload. In the early days, these requests were handled by Rails, and they would take up considerable CPU and memory for a long time. In order to stabilize the service, GitLab had to set the timeout time of git clone to 1 minute, but this brought the large warehouses that could not be fully cloned. usability issues. The cost of goroutine is much lower, and it is used to deal with this kind of request.
- Extreme Fox GitLab Shell: Developed in Go, it responds to and authenticates Git SSH connections, and transfers data between the user's Git client and Gitaly.
- Extreme Fox GitLab Runner: Developed using Go, responsible for the execution of CI/CD work.
- GitLab's data is stored in Postgres, and Redis is used for caching. Rails and Sidekiq directly connect to the database and cache, and other components read and write data through the API exposed by Rails.
Concise GitLab component diagram of GitLab/GitLab architecture overview on docs.gitlab.cn
Start git push!
Three minutes went by so fast! Now that you have the basics down, let's start the journey!
Do you like SSH?
If your remote address is git@jihulab.example.com :user/repo.git, then you are using SSH to communicate with GitLab. When you do a git push, essentially, your Git client's upload-pack service is executing the following command:
ssh -x git@jihulab.example.com "git-receive-pack 'user/repo.git'"
There are a lot of questions to ask here:
Everyone's username is git, how does the server distinguish who is who? (An can tell whether I am male or female?)
ssh? Can I run arbitrary commands on the server?
These two problems are solved by the gitlab-sshd of GitLab Shell. It's a customized SSH Daemon that speaks the same SSH protocol as a normal sshd, and the client can't tell them apart. The client will provide its own public key when doing the SSH handshake, and gitlab-sshd will call the Rails internal API GET /api/v4/internal/authorized_keys to query whether the public key has been registered in GitLab and return the corresponding public key ID (can be locate the user), and verify whether the signature of the SSH handshake is generated by the private key corresponding to the same public key.
In addition, gitlab-sshd limits the commands that the client can run. In fact, it uses the commands run by the user to match which method it should run. Commands without corresponding methods will be rejected.
Unfortunately, it seems that we can't run bash or rm -rf / on the GitLab server via SSH. ┑( ̄Д  ̄)┍
Interestingly, in the early days GitLab really used sshd to respond to Git requests. In order to solve the above two problems, they wrote authorized_keys like this:
# Managed by gitlab-rails
command="/bin/gitlab-shell key-1",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-
rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAiPWx6WM4lhHNedGfBpPJNPpZ7yKu+dnn1SJejgt1016k6YjzGGphH2TUxwKzxcKDKKezwkpfnxPkSMkuEspGRt/aZZ9wa++Oi7
Qkr8prgHc4soW6NUlfDzpvZK2H5E7eQaSeP3SAwGmQKUFHCddNaP0L+hM7zhFNzjFvpaMgJw0=
command="/bin/gitlab-shell key-2",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-
rsa AAAAB3NzaC1yc2EAAAABJQAAAIEAiPWx6WM4lhHNedGfBpPJNPpZ7yKu+dnn1SJejgt1026k6YjzGGphH2TUxwKzxcKDKKezwkpfnxPkSMkuEspGRt/aZZ9wa++Oi7
Qkr8prgHc4soW6NUlfDzpvZK2H5E7eQaSeP3SAwGmQKUFHCddNaP0L+hM7zhFNzjFvpaMgJw0=
Yes, you guessed it right, the entire user public key of GitLab will be placed in this file, which may be hundreds of MB in size! Unpretentious!
The Command parameter covers the command that the SSH client wants to run each time, so that sshd starts gitlab-shell, and the startup parameter is the public key ID. gitlab-shell can obtain the command that the client originally wanted to execute in the environment variable SSH_ORIGINAL_COMMAND set by sshd , and then run the relevant method.
Since sshd uses linear retrieval when matching authorized_keys, when authorized_keys is very large, the matching priority of the first registered user (the public key is in the front of the file) will be much higher than that of the later registered user. In other words, the old user's SSH authentication is faster than a new user's, and noticeably faster. (Real and old user benefits)
Special benefits for old gold users - super long git push time chart/xkcd-excuse.com
Now that gitlab-sshd relies on the Rails API behind the Postgres index, this bug (feature?) no longer exists.
After user authentication, gitlab-sshd will check whether the user has write permission to the target repository (POST /api/v4/internal/allowed), and learn which Gitaly instance the repository is in, as well as the user ID and repository information.
Finally, gitlab-sshd will call the SSHReceivePack method of the corresponding Gitaly instance to act as a relay and translation between the Git client (SSH) and Gitaly (GRPC).
The last two steps gitlab-shell behaves the same as gitlab-sshd.
From a macro perspective, a git push via SSH looks like this:
- User executes git push;
- Git client linked to gitlab-shell via SSH;
- gitlab-shell uses the client public key to call GET /api/v4/internal/authorized_keys to obtain the public key ID and perform SSH handshake;
- gitlab-shell uses the public key ID and repository address to call POST /api/v4/internal/allowed to confirm that the user has write permission to the repository;
- API returns: Gitaly address and authentication token, repo object, hook callback information (logical username GL_ID, logical project name GL_REPOSITORY);
- gitlab-shell uses the above information to call Gitaly's SSHReceivePack method to become the relay between the client and Gitaly;
- Gitaly runs git-receive-pack in the appropriate working directory, and pre-sets the environment variable GITALY_HOOKS_PAYLOAD, which contains GL_ID, GL_REPOSITORY, etc.;
- Server Git tries to update refs and runs Git hooks;
- Finish.
Gitaly and refs updates we'll talk about later.
Do you prefer HTTP(S)?
The remote address of HTTP(S) is in the form of https://jihulab.example.com/user/repo.git. Unlike SSH, HTTP requests are stateless and always answer questions. When you perform a git push, the Git client interacts with two interfaces in order:
- GET https://jihulab.example.com/user/repo.git/info/refs?service=git-receive-pack : The server will return in the body the hash of the commit of each branch of the current server warehouse.
- POST https://jihulab.example.com/user/repo.git/git-receive-pack : The client will submit the branch to be updated and its old commit hash and new commit hash in the body, along with the required packfile. The server will return the processing result in the body, as well as our old acquaintance's "to create a merge request" prompt:
003a\x01000eunpack ok
0023ok refs/heads/feat/hello-world
00000085\x02
To create a merge request for feat/hello-world, visit:
https://jihulab.example.com/user/repo/-/merge_requests/new?merge_request%5Bs0029\x02ource_branch%5D=feat%2Fhello-world
0000
The above two requests are intercepted by Workhorse, which does two things each time:
- Send the request to Rails as it is, and the latter will return the authentication result, user ID, and Gitaly instance information corresponding to the repository (a bit weird, right? Rails' info/refs and git-receive-pack interfaces are actually used for authentication, I guess there are some historical reasons behind this);
- Workhorse establishes a connection with Gitaly based on the information returned by Rails in the previous step, acting as a relay between the client and Gitaly.
To summarize, a git push over HTTP(S) looks like this:
- User executes git push;
- The Git client calls GET https://jihulab.example.com/user/repo.git/info/refs?service=git-receive-pack with the corresponding authorization header;
- Workhorse intercepts the request, sends the request to Rails as it is, and obtains the authentication result, user ID, and Gitaly instance information corresponding to the warehouse;
- According to the return information of Rails in the previous step, Workhorse calls Gitaly's GRPC service InfoRefsReceivePack to act as a relay between the client and Gitaly;
- Gitaly runs git-receive-pack in the appropriate working directory, returning refs information;
- Git client calls POST https://jihulab.example.com/user/repo.git/git-receive-pack ;
- Workhorse intercepts the request, sends the request to Rails as it is, and obtains the authentication result, user ID, and Gitaly instance information corresponding to the warehouse;
- According to the return information of Rails in the previous step, Workhorse calls Gitaly's GRPC service PostReceivePack to act as a relay between the client and Gitaly;
- Gitaly runs git-receive-pack in the appropriate working directory, and pre-sets the environment variable GITALY_HOOKS_PAYLOAD, which contains GL_ID, GL_REPOSITORY, etc.;
- Server Git tries to update refs and runs Git hooks;
- Finish.
Gitaly and Git Hooks
Huh... After talking about the previous connection layer and permission control, we are finally able to approach the Git core of GitLab, Gitaly.
The name Gitaly is actually a joke, paying tribute to Git and the Russian town of Aly. The latter's resident population in the 2010 Russian census was 0. Gitaly's engineers hope that most of Gitaly's disk IO operations are also 0.
The stalk of a software engineer is too blunt, and most people may not be able to eat it...
Gitaly is responsible for the storage and operation of the GitLab repository. It runs the local Git binary program through fork/exec, and uses cgroups to prevent a single Git from eating too much CPU and memory. The repository is stored locally, the path is /var/opt/gitlab/git-data/repositories/@hashed/b1/7e/b17ef6d19c7a5b1ee83b907c595526dcb1eb06db8227d650d5dda0a9f4ce8cd9.git, and the early GitLab/Gitaly also used #{namespace}/#{project_name}. git form, but both namespace and project_name can be modified by the user, which brings additional runtime overhead.
Git push corresponds to Gitaly's SSHReceivePack (SSH) and PostReceivePack (HTTPS) methods, and their bottoms are Git's git-receive-pack, that is, the core refs and object updates are done by Git binary. git-receive-pack provides hooks so that this process can be intervened by Gitaly, which also involves Rails. A unilateral request (without return) process is probably as follows:
When Gitaly starts git-receive-pack, it will pass in a Base64-encoded JSON through the environment variable GITALY_HOOKS_PAYLOAD, which includes warehouse information, Gitaly Unix Socket address and link token, user information, and which hooks to execute (for git push, always Below), and set Git's core.hooksPath parameter to a temporary folder prepared by Gitaly itself when the program starts, where all Hook files are symlinked to gitaly-hooks.
After gitaly-hooks is started by git-receive-pack, it reads GITALY_HOOKS_PAYLOAD from the environment variable, connects back to Gitaly through Unix Socket and GRPC, and informs Gitaly of the currently executed Hook and the parameters provided by Git to the Hook.
pre-receive hook
This hook will be triggered once when Git receives a git push. When calling gitlab-hooks, Git will write the change information to its standard input, that is, "a ref wants to update from commit hash A to commit hash B", a line One:
<旧commit ref hash> SP <新commit ref hash> SP <ref名字> LF
where SP is a space and LF is a newline.
After the above information is returned to Gitaly, Gitaly will call the two interfaces of Rails in turn:
- POST /api/v4/internal/allowed: This interface was previously called during authentication at the connection layer. This time, additional change information is attached, and Rails can make finer-grained judgments based on it, such as disabling force push and judging whether the branch is protected, etc.
- POST /api/v4/internal/pre_receive: Notifies Rails that the current repository is about to have a write update, and Rails adds 1 to the reference count of this repository, which prevents the Git write operation of the repository from being interrupted by major changes elsewhere.
If POST /api/v4/internal/allowed returns an error, Gitaly will return the error to gitaly-hooks, gitaly-hooks will write an error message to standard error and exit with a non-zero exit code. The error message will be received by git-receive -pack is collected and then written to standard error. The non-zero exit code of gitaly-hooks will cause git-receive-pack to stop processing the current git push and exit. The exit code is also non-zero, and control returns to Gitaly, which collects Standard error output for git-receive-pack, replying to GRPC responses to Workhorse/Gitlab-Shell.
Careful students may ask, when Hooks is running, the related object must have been uploaded to the server. How to deal with this part of the dangling object at this time?
In fact, the objects corresponding to git push that have not been processed will be written to the isolation environment first, and they will be stored independently in a subfolder under objects, in the form of incoming-8G4u9v, so that if Hooks thinks that there is a problem with this push, related resources can be easily cleaned.
update hook
This hook will be triggered just before Git actually updates the ref, once for each ref. The input parameters are passed in from the command line parameters: the ref to be updated, the old commit hash, and the new commit hash. Currently this hook does not interact with Rails.
GitLab also supports custom Git Hooks, pre-receive hooks, update hooks and post-receive hooks. This operation is done in Gitaly when gitlab-hooks notifies Gitaly that the hook is running. Now is the time to trigger the custom update hook.
Photo/Vishal Jadhav on Unsplash
The hook in the picture has a long history of connection with computer science... ahem, well I can't make it up, I'm just worried that you're going to fall asleep when you see this, find a picture to relax you~
post-receive hook
After all refs have been updated, Git executes a post-receive hook that gets the same parameters as the pre-receive hook.
After Gitaly receives the reminder from gitaly-hooks, it will call Rails' POST /api/v4/internal/post_receive, and Rails will do a lot of things at this time:
- Returns a message that reminds the user to create a Merge Request;
- Set the reference count of the warehouse mentioned in the pre-receive to -1;
- Refresh the warehouse cache;
- trigger CI;
- Email if applicable.
Some of these operations are asynchronous and dispatched to SideKiq for scheduling.
Epilogue
Now that you've gone through git push from client to server, what a great journey!
Warriors, the picture below is your customs clearance treasure!
References
If you want to continue to dig deeper, the following sources are a good starting point:
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。