2
头图

Gitlab suddenly fails to deploy

One day about three months ago, the apprentice said to me with a frown: Master, it's not good, I recently used our gitlab when it was packaged and released, it always failed.

I said: Wasn't it good before?

He said: Yes, it was really easy to use in the past, but I don't know what happened. It just doesn't work for a while now. You say it can't be used at all. No, it can be used occasionally, but you have to try again several times. times, sometimes even a dozen times.

I thought about it carefully, but I didn't change anything, just for the sake of code security, I switched the original giblab IP address from the public network to the internal network, but DNS I also It has also been changed, and no one has reported any problems.

I said: You are like this, you first change the compilation script to ping gitlab.mydomain.com see if it can ping pass.

ping , it is unstable, sometimes it works and sometimes it doesn't work.

This is very strange, can ping to the IP address, indicating DNS no problem, but the IP address is not available. (At this point, I'm still obsessed, I didn't think it was a routing problem, I'll talk about it later)

Three months misled by MTU

The apprentice began to try to locate the problem by means of subtraction, and soon made a major discovery.

As long as we don't add the docker:dind service to the task, there will be no problem with the network, which means that the problem lies in this dind .

The full name of dind b7000a464c07ad33da3e61ceddcf64c0--- is docker in docker , because we compiled gitlab runner 0541e0ed0b8bfce9ee1d837f362e1fafb--- in the container of docker To execute the docker command in the container for packaging, you must rely on this dind service, dind docker service itself is a small container, and it starts another- docker daemon, so that the outside container can run the docker command.

Habitually open Google and start solving problems with Google.

很多网页都把问题的焦点指向了一个名叫---9869962a7eb81b446436d7d9fe7ed03b MTU的神秘设置,说这个dind e81853be603fc05eb42f21a4d0564fc3---容器MTU1500 , in some cases will cause unstable network transmission, similar to the symptoms we encountered.

MTU ( Maximum Transmission Unit ): MTU refers to the maximum data packet size that can pass above the data link layer. The maximum transmission unit parameter is usually related to the communication interface. The Internet Protocol allows IP fragmentation, so that a datagram packet can be broken into pieces small enough to pass over links whose MTU is smaller than the original size of the datagram. This fragmentation process occurs at the IP layer, and it uses the value of the maximum transmission unit to send the packet to the network interface on the link.

The question becomes: how to set this MTU ? We need to understand gitlab how to set up the sub-container in the service in the pipeline task. After checking a lot of information, someone service that as long as we add one to command The parameter mtu=1400 is fine, but after the experiment, it is found that it still does not work. We changed the command line to ifconfig , and directly checked the network card parameters, and found that it was still 1500 , so check docker:dind source code , found dind will read an environment variable DOCKERD_ROOTLESS_ROOTLESSKIT_MTU .

So the question becomes: how to set this environment variable so that dind read? Tried various methods, dind still can't read.

Calm down and re-read the documentation on Gitlab's official website , which introduces a parameter variables :

Additional environment variables that are passed exclusively to the service.
Additional environment variables provided for use by the service

It seems that this thing is what we want, but it has an additional condition: this setting can only be used for Gitlab 14.5 version and above. And our Gitlab version is still 13.12.3 .

Upgrading Gitlab Psycho

I found a quiet weekend, started a decisive upgrade Gitlab , and nearly fell into a doom.

I think, upgrading, this is not a very simple thing, and for people like us who strictly abide by the rules, upgrading is just adding a version number.

docker-compose.yml already has this sentence, this is still set up during the initial installation gitlab about a year ago: image: 'gitlab/gitlab-ee:latest' , then this is already the highest version , so directly docker-compose down and then docker-compose up -d should be fine, right?

No, it's still 13.12.3 .

Check the information, only to know that you should first docker-compose pull to get the latest version.

Well, follow the steps.

Broken, gitlab Why does the container keep restarting? Hurry up and execute docker logs Check the container and find that there is a big line inside: To upgrade the major version, you must first upgrade to version 14.0!

Depressed, let's take a look at the gitlab upgrade manual first, which mentions the upgrade path:

8.11.Z -> 8.12.0 -> 8.17.7 -> 9.5.10 -> 10.8.7 -> 11.11.8 -> 12.0.12 -> 12.1.17 -> 12.10.14 -> 13.0.14 -> 13.1.11 -> 13.8.8 -> 13.12.15 -> 14.0.12 -> 14.9.0 -> latest 14.YZ

So, I directly changed docker-compose.yml to image: 'gitlab/gitlab-ee:14.0.12' , but it reported that the package could not be found, and then checked gitlab label, and found that there is still a -ee.0 after the version number. -ee.0 , you already have ee before, why do you have to write it again in the label? (The so-called ee is the abbreviation of the enterprise version Enterprise Edition , gitlab actually does not distinguish between the enterprise version and the community version, and all require everyone to use the enterprise version, but if you do not pay, the part of the functions that belong to the enterprise version will be cannot be used)

Change it to image: 'gitlab/gitlab-ee:14.0.12-ee.0' try again, this time it finally succeeded!

Well, keep going, and rise to 14.9.0 , but gitlab the container can't get up again!

Open it again docker logs to check, this time it is full of characters flying, it seems to be doing database upgrades, but why does it keep restarting?

Google search again, it is Gitlab from 14.0 version introduced database migration mechanism, each upgrade must wait until the migration of the previous version is completed. Upgrade to the next version, and this migration process can take hours or even days!

It's broken. I must have upgraded too fast. Before the last version was finished, I started to upgrade to the next version 😭.

What should we do now? My brain is running fast. Report a disaster to your boss? Admit mistakes to colleagues? Say I lost all your code?

calm down. I thought about it for five minutes, let's see if I can downgrade it back to 14.0 .

So I changed ---f6ab64846c705afb503c773d7c749d10 docker-compose.yml to 14.0 again. Reboot and pray that the data is not lost.

14.0 finally started, but when I visit the page:

image.png

Big trouble now!

Resist the grief, open docker logs look, there is no clue, the report says everything is normal.

But the page is 500 ah!

There is information on the Internet that if you encounter 500 don't panic, go into the container and look at it. Well, docker exec -it into the container, run gitlab-ctl tail to see the output, when the page is refreshed here, the log reports that a database table called services is missing!

Isn't that still ruined? My database can't be upgraded to half, and the code is old, what should I do? You can't get up and down, and you can't get down again. This is a big trouble.

No way, Google again, and finally found the brothers who share the same fate . Hear his blood and tears indictment:

I upgraded from 13.12 to 14.0.7, I thought the migration was over and everything was fine, so I stopped the container and upgraded to 14.2, but it couldn't start, so I went back to 14.0.7, this time A 500 error was generated, and the log details are as follows:
ActionView::Template::Error (PG::UndefinedTable: ERROR: relation "services" does not exist
and I don't have a backup.

Exactly the exact same situation I encountered. Someone below said, Gitlab the database will be automatically backed up in the backup folder before each upgrade. This guy said no, I also went to see it, and there was no.

Fortunately, in the bug report he submitted, our Chinese brothers solved this problem, and this is this one:
image.png

The method is so simple: slowly rise!

Since there will be problems with fast upgrade, upgrade slowly, first upgrade to 14.1.1 , and then upgrade to 14.2.1 after the migration is over.

With the last glimmer of hope, I upgraded docker-compose.yml to 14.1.1 again and started it.

After five minutes of waiting, the container finally started.

Open a browser, visit the web page, and pray for no more 500 .

Ah, let out a long sigh of relief, and finally saw the familiar page.

But dare not move. According to the established steps, go to the administrator monitor to check the migration progress. Sure enough, there are 14 tasks being migrated. After this 14 task migration is complete, I start to think about the next steps.

Finally found the source of the problem

But I still want to upgrade to Gitlab 14.5 , I think that since I have already stood on 14.1.1 , and the migration is completed, it should not be too difficult to follow up, to be on the safe side, or upgrade 14.2.1 , this time also succeeded.

I silently waited for the migration to complete before going up to 14.9.0 . In fact, it can be upgraded to 14.10.0 , but don't use it for now, 14.9.0 is enough.

So we go back to Gitlab , set the environment variable of dind , compile again, no, network MTU or 1500 .

再次搜索关于dind MTU问题,设置方法其实还是和以前一样,就是command就够了,然后查看docker network inspect bridgeMTU ifconfig查出来的MTUdocker network里虽然已经是1400 , but ifconfig still shows 1500 , what does this mean?

I vaguely feel that this docker bridged network MTU may not need to be consistent with the outside, but anyway, the inside of the container is now 1400 , It can be even lower, but I can't communicate with the outside anyway, and I also tried ping www.baidu.com and it works ping It works, only to our intranet server ping pass.

I'm too tired, so I'm going to take a nap first.

After waking up, lying in bed, I started thinking about this: if I don't add the dind service, it works ping , and if I add it, it ping --No way, that means this ping dind service has modified my network configuration.

I see that if I don't add dind , my container has two network cards, one is eth0 , and the other is localhost , in this case it is normal Yes, but when I add the dind service, there will be three network cards in the container, and one more network card docker0 , will it be mine ping -Request only works on ping eth0 , not on docker0 , and when adding dind service, all ping The requests all automatically went to docker0 went up? Can I force ping to request eth0 ?

Try again: ping -I eth0 -c 10 gitlab.mydomain.com , this time it worked!

That shows that the problem lies in this network card docker0 .

Because of it, all network requests go to this network card, resulting in network failure.

So why does the network request go to this network card? Looking closely at its IP settings, I suddenly realized the problem. docker0 is a bridged network for docker .

By default, Docker uses 172.17.0.0/16 subnet range.
By default, Docker uses 172.17.0.0/16 as the subnet range.

And our intranet address 172.17.111.27 just happens to be in this network segment!

Then this makes sense. Originally, we had no problem using the public network IP . Our container access www.baidu.com also had no problem, and only had access to our intranet server. Sometimes there is a problem, because our intranet server address just coincides with the default intranet address of docker , so all network requests are forwarded to the bridge network of docker , resulting in Unable to communicate with the intranet server!

final battle

It is impossible and unnecessary for us to modify the address of the intranet server. Now we need to study how to modify the default subnet address of docker .

All the posts on the Internet say to modify /etc/docker/daemon.json this file, but there is no such file in our container at all, because we are starting a service in the container dind , we must make dind Get the modified settings, and gitlab service the settings in the container are very limited and cannot be easily modified service content.

After another intense search, I finally found the answer from another old man:

 variables:
    DAEMON_CONFIG: '{"bip": "192.168.123.1/24"}'
  services:
    - name: docker:dind
      entrypoint: ["/bin/sh", "-c", "mkdir -p /etc/docker && echo \"${DAEMON_CONFIG}\" > /etc/docker/daemon.json && exec dockerd-entrypoint.sh"]

The principle is also very simple: forcibly modify the entry address of this service dind a72b422f0e17614a106db6d8e8b969f9---, and write the content we want to modify before starting execution daemon.json , this time docker network segment of the bridge mesh docker0 does not overlap with our intranet segment, so it should take effect.

After modifying according to this method, execute the compilation process again. Now we can ping gitlab.mydomain.com directly ping pass, no need to specify the network card, indicating that the entire network is normal.

At this point, the problem that has been bothering us for three months has finally been completely solved: 在kubernetes网络中安装gitlab runner并运行一个docker打包的任务 .


Looking back at the whole process of solving the problem, we still ignored the biggest variable of network environment change at the beginning, and took a lot of detours. During this process, we learned what is MTU Gitlab and understood-- Gitlab How to upgrade, understand the bridge network settings of Docker . Although we suffered a little loss, the gain is huge, and we can use Gitlab for continuous deployment without any hindrance from now on. 😄


张京
13.4k 声望4.7k 粉丝

现任北京联云天下科技有限公司技术副总裁。1994年毕业于清华大学计算机科学与技术专业;20多年软件开发及项目管理经验;历任亚洲生活网络公司CTO,摩托罗拉软件中心QSE工具经理,融信恒通技术总监,安必信软件公...