TiDB Operator source code reading (4) component control loop

previous article , we introduced the orchestration of component lifecycle management of TiDB Operator, and introduced the implementation of Controller Manager with TiDBCluster Controller as an example. The TiDBCluster Controller is responsible for the life cycle management of the main components of TiDB. The Member Manager of each component of TiDB encapsulates the corresponding specific life cycle management logic. In the last article, we described the implementation of an abstract component life cycle management. In this article, we will take PD as an example to introduce the implementation process and related code of component life cycle management in detail, and based on the introduction of PD, introduce Some differences in other components.

PD life cycle management

The main logic of PD lifecycle management is maintained under PD Member Manager. The main code is in the pkg/manager/member/pd_member_manager.go file. The logic for scaling, upgrading, and failover is encapsulated in PD Scaler, PD Upgrader, and PD Failover, respectively, located in pd_scaler.go , pd_upgrader.go , pd_failover.go file.

According to the previous description, the life cycle management of components mainly needs to complete the following processes:

Synchronize Service;
Enter the StatefulSet synchronization process;
Synchronization Status;
Synchronize ConfigMap;
Handle rolling updates;
Deal with expansion and contraction;
Handle failover;
Finally, the StatefulSet synchronization process is completed.

Among them, the synchronization StatefulSet process is the main logic of PD component lifecycle management, and other synchronization processes, such as synchronization Status, synchronization ConfigMap, rolling update, capacity expansion and contraction, and failover tasks, are defined as sub-functions respectively, in the synchronization StatefulSet The implementation of these subtasks will be introduced in detail after the synchronization process of the StatefulSet is introduced.

Synchronize StatefulSet

Use Stetefulset Lister to get PD's existing StatefulSet:

oldPDSetTmp, err := m.deps.StatefulSetLister.StatefulSets(ns).Get(controller.PDMemberName(tcName))
if err != nil && !errors.IsNotFound(err) {
    return fmt.Errorf("syncPDStatefulSetForTidbCluster: fail to get sts %s for cluster %s/%s, error: %s", controller.PDMemberName(tcName), ns, tcName, err)
}
setNotExist := errors.IsNotFound(err)
 
oldPDSet := oldPDSetTmp.DeepCopy()

Use m.syncTidbClusterStatus(tc, oldPDSet) get the latest status.

if err := m.syncTidbClusterStatus(tc, oldPDSet); err != nil {
    klog.Errorf("failed to sync TidbCluster: [%s/%s]'s status, error: %v", ns, tcName, err)
}

Check whether TidbCluster is in Paused state, if so, stop the next Reconcile process.

if tc.Spec.Paused {
    klog.V(4).Infof("tidb cluster %s/%s is paused, skip syncing for pd statefulset", tc.GetNamespace(), tc.GetName())
    return nil
}

According to the latest tc.Spec , synchronize the ConfigMap.

cm, err := m.syncPDConfigMap(tc, oldPDSet)

According to the latest tc.Spec , tc.Status and the ConfigMap obtained in the previous step, generate the latest StatefulSet template.

newPDSet, err := getNewPDSetForTidbCluster(tc, cm)

If the StatefulSet of the PD has not been created, the StatefulSet of the PD will be created first in this round of synchronization.

if setNotExist {
    if err := SetStatefulSetLastAppliedConfigAnnotation(newPDSet); err != nil {
        return err
    }
    if err := m.deps.StatefulSetControl.CreateStatefulSet(tc, newPDSet); err != nil {
        return err
    }
    tc.Status.PD.StatefulSet = &apps.StatefulSetStatus{}
    return controller.RequeueErrorf("TidbCluster: [%s/%s], waiting for PD cluster running", ns, tcName)
}

If the user configures a mandatory upgrade using Annotation, then in this step the StatefulSet will be directly set for rolling update, which is used in some scenarios where the synchronization cycle is blocked and cannot be updated.

if !tc.Status.PD.Synced && NeedForceUpgrade(tc.Annotations) {
    tc.Status.PD.Phase = v1alpha1.UpgradePhase
    setUpgradePartition(newPDSet, 0)
    errSTS := UpdateStatefulSet(m.deps.StatefulSetControl, tc, newPDSet, oldPDSet)
    return controller.RequeueErrorf("tidbcluster: [%s/%s]'s pd needs force upgrade, %v", ns, tcName, errSTS)
}

To process the Scale, call the scaling logic implemented in pd_scaler.go

if err := m.scaler.Scale(tc, oldPDSet, newPDSet); err != nil {
    return err
}

To handle Failover, call pd_failover.go , first check whether Recover is required, then check whether all Pods are started normally, and whether all members are healthy, and then decide whether to enter the Failover logic.

if m.deps.CLIConfig.AutoFailover {
    if m.shouldRecover(tc) {
        m.failover.Recover(tc)
    } else if tc.PDAllPodsStarted() && !tc.PDAllMembersReady() || tc.PDAutoFailovering() {
        if err := m.failover.Failover(tc); err != nil {
            return err
        }
    }
}

To process Upgrade, call pd_upgrader.go . The new PD StatefulSet currently generated is inconsistent with the existing PD StatefulSet in tc.Status.PD.Phase , or the StatefulSet is the same but the state recorded in 060de884647c86 is the updated state, and it will enter the Upgrader to process the rolling update. Related logic.

if !templateEqual(newPDSet, oldPDSet) || tc.Status.PD.Phase == v1alpha1.UpgradePhase {
    if err := m.upgrader.Upgrade(tc, oldPDSet, newPDSet); err != nil {
        return err
    }
}

Finally, the PD StatefulSet is synchronized, and the new StatefulSet is updated to the Kubernetes cluster.

Synchronization Service

PD uses two services, Service and Headless Service, which are managed by two functions syncPDServiceForTidbCluster and syncPDHeadlessServiceForTidbCluster

The Service address is generally used for the PD Endpoint configured by TiKV, TiDB, and TiFlash. For example, the startup parameters of TiDB are as follows, where 060de884647d0b used by --path=${CLUSTER_NAME}-pd:2379 is the service address of the PD:

ARGS="--store=tikv \
--advertise-address=${POD_NAME}.${HEADLESS_SERVICE_NAME}.${NAMESPACE}.svc \
--path=${CLUSTER_NAME}-pd:2379 \

Headless Service can provide a unique network identifier for each Pod. For example, the startup parameters of the PD are as follows. When the PD is started with the following parameters, the Endpoint of the Pod registered by the PD Pod in the PD Members is "${POD_NAME}.${PEER_SERVICE_NAME}.${NAMESPACE}.svc" .

domain="${POD_NAME}.${PEER_SERVICE_NAME}.${NAMESPACE}.svc"
ARGS="--data-dir=/var/lib/pd \
--name=${POD_NAME} \
--peer-urls=http://0.0.0.0:2380 \
--advertise-peer-urls=http://${domain}:2380 \
--client-urls=http://0.0.0.0:2379 \
--advertise-client-urls=http://${domain}:2379 \
--config=/etc/pd/pd.toml \
"

Synchronize ConfigMap

PD uses ConfigMap to manage configuration and startup scripts. The syncPDConfigMap function calls getPDConfigMap obtain the latest ConfigMap, and then updates the latest ConfigMap to the Kubernetes cluster. ConfigMap needs to handle the following tasks:

Obtain PD Config for subsequent synchronization. In order to be compatible with TiDB Operator 1.0 version using Helm to maintain ConfigMap, when the config object is empty, the ConfigMap is not synchronized.

config := tc.Spec.PD.Config
if config == nil {
    return nil, nil
}

Modify the TLS related configuration in the configuration, among which Dashboard is not supported below 4.0, so there is no need to set the Dashboard certificate for PD versions below 4.0.

// override CA if tls enabled
if tc.IsTLSClusterEnabled() {
    config.Set("security.cacert-path", path.Join(pdClusterCertPath, tlsSecretRootCAKey))
    config.Set("security.cert-path", path.Join(pdClusterCertPath, corev1.TLSCertKey))
    config.Set("security.key-path", path.Join(pdClusterCertPath, corev1.TLSPrivateKeyKey))
}
// Versions below v4.0 do not support Dashboard
if tc.Spec.TiDB != nil && tc.Spec.TiDB.IsTLSClientEnabled() && !tc.SkipTLSWhenConnectTiDB() && clusterVersionGE4 {
    config.Set("dashboard.tidb-cacert-path", path.Join(tidbClientCertPath, tlsSecretRootCAKey))
    config.Set("dashboard.tidb-cert-path", path.Join(tidbClientCertPath, corev1.TLSCertKey))
    config.Set("dashboard.tidb-key-path", path.Join(tidbClientCertPath, corev1.TLSPrivateKeyKey))
}

Convert Config to TOML format for PD use.

confText, err := config.MarshalTOML()

Use RenderPDStartScript generate the PD startup script, where the PD startup script template is in the pdStartScriptTpl variable in pkg/manager/member/template.go. The PD startup script is a Bash script. The purpose of rendering according to the template is to insert some variables and annotations set by the TidbCluster object into the startup script for normal startup and debug mode of PD.
Assemble the PD configuration and PD startup script generated above into a Kubernetes ConfigMap object and return it to the syncPDConfigMap function.

cm := &corev1.ConfigMap{
    ObjectMeta: metav1.ObjectMeta{
        Name:            controller.PDMemberName(tc.Name),
        Namespace:       tc.Namespace,
        Labels:          pdLabel,
        OwnerReferences: []metav1.OwnerReference{controller.GetOwnerRef(tc)},
    },
    Data: map[string]string{
        "config-file":    string(confText),
        "startup-script": startScript,
    },
}

Expansion and shrinkage

The expansion and contraction are implemented in the pkg/manager/member/pd_scaler.go file, which is used to deal with the requirements of PD expansion and contraction. In StatefulSet, the Scale function is called to enter the logic of scaling. Expansion and shrinkage are all implemented by setting the number of StatefulSet replicas. Some pre-operations need to be completed before actual expansion and shrinking. For example, leaders need to be actively migrated when shrinking, offline nodes, annotations for delayed deletion are added for PVC, and automatic deletion during expansion. Previously reserved PVC. After completing the pre-operations, adjust the number of StatefulSet replicas to reduce the impact of scaling operations on the cluster. These pre-operations can also be expanded according to business needs.

func (s *pdScaler) Scale(meta metav1.Object, oldSet *apps.StatefulSet, newSet *apps.StatefulSet) error {
    scaling, _, _, _ := scaleOne(oldSet, newSet)
    if scaling > 0 {
        return s.ScaleOut(meta, oldSet, newSet)
    } else if scaling < 0 {
        return s.ScaleIn(meta, oldSet, newSet)
    }
    return s.SyncAutoScalerAnn(meta, oldSet)
}

This Scale function is equivalent to a route, and the calling plan is determined according to the direction, step length, and distance of the expansion and contraction. Currently, PD expansion and contraction only expand and contract one node at a time, with a step size of 1, and the direction is determined by the positive or negative of the scaling variable. The solution is implemented by two functions, ScaleIn and ScaleOut.

For PD shrinking, in order not to affect cluster performance, leader migration needs to be actively completed during the shrinking process. Otherwise, when the shrinking node is the leader node, the remaining nodes will passively start electing the leader without the leader after the leader node goes offline, which affects the performance of the cluster. When actively migrating the leader, you only need to migrate the leader to the node with the smallest sequence number to ensure that the number of PD Leader migrations is only one.

First get PD Client, get PD Leader:

pdClient := controller.GetPDClient(s.deps.PDControl, tc)
leader, err := pdClient.GetPDLeader()

When the Leader Name is equal to the Node Name, perform the transferLeader operation. If the number of nodes is 1, there are not enough PD nodes to complete the transferLeader operation, skip this operation.

After the leader migration is completed, the ScaleIn function will call the PD's DeleteMember API to delete the node from the PD member to implement the offline process of the node, and finally call setReplicasAndDeleteSlots adjust the number of copies of the StatefulSet to complete the scaling.

For PD expansion, the PVC left behind for data reliability needs to be deleted during expansion to prevent the use of old data. Therefore, deleteDeferDeletingPVC will be called to delete the delayed PVC before expansion. After deleting, adjust the number of replicas of the StatefulSet to expand the capacity.

For the expansion and contraction of PD, the expansion and contraction is mainly accomplished by setting the number of replicas of the StatefulSet. Therefore, when supporting Advanced StatefulSet, the existence of empty slots needs to be considered when calculating the number of replicas.

Rolling update

PD upgrade pkg/manager/member/pd_upgrader.go , and the main method is to use StatefulSet's UpdateStrategy to implement rolling updates. PD Upgrader will insert some pre-operations of PD in the process of adjusting StatefulSet UpgradeStrategy to reduce the impact of upgrade operations on PD clusters. How specific control StatefulSet UpgradeStrategy, refer on articles .

Before starting the upgrade, you need to complete the following status checks:

Check whether other operations are in progress, mainly to check whether TiCDC and TiFlash are in the upgrade state, and whether the PD is in the expansion state:

if tc.Status.TiCDC.Phase == v1alpha1.UpgradePhase ||
        tc.Status.TiFlash.Phase == v1alpha1.UpgradePhase ||
        tc.PDScaling()

As mentioned in the Synchronous StatefulSet section, there are two conditions for entering the Upgrader. One is that the newSet and oldSet Template Spec are inconsistent. This happens at the beginning of the update. At this time, return nil and update the StatefulSet directly. You do not need to perform the following Pod checks one by one. Operation. If it is an tc.Status.PD.Phase == v1alpha1.UpgradePhase entered under the condition of 060de88464814d, then newSet and oldSet Template Spec same. At this time, the following checks need to be continued.

if !templateEqual(newSet, oldSet) {
    return nil
}

Compare tc.Status.PD.StatefulSet.UpdateRevision and tc.Status.PD.StatefulSet.CurrentRevision to get the rolling update status. If the two are equal, the rolling update operation is complete, and you can exit the rolling update process.

if tc.Status.PD.StatefulSet.UpdateRevision == tc.Status.PD.StatefulSet.CurrentRevision

Check StatefulSet of UpdateStrategy has been manually modified. If it is manually modified, the corresponding strategy will be used.

if oldSet.Spec.UpdateStrategy.Type == apps.OnDeleteStatefulSetStrategyType || oldSet.Spec.UpdateStrategy.RollingUpdate == nil {
        newSet.Spec.UpdateStrategy = oldSet.Spec.UpdateStrategy
        klog.Warningf("tidbcluster: [%s/%s] pd statefulset %s UpdateStrategy has been modified manually", ns, tcName, oldSet.GetName())
        return nil
    }

After completing the overall inspection, start to process each Pod and perform a rolling update operation:

Check whether the PD Pod has been updated. By checking the value of controller-revision-hash in Pod Label and StatefulSet with UpdateRevision , determine whether the Pod is an upgraded Pod or an unprocessed Pod. For the Pod that has been upgraded, check whether the PD Member corresponding to the Pod has become healthy, if not, return an error, wait for the next synchronization to continue to check the state, and if it reaches the healthy state, start processing the next Pod.

revision, exist := pod.Labels[apps.ControllerRevisionHashLabelKey]
        if !exist {
            return controller.RequeueErrorf("tidbcluster: [%s/%s]'s pd pod: [%s] has no label: %s", ns, tcName, podName, apps.ControllerRevisionHashLabelKey)
        }
 
        if revision == tc.Status.PD.StatefulSet.UpdateRevision {
            if member, exist := tc.Status.PD.Members[PdName(tc.Name, i, tc.Namespace, tc.Spec.ClusterDomain)]; !exist || !member.Health {
                return controller.RequeueErrorf("tidbcluster: [%s/%s]'s pd upgraded pod: [%s] is not ready", ns, tcName, podName)
            }
            continue
        }

For Pod revision != tc.Status.PD.StatefulSet.UpdateRevision , indicating that the Pod has not yet performed a rolling update, call the upgradePDPod function to process the Pod. As with the shrinking logic, when the PD Leader Pod is processed, an operation of actively migrating the Leader will be performed, and then the operation of updating the Pod will be performed.

if tc.Status.PD.Leader.Name == upgradePdName || tc.Status.PD.Leader.Name == upgradePodName {
    var targetName string
    targetOrdinal := helper.GetMaxPodOrdinal(*newSet.Spec.Replicas, newSet)
    if ordinal == targetOrdinal {
        targetOrdinal = helper.GetMinPodOrdinal(*newSet.Spec.Replicas, newSet)
    }
    targetName = PdName(tcName, targetOrdinal, tc.Namespace, tc.Spec.ClusterDomain)
    if _, exist := tc.Status.PD.Members[targetName]; !exist {
        targetName = PdPodName(tcName, targetOrdinal)
    }
    
    if len(targetName) > 0 {
        err := u.transferPDLeaderTo(tc, targetName)
        if err != nil {
            klog.Errorf("pd upgrader: failed to transfer pd leader to: %s, %v", targetName, err)
            return err
        }
        klog.Infof("pd upgrader: transfer pd leader to: %s successfully", targetName)
        return controller.RequeueErrorf("tidbcluster: [%s/%s]'s pd member: [%s] is transferring leader to pd member: [%s]", ns, tcName, upgradePdName, targetName)
    }
}
setUpgradePartition(newSet, ordinal)

Failover

PD failover pd_failover.go . Unlike other component Failover logic, PD will repair the problem by actively deleting the failed Pod. PD Failover will be checked before. When the PD cluster is unavailable, that is, more than half of the PD nodes are unhealthy, rebuilding the PD nodes at this time will not restore the cluster, so failover work will not be performed.

Traverse the health status of PD Members obtained from the PD Client. When the health status of the PD member is Unhealthy and LastTransitionTime exceeds the failoverDeadline time, it will be marked as an unhealthy member. The next operation will be the Pod information and PVC related to the unhealthy PD member. The information is recorded in tc.Status.PD.FailureMembers .

for pdName, pdMember := range tc.Status.PD.Members {
    podName := strings.Split(pdName, ".")[0]
 
    failoverDeadline := pdMember.LastTransitionTime.Add(f.deps.CLIConfig.PDFailoverPeriod)
    _, exist := tc.Status.PD.FailureMembers[pdName]
 
    if pdMember.Health || time.Now().Before(failoverDeadline) || exist {
        continue
    }
 
    pod, _ := f.deps.PodLister.Pods(ns).Get(podName)
 
    pvcs, _ := util.ResolvePVCFromPod(pod, f.deps.PVCLister)
 
    f.deps.Recorder.Eventf(tc, apiv1.EventTypeWarning, "PDMemberUnhealthy", "%s/%s(%s) is unhealthy", ns, podName, pdMember.ID)
 
    pvcUIDSet := make(map[types.UID]struct{})
    for _, pvc := range pvcs {
        pvcUIDSet[pvc.UID] = struct{}{}
    }
    tc.Status.PD.FailureMembers[pdName] = v1alpha1.PDFailureMember{
        PodName:       podName,
        MemberID:      pdMember.ID,
        PVCUIDSet:     pvcUIDSet,
        MemberDeleted: false,
        CreatedAt:     metav1.Now(),
    }
    return controller.RequeueErrorf("marking Pod: %s/%s pd member: %s as failure", ns, podName, pdMember.Name)
}

Call the tryToDeleteAFailureMember function to process FailureMemebrs, traverse FailureMembers, when encountering a member whose MemberDeleted is False, call PD Client to delete the PD member, and try to restore the Pod.

func (f *pdFailover) tryToDeleteAFailureMember(tc *v1alpha1.TidbCluster) error {
    ns := tc.GetNamespace()
    tcName := tc.GetName()
    var failureMember *v1alpha1.PDFailureMember
    var failurePodName string
    var failurePDName string
 
    for pdName, pdMember := range tc.Status.PD.FailureMembers {
        if !pdMember.MemberDeleted {
            failureMember = &pdMember
            failurePodName = strings.Split(pdName, ".")[0]
            failurePDName = pdName
            break
        }
    }
    if failureMember == nil {
        klog.Infof("No PD FailureMembers to delete for tc %s/%s", ns, tcName)
        return nil
    }
 
    memberID, err := strconv.ParseUint(failureMember.MemberID, 10, 64)
    if err != nil {
        return err
    }
 
    if err := controller.GetPDClient(f.deps.PDControl, tc).DeleteMemberByID(memberID); err != nil {
        klog.Errorf("pd failover[tryToDeleteAFailureMember]: failed to delete member %s/%s(%d), error: %v", ns, failurePodName, memberID, err)
        return err
    }
    klog.Infof("pd failover[tryToDeleteAFailureMember]: delete member %s/%s(%d) successfully", ns, failurePodName, memberID)
  ...

Delete the faulty Pod,

pod, err := f.deps.PodLister.Pods(ns).Get(failurePodName)
    if err != nil && !errors.IsNotFound(err) {
        return fmt.Errorf("pd failover[tryToDeleteAFailureMember]: failed to get pod %s/%s for tc %s/%s, error: %s", ns, failurePodName, ns, tcName, err)
    }
    if pod != nil {
        if pod.DeletionTimestamp == nil {
            if err := f.deps.PodControl.DeletePod(tc, pod); err != nil {
                return err
            }
        }
    } else {
        klog.Infof("pd failover[tryToDeleteAFailureMember]: failure pod %s/%s not found, skip", ns, failurePodName)
    }

Delete the PVC,

for _, pvc := range pvcs {
        _, pvcUIDExist := failureMember.PVCUIDSet[pvc.GetUID()]
        // for backward compatibility, if there exists failureMembers and user upgrades operator to newer version
        // there will be failure member structures with PVCUID set from api server, we should handle this as pvcUIDExist == true
        if pvc.GetUID() == failureMember.PVCUID {
            pvcUIDExist = true
        }
        if pvc.DeletionTimestamp == nil && pvcUIDExist {
            if err := f.deps.PVCControl.DeletePVC(tc, pvc); err != nil {
                klog.Errorf("pd failover[tryToDeleteAFailureMember]: failed to delete PVC: %s/%s, error: %s", ns, pvc.Name, err)
                return err
            }
            klog.Infof("pd failover[tryToDeleteAFailureMember]: delete PVC %s/%s successfully", ns, pvc.Name)
        }
    }

The status in the mark tc.Status.PD.FailureMembers

setMemberDeleted(tc, failurePDName)

The number of replicas of the PD StatefulSet is tc.PDStsDesiredReplicas() . The number of replicas of the StatefulSet will be added to the number of FailureMembers that have been deleted. At this time, the logic of expanding the StatefulSet will be called to add a PD Pod for failover during the process of synchronizing the StatefulSet.

func (tc *TidbCluster) GetPDDeletedFailureReplicas() int32 {
    var deletedReplicas int32 = 0
    for _, failureMember := range tc.Status.PD.FailureMembers {
        if failureMember.MemberDeleted {
            deletedReplicas++
        }
    }
    return deletedReplicas
}
 
func (tc *TidbCluster) PDStsDesiredReplicas() int32 {
    return tc.Spec.PD.Replicas + tc.GetPDDeletedFailureReplicas()
}

Life cycle management of other components

In the previous section, we introduced the code implementation of component lifecycle management in detail through PD. Other components, including TiKV, TiFlash, TiDB, Pump, TiCDC, are similar to PD lifecycle management, so I won’t repeat them here. In the following part, we Emphasize the difference with PD code implementation.

Life cycle management of TiKV/TiFlash

TiKV is similar to TiFlash life cycle management, and you just need to understand the code implementation of TiKV. The differences between TiKV's life cycle management and PD life cycle management are as follows:

In the process of synchronizing StatefulSet, TiKV Member Manager needs to set TiKV Store Label through setStoreLabelsForTiKV. In the setStoreLabelsForTiKV function, it is implemented to set the Label on the Node to the TiKV store through the SetStoreLabels interface of the PD Client.

for _, store := range storesInfo.Stores {
    nodeName := pod.Spec.NodeName
    ls, _ := getNodeLabels(m.deps.NodeLister, nodeName, storeLabels)
 
    if !m.storeLabelsEqualNodeLabels(store.Store.Labels, ls) {
        set, err := pdCli.SetStoreLabels(store.Store.Id, ls)
        if err != nil {
            continue
        }
        if set {
            setCount++
            klog.Infof("pod: [%s/%s] set labels: %v successfully", ns, podName, ls)
        }
    }
}

In terms of synchronization status, TiKV Member Manager will call PD Client's GetStores function to obtain TiKV Stores information from PD, and classify the obtained Stores information for subsequent synchronization. This process is similar to the process of calling the GetMembers interface during the PD Status synchronization process and analyzing and recording the PD Members information.
In terms of service synchronization, TiKV Member Manager only creates Headless Service for TiKV Pod DNS resolution.
In terms of synchronizing ConfigMap, TiKV Member Manager is similar to PD Member Manager. The relevant script template is implemented in the templates.go file. The startup script of TiKV is generated by calling RenderTiKVStartScript, and the configuration file of TiKV is obtained by calling transformTiKVConfigMap.
In terms of capacity expansion and contraction, similar to the active migration of Leader nodes before PD life cycle management and capacity reduction, TiKV life cycle management needs to safely offline TiKV Stores, and TiKV Member Manager calls the DeleteStore of the PD Client to delete the corresponding Store on the Pod.
In terms of rolling updates, TiKV needs to ensure that there is no Region Leader on the Pod before rolling the Pod. Before starting the rolling update, TiKV Upgrader will determine whether the Pod has been EvictLeader operation based on whether there is EvictLeaderBeginTime on the Pod Annotation. If not, call the BeginEvictLeader function in the PD Client to add an evict leader to the TiKV store. The scheduler will expel the Region Leader on the TiKV Store.

_, evicting := upgradePod.Annotations[EvictLeaderBeginTime]
if !evicting {
    return u.beginEvictLeader(tc, storeID, upgradePod)
}

In the readyToUpgrade function, when the Region Leader is zero, or the time to transfer the Region Leader exceeds the tc.Spec.TiKV.EvictLeaderTimeout set by 060de88464876f, the partition configuration in the StatefulSet UpgradeStrategy is updated to trigger the Po upgrade. After the Pod is upgraded, call endEvictLeaderbyStoreID to end the evictLeader operation.

In terms of failover, TiKV Member Manager will record the time of the last Store state change when the state is synchronized.

status.LastTransitionTime = metav1.Now()
if exist && status.State == oldStore.State {
    status.LastTransitionTime = oldStore.LastTransitionTime
}

When the Store status is v1alpha1.TiKVStateDown , and the time to keep the Down status according to LastTransitionTime exceeds the failover time limit set in the configuration, add the TiKV Pod to FailureStores,

if store.State == v1alpha1.TiKVStateDown && time.Now().After(deadline) && !exist {
    if tc.Status.TiKV.FailureStores == nil {
        tc.Status.TiKV.FailureStores = map[string]v1alpha1.TiKVFailureStore{}
    }
    if tc.Spec.TiKV.MaxFailoverCount != nil && *tc.Spec.TiKV.MaxFailoverCount > 0 {
        maxFailoverCount := *tc.Spec.TiKV.MaxFailoverCount
        if len(tc.Status.TiKV.FailureStores) >= int(maxFailoverCount) {
            klog.Warningf("%s/%s failure stores count reached the limit: %d", ns, tcName, tc.Spec.TiKV.MaxFailoverCount)
            return nil
        }
        tc.Status.TiKV.FailureStores[storeID] = v1alpha1.TiKVFailureStore{
            PodName:   podName,
            StoreID:   store.ID,
            CreatedAt: metav1.Now(),
        }
        msg := fmt.Sprintf("store[%s] is Down", store.ID)
        f.deps.Recorder.Event(tc, corev1.EventTypeWarning, unHealthEventReason, fmt.Sprintf(unHealthEventMsgPattern, "tikv", podName, msg))
    }
}

In the process of synchronizing the TiKV StatefulSet, the number of replicas will be added to the number of FailureStores, triggering the expansion logic and completing the failover process.

func (tc *TidbCluster) TiKVStsDesiredReplicas() int32 {
   return tc.Spec.TiKV.Replicas + int32(len(tc.Status.TiKV.FailureStores))
}

Life cycle management of TiDB/TiCDC/Pump

The life cycle management of TiDB, TiCDC, and Pump is similar. Compared with other components, it is mainly necessary to control the rolling update when the member status is healthy before allowing the rolling update process to continue. In the process of capacity expansion and contraction, the additional consideration is the use of PVC. Similar to the use of PD's PVC, the design of deferDeleting needs to be added when scaling to ensure data security, and the PVC should be removed when expanding. In terms of failover, currently TiDB has implemented failover logic, and TiCDC and Pump have no failover related logic for the time being.

summary

This article introduces the specific implementation of the control loop of the TiDBCluster component. It mainly combines the context information of the PD component to explain the general logic design introduced in the previous article, and then introduces some of the differences of other components. Through this article and the previous article, we learned about the design of the Member Manager part of the main components of TiDB and the implementation of the TiDB lifecycle management process in the TiDB Operator.

If you have any good ideas, welcome to participate in the TiDB Operator community communication #sig-k8s or pingcap/tidb-operator

TiDB Operator source code reading (4) component control loop

PD life cycle management

Synchronize StatefulSet

Synchronization Service

Synchronize ConfigMap

Expansion and shrinkage

Rolling update

Failover

Life cycle management of other components

Life cycle management of TiKV/TiFlash

Life cycle management of TiDB/TiCDC/Pump

summary

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式