Author: Zhang Luodan

Former member of the DBA team of Aikesheng, now a member of the DBA team of Lufax, has a persistent pursuit of technology!

Source of this article: original submission

* Produced by the Aikesheng open source community, original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.


background

Use MHA for master-slave switch in a change, the command is as follows:

masterha_master_switch --master_state=alive --conf=/etc/mha/mha3306.cnf --new_master_host=xx.xx.xx.xx --new_master_port=3306 --interactive=0 --orig_master_is_new_slave

However, I encountered an error, as follows:

[error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln1213] XX.XX.XX.XX is bad as a new master!
[error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln232] Failed to get new master!
[error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /bin/masterha_master_switch line 53.

bad new master error is that the designated new master is a 06127510fa535b.

I am confused when encountering this error. It is normal to check the cluster status and masterha_check_repl before switching. Hmm... I still don't have a deep understanding of the principles of MHA.

At that time, there was no time to study why the error was reported, so I switched manually. Next, let us explore why this error occurred!

Explain that the environment of the online master-slave cluster is like this:

| Role | MySQL version |
| :-------- | :--------|
| M | MySQL 5.6.40 |
| S1 | MySQL 5.6.40 |
| S2 | MySQL 5.7.29 |
| S3 | MySQL 5.7.29 |

PS: Why are the master and slave versions inconsistent?
It is because the upgrade is in progress. The purpose of this switch is to switch S2 as the master, and then upgrade the two instances of the lower version.

testing scenarios

The online manual switch bypasses the MHA error report, and the specific reasons will be analyzed later. Because the version of the new main library in the live environment is different from the version of the old main library, I wonder if MHA does not support cross-version switching, and I didn't pay attention to this issue before. So a wave of tests were carried out in the test environment. The test scenarios and test conclusions are listed below. Those who are interested can test it by themselves:

testing scenariosOriginal master versionNew master versionOther slaves versionsSwitch result
scene 15.6.405.7.29withoutSwitch successfully
Scene 25.7.295.6.40withoutSwitch successfully
Scene 35.6.405.7.295.6.38Switch failed
Scene 45.6.385.7.295.6.40Switch failed
Scene 55.6.385.7.295.7.29Switch successfully

The phenomenon is such a phenomenon. Isn’t it curious why the cross-version can switch successfully when there is only one slave library. When there are other slave libraries, the switch can be successful in some cases, and the switch fails in some cases. Look down Bar!

problem analysis

Go to google first, search for keywords: mha .. is bad as a new master ,

Then the search results did not find the results I wanted. Some articles with reference value are as follows:

At the end of the road, I can only go through the source code. After all, MHA is an open source tool [If you don’t force yourself to do it, you don’t know if your English is good]
Find the relevant code of the MHA master selection, first define several arrays:

  1. slaves array: select the slaves of alive
  2. latest array: select the latest slaves of the replication site from the alive slave
  3. pref array: slaves with candidate_master configured in the configuration file
  4. bad array: explained later

Then in the election of the leader, the election is conducted in the following order:

  1. Elect the slave with the highest priority as the new master (usually switch the designated new master manually). If the slave cannot be the new master, it will report an error and exit. Otherwise, if it is a failover, proceed to the following steps
  2. Select the slave with the latest replication site and in the pref array as the new master. If the slave with the latest replication site is not in the pref array, continue with the following steps
  3. Select a slave from pref as the new master, if not selected, continue
  4. Select the latest slave at the replication site as the new master, if not selected, continue
  5. Choose from all slaves
  6. After the above steps, the election fails if you still can’t choose the master.

Note: For the first 6 election steps, it is necessary to ensure that the new master is not in the bad array

# Picking up new master
# If preferred node is specified, one of active preferred nodes will be new master.
# If the latest server behinds too much (i.e. stopping sql thread for online backups), we should not use it as a new master, but we should fetch relay log there. Even though preferred master is configured, it does not become a master if it's far behind.
sub select_new_master {
  my $self                    = shift;
  my $prio_new_master_host    = shift;
  my $prio_new_master_port    = shift;
  my $check_replication_delay = shift;
  $check_replication_delay = 1 if ( !defined($check_replication_delay) );

  my $log    = $self->{logger};
  my @latest = $self->get_latest_slaves();
  my @slaves = $self->get_alive_slaves();

  my @pref = $self->get_candidate_masters();
  my @bad =
    $self->get_bad_candidate_masters( $latest[0], $check_replication_delay );

  if ( $prio_new_master_host && $prio_new_master_port ) {
    my $new_master =
      $self->get_alive_server_by_hostport( $prio_new_master_host,
      $prio_new_master_port );
    if ($new_master) {
      my $a = $self->get_server_from_by_id( \@bad, $new_master->{id} );
      unless ($a) {
        $log->info("$prio_new_master_host can be new master.");
        return $new_master;
      }
      else {
        $log->error("$prio_new_master_host is bad as a new master!");
        return;
      }
    }
    else {
      $log->error("$prio_new_master_host is not alive!");
      return;
    }
  }
$log->info("Searching new master from slaves..");
  $log->info(" Candidate masters from the configuration file:");
  $self->print_servers( \@pref );
  $log->info(" Non-candidate masters:");
  $self->print_servers( \@bad );

  return $latest[0]
    if ( $#pref < 0 && $#bad < 0 && $latest[0]->{latest_priority} );

  if ( $latest[0]->{latest_priority} ) {
    $log->info(
" Searching from candidate_master slaves which have received the latest relay log events.."
    ) if ( $#pref >= 0 );
    foreach my $h (@latest) {
      foreach my $p (@pref) {
        if ( $h->{id} eq $p->{id} ) {
          return $h
            if ( !$self->get_server_from_by_id( \@bad, $p->{id} ) );
        }
      }
    }
    $log->info("  Not found.") if ( $#pref >= 0 );
  }

  #new master is not latest
  $log->info(" Searching from all candidate_master slaves..")
    if ( $#pref >= 0 );
  foreach my $s (@slaves) {
    foreach my $p (@pref) {
      if ( $s->{id} eq $p->{id} ) {
        my $a = $self->get_server_from_by_id( \@bad, $p->{id} );
        return $s unless ($a);
      }
    }
  }
  $log->info("  Not found.") if ( $#pref >= 0 );
if ( $latest[0]->{latest_priority} ) {
    $log->info(
" Searching from all slaves which have received the latest relay log events.."
    );
    foreach my $h (@latest) {
      my $a = $self->get_server_from_by_id( \@bad, $h->{id} );
      return $h unless ($a);
    }
    $log->info("  Not found.");
  }

  # none of latest servers can not be a master
  $log->info(" Searching from all slaves..");
  foreach my $s (@slaves) {
    my $a = $self->get_server_from_by_id( \@bad, $s->{id} );
    return $s unless ($a);
  }
  $log->info("  Not found.");

  return;
}

Because the error is that the new master is bad, let's focus on why the new master is judged as bad and how it is judged. The function to get the bad list is get_bad_candidate_masters , as shown below, it can be seen that the slave with the following five situations will be judged as bad:

  1. dead servers
  2. {no_master} >= 1[no_master is set in the configuration file]
  3. log_bin is disabled [binlog is not turned on]
  4. {oldest_major_version} eq '0' [MySQL major version is not the oldest]
  5. Too much replication delay [Large delay, the difference between the binlog position of the master and the master is greater than 100000000]

    # The following servers can not be master:
    # - dead servers
    # - Set no_master in conf files (i.e. DR servers)
    # - log_bin is disabled
    # - Major version is not the oldest
    # - too much replication delay
    sub get_bad_candidate_masters($$$) {
      my $self                    = shift;
      my $latest_slave            = shift;
      my $check_replication_delay = shift;
      my $log                     = $self->{logger};
    
      my @servers     = $self->get_alive_slaves();
      my @ret_servers = ();
      foreach (@servers) {
     if (
          $_->{no_master} >= 1
       || $_->{log_bin} eq '0'
       || $_->{oldest_major_version} eq '0'
       || (
         $latest_slave
         && ( $check_replication_delay
           && $self->check_slave_delay( $_, $latest_slave ) >= 1 )
       )
       )
     {
       push( @ret_servers, $_ );
     }
      }
      return @ret_servers;
    }

It’s easy to understand 1-3 and 5, and after checking online through monitoring, these problems do not exist, so I will focus on how 4 is defined.

Find the related function:

sub compare_slave_version($) {
  my $self    = shift;
  my @servers = $self->get_alive_servers();
  my $log     = $self->{logger};
  $log->debug(" Comparing MySQL versions..");
  my $min_major_version;
  foreach (@servers) {
    my $dbhelper = $_->{dbhelper};
    -- 如果dead或不为从库,则跳过判断
    next if ( $_->{dead} || $_->{not_slave} );
    my $parsed_major_version =
      MHA::NodeUtil::parse_mysql_major_version( $_->{mysql_version} );
    if (!$min_major_version
      || $parsed_major_version < $min_major_version )
    {
      $min_major_version = $parsed_major_version;
    }
  }
  foreach (@servers) {
    my $dbhelper = $_->{dbhelper};
    next if ( $_->{dead} || $_->{not_slave} );
    my $parsed_major_version =
      MHA::NodeUtil::parse_mysql_major_version( $_->{mysql_version} );
    if ( $min_major_version == $parsed_major_version ) {
      $_->{oldest_major_version} = 1;
    }
    else {
      $_->{oldest_major_version} = 0;
    }
  }
  $log->debug("  Comparing MySQL versions done.");
}

As you can see, here will first get the smallest version from alive_servers, which is min_major_version :

  • If the instance is dead or non-slave, then the instance will not be compared, otherwise it will be compared, the key code is next if ( $_->{dead} || $_->{not_slave} );

Next, parsed_major_version [the main version of MySQL, for example, 5.6, 5.7] and min_major_version incoming server:

  • If parsed_major_version==min_major_version , then oldest_major_version =1; otherwise oldest_major_version =0

In summary, it can be seen that the version number of the new master must be the lowest version among all slave libraries to be used as the new master library, otherwise it will not be able to be used as the new master library.

At this point, the problem has come to the bottom. Going back to the scene we tested earlier, we figured it out:

  • When scene 1 and scene 2 have only one slave library, the cross-version switch can be switched successfully because the main version of the slave library is min_major_version
  • The reason for the failure to switch between scene 3 and scene 4 is that the major version of the new master is 5.7, and the smallest major version number in all slave libraries is 5.6, so it cannot be switched

But why is MHA designed this way?

  • There is no problem with the replication of MySQL from a high version to a low version, but there may be problems when copying from a low version to a high version. This is officially introduced: https://dev.mysql.com/doc/refman/5.7/en/replication-compatibility.html
  • However, MHA did not compare the version of the original main library when comparing the smallest version. This may still happen when the lower version is copied to the higher version.

summary

MHA main selection logic:

  1. Elect the slave with the highest priority as the new master (usually switch the designated new master manually). If the slave cannot be the new master, it will report an error and exit. Otherwise, if it is a failover, proceed to the following steps
  2. Select the latest slave at the replication site and set candidate_master as the new master. If the latest slave at the replication site does not set candidate_master, proceed to the following steps
  3. Select a slave from the set candidate_master as the new master, if not selected, continue
  4. Select the latest slave at the replication site as the new master, if not selected, continue
  5. Choose from all slaves
  6. After the above steps, the election fails if you still can’t choose the master.

Note: For the first 6 election steps, it is necessary to ensure that the new master is not in the bad array

The bad array is defined as follows:

  1. dead servers
  2. {no_master} >= 1[no_master is set in the configuration file]
  3. log_bin is disabled [binlog is not turned on]
  4. {oldest_major_version} eq '0' [MySQL major version is not the oldest]
  5. Too much replication delay [Large delay, the difference between the binlog position of the master and the master is greater than 100000000]

Among them, 4 is a point that is relatively easy to overlook, and you need to pay attention!


爱可生开源社区
426 声望207 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。


引用和评论

0 条评论