Abstract: The current DWS supports NBU media backup and recovery. This article introduces the troubleshooting methods for DWS docking with NBU backup.

This article is shared from HUAWEI CLOUD COMMUNITY "DWS Docking NBU Backup Troubleshooting Guide" , author: Tang Bohu lights mosquito coils.

NetBackup is a software product of Veritas, which provides complete and flexible data protection solutions for various platforms. These platforms include Microsoft Windows, UNIX, Linux and other systems. With NetBackup, you can back up, archive, and restore files, folders or directories, and volumes or partitions on your computer. Currently DWS supports NBU media backup and recovery. This article describes troubleshooting methods for DWS docking with NBU backup.

Deployment method

If there is a 3-node DWS cluster, Roach (DWS backup tool) will send the cluster data of this node to the remote NBU Media Server machine via TCP. Each NBU Media Server simultaneously installs the NBU Client and deploys the Roach client component. The latter receives the backup data sent by the Roach process in the cluster, and forwards it to the local NBU Client through the XBSA interface to complete the NBU backup. The recovery process is similar, but the data flow is reversed.
image.png

During the DWS backup process, general failures mainly come from the following three places:

  • Roach agent: In the cluster node, you can directly view the cluster backup log ($GAUSSLOG/roach/)
  • Roach client: This plug-in is mainly responsible for data transmission and reception. The log path is specified by the -l parameter when starting, and you can enter the path to query
  • NBU software terminal: You can troubleshoot the fault through the following positioning method

Environmental verification

When performing NBU non-intrusive backup, considering that the cluster backup is too heavy, you can first test the connectivity of the environment by specifying a small file to ensure the NBU configuration

gs_roach uploadmeta --media-destination 'nbu_policy' --metadata-destination '/home/Ruby/meta' --media-type NBU --backup-key '20200903_164332' --nbu-on-remote --media-server 192.168.243.65 --client-port 9000 

Note:

--media-destination is the name of the NBU strategy

--backup-key can be any specified timestamp

--media-server is the ip address of any one deployed with the roach client plug-in

--client-port is the port open for the roach client

--metadata-destination specifies the file path for upload, where the test upload file is renamed metadata.tar.gz and placed in the /home/Ruby directory, not the /home/Ruby/meta directory

If the backup is successful, it means that the connected media server configuration is correct. If there is a failure, there is a problem with the NBU configuration. You need to follow the follow-up instructions to find the reason.

Fault definition

The first step in troubleshooting is to define the problem. During the installation, configuration, and operation of the NBU system, if there are results that are different from the correct expectations, it can be considered as a failure; sometimes, this requires us to know what the correct situation should be.

The common faults in the delivery and use of NBU are mainly divided into types:

The first is the software installation and configuration phase, such as unsuccessful software installation, unsuccessful docking, unavailability of a certain module function, etc. Errors at this stage generally do not have specific error codes, and need to be debugged based on the experience of the delivery personnel and system logs , This kind of failure is a one-time failure, and the possibility of reoccurring after troubleshooting is very small;

Second, after the system deployment is completed, errors are reported when the data backup service is online, backup and recovery tasks are performed, such as failure to access the client, failure to write data to the storage unit, and client server not found; this kind of failure console will provide an error code (Error code), maintenance personnel can make preliminary positioning based on the error. This type of failure is a daily failure and is related to various factors in the environment. A subtle change in the business environment outside the backup system itself may cause a failure. appear.

Troubleshooting process

To troubleshoot the problem, you must know what went wrong.

Error messages are usually a means of pointing out where the failure occurred. So, the first thing we need to do is to find the error message. If you don’t see an error message on the interface, but you still suspect a problem, please check the report and log. NetBackup provides a wide range of reporting and logging tools that provide error messages and directly point to solutions. The log can also show what is working well and what NetBackup is doing when the problem occurs.

In summary, the troubleshooting process of NBU backup and recovery is as follows:

1. Confirm that the server and client are running the supported operating system or application version; see the NBU compatibility list for specific information;

2. Reproduce faults and obtain fault information; the channels for obtaining information include error codes, job details, logs, etc.;

3. Perform fault location and troubleshooting based on the acquired information;

Troubleshooting method

Use status code

Each backup and recovery task is an activity, which can be monitored in the activity monitor column. From task monitoring, we can see the ID of the task, the operation performed, the status, the return value, who the Server and Client are, and which Policy and Schedule are used to execute the task.

The specific display duration of the task depends on the settings in the NetBackup global properties. Each task has the following states:

  • Queued task is queued
  • Active task is executing
  • Done task completed

During the execution of the activity, each task result corresponds to a status code, 0 means success, and non-zero means failure. The return value is a very useful parameter. Through the return value, the relevant adjustment suggestions suggested in the manual can be found through the error code, which is very useful for problem checking and performance adjustment. The location on the page is as follows:
image.png

The following link provides the NBU backup task status code list:

https://www.veritas.com/content/support/en_US/doc/44037985-127664609-0/v15096675-127664609

According to the obtained status code, the cause of the error can be initially located

Use Job details

Similar to the status code, the job details and the activity are also one-to-one; the difference is that the job details provide more information than the status code. For common faults, the job details can be used to locate and eliminate the cause of the fault.

Double-click an activity, select detailed status, and get more detailed information in the status column. Find the key error message (usually in the red font or the context of the red font), extract the keywords, and search on Google. There are a lot of the same error scenarios and solutions on the Internet.
image.png

Usage log

The above methods of troubleshooting using status codes and job details remain at the preliminary stage, and are usually only effective for simple faults; for complex problems, if they cannot be solved, you need to collect logs for analysis.

In the NBU system, the log level is divided into 6 levels, respectively 0-5. The following is the information to be recorded corresponding to the log level:

0: Very important few diagnostic messages and debugging messages

1: This level adds detailed diagnostic messages and debugging messages

2: Add progress message

3: Add prompt dump message

4: Add function entry and exit messages

5: The most detailed information: record all information

The log level adjustment method is as follows:

1. Adjust the console interface
image.png

2. vi /usr/openv/netbackup/bp.conf, add the following configuration at the end

VERBOSE = 5

The NBU system has a separate directory for each process to store, but it is not created by default. If you want to collect these logs, engineers need to create these directories manually. The directory format is /usr/openv/netbackup/logs/process name; taking the bpcd program as an example, execute the following command to create a subdirectory:

mkdir /usr/openv/netbackup/logs/bpcd

Or use the batch creation script provided by NBU to create all log directories with one click, and execute the following command:

sh /usr/openv/netbackup/logs/mklogdir

When collecting logs, NBU creates a log subdirectory for each process to achieve process-level log analysis. Then we need to know what processes commonly used by NBU are:

admin: Management commands.

bpbrm: NetBackup backup and restore manager.

bpcd: NetBackup client daemon or manager.

bpdm: NetBackup Disk Manager.

bpdbm: NetBackup database manager. This process only runs on the master server.

bprd: NetBackup request manager, which responds to client and management requests such as backup, recovery, and archive.

vnetd: Veritas network daemon.

bpbackup: On UNIX client, when the user initiates a backup, this program communicates with bprd on the main server.

After obtaining the log, search for keywords such as fail, error, can not, freeze, etc. in each file to locate the cause of the fault

NBU common maintenance commands

Start the netbackup service process with the command line

/usr/openv/netbackup/bin/bp.start_all

Use the command line to stop the netbackup service process

/usr/openv/netbackup/bin/bp.kill_all

Use the command line to clear the host cache

/usr/openv/netbackup/bin/bpclntcmd -clear_host_cache # 清除缓存
cd /usr/openv/var/host_cache/  # 清除临时文件
rm –rf tmp
mkdir tmp
mv * tmp

Use the command line to detect the master and client connectivity

/usr/openv/netbackup/bin/admincmd/bptestbpcd -client client_hostname

If it can be connected, the returned result is similar to the following:
image.png

Communication problem between NBU master server and NBU client

Telnet each other's backup management plane IP ports 1556, 1372, 13788 on the client and master server to confirm that the communication between the client server and the master server is normal

netstat –an | grep 1556
netstat –an | grep 1372
netstat –an | grep 13782

Check NBU service and process

/usr/openv/netbackup/bin/./bpps -x

Media server is not a certified host

This is a trust configuration problem for the media server on the client. Click host properties>client on the console, find the faulty client, double-click the client, click the servers column in the pop-up interface, and add the host name of the media server in the additional server configuration

Storage unit is unavailable

The "Storage Unit Unavailable" fault message may appear in the following situations:

1. The storage unit is full

2. Too many backup tasks are queued on this storage unit

3. The client cannot communicate with the media server to which the storage unit belongs

If you want to know more about GuassDB (DWS), welcome to search "GaussDB DWS" on WeChat and follow the WeChat official account, and share with you the latest and most complete PB-level digital warehouse black technology. You can also get a lot of learning materials in the background~

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量