大数据 - Cloud Lesson | Big Data Fusion Analysis: GaussDW (DWS) easily import MRS-Hive data source - 华为云开发者之家

Abstract: supports the data warehouse service SQL on Hadoop by establishing a connection between GaussDB (DWS) and MRS, and realizes the fast import of Hive data in a table format to meet the application scenarios of big data fusion analysis.

This article is shared from the HUAWEI CLOUD community " [Cloud Small Class] EI Lesson 17 Big Data Fusion Analysis: GaussDB (DWS) Easily Import MRS-Hive Data Source ", the original author: Hi,EI.

In the era of big data fusion analysis, GaussDB(DWS) If you need to access MRS data source , how to achieve it? This small cloud lesson will take you to open the door of MRS data source and complete the data import into DWS by remotely reading the ORC data table on the MRS cluster Hive.

Prepare the environment

The DWS cluster has been created. Ensure that the MRS and DWS clusters are in the same region, availability zone, and VPC subnet to ensure that the cluster network can communicate.

Basic process

Estimated duration of this practice: 1 hour, the basic process is as follows:

1. Create an MRS analysis cluster (choose Hive, Spark, Tez components).

2. Upload the local txt data file to the OBS bucket, then import Hive through the OBS bucket, and import the txt storage table to the ORC storage table.

3. Create MRS data source connection.

4. Create an external server.

5. Create appearance.

6. Import DWS local table through appearance.

1. Create an MRS analysis cluster

1. Log in to the Huawei Cloud console, select "EI Enterprise Intelligence> MapReduce Service", click "Buy Cluster", select "Custom Purchase", fill in the software configuration parameters, and click "Next".

2. Fill in the hardware configuration parameters and click "Next".

3. Fill in the advanced configuration parameters in the following table, click "Buy Now", and wait about 15 minutes for the cluster to be created successfully.

Two, prepare the ORC table data source of MRS

1. Create a new product_info.txt on the local PC, copy the following data, and save it locally.

100,XHDK-A-1293-#fJ3,2017-09-01,A,2017 Autumn New Shirt Women,red,M,328,2017-09-04,715,good
205,KDKE-B-9947-#kL5,2017-09-01,A,2017 Autumn New Knitwear Women,pink,L,584,2017-09-05,406,very good!
300,JODL-X-1937-#pV7,2017-09-01,A,2017 autumn new T-shirt men,red,XL,1245,2017-09-03,502,Bad.
310,QQPX-R-3956-#aD8,2017-09-02,B,2017 autumn new jacket women,red,L,411,2017-09-05,436,It's really super nice
150,ABEF-C-1820-#mC6,2017-09-03,B,2017 Autumn New Jeans Women,blue,M,1223,2017-09-06,1200,The seller's packaging is exquisite
200,BCQP-E-2365-#qE4,2017-09-04,B,2017 autumn new casual pants men,black,L,997,2017-09-10,301,The clothes are of good quality.
250,EABE-D-1476-#oB1,2017-09-10,A,2017 autumn new dress women,black,S,841,2017-09-15,299,Follow the store for a long time.
108,CDXK-F-1527-#pL2,2017-09-11,A,2017 autumn new dress women,red,M,85,2017-09-14,22,It's really amazing to buy
450,MMCE-H-4728-#nP9,2017-09-11,A,2017 autumn new jacket women,white,M,114,2017-09-14,22,Open the package and the clothes have no odor
260,OCDA-G-2817-#bD3,2017-09-12,B,2017 autumn new woolen coat women,red,L,2004,2017-09-15,826,Very favorite clothes
980,ZKDS-J-5490-#cW4,2017-09-13,B,2017 Autumn New Women's Cotton Clothing,red,M,112,2017-09-16,219,The clothes are small
98,FKQB-I-2564-#dA5,2017-09-15,B,2017 autumn new shoes men,green,M,4345,2017-09-18,5473,The clothes are thick and it's better this winter.
150,DMQY-K-6579-#eS6,2017-09-21,A,2017 autumn new underwear men,yellow,37,2840,2017-09-25,5831,This price is very cost effective
200,GKLW-l-2897-#wQ7,2017-09-22,A,2017 Autumn New Jeans Men,blue,39,5879,2017-09-25,7200,The clothes are very comfortable to wear
300,HWEC-L-2531-#xP8,2017-09-23,A,2017 autumn new shoes women,brown,M,403,2017-09-26,607,good
100,IQPD-M-3214-#yQ1,2017-09-24,B,2017 Autumn New Wide Leg Pants Women,black,M,3045,2017-09-27,5021,very good.
350,LPEC-N-4572-#zX2,2017-09-25,B,2017 Autumn New Underwear Women,red,M,239,2017-09-28,407,The seller's service is very good
110,NQAB-O-3768-#sM3,2017-09-26,B,2017 autumn new underwear women,red,S,6089,2017-09-29,7021,The color is very good 
210,HWNB-P-7879-#tN4,2017-09-27,B,2017 autumn new underwear women,red,L,3201,2017-09-30,4059,I like it very much and the quality is good.
230,JKHU-Q-8865-#uO5,2017-09-29,C,2017 Autumn New Clothes with Chiffon Shirt,black,M,2056,2017-10-02,3842,very good

2. Log in to the OBS console, click "Create Bucket", fill in the following parameters, and click "Create Now".

3. Wait for the bucket to be created, click the bucket name, choose Object> Upload Object, and upload product_info.txt to the OBS bucket.

4. Switch back to the MRS console, click the created MRS cluster name, enter the "Overview", click "Click Sync" in the row of "IAM User Synchronization", and wait about 5 minutes for the synchronization to be completed.

5. Back to the MRS cluster page, click "Node Management", click any master node, enter the node page, switch to "Elastic Public Network IP", click "Bind Elastic Public Network IP", and tick Existing elastic IP and click "OK", if not, please create it. Record this public IP.

6. Confirm the master node.

Use the SSH tool to log in to the above nodes as the root user, the root password is Huawei_12345, and switch to the omm user.
su - omm
Execute the following command to query the master node. In the response information, the node whose HAActive parameter value is "active" is the master node.
sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh

7. Log in to the master node as the root user, switch to the omm user, and enter the directory where the Hive client is located.

su - omm
cd /opt/client

8. Create a table product_info whose storage type is TEXTFILE on Hive.

Under the /opt/client path, import environment variables.
source bigdata_env
Log in to the Hive client.
beeline
Execute the following SQL statements in sequence to create the demo database and table product_info.

CREATE DATABASE demo;
USE demo;
DROP TABLE product_info;
CREATE TABLE product_info 
(    
    product_price                int            not null,
    product_id                   char(30)       not null,
    product_time                 date           ,
    product_level                char(10)       ,
    product_name                 varchar(200)   ,
    product_type1                varchar(20)    ,
    product_type2                char(10)       ,
    product_monthly_sales_cnt    int            ,
    product_comment_time         date           ,
    product_comment_num          int        ,
    product_comment_content      varchar(200)                   
) 
row format delimited fields terminated by ',' 
stored as TEXTFILE

9. Import the product_info.txt data file into Hive.

Switch back to the MRS cluster, click "File Management", and click "Import Data".
OBS path: Select the OBS bucket name created above, find the product_info.txt file, and click "Yes".
HDFS path: select /user/hive/warehouse/demo.db/product_info/, and click "Yes".
Click "OK" and wait for the import to succeed. At this time, the table data of product_info has been imported successfully.

10. Create an ORC table and import the data into the ORC table.

Execute the following SQL statement to create an ORC table.

DROP TABLE product_info_orc;
CREATE TABLE product_info_orc
(    
    product_price                int            not null,
    product_id                   char(30)       not null,
    product_time                 date           ,
    product_level                char(10)       ,
    product_name                 varchar(200)   ,
    product_type1                varchar(20)    ,
    product_type2                char(10)       ,
    product_monthly_sales_cnt    int            ,
    product_comment_time         date           ,
    product_comment_num          int            ,
    product_comment_content      varchar(200)                   
) 
row format delimited fields terminated by ',' 
stored as orc;

Insert the data of the product_info table into the Hive ORC table product_info_orc.

insert into product_info_orc select * from product_info;

The query ORC table data is imported successfully.

select * from product_info_orc;

Three, create MRS data source connection

Log in to the DWS management console, click the created DWS cluster, make sure that the DWS cluster and MRS are in the same area, available zone, and under the same VPC subnet.
Switch to "MRS Data Source" and click "Create MRS Data Source Connection".
Select the previous steps to create a data source named "MRS01", user name: admin, password: Huawei@12345, click "OK", the creation is successful.

Fourth, create an external server

(1) Use Data Studio to connect to the created DWS cluster.
(2) Create a new user dbuser with permission to create a database:

CREATE USER dbuser WITH CREATEDB PASSWORD "Bigdata@123";

(3) Switch to the newly created dbuser user:
SET ROLE dbuser PASSWORD "Bigdata@123";
(4) Create a new mydatabase database:
CREATE DATABASE mydatabase;
(5) Perform the following steps to switch to connect to the newly created mydatabase database.

In the "Object Browser" window of the Data Studio client, right-click the database connection name and click "Refresh" in the pop-up menu. After refreshing, you can see the newly created database.
Right-click the "mydatabase" database name, and click "Open Connection" in the pop-up menu.
Right-click the "mydatabase" database name and click "Open New Terminal" in the pop-up menu to open the SQL command window connected to the specified database. Please execute all the following steps in this command window.

(6) Grant the permission to create an external server for the dbuser user:

GRANT ALL ON FOREIGN DATA WRAPPER hdfs_fdw TO dbuser;

The name of FOREIGN DATA WRAPPER can only be hdfs_fdw, and dbuser is the name of the user who created the SERVER.

(7) Execute the following commands to give users the authority to use appearances.
ALTER USER dbuser USEFT;
(8) Switch back to the Postgres system database and query the external server automatically created by the system after the MRS data source is created.
SELECT * FROM pg_foreign_server;
Return results such as:


                     srvname                      | srvowner | srvfdw | srvtype | srvversion | srvacl |                                                     srvoptions
--------------------------------------------------+----------+--------+---------+------------+--------+---------------------------------------------------------------------------------------------------------------------
 gsmpp_server                                     |       10 |  13673 |         |            |        |
 gsmpp_errorinfo_server                           |       10 |  13678 |         |            |        |
 hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca |    16476 |  13685 |         |            |        | {"address=192.168.1.245:9820,192.168.1.218:9820",hdfscfgpath=/MRS/8f79ada0-d998-4026-9020-80d6de2692ca,type=hdfs}
(3 rows)

(9) Switch to the mydatabase database and switch to the dbuser user.
SET ROLE dbuser PASSWORD "Bigdata@123";
(10) Create an external server.
The SERVER name, address, and configuration path can be kept consistent with 8.

CREATE SERVER hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca FOREIGN DATA WRAPPER HDFS_FDW 
OPTIONS 
(
address '192.168.1.245:9820,192.168.1.218:9820',   //MRS管理面的Master主备节点的内网IP，可与DWS通讯。
hdfscfgpath '/MRS/8f79ada0-d998-4026-9020-80d6de2692ca',
type 'hdfs'
);

(11) View external servers.

SELECT * FROM pg_foreign_server WHERE srvname='hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca';

The returned result is as follows, indicating that it has been created successfully:


                     srvname                      | srvowner | srvfdw | srvtype | srvversion | srvacl |                                                     srvoptions
--------------------------------------------------+----------+--------+---------+------------+--------+---------------------------------------------------------------------------------------------------------------------
 hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca |    16476 |  13685 |         |            |        | {"address=192.168.1.245:9820,192.168.1.218:29820",hdfscfgpath=/MRS/8f79ada0-d998-4026-9020-80d6de2692ca,type=hdfs}
(1 row)

Five, create appearance

1. Obtain the file path of Hive's product_info_orc.

Log in to the MRS management console.
Select "Cluster List> Existing Cluster", click the name of the cluster you want to view, and enter the basic cluster information page.
Click "File Management" and select "HDFS File List".
Enter the storage directory of the data you want to import into the GaussDB (DWS) cluster, and record the path.

Figure 1 View the data storage path on MRS

2. Create appearance. The SERVER name fills in the name of the external server created by 10, and the foldername fills in the path found in 1.

DROP FOREIGN TABLE IF EXISTS foreign_product_info;
CREATE FOREIGN TABLE foreign_product_info
(
    product_price                integer        not null,
    product_id                   char(30)       not null,
    product_time                 date           ,
    product_level                char(10)       ,
    product_name                 varchar(200)   ,
    product_type1                varchar(20)    ,
    product_type2                char(10)       ,
    product_monthly_sales_cnt    integer        ,
    product_comment_time         date           ,
    product_comment_num          integer        ,
    product_comment_content      varchar(200)                      
) SERVER hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca 
OPTIONS (
format 'orc', 
encoding 'utf8',
foldername '/user/hive/warehouse/demo.db/product_info_orc/'
) 
DISTRIBUTE BY ROUNDROBIN;

Six, perform data import

1. Create a local target table.

DROP TABLE IF EXISTS product_info;
CREATE TABLE product_info
(
    product_price                integer        not null,
    product_id                   char(30)       not null,
    product_time                 date           ,
    product_level                char(10)       ,
    product_name                 varchar(200)   ,
    product_type1                varchar(20)    ,
    product_type2                char(10)       ,
    product_monthly_sales_cnt    integer        ,
    product_comment_time         date           ,
    product_comment_num          integer        ,
    product_comment_content      varchar(200)                   
) 
with (
orientation = column,
compression=middle
) 
DISTRIBUTE BY HASH (product_id);

2. Import the target table from the external table.

INSERT INTO product_info SELECT * FROM foreign_product_info;

3. Query the import result.

SELECT * FROM product_info;
So, practice it and teach you how to quickly get started with data warehouse services~

For details, please click here understand.

Click to follow, and get to know the fresh technology of

Cloud Lesson | Big Data Fusion Analysis: GaussDW (DWS) easily import MRS-Hive data source

Prepare the environment

Basic process

1. Create an MRS analysis cluster

Two, prepare the ORC table data source of MRS

Three, create MRS data source connection

Fourth, create an external server

Five, create appearance

Six, perform data import

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

智能问数技术路径对比：NL2SQL vs NL2Semantic2SQL

ClkLog埋点分析系统-环境部署配置指南

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强