Introduction to This article mainly explains how open source ETL tools connect to MaxCompute

For the live video, please click live watch.

This sharing will explain from five aspects.

01 Enter the MaxCompute ecosystem

02 Open source ETL tools

03 Client Introduction

04 Introduction to PyODPS

05 Practical demonstration

1. Enter the MaxCompute ecosystem

First, look product, which can be roughly divided into business intelligence, development management, transmission scheduling, and programming interface. This sharing mainly focuses on the business intelligence (BI) tool section. You can see that MaxCompute officially integrates Tableau, FineReport, FineBI, and Quick BI. Among them, Tableau, FineBI, and FineReport have built-in MaxCompute drivers in specific versions. If you need to connect to MaxCompute through JDBC, you still need to manually load the MaxCompute JDBC driver. As a product of Alibaba Cloud, Quick BI can be directly connected through Alibaba Cloud account and AK information. Yes, and Yonghong Desktop 8.6 and above can also connect to MaxCompute through the built-in driver. In the business intelligence part, there are also open source BI tools. Superset and Davinci can also connect to MaxCompute.

In the development management part, it is the content of our second lecture, including DBeaver, DataGrip, SQL Workbench/J.

At the same time, our products also integrate Kafka and Flink open source engines. The supported ETL open source tools include Kettle, Airflow, and Azkaban. This part is introduced in this sharing. The supported programming interfaces are Python, JDBC, and SQLAlchemy.

In addition to supporting external tools, MaxCompute also has an open ecosystem, including the built-in open source engine Spark, migration tool MMA, development ecology PyODPS, Mars, tool ecology Web-Console, etc. At the same time, MaxCompute has also built a rich solution ecology and data application ecology together with Alibaba Cloud's internal products.

image

Two, open source ETL tools

Mainly introduce how open source ETL tools connect to MaxCompute. Open source ETL tools include Airflow, Azkaban, and Kettle.

First look at Airflow. Airflow is a scheduling tool written in python. There are Python Operator and Bash Operator inside, and various Operators. It also supports the development of custom plug-ins. Airflow controls the MaxCompute client to submit SQL tasks through the command line through the command Operator. For the Python SDK, you can submit it through the Python py file, and the Java sdk can be submitted through the java -jar method. Because Airflow supports Python Operator, it can directly integrate PyODPS and write Python code directly. The second part is Azkaban. Azkaban mainly submits our tasks through commands, and SQL tasks can be submitted through the programming interface provided by MaxCompute. Kettle can directly connect to MaxCompute through JDBC.

02.jpg

3. Introduction to MaxCompute CLI Client

The MaxCompute client supports running on three systems: Linux/Mac/Window.

Install

• JDK 1.8 or higher.

•A MaxCompute project has been created, and an account with the permission of the project

Configuration

• Modify the odps\_config.ini file in the conf folder

•Fill in ak, project name, endpoint

use

• Execute odpscmd in the bin directory on Linux/Mac, and execute odpscmd.bat in the bin directory on Windows

• Support the execution of a single sql statement, execute sql files, upload resources, upload and download data (Tunnel), authorization, etc.

image

4. Introduction to MaxCompute Python SDK (PyODPS)

Install

•PC client installation depends on the python environment, execute pip install pyodps

•DataWorks has built-in PyODPS support, and submit Python tasks by creating a new PyOdps node

PyODPS initialization

from odps import ODPS

o = ODPS('**your-access-id**', '**your-secret-access-key**', project='**your-project**', endpoint='**your-end-point**')

PyODPS interface

•Table interface: o.get\_table o.create\_table

•Sql interface: o.run\_sql (asynchronous execution) o.execute\_sql (synchronous execution)

•PyOdpsDataFrame:DataFrame(o.get\_table)、o.get\_table().to\_df()

•Upload and download data: create\_upload\_session() create\_download\_session()

image

Five, practical display

Airflow Practical Show

please click video view the actual operation part

Azkaban Practical Show

please click video view the actual operation part

Kettle Practical Show

please click video view the actual operation part

Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

阿里云开发者
3.2k 声望6.3k 粉丝

阿里巴巴官方技术号,关于阿里巴巴经济体的技术创新、实战经验、技术人的成长心得均呈现于此。