Introduction to This article mainly explains how open source ETL tools connect to MaxCompute
For the live video, please click live watch.
This sharing will explain from five aspects.
01 Enter the MaxCompute ecosystem
02 Open source ETL tools
03 Client Introduction
04 Introduction to PyODPS
05 Practical demonstration
1. Enter the MaxCompute ecosystem
First, look product, which can be roughly divided into business intelligence, development management, transmission scheduling, and programming interface. This sharing mainly focuses on the business intelligence (BI) tool section. You can see that MaxCompute officially integrates Tableau, FineReport, FineBI, and Quick BI. Among them, Tableau, FineBI, and FineReport have built-in MaxCompute drivers in specific versions. If you need to connect to MaxCompute through JDBC, you still need to manually load the MaxCompute JDBC driver. As a product of Alibaba Cloud, Quick BI can be directly connected through Alibaba Cloud account and AK information. Yes, and Yonghong Desktop 8.6 and above can also connect to MaxCompute through the built-in driver. In the business intelligence part, there are also open source BI tools. Superset and Davinci can also connect to MaxCompute.
In the development management part, it is the content of our second lecture, including DBeaver, DataGrip, SQL Workbench/J.
At the same time, our products also integrate Kafka and Flink open source engines. The supported ETL open source tools include Kettle, Airflow, and Azkaban. This part is introduced in this sharing. The supported programming interfaces are Python, JDBC, and SQLAlchemy.
In addition to supporting external tools, MaxCompute also has an open ecosystem, including the built-in open source engine Spark, migration tool MMA, development ecology PyODPS, Mars, tool ecology Web-Console, etc. At the same time, MaxCompute has also built a rich solution ecology and data application ecology together with Alibaba Cloud's internal products.
Two, open source ETL tools
Mainly introduce how open source ETL tools connect to MaxCompute. Open source ETL tools include Airflow, Azkaban, and Kettle.
First look at Airflow. Airflow is a scheduling tool written in python. There are Python Operator and Bash Operator inside, and various Operators. It also supports the development of custom plug-ins. Airflow controls the MaxCompute client to submit SQL tasks through the command line through the command Operator. For the Python SDK, you can submit it through the Python py file, and the Java sdk can be submitted through the java -jar method. Because Airflow supports Python Operator, it can directly integrate PyODPS and write Python code directly. The second part is Azkaban. Azkaban mainly submits our tasks through commands, and SQL tasks can be submitted through the programming interface provided by MaxCompute. Kettle can directly connect to MaxCompute through JDBC.
3. Introduction to MaxCompute CLI Client
The MaxCompute client supports running on three systems: Linux/Mac/Window.
Install
• JDK 1.8 or higher.
•A MaxCompute project has been created, and an account with the permission of the project
Configuration
• Modify the odps\_config.ini file in the conf folder
•Fill in ak, project name, endpoint
use
• Execute odpscmd in the bin directory on Linux/Mac, and execute odpscmd.bat in the bin directory on Windows
• Support the execution of a single sql statement, execute sql files, upload resources, upload and download data (Tunnel), authorization, etc.
4. Introduction to MaxCompute Python SDK (PyODPS)
Install
•PC client installation depends on the python environment, execute pip install pyodps
•DataWorks has built-in PyODPS support, and submit Python tasks by creating a new PyOdps node
PyODPS initialization
from odps import ODPS
o = ODPS('**your-access-id**', '**your-secret-access-key**', project='**your-project**', endpoint='**your-end-point**')
PyODPS interface
•Table interface: o.get\_table o.create\_table
•Sql interface: o.run\_sql (asynchronous execution) o.execute\_sql (synchronous execution)
•PyOdpsDataFrame:DataFrame(o.get\_table)、o.get\_table().to\_df()
•Upload and download data: create\_upload\_session() create\_download\_session()
Five, practical display
Airflow Practical Show
please click video view the actual operation part
Azkaban Practical Show
please click video view the actual operation part
Kettle Practical Show
please click video view the actual operation part
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。