Building Apache Superset on Amazon ECS Fargate

:
Apache Superset is an open source data visualization analysis platform Business Intelligence (BI) based on mainstream cloud-native technologies. It provides users with a lightweight, intuitive and customizable operation interface to connect various data sources and realize data query. , orchestration and visualization. By combining Amazon Elastic Container Service (Amazon ECS), Amazon Cloud Map and other managed services, we can quickly build Apache Superset into managed container clusters without installing, operating and scaling additional container orchestration and cluster management infrastructure, IT staff , data analysts and other roles can focus on the business itself, and more efficiently realize the transformation from data-driven cognition to data-driven decision-making.

Apache Superset:
https://superset.apache.org/

key message:
Business Intelligence (BI), container technology;

Key Services:
Amazon Elastic Container Service（ECS），Amazon Cloud Map，Amazon Elastic File System（EFS）；

Preface

The technological evolution of today's BI platform is developing towards the two ends of data and analysis. The data end connects to the data source through ODBC/JDBC and other SQL application interfaces, and is processed by ETL and then sent to the data warehouse to achieve unified management of cloud and offline data, and the analysis end uses Big data, AI/ML, NLP and other technologies realize intelligent data query, in-depth analysis and knowledge graph capabilities. With the maturity and development of cloud computing technology, the basic platform built by using the hosting services provided by cloud service providers has great advantages over traditional BI in terms of business reliability, application flexibility and existing service docking. A one-stop BI platform for infrastructure is gradually becoming a trend.

In data ETL, data warehouse hosting, data mining, and data visualization, Amazon Cloud Technology provides mature and reliable hosting services (such as Amazon Glue, Amazon Redshift, Amazon EMR) to help customers quickly build their own fully automated data processing The pipeline realizes the rapid implementation from the acquisition of raw data to the final business decision. At the same time, in response to the future technology trend of intelligence and automation of the BI platform, Amazon Cloud Technology also provides a full AI/ML product system from the application of Amazon SaaS to the infrastructure. Better support for various vertical industries such as finance, e-commerce, and manufacturing segments and specific applications.

architecture overview

Each functional module of Amazon Apache Superset is independently developed and iterated in a loosely coupled manner. The communication between modules is implemented through Celery . The support for container technologies such as Docker and Kubernetes is relatively complete.

Celery：
https://docs.celeryproject.org/en/stable/index.html

The main modules and the technology stack used are as follows:

web server(Gunicorn, Nginx, Apache)
metadata database engine(MySQL, Postgres, MariaDB, etc.)
message queue(Celery, Redis, RabbitMQ, SQS, etc.)
results backend(S3, Redis, Memcached, etc.)
caching layer(Memcached, Redis, etc.)

Based on the concept of cloud-native technology development, users can flexibly customize the back-end implementation according to their own needs. For example, in terms of message queues, users can use the default Redis or connect to Amazon SQS to achieve more economical, reliable, flexible and efficient queuing functions. At present, the container version provided by the community realizes the initial creation of the application, data mounting and other functions through the single-node operation of the Host Volume, as shown in the following figure:

In order to adapt to the architecture design of the original Apache Superset to the greatest extent, the basic idea of migrating it to Amazon Cloud Technology is to run the relatively independent functional modules of the platform itself on Amazon ECS, and use Amazon ECS Fargate to realize resource scheduling and service. The health check of each Amazon ECS service itself is performed through the private DNS created by Amazon Cloud Map for service discovery, addressing and communication, and the data storage sharing of each Amazon ECS service itself is implemented through Amazon EFS for better availability, flexibility and low cost. For network planning, we follow the best practices of Amazon cloud technology. User upstream inbound traffic is connected to the Superset Service of the Amazon ECS cluster through Amazon Application Load Balancer. The downstream outbound traffic of Superset Service, such as connecting to external data sources and obtaining sample data, passes through Amazon NAT. Gateway implementation, combined with Amazon VPC security group to realize port control of network traffic (such as Superset default port 8088), the software architecture of the overall solution is shown in the following figure:

Compared with the Apache community version, Apache Superset running on Amazon Cloud Technology has the following advantages:

Core modules (Superset, Cache, Database) are highly available;
Business data (metadata, query data, interaction data) persistence;
The platform resources are elastically scaled, and users do not need to care about the underlying resource scheduling;
Pre-installed SQL, PostgreSQL, Redshift, Athena, ClickHouse data source drivers, you can connect to existing data after creation
Pre-installed time series prediction algorithm to realize future trend prediction based on imported data
Visual Kanban monitors various indicators of Amazon cloud services and application comprehensive indicators in real time

Create steps

All functional modules of Apache Superset in this solution are created and started through the predefined Amazon CloudFormation template. Click the button below to jump to the Amazon CloudFormation console interface (Beijing) for one-click deployment of the overall solution. Code implementation details can be found here.

Amazon CloudFormation：
https://aws.amazon.com/cn/cloudformation/
here:
https://github.com/aws-quickstart/quickstart-aws-superset

Deploy to an existing Amazon VPC:
https://cn-north-1.console.amazonaws.cn/cloudformation/home?region=cn-north-1#/stacks/create/template?stackName=SupersetOnAWS&templateURL=https://aws-cn-quickstart-cn-north-1.s3.cn-north-1.amazonaws.com.cn/quickstart-aws-superset/templates/superset-entrypoint-existing-vpc.template.yaml

Deploy to the newly created Amazon VPC:
https://cn-north-1.console.amazonaws.cn/cloudformation/home?region=cn-north-1#/stacks/create/template?stackName=SupersetOnAWS&templateURL=https://aws-cn-quickstart-cn-north-1.s3.cn-north-1.amazonaws.com.cn/quickstart-aws-superset/templates/superset-entrypoint-new-vpc.template.yaml

The configuration option configures the username and password used to log in to the Superset console. "Pre-populate example dashboard" is used to configure whether to obtain the official built-in sample dataset and dashboard, and "Install Prophet library" is used to configure whether to install the Prophet software. package to implement the online prediction function of the data.

After the Stack is installed, we can jump to the Amazon ECS console to check whether all Superset components are running normally. As shown in the figure below, we can see that all services are in Active state

Click on the Service containing the word SupersetService, and in the Load Balancing column, you can view the Target Group of the connected Amazon ALB. The default open port is 8088, and our subsequent operation access pages will also pass through this port; in the Network Access column, you can View the basic information of the Amazon VPC where Superset is located, including subnets, security groups, etc. In the next data connection, we need to ensure that the created Amazon Redshift is in the same Amazon VPC; in the Service discovery column, you can view the internal DNS corresponding to the service Name, readers of this interest can jump to the Amazon Route 53 interface to view the corresponding Domain name and Record name to understand how each service is discovered, addressed, and communicated.

Next, we enter the address of the login Superset output in the Outputs option in the browser.

Enter the username and password that you configured when you created the app, and you can start using Superset for data analysis.

data docking

Next we will create an Amazon Redshift, specifically to create a cluster, data import and other processes not discussed here, refer to official website detail. It should be noted that the Amazon VPC where Amazon Redshift is created here needs to be the same as the Amazon ECS where Superset is created to ensure that Superset can still achieve data transmission through the internal Amazon VPC network without enabling public access to Amazon Redshift. association.

Official website:
https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html

After Amazon Redshift is created, record the corresponding Endpoint address.

Next, switch to the Superset operation interface, click Data, and select Databases from the drop-down box to start the connection. The URL format is redshift+psycopg2://:@:5439/. For more URL formats of other data sources, please refer to the official Superset documentation.

Superset official documentation:
https://superset.apache.org/docs/databases/redshift

After the connection is successful, click Datasets to check whether the corresponding dataset is displayed normally. The example data here is "daily".

Click to edit, select Time-series Chart in Visualization Type, No filter in Time Range, Column in Query is the value to be displayed, Aggregate select Sum operation, click Run, you can see that the data is drawn according to the time series, and at the same time in the Data one The column shows the data sample. Note that the above configuration needs to be modified according to the actual data. Need to pay attention to the last time of the horizontal timeline is 2014-02-22

If we choose Install Prophet library as yes in the option of creating Superset, we can further experience the built-in time series forecasting function of Superset. Click Predictive Analytics, check the Enable Forecast option, other options are default, and re-execute RUN.

It can be seen that Superset has predicted the data trend of the next 10 days based on the previous data (the last time of the horizontal time axis is 2014-03-04) and drawn the confidence interval of the data, in which the purple dots are The original time series discrete points, the purple solid line is the value obtained by using the time series fitting, and the light-colored area around the solid line is the confidence interval of the data, that is, the reasonable upper and lower bounds. In this operation, we need to input raw data containing timestamps and values, the length of the time series to be predicted, and the output is the future time series trend and the corresponding confidence interval.

last

Using Amazon Elastic Container Service (ECS) and its supporting serverless computing feature (Fargate), we can smoothly migrate the original container load or newly developed cloud native load to the Amazon cloud technology platform, and isolate each software module by design. To improve the security and reliability of services without provisioning and managing servers, combining Amazon Route 53, Amazon Cloud Map, Amazon Elastic File System for service communication and data storage, Amazon CloudFormation for service deployment and expansion, and ultimately reducing Amazon BI The use threshold of the platform, IT personnel, data analysts and other roles can focus on the business itself, and more efficiently realize the transformation from data-driven cognition to data-driven decision-making.

Author of this article

Amazon Cloud Technology Solutions Architect

Open source project and emerging technology enthusiast, responsible for the consulting, construction and implementation of Amazon cloud technology solutions, with nearly ten years of experience in R&D and technical team management, his technical fields include serverless (Serverless), containers, AI/ML.

Building Apache Superset on Amazon ECS Fargate

Preface

architecture overview

Create steps

data docking

last

亚马逊云开发者

引用和评论

提升开发运维效率：原力棱镜游戏公司的 Amazon Q Developer CLI 实践

2025 Lakehouse 趋势全景展望：从技术演进到商业重构

分析型数据库入门指南：如何选择适合你的实时分析工具？

湖仓一体架构解析：如何平衡数据灵活性与分析性能？