技能速成！教你10分钟内在电脑上配置运行Hive Metastore和Presto

作者：范斌；Alluxio创始成员、开源社区副总裁

To 初学者：
本教程将指导初学者在本地服务器上通过搭建Presto和Hive Metastore来查询S3上的数据。
Presto是用于计划和执行查询的SQL引擎，S3为表分区文件提供存储服务，而Hive Metastore是为Presto访问表模式和位置信息提供catalog服务。
本教程将展示如何一步一步安装并配置Presto和Hive MetaStore，从而查询存储在公有S3 bucket中的数据。

第一步：下载和启动Hive MetaStore

本教程中我们下载并使用 [apache-hive-2.3.7-bin.tar.gz]，点击下载并解压Hive的二进制压缩包。

$ cd /path/to/tutorial/root
$ wget https://downloads.apache.org/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz
$ tar -zxf apache-hive-2.3.7-bin.tar.gz
$ cd apache-hive-2.3.7-bin

我们只需要启动Hive Metastore来为Presto提供诸如表模式和分区位置等的catalog信息。

如果你是第一次启动Hive Metastore，请准备好相应的配置文件和环境，同时初始化(initialize)一个新的Metastore。

$ export HIVE_HOME=`pwd`
$ cp conf/hive-default.xml.template conf/hive-site.xml
$ mkdir -p hcatalog/var/log/
$ bin/schematool -dbType derby -initSchema

需要配置Hive来访问S3，可以在conf/hive-env.sh中添加以下几行。同时，Hive需要相应的jar包来访问带有“s3a://”地址的文件，还需要AWS凭证来访问S3 bucket（包括公有S3 bucket）。

export HIVE_AUX_JARS_PATH=${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-core-1.10.6.jar:${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-s3-1.10.6.jar:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-aws-2.8.4.jar
export AWS_ACCESS_KEY_ID=<Your AWS Access Key>
export AWS_SECRET_ACCESS_KEY=<Your AWS Secret Key>

如果你的Hadoop安装包中没有上述jar包，你也可以从maven central下载：

<aws-java-sdk-core-1.10.6.jar>、<aws-java-sdk-s3-1.10.6.jar>、<hadoop-aws-2.8.4.jar>

启动Hive Metastore，它将在后台运行并监听端口9083（默认端口）。

$ hcatalog/sbin/hcat_server.sh start
Started metastore server init, testing if initialized correctly...
Metastore initialized successfully on port[9083].

为了验证MetaStore是否在运行，请在 hcatalog/var/log/路径下查看Hive Metastore日志。

第二步：下载并启动Presto服务器

在本教程中我们以[0.237.1版本]服务器为例，点击链接，打开Presto服务器安装页面，下载并解压经过预编译的（pre-build），服务器压缩包。

$ cd /path/to/tutorial/root
$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.237.1/presto-server-0.237.1.tar.gz
$ tar -zxf presto-server-0.237.1.tar.gz
$ cd presto-server-0.237.1

创建一个包含基本Presto配置的配置文件: etc/config.properties。

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080

创建 etc/jvm.config来完成以下JVM配置。

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

创建 etc/node.properties，应包含下面几行内容：

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/tmp/presto/data

最后，在etc/catalog/hive.properties中配置Presto Hive连接器，指向刚刚启动的Hive Metastore服务。此外，这里还需要再次输入AWS凭证，完成后，Presto即可从S3读取输入文件。

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=<Your AWS Access Key>
hive.s3.aws-secret-key=<Your AWS Secret Key>

在后台启动Presto服务器：

$ ./bin/launcher start
为了验证Presto服务器是否在运行，从浏览器中访问链接http://localhost:8080 ，并在网页用户界面（UI）上检查服务器状态。

第三步：启动Presto CLI（Presto 命令行工具）

并运行查询命令，从服务器上下载Presto命令行工具，它是一个单独的二进制文件Presto CLI


$ cd /path/to/tutorial/root
$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.237.1/presto-cli-0.237.1-executable.jar
$ mv presto-cli-0.237.1-executable.jar presto
$ chmod +x presto

连接到上一步中已经启动的Presto服务器。

$ ./presto --server localhost:8080  --catalog hive --debug

使用默认模式

`presto> use default;
USE`

基于S3中的文件在默认模式下创建一个新表，这些信息将被发送到Hive MetaStore。

presto:default> CREATE TABLE reason (
  r_reason_sk integer,
  r_reason_id varchar,
  r_reason_desc varchar
) WITH (
  external_location = 's3a://apc999/presto-tutorial/example-reason',
  format = 'PARQUET'
);
CREATE TABLE

扫描创建的新表：

presto:default> SELECT * FROM reason limit 3;
 r_reason_sk |   r_reason_id    |     r_reason_desc      
-------------+------------------+------------------------
           1 | AAAAAAAABAAAAAAA | Package was damaged    
           2 | AAAAAAAACAAAAAAA | Stopped working        
           3 | AAAAAAAADAAAAAAA | Did not get it on time 
(3 rows)Query 20200703_074406_00011_8vq8w, FINISHED, 1 node
http://localhost:8080/ui/query.html?20200703_074406_00011_8vq8w
Splits: 18 total, 18 done (100.00%)
CPU Time: 0.5s total,     6 rows/s, 2.06KB/s, 27% active
Per Node: 0.1 parallelism,     0 rows/s,   279B/s
Parallelism: 0.1
Peak User Memory: 0B
Peak Total Memory: 219B
Peak Task Total Memory: 219B
0:04 [3 rows, 1002B] [0 rows/s, 279B/s]

第四步：停止服务器

$ cd /path/to/tutorial/root
$ presto-server-0.237.1/bin/launcher stop
$ apache-hive-2.3.7-bin/hcatalog/sbin/hcat_server.sh stop

总结：

在本教程中，我们演示了如何通过搭建Presto和Hive Metastore来对存储在公有S3 bucket中的数据进行SQL查询，希望对你有所帮助。

技能速成！教你10分钟内在电脑上配置运行Hive Metastore和Presto

第一步：下载和启动Hive MetaStore

第二步：下载并启动Presto服务器

第三步：启动Presto CLI（Presto 命令行工具）

第四步：停止服务器

总结：

Alluxio

引用和评论

Alluxio Enterprise AI 3.6加速模型分发、优化checkpoint写入并增强多租户支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

技能速成！教你10分钟内在电脑上配置运行Hive Metastore和Presto

第一步：下载和启动Hive MetaStore

第二步：下载并启动Presto服务器

第三步：启动Presto CLI（Presto 命令行工具）

第四步：停止服务器

总结：

Alluxio

引用和评论

Alluxio Enterprise AI 3.6加速模型分发、优化checkpoint写入并增强多租户支持

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

2025年3月中国数据库排行榜：PolarDB夺魁傲群雄，GoldenDB晋位入三强

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈