大数据开发之常用命令大全

Linux（vi/vim）
一般模式

编辑模式

指令模式

压缩和解压
gzip/gunzip 压缩
（1）只能压缩文件不能压缩目录

（2）不保留原来的文件

gzip压缩：gzip hello.txt

gunzip解压缩文件：gunzip hello.txt.gz

zip/unzip 压缩
可以压缩目录且保留源文件

zip压缩（压缩 1.txt 和2.txt，压缩后的名称为mypackage.zip）：zip hello.zip hello.txt world.txt

unzip解压：unzip hello.zip

unzip解压到指定目录：unzip hello.zip -d /opt

tar 打包
tar压缩多个文件：tar -zcvf hello.txt world.txt

tar压缩目录：tar -zcvf hello.tar.gz opt/

tar解压到当前目录：tar -zxvf hello.tar.gz

tar解压到指定目录：tar -zxvf hello.tar.gz -C /opt

RPM
RPM查询命令：rpm -qa |grep firefox

RPM卸载命令：

rpm -e xxxxxx

rpm -e --nodeps xxxxxx（不检查依赖）

RPM安装命令：

rpm -ivh xxxxxx.rpm

rpm -ivh --nodeps fxxxxxx.rpm（--nodeps，不检测依赖进度）

Shell
输入/输出重定向

脚本编辑

Hadoop
启动类命令

hadoop fs/hdfs dfs 命令

yarn命令

Zookeeper
启动命令

基本操作

四字母命令

Kafka
「注:」这里机器我只写一个。命令你们也可使用 ./bin/xx.sh (如：./bin/kafka-topics.sh)

查看当前服务器中的所有topic
kafka-topics --zookeeper xxxxxx:2181 --list --exclude-internal

说明：

exclude-internal：排除kafka内部topic

比如： --exclude-internal --topic "test_.*"
创建topic
kafka-topics --zookeeper xxxxxx:2181 --create
--replication-factor
--partitions 1
--topic topic_name

说明：

--topic 定义topic名

--replication-factor 定义副本数

--partitions 定义分区数
删除topic
「注意:」需要server.properties中设置delete.topic.enable=true否则只是标记删除

kafka-topics --zookeeper xxxxxx:2181 --delete --topic topic_name
生产者
kafka-console-producer --broker-list xxxxxx:9092 --topic topic_name

可加：--property parse.key=true（有key消息）
消费者
kafka-console-consumer --bootstrap-server xxxxxx:9092 --topic topic_name

注：可选

--from-beginning：会把主题中以往所有的数据都读取出来

--whitelist '.*' ：消费所有的topic

--property print.key=true：显示key进行消费

--partition 0：指定分区消费

--offset：指定起始偏移量消费
查看某个Topic的详情
kafka-topics --zookeeper xxxxxx:2181 --describe --topic topic_name
修改分区数
kafka-topics --zookeeper xxxxxx:2181 --alter --topic topic_name --partitions 6
查看某个消费者组信息
kafka-consumer-groups --bootstrap-server xxxxxx:9092 --describe --group group_name
删除消费者组
kafka-consumer-groups --bootstrap-server xxxxxx:9092 ---delete --group group_name
重置offset
kafka-consumer-groups --bootstrap-server xxxxxx:9092 --group group_name

--reset-offsets --all-topics --to-latest --execute
leader重新选举
指定Topic指定分区用重新PREFERRED：优先副本策略进行Leader重选举

kafka-leader-election --bootstrap-server xxxxxx:9092
--topic topic_name --election-type PREFERRED --partition 0
所有Topic所有分区用重新PREFERRED：优先副本策略进行Leader重选举

kafka-leader-election --bootstrap-server xxxxxx:9092
--election-type preferred --all-topic-partitions
查询kafka版本信息
kafka-configs --bootstrap-server xxxxxx:9092
--describe --version
增删改配置

topic添加/修改动态配置

kafka-configs --bootstrap-server xxxxxx:9092
--alter --entity-type topics --entity-name topic_name
--add-config file.delete.delay.ms=222222,retention.ms=999999
topic删除动态配置

kafka-configs --bootstrap-server xxxxxx:9092
--alter --entity-type topics --entity-name topic_name
--delete-config file.delete.delay.ms,retention.ms
持续批量拉取消息
单次最大消费10条消息(不加参数意为持续消费)

kafka-verifiable-consumer --bootstrap-server xxxxxx:9092
--group group_name
--topic topic_name --max-messages 10
删除指定分区的消息
删除指定topic的某个分区的消息删除至offset为1024

json文件offset-json-file.json

{

"partitions": [
    {
        "topic": "topic_name",
        "partition": 0,
        "offset": 1024
    }
],
"version": 1

}
kafka-delete-records --bootstrap-server xxxxxx:9092
--offset-json-file offset-json-file.json
查看Broker磁盘信息
查询指定topic磁盘信息

kafka-log-dirs --bootstrap-server xxxxxx:9090
--describe --topic-list topic1,topic2
查询指定Broker磁盘信息

kafka-log-dirs --bootstrap-server xxxxxx:9090
--describe --topic-list topic1 --broker-list 0
Hive
启动类

hive 启动元数据服务（metastore和hiveserver2）和优雅关闭脚本

启动：hive.sh start
关闭：hive.sh stop
重启：hive.sh restart
状态：hive.sh status
脚本如下

!/bin/bash

HIVE_LOG_DIR=$HIVE_HOME/logs

mkdir -p $HIVE_LOG_DIR

检查进程是否运行正常，参数1为进程名，参数2为进程端口

function check_process()
{

pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}')
ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1)
echo $pid
[[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1

}

function hive_start()
{

metapid=$(check_process HiveMetastore 9083)
cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &"
cmd=$cmd" sleep4; hdfs dfsadmin -safemode wait >/dev/null 2>&1"
[ -z "$metapid" ] && eval $cmd || echo "Metastroe服务已启动"
server2pid=$(check_process HiveServer2 10000)
cmd="nohup hive --service hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &"
[ -z "$server2pid" ] && eval $cmd || echo "HiveServer2服务已启动"

}

function hive_stop()
{

metapid=$(check_process HiveMetastore 9083)
[ "$metapid" ] && kill $metapid || echo "Metastore服务未启动"
server2pid=$(check_process HiveServer2 10000)
[ "$server2pid" ] && kill $server2pid || echo "HiveServer2服务未启动"

}

case $1 in
"start")

hive_start
;;

"stop")

hive_stop
;;

"restart")

hive_stop
sleep 2
hive_start
;;

"status")

check_process HiveMetastore 9083 >/dev/null && echo "Metastore服务运行正常" || echo "Metastore服务运行异常"
check_process HiveServer2 10000 >/dev/null && echo "HiveServer2服务运行正常" || echo "HiveServer2服务运行异常"
;;

echo Invalid Args!
echo 'Usage: '$(basename $0)' start|stop|restart|status'
;;

esac
常用交互命令

SQL类(特殊的)

内置函数
（1） NVL

给值为NULL的数据赋值，它的格式是NVL( value，default_value)。它的功大数据培训能是如果value为NULL，则NVL函数返回default_value的值，否则返回value的值，如果两个参数都为NULL ，则返回NULL

select nvl(column, 0) from xxx；
（2）行转列

（3）列转行(一列转多行)

「Split(str, separator)：」将字符串按照后面的分隔符切割，转换成字符array。

「EXPLODE(col)：」将hive一列中复杂的array或者map结构拆分成多行。

「LATERAL VIEW」

用法：

LATERAL VIEW udtf(expression) tableAlias AS columnAlias
解释：lateral view用于和split, explode等UDTF一起使用，它能够将一行数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合。

lateral view首先为原始表的每行调用UDTF，UDTF会把一行拆分成一或者多行，lateral view再把结果组合，产生一个支持别名表的虚拟表。

「准备数据源测试」

「SQL」

SELECT movie,category_name
FROM movie_info
lateral VIEW
explode(split(category,",")) movie_info_tmp AS category_name ;

「测试结果」

《功勋》记录
《功勋》剧情
《战狼2》战争
《战狼2》动作
《战狼2》灾难

窗口函数
（1）OVER()

定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变而变化。

（2）CURRENT ROW（当前行）

n PRECEDING：往前n行数据

n FOLLOWING：往后n行数据
（3）UNBOUNDED（无边界）

UNBOUNDED PRECEDING 前无边界，表示从前面的起点

UNBOUNDED FOLLOWING后无边界，表示到后面的终点
「SQL案例：由起点到当前行的聚合」

select

sum(money) over(partition by user_id order by pay_time rows between UNBOUNDED PRECEDING and current row)

from or_order;
「SQL案例：当前行和前面一行做聚合」

select

sum(money) over(partition by user_id order by pay_time rows between 1 PRECEDING and current row)

from or_order;
「SQL案例：当前行和前面一行和后一行做聚合」

select

sum(money) over(partition by user_id order by pay_time rows between 1 PRECEDING AND 1 FOLLOWING )

from or_order;
「SQL案例：当前行及后面所有行」

select

sum(money) over(partition by user_id order by pay_time rows between current row and UNBOUNDED FOLLOWING  )

from or_order;
（4）LAG(col,n,default_val)

往前第n行数据，没有的话default_val

（5）LEAD(col,n, default_val)

往后第n行数据，没有的话default_val

「SQL案例：查询用户购买明细以及上次的购买时间和下次购买时间」

select
user_id,,pay_time,money,

lag(pay_time,1,'1970-01-01') over(PARTITION by name order by pay_time) prev_time,

lead(pay_time,1,'1970-01-01') over(PARTITION by name order by pay_time) next_time
from or_order;

（6）FIRST_VALUE(col,true/false)

当前窗口下的第一个值，第二个参数为true，跳过空值。

（7）LAST_VALUE (col,true/false)

当前窗口下的最后一个值，第二个参数为true，跳过空值。

「SQL案例：查询用户每个月第一次的购买时间和每个月的最后一次购买时间」

select
FIRST_VALUE(pay_time)

 over(
     partition by user_id,month(pay_time) order by pay_time 
     rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
     ) first_time,

LAST_VALUE(pay_time)

 over(partition by user_id,month(pay_time) order by pay_time rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
 ) last_time

from or_order;

（8）NTILE(n)

把有序窗口的行分发到指定数据的组中，各个组有编号，编号从1开始，对于每一行，NTILE返回此行所属的组的编号。（用于将分组数据按照顺序切分成n片，返回当前切片值）

「SQL案例：查询前25%时间的订单信息」

select * from (

select User_id,pay_time,money,

ntile(4) over(order by pay_time) sorted

from or_order

) t
where sorted = 1;

4个By
（1）Order By

全局排序，只有一个Reducer。

（2）Sort By

分区内有序。

（3）Distrbute By

类似MR中Partition，进行分区，结合sort by使用。

（4） Cluster By

当Distribute by和Sorts by字段相同时，可以使用Cluster by方式。Cluster by除了具有Distribute by的功能外还兼具Sort by的功能。但是排序只能是升序排序，不能指定排序规则为ASC或者DESC。

在生产环境中Order By用的比较少，容易导致OOM。

在生产环境中Sort By+ Distrbute By用的多。

排序函数
（1）RANK()

排序相同时会重复，总数不会变

1
1
3
3
5
（2）DENSE_RANK()

排序相同时会重复，总数会减少

1
1
2
2
3
（3）ROW_NUMBER()

会根据顺序计算

1
2
3
4
5
日期函数
datediff：返回结束日期减去开始日期的天数

datediff(string enddate, string startdate)

select datediff('2021-11-20','2021-11-22')
date_add：返回开始日期startdate增加days天后的日期

date_add(string startdate, int days)

select date_add('2021-11-20',3)
date_sub：返回开始日期startdate减少days天后的日期

date_sub (string startdate, int days)

select date_sub('2021-11-22',3)
Redis
启动类
key

String

List

Set

Hash

zset(Sorted set)

Flink
启动
./start-cluster.sh
run
./bin/flink run [OPTIONS]

./bin/flink run -m yarn-cluster -c com.wang.flink.WordCount /opt/app/WordCount.jar

info
./bin/flink info [OPTIONS]

list
./bin/flink list [OPTIONS]

stop
./bin/flink stop [OPTIONS] <Job ID>

cancel(弱化)
./bin/flink cancel [OPTIONS] <Job ID>

savepoint
./bin/flink savepoint [OPTIONS] <Job ID>

原创作者：王了个博