配置Maven
项目
在pom.xml
配置文件中配置spark开发所需要的包,根据你Spark
版本找对应的包,Maven中央仓库
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
构建方式
配置Artifacts
构建包
配置Maven
构建包
- 使用
Maven
构建包只需要在pom.xml
中添加如下插件(maven-shade-plugin
)即可
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/spring.handlers</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>META-INF/spring.schemas</resource>
</transformer>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>cn.mucang.sensor.SensorMain</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
构建示例scala
代码
import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}
object InfoOutput {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("NginxLog")
val sc = new SparkContext(sparkConf)
val fd = sc.textFile("hdfs:///xxx/logs/access.log")
val logRDD = fd.filter(_.contains(".baidu.com")).map(_.split(" "))
logRDD.persist(StorageLevel.DISK_ONLY)
val ipTopRDD = logRDD.map(v => v(2)).countByValue().take(10)
ipTopRDD.foreach(println)
}
}
上传Jar
包
- 使用
scp
上传Jar
包到spark-submit服务器,Jar
位置在项目的out目录下 - 因为没有依赖第三方包所以打出怕jar会很小,使用spark-submit提示任务:
spark-submit --class InfoOutput --verbose --master yarn --deploy-mode cluster nginxlogs.jar
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。