本文讲解spark-2.4.0-bin-hadoop2.7 安装步骤。
spark可以跟hadoop分开安装。 下面采用spark集群和hadoo集群分开。
序号 | ip | hostname | 配置 |
---|---|---|---|
1 | 192.168.0.5 | spark-master | 磁盘40GB, 4C8GB |
2 | 192.168.0.6 | spark-slaver-1 | 磁盘40GB, 4C8GB |
3 | 192.168.0.7 | spark-slaver-2 | 磁盘40GB, 4C8GB |
4 | 192.168.0.8 | spark-slaver-3 | 磁盘40GB, 4C8GB |
centos6/centos7/alios6/alios7可以用
hostname master (可以立即生效,但是重启后失效)
/etc/sysconfig/network (重启后生效)
HOSTNAME=spark-1 # 重启后生效
/etc/hosts 删除 之前老的 hostname的映射
修改 四台机器的/etc/hosts
127.0.0.1 localhost
::1 localhost
192.168.0.1 spark-master
192.168.0.2 spark-slaver-1
192.168.0.3 spark-slaver-2
192.168.0.4 spark-slaver-3
centos7/alios7
hostname namesudo systemctl disable firewalld
sudo systemctl stop firewalld
centos6/alios6
service iptables stop (临时关闭)
chkconfig iptables off (重启后生效)
setenforce 0 (临时生效)
修改/etc/selinux/config 下的 SELINUX=disabled (重启后生效)
运行如下命令
getenforce
四台机器都切换到admin用户。
ssh免密只要master能够免密登录cluster商任何机器就好。 至于slaver没必要免密登录其他机器。
ssh-keygen -t rsa -P ''
然后在master机器上执行
ssh-copy-id admin@spark-slaver-1
ssh-copy-id admin@spark-slaver-2
ssh-copy-id admin@spark-slaver-3
如果使用AWS(Security Group)或者阿里云(安全组),那么要注意有
SPARK_MASTER_PORT=8077
SPARK_MASTER_WEBUI_PORT=8088
都是需要VPC外部可见的
安装jdk1.8 [见我之前写的jdk1.8安装教程][8] 配置环境变量, 尽量配置在 ~/.bashrc 或者 /etc/profile中
JAVA_HOME=/usr/java/jdk1.8.0_131
export PATH=$JAVA_HOME/bin:$PATH
#下载安装scala
wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
tar -xzvf scala-2.11.8.tgz
cp -r scala-2.11.8 /usr/share
export SCALA_HOME=/usr/share/scala
export PATH=$SCALA_HOME/bin:$PATH
hadoop和spark可以分开,也就是spark可以安装到不是hadoop cluster的机器上。但是hadoop一些lib库还是要的。所以可以配置下hadoop环境变量
cd /opt/app
wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
# hadoop 运行环境 ~/.bashrc
export HADOOP_HOME=/opt/app/hadoop-2.8.5
export PATH=$PATH:$HADOOP_HOME/sbin
export PATH=$PATH:$HADOOP_HOME/bin
# hadoop lib环境 ~/.bashrc
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
cd /opt/app
http://mirrors.shu.edu.cn/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz
# 环境变量 ~/.bashrc
export SPARK_HOME=/opt/app/spark-2.4.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
#注意这个ip根据实际ip填写。 ~/.bashrc
export SPARK_LOCAL_IP=192.168.0.6
# 修改 spark-env.sh
cp conf/spark-env.sh.template conf/spark-env.sh
conf/spark-env.sh
# vi conf/spark-env.sh
source ~/.bashrc
export JAVA_HOME=${JAVA_HOME}
export SCALA_HOME=${SCALA_HOME}
export HADOOP_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_MASTER_IP=192.168.0.5
export SPARK_MASTER_HOST=spark-master
export SPARK_MASTER_PORT=8077
export SPARK_MASTER_WEBUI_PORT=8088
export SPARK_WORKER_MEMORY=6g
export SPARK_WORKER_CORES=4
export SPARK_HOME=${SPARK_HOME}
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=8099 -Dspark.history.retainedApplications=10 -Dspark.history.fs.logDirectory=hdfs://hmaster1:8102/spark-log"
export SPARK_LOCAL_DIRS=/data/spark/tmp
export SPARK_WORKER_DIR=/data/spark/work
export SPARK_LOG_DIR=/data/spark/logs
export SPARK_PID_DIR=/data/spark/tmp
export SPARK_DAEMON_JAVA_OPTS="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
export SPARK_MASTER_OPTS="-Dcom.sun.management.jmxremote.port=11061"
export SPARK_WORKER_OPTS="-Dcom.sun.management.jmxremote.port=11062"
# slaves文件
cp slaves.template slaves
spark-slaver-1
spark-slaver-2
spark-slaver-3
# hbase-site.xml 如果需要spark访问hbase集群的话,把hbase中的hbase-site.xml拷贝到 conf目录中
sbin/start-all.sh
master进程
3998 Master
slaver 进程
15859 Worker
日志
spark-admin-org.apache.spark.deploy.worker.Worker-1-spark-2.out
mkdir -p sparkapp2/src/main/java/
vi sparkapp2/src/main/java/WordCount.java
写入如下文件
import java.util.*;
import java.util.regex.Pattern;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.List;
public final class WordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount <file>");
System.exit(1);
}
SparkConf conf = new SparkConf().setAppName("JavaWordCount");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile(args[0],1);
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
private static final long serialVersionUID = 1L;
@Override
public Iterator<String> call(String s) {
return Arrays.asList(SPACE.split(s)).iterator();
}
});
JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
List<Tuple2<String, Integer>> output = counts.collect();
for (Tuple2<?, ?> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
sc.stop();
}
}
vi sparkapp2/pom.xml
<project>
<groupId>com.moheqionglin</groupId>
<artifactId>work-cound</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<repositories>
<repository>
<id>Akka repository</id>
<url>http://repo.akka.io/releases</url>
</repository>
</repositories>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.4</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.0.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
mvn clean pacakge
然后把 target下面的jar包放到 spark集群所有机器的 /home/admin下面
然后执行
./spark-submit --master spark://master:8077 --deploy-mode cluster --class "WordCount" --executor-memory 1G --total-executor-cores 2 /home/admin/simple-project-1.0.jar file:///opt/app/spark-2.4.0-bin-hadoop2.7/README.md