spark-2.4.0安装教程和采坑指南

本文讲解spark-2.4.0-bin-hadoop2.7 安装步骤。

1. 安装前准备

1.1 申请机器

spark可以跟hadoop分开安装。 下面采用spark集群和hadoo集群分开。

序号 ip hostname 配置
1 192.168.0.5 spark-master 磁盘40GB, 4C8GB
2 192.168.0.6 spark-slaver-1 磁盘40GB, 4C8GB
3 192.168.0.7 spark-slaver-2 磁盘40GB, 4C8GB
4 192.168.0.8 spark-slaver-3 磁盘40GB, 4C8GB

1.2 hostname配置

centos6/centos7/alios6/alios7可以用


hostname master  (可以立即生效,但是重启后失效)

/etc/sysconfig/network (重启后生效)
HOSTNAME=spark-1    # 重启后生效

/etc/hosts 删除 之前老的 hostname的映射

1.3 hostname配置

修改 四台机器的/etc/hosts

127.0.0.1     localhost
::1     localhost  
192.168.0.1     spark-master
192.168.0.2     spark-slaver-1
192.168.0.3     spark-slaver-2
192.168.0.4     spark-slaver-3

1.4 禁掉防火墙

centos7/alios7

hostname namesudo systemctl disable firewalld
sudo systemctl stop firewalld

centos6/alios6

service iptables stop (临时关闭)

chkconfig iptables off (重启后生效)

setenforce 0 (临时生效)

修改/etc/selinux/config 下的 SELINUX=disabled (重启后生效)

1.5 关闭selinux

运行如下命令

getenforce
  • 如果结果是 Permissive 或者 Disabled那么selinux本身就是关闭的,可以跳过该步骤,继续后面配置;
  • 如果结果是 enforcing 那么继续3。
  • 打开/etc/selinux/config(有些linux在/etc/sysconfig/selinux),改变 SELINUX=enforcing 为 SELINUX=permissive 然后保存;
  • reboot系统,然后运行 setenforce 0。

1.6 SSH免密

四台机器都切换到admin用户。
ssh免密只要master能够免密登录cluster商任何机器就好。 至于slaver没必要免密登录其他机器。

ssh-keygen -t rsa -P ''

然后在master机器上执行

ssh-copy-id    admin@spark-slaver-1
ssh-copy-id    admin@spark-slaver-2
ssh-copy-id    admin@spark-slaver-3

1.7 端口

如果使用AWS(Security Group)或者阿里云(安全组),那么要注意有

SPARK_MASTER_PORT=8077
SPARK_MASTER_WEBUI_PORT=8088

都是需要VPC外部可见的

1.8 JDK,scala 和环境变量

安装jdk1.8 [见我之前写的jdk1.8安装教程][8] 配置环境变量, 尽量配置在 ~/.bashrc 或者 /etc/profile中

JAVA_HOME=/usr/java/jdk1.8.0_131
export PATH=$JAVA_HOME/bin:$PATH

#下载安装scala
wget https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz

tar -xzvf scala-2.11.8.tgz
cp -r scala-2.11.8 /usr/share

export SCALA_HOME=/usr/share/scala
export PATH=$SCALA_HOME/bin:$PATH

1.9 HADOOP 环境变量

hadoop和spark可以分开,也就是spark可以安装到不是hadoop cluster的机器上。但是hadoop一些lib库还是要的。所以可以配置下hadoop环境变量

cd /opt/app

wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz

# hadoop 运行环境    ~/.bashrc
export HADOOP_HOME=/opt/app/hadoop-2.8.5
export PATH=$PATH:$HADOOP_HOME/sbin
export PATH=$PATH:$HADOOP_HOME/bin

# hadoop lib环境   ~/.bashrc
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

2. 安装过程

2.1 下载安装包

cd /opt/app  
http://mirrors.shu.edu.cn/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz

# 环境变量  ~/.bashrc

export SPARK_HOME=/opt/app/spark-2.4.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin

#注意这个ip根据实际ip填写。   ~/.bashrc
export SPARK_LOCAL_IP=192.168.0.6

2.2 安装



# 修改 spark-env.sh
cp conf/spark-env.sh.template conf/spark-env.sh
conf/spark-env.sh

# vi conf/spark-env.sh

source ~/.bashrc
export JAVA_HOME=${JAVA_HOME}
export SCALA_HOME=${SCALA_HOME}
export HADOOP_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_MASTER_IP=192.168.0.5
export SPARK_MASTER_HOST=spark-master
export SPARK_MASTER_PORT=8077
export SPARK_MASTER_WEBUI_PORT=8088
export SPARK_WORKER_MEMORY=6g
export SPARK_WORKER_CORES=4
export SPARK_HOME=${SPARK_HOME}
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=8099 -Dspark.history.retainedApplications=10 -Dspark.history.fs.logDirectory=hdfs://hmaster1:8102/spark-log"
export SPARK_LOCAL_DIRS=/data/spark/tmp
export SPARK_WORKER_DIR=/data/spark/work
export SPARK_LOG_DIR=/data/spark/logs
export SPARK_PID_DIR=/data/spark/tmp
export SPARK_DAEMON_JAVA_OPTS="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
export SPARK_MASTER_OPTS="-Dcom.sun.management.jmxremote.port=11061"
export SPARK_WORKER_OPTS="-Dcom.sun.management.jmxremote.port=11062"


 # slaves文件
cp slaves.template slaves

spark-slaver-1
spark-slaver-2
spark-slaver-3

# hbase-site.xml 如果需要spark访问hbase集群的话,把hbase中的hbase-site.xml拷贝到 conf目录中

2.3 启动

    sbin/start-all.sh

2.4 日志和进程

master进程

3998 Master

slaver 进程

15859 Worker

日志

spark-admin-org.apache.spark.deploy.worker.Worker-1-spark-2.out

2.5 界面

http://192。168.0.5:8088/

2.7 提交测试代码

mkdir -p sparkapp2/src/main/java/
vi sparkapp2/src/main/java/WordCount.java

写入如下文件

import java.util.*;
import java.util.regex.Pattern;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.List;

public final class WordCount {
    private static final Pattern SPACE = Pattern.compile(" ");

    public static void main(String[] args) throws Exception {
        if (args.length < 1) {
            System.err.println("Usage: JavaWordCount <file>");
            System.exit(1);
        }
        SparkConf conf = new SparkConf().setAppName("JavaWordCount");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> lines = sc.textFile(args[0],1);
        JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {

            private static final long serialVersionUID = 1L;


            @Override
            public Iterator<String> call(String s) {

                return Arrays.asList(SPACE.split(s)).iterator();
            }
        });

        JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Tuple2<String, Integer> call(String s) {
                return new Tuple2<String, Integer>(s, 1);
            }
        });

        JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Integer call(Integer i1, Integer i2) {
                return i1 + i2;
            }
        });

        List<Tuple2<String, Integer>> output = counts.collect();
        for (Tuple2<?, ?> tuple : output) {
            System.out.println(tuple._1() + ": " + tuple._2());
        }

        sc.stop();
    }
}
vi sparkapp2/pom.xml
<project>
    <groupId>com.moheqionglin</groupId>
    <artifactId>work-cound</artifactId>
    <modelVersion>4.0.0</modelVersion>
    <name>Simple Project</name>
    <packaging>jar</packaging>
    <version>1.0</version>
    <repositories>
        <repository>
            <id>Akka repository</id>
            <url>http://repo.akka.io/releases</url>
        </repository>
    </repositories>
    <dependencies>
        <dependency> <!-- Spark dependency -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.10.4</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.0.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>
mvn clean pacakge

然后把 target下面的jar包放到 spark集群所有机器的 /home/admin下面

然后执行

./spark-submit --master spark://master:8077   --deploy-mode cluster --class "WordCount" --executor-memory 1G --total-executor-cores 2 /home/admin/simple-project-1.0.jar file:///opt/app/spark-2.4.0-bin-hadoop2.7/README.md
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
慷慨打赏