大数据主题

apache spark单机安装教程

  以windows环境安装spark为案例,首先下载spark:

http://spark-project.org/download/spark-0.8.0-incubating.tgz 

然后解压编译,完成时间如下 :

C:\Develop\Source\Spark\spark-0.8.0-incubating>sbt\sbt.cmd assembly
(省略)
[info] Done packaging.
[info] Packaging C:\Develop\Source\Spark\spark-0.8.0-incubating\examples\target\
scala-2.9.3\spark-examples-assembly-0.8.0-incubating.jar ...
[info] Done packaging.
[success] Total time: 1265 s, completed 2013/11/04 21:36:04

Spark Shell的运行

如下启动Spark:

C:\Develop\Source\Spark\spark-0.8.0-incubating>spark-shell.cmd
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 0.8.0
/_/
(省略)
13/11/04 21:45:18 INFO ui.SparkUI: Started Spark Web UI at http://haumea:4040
Spark context available as sc.
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Web访问情况如下:

 

操作Shell的命令如下:

scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

scala> textFile.count()
res1: Long = 111

scala> textFile.first()
res2: String = # Apache Spark

scala> textFile.foreach(println(_))

上述命令运行后,结果在UI显示如下:

再次运行下面命令:

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = FilteredRDD[2] at filter at <console>:14

scala> linesWithSpark.foreach(println(_))
# Apache Spark
You can find the latest Spark documentation, including a programming
Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The project is
Spark and its example programs, run:
Once you've built Spark, the easiest way to start using it is the shell:
Spark also comes with several sample programs in the `examples` directory.
./run-example org.apache.spark.examples.SparkLR local[2]
All of the Spark samples take a `<master>` parameter that is the cluster URL
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
Hadoop, you must build Spark against the same version that your cluster runs.
when building Spark.
When developing a Spark application, specify the Hadoop version by adding the
in the online documentation for an overview on how to configure Spark.
Apache Spark is an effort undergoing incubation at The Apache Software
## Contributing to Spark

scala> exit

下面创建一个TextCount,然后在其下创建子目录:

TextCount/src/main/scala/TextCountApp.scala
TextCount/count.sbt

TextCount/src/main/scala/TextCountApp.scala内容如下:

/*** TextCountApp.scala ***/
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object TextCountApp {
def main(args: Array[String]) {


val logFile = "C:/Develop/Source/Spark/spark-0.8.0-incubating/README.md"
val sc = new SparkContext("local", "TextCountApp", "C:/Develop/Source/Spark/spark-.8.0-incubating",
List("target/scala-2.9.3/count-project_2.9.3-1.0.jar"))
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
val numSparks = logData.filter(line => line.contains("Spark")).count()
println("Lines with a: %s, Lines with b: %s, Lines with Spark: %s".format(numAs, numBs, numSparks))


}
}

SparkContext是个参数的意义:

第一个参数 : URL (本地单机)
第二参数 : 应用程序名称
第三个参数 : spark安装目标目录
第四个参数 : 应用程序依赖的库

创建extCount / count.sbt

name := "Count Project"

version := "1.0"

scalaVersion := "2.9.3"

libraryDependencies += "org.apache.spark" %% "spark-core" % "0.8.0-incubating"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

运行下面的命令。 然后进行编译 ,生成JAR文件:

C:\Develop\Source\Spark\spark-0.8.0-incubating\TextCount>..\sbt\sbt.cmd package
(省略)
[info] Packaging C:\Develop\Source\Spark\spark-0.8.0-incubating\TextCount\target\scala-2.9.3\count-project_2.9.3-1.0.jar ...
[info] Done packaging.
[success] Total time: 7 s, completed 2013/11/04 22:29:24

尝试运行建立的应用程序,看到结果如下:

C:\Develop\Source\Spark\spark-0.8.0-incubating\TextCount>..\sbt\sbt.cmd run
[info] Set current project to Count Project (in build file:/C:/Develop/Source/Spark/spark-0.8.0-incubating/TextCount/)
[info] Running TextCountApp
13/11/04 22:33:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/11/04 22:33:27 WARN snappy.LoadSnappy: Snappy native library not loaded
13/11/04 22:33:27 INFO mapred.FileInputFormat: Total input paths to process : 1
Lines with a: 66, Lines with b: 35, Lines with Spark: 15 //执行結果
[success] Total time: 6 s, completed 2013/11/04 22:33:28

好了,我们已经可以在单机情况下运行Spark。

 

Spark - 大数据Big Data处理框架

简单的Spark应用源码案例

大数据专题