Spark is based on the Hadoop ecosystem, using Hadoops scalable, flexible, fault-tolerant and cost effective storage and processing implementations while having its own cluster management solution. The intermediate results are also not stored on disk, but rather in distributed memory, making Spark computations a lot faster.
2. Installation
java-version# If java 8 is not installedsudoaptinstall--no-install-recommendsopenjdk-8-jre-headless-y# Download spark from https://spark.apache.org/downloads.htmlwgethttp://apache.40b.nl/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgzgunzip-cspark-2.4.0-bin-hadoop2.7.tgz|tarxvf-rmspark-2.4.0-bin-hadoop2.7.tgz# Install sparksudomvspark-2.4.0-bin-hadoop2.7//usr/local/spark
Add to PATH:
echo"# For spark!">>~/.bashrcecho"JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64">>~/.bashrcecho"export PATH=$PATH:/usr/local/spark/bin">>~/.bashrcsource~/.bashrc
# 'csv', 'jdbc', 'json', 'orc', 'parquet', 'text'df = spark.read \.format("csv")\ # How the data is stored.option("header", "true")\.option("inferSchema", "true")\.option("nanValue", "NA")\.csv("data/heroes.csv")# How the data should be stored