Spark
1. Introduction
Spark is based on the Hadoop ecosystem, using Hadoops scalable, flexible, fault-tolerant and cost effective storage and processing implementations while having its own cluster management solution. The intermediate results are also not stored on disk, but rather in distributed memory, making Spark computations a lot faster.
2. Installation
java -version
# If java 8 is not installed
sudo apt install --no-install-recommends openjdk-8-jre-headless -y
# Download spark from https://spark.apache.org/downloads.html
wget http://apache.40b.nl/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
gunzip -c spark-2.4.0-bin-hadoop2.7.tgz | tar xvf -
rm spark-2.4.0-bin-hadoop2.7.tgz
# Install spark
sudo mv spark-2.4.0-bin-hadoop2.7/ /usr/local/sparkAdd to PATH:
echo "# For spark!" >> ~/.bashrc
echo "JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> ~/.bashrc
echo "export PATH=$PATH:/usr/local/spark/bin" >> ~/.bashrc
source ~/.bashrcVerify the installation:
3. Testing
https://github.com/holdenk/spark-testing-base
4. Pyspark
4.1 Setup
Import types and functions:
Start spark session:
Read data:
Output:
4.2 Create DataFrame
Specify data and schema:
Add column:
5. Docker Image
Last updated
Was this helpful?