Setting Up Pyspark on AWS EC2 and Common Problems

What Is Pyspark?

Apache Spark is a popular open-source framework that ensures data processing with lighting speed by distributing the workload among computers. The concept is that it is very expensive to buy and run a supercomputer, but it is much cheaper to buy several adequate computers and partition a job for each. This is called interfacing with a Resilient Distributed Datasets (RDDs). It allows anyone with the ability to run very fast calculations using cloud computing from many different computers. PySpark is the collaboration of Apache Spark and Python to bring more python users to the Apache Spark. Spark runs Scala which is run using Java. With the help of the popular library Py4J, a person versed in python will be able to run code that was originally only meant for java users.

What Is AWS EC2?

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is cloud computing made easy in terms of usability and price transparency. You have the ability to increase or decrease capacity which gives users flexibility in pricing and timing. The highest tier has the ability to run 160gib of RAM.

Instructions To Start Pyspark and Amazon EC2

A couple of things to note is that AWS is a free service up to a certain level of usage. You will need a credit card to sign up, but will only be charged if your usage goes above the free tier. The first step is to go to https://aws.amazon.com/free, click on the “Create an AWS Account” and sign up with email address and follow prompts.

Once you are set up, you can head directly through the Amazon EC2 page by using the products tab or by searching for “EC2” on the search bar. You then click “Launch Instance” and pick Ubuntu Server. It should say “Free Tier Eligible” under the logo. For the instance type, we choose micro (it will also say free tier eligible). The prices depend on how much RAM we want to use. If you have a project that needs to be done quickly and has a lot of data, it can go up to 160bp of ram.

On step 3: The free tier is available for 1 instance to make sure to put 1 on he number of instances. Leave everything the same until you get to Tag Instance and they ask for a Key and Value. Important: you will need to write these two values down! I learned to put “myspark” on “Key” and “mymachine” on “Value”. For the security group, choose “All Traffic” under type. If you are using this for personal use, feel free to ignore the warning about accessibility, because we want to easily access to being open.

When you get to the option of creating a key pair, you need to “Create a new key pair”, name the keypair something you’ll remember: feel free to use “newspark” and Download Key Pair. You need to download and make sure you have the .pem file. You’ll need to locate this and store it on your home page.

Installing Spark on AWS Console

You will need to move your pemfile (“newspark.pem”) to the desktop and open up your terminal. In the terminal, you will need to go to desktop and turn your key private. If you do not complete this, it will prevent you from accessing the EC2 instance with a public key error.

#turns our key private
chmod 400  newspark.pem

#copy your Public DNS and paste after pem file
ssh -i "newspark.pem" ec2-user@***-**-***-***-***.compute-1.amazonaws.com

Click yes. You will know you are connected when your terminal changes from your regular base$ to one that has Ubuntu user. So congratulations, you are officially connected to your AWS EC2 instance. Since we are basically running operations in the cloud we’ll need to install programs to run our data modeling.

On our EC2

#update our ability to install pip3
sudo apt-get update

#install pip3
sudo apt install python3-pip

#install jupyter notebook
pip3 install jupyter

#install java scala is in java
sudo apt-get install default-jre

#click y to continue

# confirm java was installed
java -version

# install scala
sudo apt-get install scala

#confirm scala was installed
scala -version

#install py4j which translates python to java 
pip3 install py4j

#download tgz file for apache
wget http://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

#install spark and hadoop
sudo tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz

#will help us connect to spark easily
install findspark

The coding block above shows the commands you will have to enter on your EC2 instance terminal. The reason we are installing it again is because we technically aren’t using our computer for processing, but are using AWS. Scala is written in Java, but thankfully for python users the py4j module translates our code to Java in the backend.

#makes sure we are using Python3
source. bashrc

#test to see that we are using python3
python

#quit python
quit()

Creating Config File

The config file will have our login details and pem file location so that we’ll be able to connect to our AWS instance using our local tools. So we open up another terminal window and create a local file that shows our log in details so we can connect to our AWS instance.

#open another terminal and go to the home directory
cd

# create config file
vim .ssh/config

# what to edit
# press "i" to insert information
#after hostname we paste our EC2 address
Host EC2
   Hostname ec2-user@***-**-***-***-***.compute-1.amazonaws.com
   User ubuntu
   IdentityFile ~/newspark.pem

#to save: Press "esc" key and type ":wq" and press "enter"

Now that we have created our config file we are able to now run a jupyter notebook on our EC2 instance.

On the EC2 command line (it should start with Ubuntu on every line), type the code below. This will give you a link to the Jupiter notebook.

#on the ec2 command line
jupyter notebook --no-browser

We go back to our computer’s terminal and type:

ssh -NfL 9999:localhost:8888 ec2

This will connect us on the 9999 instead of the often used 8888 local host. If we are already running a jupyter notebook locally, we will not have to shut them off to run Pyspark. Now simply copy the link from our EC2 command line on Chrome, but instead of the 8888, we’ll need to change it to 9999. From there, we’ll see the familiar Jupiter notebook that we can run our code.

FileZilla

We don’t have any data on our EC2 instance. We can use a program like FileZilla to get our local files onto AWS. After downloading FileZilla, you can connect to your EC2 instance using the ssh command. This video will show how to easily connect.

Once you have the files that you want on AWS. The only thing to do is start a new notebook and use spark to calculate. As luck would have it, we still need to locate the spark file that we downloaded earlier and that is when we call upon findspark. You’ll notice that this findspark.init() has to include the location of the sparkfile. If it is located somewhere else, you would need to head there before starting the Jupiter notebook.

# how to find spark in ec2 instance
import findspark
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')

import pyspark
from pyspark.sql import SparkSession

Closing The Instance

After you have run info until your hearts content, now it’s time to stop our instance. Remember that you get paid per use and being connected to your AWS EC2 will use up resources. So we need to save our jupyter notebook and then go to our EC2 homepage and right click our instance and “Stop” instance. Then we can press control C to terminate ubuntu and just the code below to clear our SSH tunnel from our terminal

#9999 is the number of port you want to kill
lsof -i :9999

#or try
lsof -ti:9999 | xargs kill


#connect again to
ssh -NfL 9999:localhost:8888 ec2

One thought on “Setting Up Pyspark on AWS EC2 and Common Problems

Comments are closed.