Changes

GPU621/Apache Spark

420 bytes added, 13:44, 30 November 2020

m

→‎Setup

# Run a '''Dataproc''' Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.

=== ~~Setup~~ Setting up Dataproc and Google Cloud Storage===

Using a registered Google account navigate to the Google Cloud Console https://console.cloud.google.com/ and activate the free-trial credits.

[[File:Googlecloud-setup-2.jpg]]

[[File:Googlecloud-setup-6b.jpg]]

'''We will create 5 worker nodes and 1 master node using the N1 series General-Purpose machine with 4vCPU and 15 GB memory and a disk size of 32-50 GB for all nodes. You can see the cost of your machine configuration per hour. Using machines with more memory, computing power, etc will cost more per hourly use.'''

[[File:Googlecloud-dataproc-1.jpg]]

'''Allow API access to all google Cloud services in the project.'''

[[File:Googlecloud-setup-9.jpg]]

'''To view the individual nodes in the cluster go to '''Menu -> Virtual Machines -> VM Instances'''

[[File:Googlecloud-setup-11b.jpg]]

'''Ensure that Dataproc, Compute Engine, and Cloud Storage APIs are all enabled by going to '''Menu -> API & Services -> Library.~~'''~~ Search for the API name and enable them if they are not already enabled.'''

Create a Cloud Storage Bucket by going from '''Menu -> Storage -> Browser -> Create Bucket'''

Make a note of the bucket name.

Copy the Hadoop wordcount example available on every Dataproc cluster, from Master node VM to our Cloud Storage bucket

# Open Secure Shell (SSH) from VM Instances list: Menu -> Compute -> Compute Engine.

# enter the following command in the shell:

gsutil cp /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar gs://rinsereduce/

=== Results ===

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools