Changes

Jump to: navigation, search

GPU621/Apache Spark

420 bytes added, 13:44, 30 November 2020
m
Setup
# Run a '''Dataproc''' Hadoop MapReduce and Spark jobs to count number of words in large text files and compare the performance between Hadoop and Spark in execution time.
=== Setup Setting up Dataproc and Google Cloud Storage===
Using a registered Google account navigate to the Google Cloud Console https://console.cloud.google.com/ and activate the free-trial credits.
[[File:Googlecloud-setup-2.jpg]]
[[File:Googlecloud-setup-6b.jpg]]
'''We will create 5 worker nodes and 1 master node using the N1 series General-Purpose machine with 4vCPU and 15 GB memory and a disk size of 32-50 GB for all nodes. You can see the cost of your machine configuration per hour. Using machines with more memory, computing power, etc will cost more per hourly use.'''
[[File:Googlecloud-dataproc-1.jpg]]
'''Allow API access to all google Cloud services in the project.'''
[[File:Googlecloud-setup-9.jpg]]
'''To view the individual nodes in the cluster go to '''Menu -> Virtual Machines -> VM Instances'''
[[File:Googlecloud-setup-11b.jpg]]
'''Ensure that Dataproc, Compute Engine, and Cloud Storage APIs are all enabled by going to '''Menu -> API & Services -> Library.''' Search for the API name and enable them if they are not already enabled.'''
Create a Cloud Storage Bucket by going from '''Menu -> Storage -> Browser -> Create Bucket'''
Make a note of the bucket name.
 
Copy the Hadoop wordcount example available on every Dataproc cluster, from Master node VM to our Cloud Storage bucket
# Open Secure Shell (SSH) from VM Instances list: Menu -> Compute -> Compute Engine.
# enter the following command in the shell:
gsutil cp /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar gs://rinsereduce/
=== Results ===
76
edits

Navigation menu