Changes

GPU621/Apache Spark

517 bytes added, 14:02, 30 November 2020

m

→‎Spark vs Hadoop Wordcount Performance

[[File:Googlecloud-setup-11b.jpg]]

'''Ensure that Dataproc, Compute Engine, and Cloud Storage APIs are all enabled ~~by going~~ '''# Go to '''Menu -> API & Services -> Library.# Search for the API name and enable them if they are not already enabled.'''

Create a Cloud Storage Bucket by going from '''Menu -> Storage -> Browser -> Create Bucket'''

Make a note of the bucket name.

'''Copy the Hadoop wordcount example available on every Dataproc cluster, from Master node VM to our Cloud Storage bucket'''

# Open Secure Shell (SSH) from VM Instances list: Menu -> Compute -> Compute Engine.

# To copy from the VM local disk to Cloud Storage bucket enter the following command in the shell:

<Code> gsutil cp /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar gs://<myBucketName>/ </Code>

'''Save the Spark wordcount example into the Cloud Storage bucket by dragging and dropping it into the storage browswer'''# To open Browser: '''Menu -> Storage -> Browser'''

# Drag and drop the below word-count.py into the browser, or use 'UPLOAD FILES' to upload.

wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda count1, count2: count1 + count2)

wordCounts.saveAsTextFile(sys.argv[2])

'''Finally, add the input files containing the text the word count jobs will be processing'''

* Go to Cloud Storage Bucket: '''Menu -> Storage -> Browser'''

* Create a new folder 'input' and open it

* Drag and Drop input files, or use 'UPLOAD FILES' or 'UPLOAD FOLDER'

For this analysis we are using archive text files of walkthroughs from https://gamefaqs.gamespot.com/

The files range in size from 4MB to 2.8GB for a total size of 7.7 GB of plain text.

[[File:Googlecloud-wordcountfiles.jpg]]

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools