Changes

GPU621/Apache Spark

834 bytes added, 14:38, 30 November 2020

m

→‎Spark vs Hadoop Wordcount Performance

=== Running the ~~Jobs~~ Hadoop MapReduce Job in Dataproc ===

Now that we have our project code, input files and Dataproc cluster setup we can proceed to run the Hadoop MapReduce and Spark wordcount jobs.

[[File:Dataproc-hadoop-2.jpeg]]

=== Running the Apahce Spark Wordcount Job in Dataproc ===

'''Create and Submit Dataproc Job'''

# Go to '''Menu -> Big Data -> Dataproc -> Jobs'''

# Select 'SUBMIT JOB' and name your job ID

# Choose Region that the cluster was created on

# Select your cluster

# Specify PySpark Job Type

# Specify .py file which contains the Apache Spark wordcount algorithm

# Give 2 arguments to word-count.py; the input folder and the output folder

word-count.py:

gs://<myBucketName>/hadoop-mapreduce-examples.jar

2 arguments:

gs://<myBucketName>/inputFolder gs://<myBucketName>output

'''note: Running the job will create the output folder, However for subsequent jobs be sure to delete the output folder else Hadoop or Spark will not run. This limitation is done to prevent existing output from being overwritten'''

=== Results ===

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools