Changes

GPU621/Apache Spark

1,057 bytes added, 17:08, 30 November 2020

m

no edit summary

'''note: Running the job will create the output folder, However for subsequent jobs be sure to delete the output folder else Hadoop or Spark will not run. This limitation is done to prevent existing output from being overwritten'''

=== Results ===

==== Hadoop Counters ====

* Number of splits: 66

* Total input files to process: 8

[[File:Results.jpg]]

=== Conclusion === Using the same hardware (RAM, CPUs, HDD) across a 6 node cluster and processing the same data (8 .txt files for total size of 7.77 GB) we can only see an approximately 12% performance improvement between Hadoop Mapreduce and Spark using a word count algorithm. This falls far short of the 10 times faster on disk and 100 times faster in-memory. Further testing and analyzing Spark internal data could be done to determine if any bottlenecks exists which are limiting the performance of Spark. For example, how well is the cluster utilizing the hardware, namely the RAM? One possible explanation is the use of Google Cloud Storage Bucket to store the data rather than in Hadoop Distributed File System (HDFS). Both jobs are operating directly on data in the Cloud Strage rather than the HDFS. This may be reducing data access time for Hadoop MapReduce or introducing data access time to Apache Spark, as opposed to having the input data stored directly on the VM data nodes. [[File:Googlecloud-hdfs.jpg]]

# Nov 28, 2020 - Added setup step for Google Cloud account setup

# Nov 29, 2020 - Added Testing setup and job execution for Dataproc Hadoop and Spark

# Nov 30, 2020 - Added Testing results and conclusion.

= References =

DanielPark

76

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools