Changes

GPU621/Apache Spark

259 bytes added, 16:26, 30 November 2020

→‎Apache Hadoop

= Apache Hadoop =

[https://hadoop.apache.org/ ''' Apache Hadoop'''] is an open-source framework that allows for the storage and distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is an implementation of MapReduce, an application programming model developed by Google. Hadoop is built in Java and accessible through many languages for writing MapReduce ~~has three basic operations: Map~~code including Python through a Thrift client. Hadoop can process both structured and unstructured data, ~~Shuffle~~ and ~~Reduce. Map, where each worker node applies~~ scale up reliably from a ~~map function~~ single server to the local data and writes the output to temporary storage. Shuffle, where worker nodes redistribute data based on output keys such that all data belonging to one key is located on the same worker node. Finally reduce, where each worker node processes each group thousands of ~~output in parallel~~machines.

== Architecture ==

=== Hadoop MapReduce ===

The processing component of Hadoop ecosystem. It assigns the data fragments from the HDFS to separate map tasks in the cluster and processes the chunks in parallel to combine the pieces into the desired result. MapReduce has three basic operations: Map, Shuffle and Reduce. Map, where each worker node applies a map function to the local data and writes the output to temporary storage. Shuffle, where worker nodes redistribute data based on output keys such that all data belonging to one key is located on the same worker node. Finally reduce, where each worker node processes each group of output in parallel.

== Applications ==

Abalachandran7

73

edits

Changes

GPU621/Apache Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools