Changes

GPU621/Spark

5 bytes removed, 18:20, 24 November 2016

→‎RDD Essential Properties

This is a supplementary to the presentation for those who want an in depth walk through of the concepts and code. The presentation focused on an introduction to the technical and practical aspects of Spark and these notes will focus on the same.

=== ~~SPARK~~ Spark ===

Spark is Big Data framework for large scale data procesing. It provides an API centred on a data structure called the Resilient Distributed Dataset (RDD). It provides a read only, fault tolerant multiset of data items distributed over a cluster of machines. High-level APIs are available for Scala, Java, Python, and R. This tutorial focuses on Python code for its simplicity and popularity.

=== ~~HISTORY~~ History ===

Spark was developed in 2009 at UC Berkeleys AMPLab. It was open sourced in 2010 under the BSD license. As of this writing (November 2016), it's at version 2.02.

RDD has 3 essential properties and 2 Optional properties.

1. * List of parent RDDs that is the list of the dependencies an RDD depends on for records. 2. * An array of partitions that a dataset is divided to. 3. * A compute function to do a computation on partitions. 4. * An optional partitioner that defines how keys are hashed, and the pairs partitioned (for key-value RDDs)5. * Optional preferred locations (aka locality info), i.e. hosts for a partition where the data will have been loaded.

== RDD Functions ==

Nascherman

27

edits

Changes

GPU621/Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools