Changes

← Older edit

GPU621/Spark

4 bytes added, 20:56, 24 November 2016

→‎History

=== History ===

Spark was developed in 2009 at UC ~~Berkeleys~~ Berkeley's AMPLab. It was open sourced in 2010 under the BSD license. As of this writing (November 2016), it's at version 2.02.

Spark is one of the most active projects in the Apache Software foundation and one of the most popular open source big data projects overall. Spark had over 1000 contributors in 2015.

| Original release date

| Latest Version

| Release ~~data~~ date

|-

| 0.5

[https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-lineage.html Source ]

==== D (Distributed) ====

[[File: spark-distribution.png| 600px]]

Describes how data resides on multiple nodes in a cluster across a network of machines. can be read from and written to distributed storages like HDFS or S3, and most importantly, can be cached in the memory of worker nodes for immediate reuse. Spark is designed as a framework that operates over a network infrastructure, so tasks are divided and executed across multiple nodes in a Spark Context.

==== D (Dataset) ====

[[File: partition-stages.png| 600px]]

[https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html Source]

=== RDD Essential Properties ===

RDD has 3 essential properties and 2 Optional properties.

1. * List of parent RDDs that is the list of the dependencies an RDD depends on for records. 2. * An array of partitions that a dataset is divided to. 3. * A compute function to do a computation on partitions. 4. * An optional partitioner that defines how keys are hashed, and the pairs partitioned (for key-value RDDs)5. * Optional preferred locations (aka locality info), i.e. hosts for a partition where the data will have been loaded.

=== RDD Functions ===

RDD supports two kinds of operations, ''Actions'', and ''Transformation.''

The essential idea is that the programmer specifies an operation, a transformation or series of transformations to perform on a data set using the specified operation(s), and finally, perform an action that returns new data that is a result of the transformations. The new data that resulted from the action can then be used for analysis, or for further transformations. Transformations can be thought of as the start of a parallel region of code, and the action as the end of the parallel region. Everything in Spark is designed to be as simple as possible, so partitions, threads etc. are generated automatically.

=== Advantages ===

The main advantage of Spark is that the data partitions are stored in memory, meaning that access to information is much faster than if the data was retrieved from a hard disk, this is also is a disadvantage in some cases, as storing large datasets in memory also necessitates the need for a large amount of physical memory.

Nascherman

27

edits

Changes

GPU621/Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools