Difference between revisions of "SLEEPy"

From CDOT Wiki
Jump to: navigation, search
(Batch Sorting)
(Introduction)
Line 20: Line 20:
  
 
== Introduction ==
 
== Introduction ==
 +
=== What is Big Data ===
 +
Big Data is data that is so big that traditional methods of data processing fail to keep up. A lot of the time this type of data is related to human behavior measurable by computers. This maybe trends, web analytics etc. With so many people using technology, data gathered this way is becoming massive. This is one of the reasons for the rise of Big Data.
 +
 +
[[File:Viegas-UserActivityonWikipedia.gif|center|500px|Visualization of daily Wikipedia edits created by IBM. At multiple terabytes in size, the text and images of Wikipedia are an example of big data.]]
 +
 +
=== What is DAAL ===
 
DAAL is a C++ & Java / Scala library for data analytics. It's similar to MKL(Math Kernel Library) with some differences:
 
DAAL is a C++ & Java / Scala library for data analytics. It's similar to MKL(Math Kernel Library) with some differences:
 
* MKL focuses on computation. DAAL focuses on the entire data flow (aquisition, transformation, processing).
 
* MKL focuses on computation. DAAL focuses on the entire data flow (aquisition, transformation, processing).

Revision as of 16:29, 11 April 2016


GPU621/DPS921 | Participants | Groups and Projects | Resources | Glossary

Intel Data Analytics Acceleration Library (DAAL)

Team Member

  1. Luong Chuong
  2. Jacky Siu
  3. Woodson Delhia

Intro OLD

Local DAAL Examples Location: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016\windows\daal\examples

Data: http://open.canada.ca/data/en/dataset/cad804cd-454e-4bd7-9f22-fcee64f60719

New Data: http://open.canada.ca/data/en/dataset/be3880f2-0d04-4583-8265-611b231ebce8

Parser code: https://software.intel.com/en-us/node/610127

Low Order Moments: https://software.intel.com/en-us/node/599561

Our goal is to parse & process this crime data and to add more meaning to said data. Using various parallel techniques taught in the course and comparing them via the DAAL library.

Introduction

What is Big Data

Big Data is data that is so big that traditional methods of data processing fail to keep up. A lot of the time this type of data is related to human behavior measurable by computers. This maybe trends, web analytics etc. With so many people using technology, data gathered this way is becoming massive. This is one of the reasons for the rise of Big Data.

What is DAAL

DAAL is a C++ & Java / Scala library for data analytics. It's similar to MKL(Math Kernel Library) with some differences:

  • MKL focuses on computation. DAAL focuses on the entire data flow (aquisition, transformation, processing).
  • Optimized for all kinds of Intel based devices (from data center to home computers)


DAAL supports 3 processing modes

  • Offline Processing (Batch) - Data can fit in memory, data can be processed all at once.
  • Online Processing (Streaming) - Data is too big for memory, DAAL processes the data in chunks and combine the partial results for the final result.
  • Distributed processing - Distributes data processing. DAAL has not bound the communication method and leaves it to the developer (Hadoop, Spark, MPI etc).


DAAL Data Flow.

Parallel

Code Examples

Batch Sorting

CSV Data:

-55.558252,63.051427,-27.793776, -75.622534,61.212279,-16.283311, -86.747071,-28.132241,-17.824316, -34.172101,-51.404172,14.670925, -61.506308,48.248030,-99.235341, 9.746765,-89.879258,55.561778, 48.896723,-32.648097,48.313603, -15.346015,9.769776,-33.483281, 56.726081,-87.272631,8.724224, -1.926802,54.960580,-78.723429, 45.237223,-79.764218,-47.271926, 84.138339,11.547818,-92.962952, 46.711824,-42.623510,-34.664673, 55.813112,19.803475,4.807766, -55.474098,-72.163755,89.425736, -7.566596,-77.829218,58.630172, -76.081937,-12.089445,-44.065054, -25.888944,46.425499,-37.515164, -30.201387,-16.237217,-50.716591, -88.085869,60.136249,54.812866

Code:

/* file: sorting_batch.cpp
* Copyright 2014-2016 Intel Corporation All Rights Reserved.*/

#include "daal.h"
#include "service.h"

using namespace daal;
using namespace daal::algorithms;
using namespace daal::data_management;
using namespace std;

/* Input data set parameters */
string datasetFileName = "../data/batch/sorting.csv";

int main(int argc, char *argv[])
{
    checkArguments(argc, argv, 1, &datasetFileName);

    /* Initialize FileDataSource<CSVFeatureManager> to retrieve the input data from a .csv file */
    FileDataSource<CSVFeatureManager> dataSource(datasetFileName, DataSource::doAllocateNumericTable, DataSource::doDictionaryFromContext);

    /* Retrieve the data from the input file */
    dataSource.loadDataBlock();

    /* Create algorithm objects to sort data using the default (radix) method */
    sorting::Batch<> algorithm;

    /* Print the input observations matrix */
    printNumericTable(dataSource.getNumericTable(), "Initial matrix of observations:");

    /* Set input objects for the algorithm */
    algorithm.input.set(sorting::data, dataSource.getNumericTable());

    /* Sort data observations */
    algorithm.compute();

    /* Get the sorting result */
    services::SharedPtr<sorting::Result> res = algorithm.getResult();

    printNumericTable(res->get(sorting::sortedData), "Sorted matrix of observations:");

    return 0;
}

Results:

DAAL Sort Batch.

The data is sorted from smallest to largest per column.

Data Blocks:

dataSource.loadDataBlock(5);
//dataSource.loadDataBlock(5);
dataSource.loadDataBlock(5);
dataSource.loadDataBlock(5);
DAAL Sort Block. DAAL Sort Block.

"Blocks" of data are being loaded 5 rows at a time. This allows us to easily section off data to process. This is also one way of distributing data to MPI etc.

Useful Link

  1. https://software.intel.com/en-us/daal