Changes

← Older edit

GPU621/Jedd Clan

4,372 bytes added, 18:03, 11 August 2021

→‎Data Management

== Background ==

~~Daal does this~~ The Intel® Data Analytics Acceleration Library (DAAT) is used by data scientists to help analyze big data and ~~stuf and more stuff ................~~solve their problems.This is due to the nature of the data which is becoming too large, fast, or complex to process using the traditional means.Recently data sources are becoming more complex compared to traditional data, in some cases they’re being driven by artificial intelligence, the Internet of Things, and our mobile devices.This is because the data coming from devices, sensors, web and social media, and log files are being generated at a large scale and in real time.Intel has launched their oneDAL library on December 2020 and the intial DAAL library was made on August 25, 2020.It is bundled with the intel oneAPI Base Toolkit and is compatible with Windows, Linux and Mac and also uses C++, python, Java, etc.

To install the Intel DAAL library follow the [https://software.intel.com/en-us/get-started-with-daal-for-linux instructions].

== Data Analytics Pipeline ==

The Intel® Data Analytics Acceleration Library (DAAT) is used by data scientists to help analyze big data and solve their problems. This is due to the nature of the data which is becoming too large, fast, or complex to process using the traditional means. Recently data sources are becoming more complex compared to traditional data, in some cases they’re being driven by artificial intelligence, the Internet of Things, and our mobile devices. This is because the data coming from devices, sensors, web and social media, and log files are being generated at a large scale and in real time.

*The Intel® Data Analytics Acceleration Library provides optimized building blocks for the various stages of data analysis

[[File:DataAnalyticsStages.jpg| 900px]]

== Data Management ==

Data management refers to a set of operations that work on the data and are distributed between the stages of the data analytics pipeline. The data management flow is shown in the figure below. You start with your raw data and its acquisition. The first step is to transfer the out of memory data, the source could be from files, databases, or remote storage, into an in-memory representation.

Once it’s inside memory you can then prepare the data in many ways. DAAL offers support of various in-memory data formats such as an array of structures or compressed-sparse-row format, you can also convert data into a numeric representation, filter data and perform data normalization, compute various statistical metrics for numerical data such as the mean, variance, and covariance, and also compress and decompress the data.

The third step is to stream the in-memory numerical data to the algorithm

In complex usage scenarios the data ends up going through these three stages back and forth, so for example if your data isn’t fully available at the start of the computation it can be sent in chunks which is an advantage of DAAL.

*Raw Data Acquisition

*Data preperation

*Algorithim computation

[[File:ManagemenFlowDal.jpg|900px]] This is a tabular view of data where the columns represent features, these are properties or qualities of a real object or event, and the table rows represent observations which are feature vectors that are used to encode information for a real object or event. This is used in our machine learning regression example.The data set is used across all stages of the data analytics pipeline, during acquisition it’s downloaded into local memory, during preparation it’s converted to a numerical representation, and during computation it’s used with an algorithm as an input or result. [[File:DataSet.jpg]]

== Building Blocks ==

DAAL provides building blocks which helps with aspects of data analytics from the tools used for managing data to computational algorithms.

*DAAL helps with aspects of data analytics from the tools used for managing data to computational algorithms

[[File:BuildingBlocks.jpg| 900px]]

== Computations ==

*Must choose an ~~algorithim~~ algorithm for the application*Below are the different types of algorithms that DAAL currently has for analysis training and prediction[[File:Algorithims.jpg| 900px]]

*Modes of Computation

**Batch Mode - simplest mode uses a single data set

**Online Mode - multiple training sets

** Distributed Mode - computation of partial results and supports multiple data sets

[[File:ComputationMode.jpg| 900px]]

== How To Use Intel DAAL ==

BlockDescriptor<double> block;

//offset defines the row number one wants to begin at, number of rows, read/write permissions, block is the object being written to

//provides more control vs. just defining an array

dataTable->getBlockOfRows(offset, numRows, readwrite, block);

double* rawData = block.getBlockPtr();

</source>

== How to create the Training Model ==

training::Batch<algorithmFPType=TYPE, method=MTHD> algorithm;

</source>

== Creating the Prediction Model ==

[[File:TableValueGraph.jpg]]

The information plotted graphically

== Testing Quality Of Model ==

Finally the DAAL library contains the feature to test one's model with a set of ground values and see how accurate the predictions actually are.

One would just need to load the values, obtain the prediction results and then run the quality metric to compute the error rates.

[[File:QualityModel.jpg|900px]]

As one can see from the results the model has an error rate of 0.004 showing that the algorithm is quite proficient at making predictions for these yachts.

[[File:QualityResults.jpg | 700px]]

== Sources ==

*https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onedal.html#gs.8qq3cz

*https://software.intel.com/content/www/us/en/develop/documentation/onedal-developer-guide-and-reference/top.html

*https://www.codeproject.com/Articles/1151612/A-Performance-Library-for-Data-Analytics-and-Machi

*https://colfaxresearch.com/intro-to-daal-1/

Jchionglo1

49

edits

CDOT Wiki β

Changes

GPU621/Jedd Clan

CDOT Wiki ^β