J&J

From CDOT Wiki
Jump to: navigation, search


Intel Threading Blocks


Introduction to Intel Threading Building Blocks

Intel Threading Building Blocks offers a rich and complete approach to expressing parallelism in a C++ program. It is a library that helps you leverage multi-core processor performance without having to be a threading expert. Threading Building Blocks is not just a threads-replacement library; it represents a higher-level, taskbased parallelism that abstracts platform details and threading mechanisms for performance and scalability.Intel® Threading Building Blocks (Intel® TBB) makes parallel performance and scalability easily accessible to software developers who are writing loop and task based applications. Developers can build robust applications that abstract platform details and threading mechanisms while achieving performance that scales with increasing core count.

Why Use It:Intel® Threading Building Blocks (Intel® TBB) lets you easily write parallel C++ programs that take full advantage of multicore performance, that are portable and composable, and that have future-proof scalability.

What is it:Widely used C++ template library for task parallelism.

Primary Features: - Parallel algorithms and data structures and Scalable memory allocation and task scheduling.

Reason to Use: - Rich feature set for general purpose parallelism. C++; Windows*, Linux*, OS X* and other OSes

Key Benefits of Using Intel TBB

Intel TBB differs from typical threading packages in the following ways:

Enables you to specify logical parallelism instead of threads.

Intel TBB has a runtime library that automatically maps logical parallelism onto threads in a way that makes efficient use of processor resources―thereby making it less tedious and more efficient.

Targets threading for performance.

Intel TBB focuses on the particular goal of parallelizing computationally intensive work, delivering higher-level, simpler solutions

Compatible with other threading packages.

Intel TBB can coexist seamlessly with other threading packages, giving you the flexibility to not touch your legacy code but still use Intel TBB for new implementations.

Emphasizes scalable, data parallel programming.

Intel TBB emphasizes data-parallel programming, enabling multiple threads to work on different parts of a collection. Data-parallel programming scales well to larger numbers of processors by dividing the collection into smaller pieces. With data-parallel programming, program performance increases as you add processors.

Relies on generic programming.

Intel TBB uses generic programming. The essence of generic programming is writing the best possible algorithms with the fewest constraints. The C++ Standard Template Library (STL) is a good example of generic programming in which the interfaces are specified by requirements on types.

Threading Building Blocks enables you to specify tasks instead of threads

Most threading packages require you to create, join, and manage threads. Programming directly in terms of threads can be tedious and can lead to inefficient programs because threads are low-level, heavy constructs that are close to the hardware. Direct programming with threads forces you to do the work to efficiently map logical tasks onto threads. In contrast, the Threading Building Blocks runtime library automatically schedules tasks onto threads in a way that makes efficient use of processor resources. The runtime is very effective at loadbalancing the many tasks you will be specifying. By avoiding programming in a raw native thread model, you can expect better portability, easier programming, more understandable source code, and better performance and scalability in general. Indeed, the alternative of using raw threads directly would amount to programming in the assembly language of parallel programming. It may give you maximum flexibility, but with many costs.

Threading Building Blocks targets threading for performance

Most general-purpose threading packages support many different kinds of threading, such as threading for asynchronous events in graphical user interfaces. As a result, general-purpose packages tend to be low-level tools that provide a foundation, not a solution. Instead, Threading Building Blocks focuses on the particular goal of parallelizing computationally intensive work, delivering higher-level, simpler solutions.

Threading Building Blocks is compatible with other threading packages

Threading Building Blocks can coexist seamlessly with other threading packages. This is very important because it does not force you to pick among Threading Building Blocks, OpenMP, or raw threads for your entire program. You are free to add Threading Building Blocks to programs that have threading in them already. You can also add an OpenMP directive, for instance, somewhere else in your program that uses Threading Building Blocks. For a particular part of your program, you will use one method, but in a large program, it is reasonable to anticipate the convenience of mixing various techniques. It is fortunate that Threading Building Blocks supports this. Using or creating libraries is a key reason for this flexibility, particularly because libraries are often supplied by others. For instance, Intel’s Math Kernel Library (MKL) and Integrated Performance Primitives (IPP) library are implemented internally using OpenMP. You can freely link a program using Threading Building Blocks with the Intel MKL or Intel IPP library.

Threading Building Blocks emphasizes scalable, data-parallel programming

Breaking a program into separate functional blocks and assigning a separate thread to each block is a solution that usually does not scale well because, typically, the number of functional blocks is fixed. In contrast, Threading Building Blocks emphasizes data-parallel programming, enabling multiple threads to work most efficiently together. Data-parallel programming scales well to larger numbers of processors by dividing a data set into smaller pieces. With dataparallel programming, program performance increases (scales) as you add processors. Threading Building Blocks also avoids classic bottlenecks, such as a global task queue that each processor must wait for and lock in order to get a new task.

Threading Building Blocks relies on generic programming

Traditional libraries specify interfaces in terms of specific types or base classes. Instead, Threading Building Blocks uses generic programming, which is defined in Chapter 12. The essence of generic programming is to write the best possible algorithms with the fewest constraints. The C++ Standard Template Library (STL) is a good example of generic programming in which the interfaces are specified by requirements on types. For example, C++ STL has a template function that sorts a sequence abstractly, defined in terms of iterators on the sequence. Generic programming enables Threading Building Blocks to be flexible yet efficient. The generic interfaces enable you to customize components to your specific needs.

Initializing and Terminating the Library

Intel Threading Building Blocks components are defined in the tbb namespace. The using directive in the example enables you to use the library identifiers without having to write out the namespace prefix tbb before each identifier.

#include "tbb/task_scheduler_init.h"
using namespace tbb;
int main( ) {
task_scheduler_init init;
...
return 0;
}

Rich Feature Set for Parallelism

Intel TBB includes a rich set of components for threading performance and productivity.

Parallel algorithms and data structures

Generic Parallel Algorithms

An efficient, scalable way to exploit the power of multi-core without having to start from scratch.

Flow Graph

A set of classes to express parallelism as a graph of compute dependencies and/or data flow.

Concurrent Containers

Concurrent access, and a scalable alternative to containers. that are externally locked for thread-safety.

Memory allocation and task scheduling

Task Scheduler

Sophisticated work scheduling engine that empowers parallel algorithms and the flow graph.

Memory Allocation

Scalable memory manager and false-sharing free allocators

Threads and synchronization

Synchronization Primitives

Atomic operations, a variety of mutexes with different properties, condition variables

Timers and Exceptions

Thread-safe timers and exception classes

Threads

OS API wrappers

Thread Local Storage

Efficient implementation for an unlimited number of thread-local variables.

Conditional Numerical Reproducibility

Ensure deterministic associativity for floating-point arithmetic results with the new Intel TBB template function ‘parallel_deterministic_reduce’.

Supports C++11 Lambda

Intel TBB can be used with C++11 compilers and supports lambda expressions. For developers using parallel algorithms, lambda expressions reduce the time and code needed by removing the requirement for separate objects or classes. Flow Graph Designer

Computing systems are becoming increasingly heterogeneous. And developing for heterogeneous computing systems can often be challenging because of divergent programming models and tools. Intel TBB flow graph provides a single interface that enables intra-node distributed memory programming, thereby simplifying communication and load balancing across heterogeneous devices.

It does this in two ways:

1. As an analyzer, it provides capabilities to collect and visualize execution traces from Intel TBB flow graph applications. From Flow Graph Designer, users can explore the topology of their graphs, interact with a timeline of node executions, and project important statistics on to the nodes of their graphs. 2. As a designer, it provides the ability to visually create Intel TBB flow graph diagrams and then generate C++ stubs as a starting point for further development.




Loop Parallelization

The simplest form of scalable parallelism is a loop of iterations that can each run simultaneously without interfering with each other. Such as parallel_for, parallel_reduce, parallel_scan.

parallel_for and parallel_reduce

Load-balanced, parallel execution of a fixed number of independent loop iterations

parallel_scan

A template function that computes a parallel prefix (y[i] = y[i-1] op x[i])


parallel_reduce

Applying a function such as sum, max, min, or logical AND across all the members of a group is called a reduction operation. Doing a reduction in parallel can yield a different answer from a serial reduction because of rounding. For instance, A+B+C+D+E+F may be evaluated in serial as (((((A+B)+C)+D)+E)+F), whereas the parallel version may compute ((A+B)+((C+D)+(E+F))). Ideally, the results would be the same, but if rounding can occur, the answers will differ. Traditional C++ programs perform reductions in loops

parallel_scan

A parallel_scan computes a parallel prefix, also known as a parallel scan. This computation is an advanced concept in parallel computing that is sometimes useful in scenarios that appear to have inherently serial dependencies.

Partitioner Concept

Requirements for a type that decides whether a range should be operated on by a task body or further split.The partitioner implements rules for deciding when a given range should no longer be subdivided, but should be operated over as a whole by a task’s body. The default behavior of the algorithms parallel_for, parallel_reduce, and parallel_scan is to recursively split a range until no subrange remains that is divisible, as decided by the function is_divisible of the Range Concept. The Partitioner Concept models rules for the early termination of the recursive splitting of a range, providing the ability to change the default behavior. A Partitioner object’s decision making is implemented using two functions: a splitting constructor and the function should_execute_range.


Advanced Algorithms

parallel_while

A parallel_while<Body> performs parallel iteration over items. The processing to be performed on each item is defined by a function object of type Body. The items are specified in two ways:

• An initial stream of items

• Additional items that are added while the stream is being processed

Working on the Assembly Line: Pipeline

Pipelining is a common parallel pattern that mimics a traditional manufacturing assembly line. Data flows through a series of pipeline stages and each stage processes the data in some way. Given an incoming stream of data, some of these stages can operate in parallel, and others cannot.

Pipelined processing is common in multimedia and signal processing applications. A classical thread-per-stage implementation suffers two problems:

• The speedup is limited to the number of stages.

• When a thread finishes a stage, it must pass its data to another thread.


parallel_sort

parallel_sort is a comparison sort with an average time complexity O(n log n). When worker threads are available, parallel_sort creates subtasks that may be executed concurrently. This sort provides an unstable sort of the sequence [begin1, end1). Being an unstable sort means that it might not preserve the relative ordering of elements with equal keys.The sort is deterministic; sorting the same sequence will produce the same result each time. The requirements on the iterator and sequence are the same as for std::sort.


Containers

Intel Threading Building Blocks provides highly concurrent containers that permit multiple threads to invoke a method simultaneously on the same container. At this time, a concurrent queue, vector, and hash map are provided. All of these highly concurrent containers can be used with this library, OpenMP, or raw threads.

concurrent_queue

The template class concurrent_queue<T> implements a concurrent queue with values of type T. Multiple threads may simultaneously push and pop elements from the queue. In a single-threaded program, a queue is a first-in first-out structure. But if multiple threads are pushing and popping concurrently, the definition of first is uncertain. The only guarantee of ordering offered by concurrent_queue is that if a thread pushes multiple values, and another thread pops those same values, they will be popped in the same order that they were pushed.

Pushing is provided by the push method. There are blocking and nonblocking flavors of pop:

pop_if_present

This method is nonblocking: it attempts to pop a value, and if it cannot because the queue is empty, it returns anyway.

pop

This method blocks until it pops a value. If a thread must wait for an item to become available and it has nothing else to do, it should use pop(item) and not while(!pop_if_present(item)) continue; because pop uses processor resources more efficiently than the loop.

concurrent_vector

A concurrent_vector<T> is a dynamically growable array of items of type T for which it is safe to simultaneously access elements in the vector while growing it. However, be careful not to let another task access an element that is under construction or is otherwise being modified. For safe concurrent growing, concurrent_vector has two methods for resizing that support common uses of dynamic arrays: grow_by and grow_to_at_least. The index of the first element is 0. The method grow_by(n) enables you to safely append n consecutive elements to a vector, and returns the index of the first appended element. Each element is initialized with T( ).

Memory Allocators

Threading Building Blocks offers two choices, both similar to the STL template class,

std::allocator: scalable_allocator

This template offers just scalability, but it does not completely protect against false sharing. Memory is returned to each thread from a separate pool, which helps protect against false sharing if the memory is not shared with other threads.

cache_aligned_allocator

This template offers both scalability and protection against false sharing. It addresses false sharing by making sure each allocation is done on a cache line.

Mutual Exclusion

One of the key advantages of threading is the capability of tasks to share data and other resources, such as open files, network connections, and the user interface. But concurrency creates some uncertainty over when each task reads or writes resources, leading to the need for mutual exclusion.

Threading Building Blocks offers two kinds of mutual exclusion:

Mutexes

These will be familiar to anyone who has used locks in other environments, and they include common variants such as reader-writer locks.

Atomic operations

These are based on atomic operations offered by hardware processors, and they provide a solution that is simpler and faster than mutexes in a limited set of situations. These should be preferred when the circumstances make their use possible.

Mutexes

A mutex is a global variable that multiple tasks can access. Before entering code that you do not want to execute concurrently, a task should request a lock on the mutex. If the mutex is already locked, the task is stalled waiting. Once the lock is granted, the task can proceed. The task should proceed quickly and release the lock on the mutex promptly to avoid unnecessary stalls in other tasks. If a mutex creates a lot of lock requests, it is considered highly contested. Highly contested locks will result in many tasks being stalled and a sharp reduction in program scalability. Avoiding highly contested locks through careful program design is important. A mutex, used properly, ensures that no task reads or writes a variable or other resource when another task is writing it. Intel Threading Building Blocks mutexes work with generic programming in C++, even in the presence of exceptions. Meeting all these requirements is no small feat and takes some consideration to ensure their proper usage. In Threading Building Blocks, mutual exclusion is implemented by classes known as mutexes and locks. A mutex is an object on which a task can acquire a lock. Only one task at a time can have a lock on a mutex; other tasks have to wait their turn. Mutual exclusion controls how many tasks can simultaneously run a region of code. In general, you protect a region of code from concurrency when that code reads or writes a small amount of memory, which is generally interrelated by a particular operation you want to effect.


The following is a summary of mutex behaviors:

• A spin_mutex is nonscalable, unfair, nonreentrant, and spins in user space. It would seem to be the worst of all possible worlds, except that it is very fast in lightly contended situations. If you can design your program so that contention is somehow spread out among many spin mutexes, you can improve performance over other kinds of mutexes. If a mutex is heavily contended, your algorithm will not scale anyway. Consider redesigning the algorithm instead of looking for a more efficient lock.

• A queuing_mutex is scalable, fair, nonreentrant, and spins in user space. Use it when scalability and fairness are important.

• A spin_rw_mutex and a queuing_rw_mutex are similar to spin_mutex and queuing_ mutex , but they additionally support reader locks. spin_mutex is very fast in lightly contended situations; use it if you need to protect a small section of code.

• A mutex is a wrapper around the system’s native mutual exclusion mechanism. On Windows systems, it is implemented on top of a CRITICAL_SECTION . On Linux systems, it is implemented on top of a pthread mutex . The advantages of using the wrapper are that it adds an exception-safe interface and provides an interface identical to the other mutexes in Threading Building Blocks, which makes it easy to swap in a different kind of mutex later if warranted by performance measurements.Future versions of Threading Building Blocks may be able to put tasks to sleep more often, which can be very desirable.

Atomic Operations

Atomic operations are a fast and relatively easy alternative to mutexes. They do not suffer from the deadlock and convoying problems described earlier, in the section “Lock Pathologies.” The main limitation of atomic operations is that they are limited in current computer systems to fairly small data sizes: the largest is usually the size of the largest scalar, often a double-precision floating-point number.

Atomic operations are also limited to a small set of operations supported by the underlying hardware processor. But sophisticated algorithms can accomplish a lot with these operations; this section shows a couple of examples. You should not pass up an opportunity to use an atomic operation in place of mutual exclusion. The class atomic<T> implements atomic operations with C++ style. A classic use of atomic operations is for thread-safe reference counting. Suppose x is a reference count of type int and the program needs to take some action when the reference count becomes zero. In single-threaded code, you could use a plain int for x , and write --x;if(x==0) action( ) . But this method might fail for multithreaded code because two tasks might interleave their operations


Timing

Intel Threading Building Blocks provides a thread-safe and portable method to compute elapsed time.

Many operating systems provide a fine-grained method to track the accounting of central processing unit (CPU) time, and most provide a way to track elapsed/wallclock time. The method of extracting an elapsed measurement with subsecond accuracy varies a great deal. The class tick_count in Threading Building Blocks provides a simple interface for measuring wall-clock time.


Task Scheduler

The library provides a task scheduler, which is the engine that drives the algorithm templates. You may also call it directly. This is worth considering if your application meets the criteria described earlier that make the default task scheduler inefficient. Tasks are logical units of computation. The scheduler maps these onto physical threads. The mapping is non-preemptive. Each thread has an execute( ) method. Once a thread starts running execute( ) , the task is bound to that thread until execute( ) returns. During that time, the thread services other tasks only when it waits on child tasks, at which time it may run the child tasks or—if there are no pending child tasks—service tasks created by other threads. The task scheduler is intended for parallelizing computationally intensive work. Since task objects are not scheduled preemptively, they should not make calls that might block for long periods because, meanwhile, the blocked thread (and its associated processor) are precluded from servicing other tasks.

Potential parallelism is typically generated by a split/join pattern. Two basic patterns of split/join are supported. The most efficient is continuation passing , in which the programmer constructs an explicit “continuation” task rather than leaving it to the default scheduler to determine the next task.

Making Best Use of the Scheduler

This section explains useful programming techniques for scheduling tasks.

Recursive Chain Reaction

The scheduler works best with tree-structured task graphs, because that is where the strategy of breadth-first theft and depth-first work applies very well. Also, tree structured task graphs allow fast creation of many tasks. For example, if a master task tries to create n children directly, it will take O(n) steps. But with tree-structured forking, it takes only O(log n) steps because some of the tasks created can go on to create subtasks. Often, domains are not obviously tree-structured, but you can easily map them to trees. For example, parallel_for works over an iteration space such as a sequence of integers. The template function parallel_for uses that definition to recursively map the iteration space onto a binary tree.

Continuation Passing

The spawn_and_wait_for_all method is a convenient way to wait for child tasks, but it incurs some inefficiency if a thread becomes idle. The idle thread attempts to keep busy by stealing tasks from other threads. The scheduler limits possible victim tasks to those deeper than the waiting task. This limit modifies the policy that the shallowest task should be chosen. The limit restrains memory demands in worst-case scenarios.A way around the constraint is for the parent not to wait, but simply to spawn both children and return. The children are allocated not as children of the parent, but as children of the parent’s continuation task, which is a task that runs when both children complete.

task_scheduler_init

A task_scheduler_init is either active or inactive. Each thread that uses a task should have one active task_scheduler_init object that stays active over the duration that the thread uses task objects. A thread may have more than one active task_scheduler_init at any given moment.

Task Scheduler Summary

The task scheduler works most efficiently for fork-join parallelism with lots of forks so that the task stealing can cause sufficient breadth-first behavior to occupy threads, which then conduct themselves in a depth-first manner until they need to steal more work.The task scheduler is not the simplest-possible scheduler because it is designed for speed. If you need to use it directly, it may be best to hide it behind a higher-level interface, such as the templates parallel_for, parallel_reduce, and so on. Some of the details to remember are:

• Always use new(allocation_method) T to allocate a task, where allocation_ method is one of the allocation methods of the class task. Do not create local or file-scope instances of a task.

• Allocate all siblings before any of them start to run, unless you are using allocate_additional_child_of.

• Exploit continuation passing, scheduler bypass, and task recycling to squeeze out maximum performance.

• If a task completes and was not marked for reexecution, it is automatically destroyed. Also, its dependent’s reference count is decremented, and if it hits 0, the dependent is automatically spawned.