From CDOT Wiki
Jump to: navigation, search

GPU621/DPS921 | Participants | Groups and Projects | Resources | Glossary

Nvidia CUDA Logo.jpg

Project Name


Group Members

Adam Stinziani

Project Description

This project will thoroughly introduce NVIDIA’s CUDA, starting with an overview on the history and current state of CPUs and GPUs, including an analysis of architectural differences between these computing units and architectural differences between different manufacturers of the same computing units. Also included will be an analysis on differences of how software programming models, frameworks, and toolkits maximize the potential of these architectures. Then CUDA Application Domains will be analyzed. Use cases for parallel computing with GPUs will be analyzed, as well as implemented. Tests will be made regarding the performance of CUDA vs OpenCL and OpenMP.

What is CUDA?

NVIDIA released the Compute Unified Device Architecture in 2007. CUDA exposes compatible GPU computing for general purpose. CUDA enabled Graphics Processing Units are capable of general-purpose processing. The Compute Unified Device Architecture has evolved into an extensive platform which is comprised of NVIDIAs propriety computing platform and programming model for parallel computation, specialized hardware and software APIs and frameworks, compilers, and lots of documentation from NVIDIAs Developer Zone. Developers can download and install the CUDA toolkit and learn how to use CUDA APIs and development tools with in-depth documentation on NVIDIAs website. For research in this assignment, a CUDA enabled GPU will be used for implementing use cases and testing the performance of CUDA vs other parallel programming platforms. Similarly, like how MPI can be deployed to manage a distributed memory system containing thousands of Central Processing Units, CUDA can be deployed in a distributed memory context to manage cloud installations with thousands of Graphics Processing Units. CUDA helps graphics processing for scientific applications including visualization of galaxies in astronomy, and molecular visualization for biology and medicine. CUDA enabled devices range in a wide variety from embedded systems to cloud and data center. CUDA Toolkit

History & Current State of Central & Graphics Processing Units

History of GPUs

“The graphics processing unit (GPU), first invented by NVIDIA in 1999, is the most pervasive parallel processor to date.” Source GeForce 256 was marketed as the worlds first GPU. Designed on a single chip, this product was a champion in it’s time, delivering “480 million 8-sample fully filtered pixels per second.” Source It featured hardware Transform & Lighting and Cube Environment Mapping which were graphics rendering methods of the day. There is support for Direct3D 7.0 and OpenGL 1.2, which run on Windows and Linux respectively.

Current State of GPUs

Today, Graphics processing units, or GPUs exist in many different forms:

• Dedicated or discreet graphics cards are the most powerful form and usually are connected to a PCIe lane. Although these units have their own dedicated memory as well as memory that is shared with the CPU, they cannot replace the CPU and when used with CPUs do not constitute as distributed memory systems which are systems with multiple CPUs each with their own memory. Dedicated or discreet graphics cards can be found in PCs that require extensive dedicated graphics rendering power, such as workstations for scientific visualizations and PCs for gaming.

• Integrated graphics processing or (iGPU) units share RAM with the CPU and can appear on the CPU or near it depending on the architecture. Integrated GPUs usually ship with notebook style laptops to accomplish daily graphics rendering tasks. At 2.6 trillion floating point operations per second, the recently released M1 chip from Apple with integrated graphics is currently the fastest in the world.

• Hybrid graphics processing units have earned their name by scoring in between dedicated and integrated GPUs in price and performance. An example is NVIDIA’s TurboCache, which shares main system memory.

• General Purpose Graphics Processing Units (GPGPUs) are modified vector processors (found on CPUs) with compute kernels designed for high throughput. These modifications allow them to support APIs such as OpenMP. Since CUDA was released on June 23, 2007, NVIDIA’s G8x series and onwards (all their GPUs released after November 8, 2006) are GPGPU capable.

• External GPUs are surprisingly located outside the housing of the computer and have their own housing and power supply. They need to be connected to a PCIe port of a computer, which can be accessed through a Thunderbolt 3 or 4 port. So, if you have a compatible computer and are looking for more graphics processing power, eGPUs can be a cost-saving solution.

History and Current State of CUDA

Arch NVIDIA.jpg

NVIDIA’s first CUDA architecture was called Fermi. “NVIDIA’s Fermi GPU architecture consists of multiple streaming multiprocessors (SMs), each consisting of 32 cores, each of which can execute one floatingpoint or integer instruction per clock. The SMs are supported by a second-level cache, host interface, GigaThread scheduler, and multiple DRAM interfaces.” NVIDIA Fermi

A CUDA core is approximately comparable to a CPU core.

NVIDIA’s latest CUDA architecture is called Ampere GA102 “Each SM in GA10x GPUs contain 128 CUDA Cores, four third-generation Tensor Cores, a 256 KB Register File, four Texture Units, one second-generation Ray Tracing Core, and 128 KB of L1/Shared Memory, which can be configured for differing capacities depending on the needs of the compute or graphics workloads. The memory subsystem of GA102 consists of twelve 32-bit memory controllers (384-bit total). 512 KB of L2 cache is paired with each 32-bit memory controller, for a total of 6144 KB on the full GA102 GPU.” NVIDIA Ampere GA102

As the amount of processing power that went into GPUs increased beyond the point necessary for rendering pixels on the screen, a method was needed to utilize the wasted computing power on Graphics Processing Units. General Purpose Graphics Processing Units can perform General Purpose computations and Graphics Processing computations. With the release of CUDA, all CUDA-enabled GPUs can perform general purpose processing, with double floating-point precision supported since the first major compute capability version.

History & Current State of CPUs

Winding back time to 1971, the Intel 4004 microprocessor spawned the microprocessor revolution, around the time Moore’s law was coined, and then came the need to split registers into multiple cores and the need for multi-threading on CPUs. These topics are all too familiar, so let us continue with brief details. 4004 – had 5120 bits RAM and 32768 bits of ROM, could add two eight-digit numbers in about half of a second. Architectural leaps and bounds forward, Willow Cove features 80 KiB L1 cache per core, 1.25MB L2 cache per core and 3MB L3 cache per core. The L1 cache is restricted to the CPU itself, while the L2 and L3 caches are shared with the RAM. Having the correct driver on the operating system allows the CPU to talk to the GPU, and assign it tasks as the user needs.

Architectural Differences

CPUs vs. GPUs

CPUs have a faster clock speed and can execute a wide array of general-purpose instructions used for controlling computers. At the time of their creation, GPUs could not do this. GPUs are focused on having as many cores as possible with a limited instruction set to provide sheer computing power in specialized scenarios. This higher number of cores in GPUs allows modern GPUs to achieve more floating-point operations per second than modern CPUs. It is highly debated whether CPUs will replace GPUs or the other way around, neither of which I believe will happen due to the various roles these units can take in any given system. A CPU is the physical and logical component that any operating system uses to run programs and manage hardware. If a computer has a screen, it needs a GPU. Depending on the level of depth and layers that graphics rending required will determine what GPU should be used.


Main architectural differences across AMD and NVIDIA GPUs are architecture of execution units, cache hierarchy, and graphics pipelines. Graphics pipelines are conceptual models used by GPUs to render 3D images on a 2D screen. Conceptually and theoretically these devices are quite different, but their practical uses and diagrams of modern GPUs from both companies are quite similar. NVIDIA and AMD have different product lineups, audiences, and business visions. AMD offers CPUs and GPUs, while NVIDIA focuses on GPUs. Having been founded nearly thirty years prior to NVIDIA, AMD currently sits at 28% of their competitors market capitalization. Intel’s market cap is currently 43% of NVIDIAs, making NVIDIA the largest company of the three big chip makers. Traditionally in the world of PC building, NVIDIA and Intel usually release the more expensive but most performing hardware, while AMD provides the best valued hardware in price-to-performance.




Software Development Strategies

The following platforms consist of software programming models, frameworks, toolkits, and compilers released by major tech companies such as NVIDIA and Intel that enable maximization of performance for their hardware. The focus will be on NVIDIA’s CUDA. By creating the architectures, hardware backed by those architectures and abstracting all software development methods through platforms to maximize the potential of their hardware; these companies have created complete computing systems that are extremely powerful.

Software Programming Models

Architectural models we have covered in this course include shared and distributed memory models, implementation models include SPMD and MPMD. These models abstract away from hardware and programming languages resulting in hardware agnostic and language agnostic models to allow developers to focus on designing solutions for problems that should work on various systems in various environments. Programming models are required in parallel programming to understand which solution to choose for a given problem. Complexities and dependencies in the solution will help determine which programming model to use, and which hardware is best suited for the solution if feasible. The diverse array of hardware available for parallel computing and the complex systems constructed for various use cases in turn require parallel programming models to be equally diverse and complex.

oneAPI, branded by Intel as “A Unified X-Architecture Programming Model” with applications for the CPU covered in this course, but oneAPI is designed for diverse architectures and can be used to simplify programming across GPUs. Intel oneAPI DPC++ (Data Parallel C++) compiler includes CUDA backend support, which was initially released almost two years ago. Considering Intel oneAPI was officially released December 2020, oneAPI seems to live up to its brand name.

Source Source

Software Programming Frameworks

Software Programming Frameworks are structures in which libraries reside. APIs pull source code from frameworks.


oneAPI Thread Building Blocks

CUDA API – extension of C and C++ programming language allowing for thread level parallelism. It is a high-level abstraction allowing for developers to easily maximize the potential of CUDA enabled devices. Through the CUDA API, parallel blocks of code are identified as kernels. A kernel will get executed in parallel by CUDA threads. CUDA threads are collected into blocks, and blocks are organized into grids.

CUDA code.png

CUDA special syntax parameters <<<…>>> identified in the kernel invocation execution configuration are number of blocks (1) and number of threads per block (N).

Source Source

Common libraries contained in the CUDA Framework, accessible through the CUDA API:

cuBLAS - “the CUDA Basic Linear Algebra Subroutine library.”

cuFFT - “the CUDA Fast Fourier Transform library.”

Software Programming Toolkits

Where Frameworks are the structures with libraries containing code we access, toolkits are software programs written to help other software developers to develop programs and maintain code.

oneAPI HPC Toolkit

CUDA Toolkit – available for download from NVIDIA’s website. Includes extensions for IDEs for features such as syntax highlighting and easily creating CUDA projects similarly to OpenMP, TBB & MPI. Also included in CUDA toolkit is the CUDA compiler, framework, NSight Profiler for Visual Studio for monitoring performance of applications and much more.

CUDA Application Domains

CUDA App Domains.jpg

“CUDA and Nvidia GPUs have been adopted in many areas that need high floating-point computing performance, as summarized pictorially in the image above. A more comprehensive list includes:

1. Computational finance

2. Climate, weather, and ocean modeling

3. Data science and analytics

4. Deep learning and machine learning

5. Defense and intelligence

6. Manufacturing/AEC (Architecture, Engineering, and Construction): CAD and CAE (including computational fluid dynamics, computational structural mechanics, design and visualization, and electronic design automation)

7. Media and entertainment (including animation, modeling, and rendering; color correction and grain management; compositing; finishing and effects; editing; encoding and digital distribution; on-air graphics; on-set, review, and stereo tools; and weather graphics)

8. Medical imaging

9. Oil and gas

10. Research: Higher education and supercomputing (including computational chemistry and biology, numerical analytics, physics, and scientific visualization)

11. Safety and security

12. Tools and management”


Use Cases for Parallel Computing with GPUs

Video Processing

My favourite use case to fall under this bucket, Video Games! Originally the exclusive use case for the Graphics Processing Unit was… Graphics Processing. Otherwise known as pushing pixels to the screen, gaming graphics cards have been optimized for this behaviour since their conception. Interestingly, one of the most important calculations in graphics is matrix multiplication. Matrices are generic operators required for processing graphic transformations such as: translation, rotation, scaling, reflection, and shearing. Of course, video rendering techniques have evolved greatly in the last couple of decades, from 2D to 3D rendering methods and additional layers of graphical computation resulting in continuously increasing realism.

Average screen is 1920 by 1080 pixels, average refresh rate is 60 frames per second, resulting in the number of operations to update all pixels on the screen per second to be multiplied by 124,416,000. From a computation standpoint that is not a very big number especially considering a 2 gigahertz CPU can perform two billion operations per second, so that begs the question why GPUs? The answer is because the number of calculations to update all pixels on the screen per second can reach insanely large numbers resulting in more than two billion operations per second. This is in intense graprics rendering sceneriaos. However, they are simple operations like matrix multiplication. This helps clearly understand the role of the GPU which is to perform massive amounts of simple operations in parallel. As a result of GPU being able to perform so many calculations per clock cycle, clock speeds aren’t as high. Source Source Source

Modern day GPUs have dedicated physical units for tasks such as 3D rendering, Copying, Video Encoding & Decoding, and more. As you may expect, 3D rendering is the concept of converting 3D models into 2D images on a computer, video encoding occurs on outbound data and video decoding occurs on inbound data. For small tasks such as video calls, and watching YouTube videos, integrated graphics will suffice and are usually shipped with notebook style laptops for this reason. For larger tasks such as 4K video editing, or 4K gaming, dedicated hardware is required.

Real Time

3D rendering often occurs in real time, especially in video games and simulations such as those that occur in software development for robotic applications.

Non-Real Time or Pre-Rendering

Any time we watch or upload a video through a computer, whether it is stored on the device or being streamed over a network, data containing the video must be encoded before it is stored or transmitted and decoded before it can be displayed. This is where pre-rendering occurs.

Machine Learning

Machine learning has gained extreme popularity in recent years due to a combination of factors including big data, computational advances, and cloud business models. Every industry can benefit from Machine Learning Artificial Intelligence technologies, whether through data analysis to improve operation quality or embedded robotic systems to automate physical tasks.



Science & Research

CUDA helps graphics processing for scientific applications including visualization of galaxies in astronomy, and molecular visualization for biology and medicine.

Use Case Implementation

Machine Learning (TensorFlow)

The following models have been tested on a CUDA-enabled GPU, versus an Apple Neural Engine from an M1. The Python TensorFlow implementation abstracts away from hardware and allows the same code to be ran on various devices such as these. The code for these models reside in the TensorFlow Model Garden.

TensorFlow Machine Learning Algorithms Tested:

BERT (Bidirectional Encoder Representations from Transformers) Models Test

Orbit Standard Runner Test

Image Classifier Trainer Util Test


“NVIDIA's GTX 1080 does around 8.9 teraflops” Source

Apple M1 does around 2.6 teraflops Source

8.9 / 2.6 = 3.4230769230769230769230769230769

1.136666667 / 0.142666667 = 7.9672897033474539641414627006041

Although the GTX 1080 does around 9 teraflops which is about three and a half times faster than the Apple M1, the M1 can consistently perform about 8 times faster for TensorFlow machine learning algorithms. This is most likely since the GTX 1080 is a Graphics Processing Unit meant for gaming, and the Apple M1 chip has a dedicated 16-core Neural Engine component for Machine Learning. The test results prove that this dedicated Neural component greatly accelerates Machine Learning as Apple claims. Unfortunately, an NVIDIA Volta or Turing GPU was not available for testing in this scenario. Volta and Turing are NVIDIAs machine learning architectures, and they offer a wide range of devices for Machine Learning uses from embedded devices to datacenter and cloud. The architecture of these devices includes tensor cores that are dedicated for machine learning and greatly increase efficiency.

Source Source

CUDA Performance Testing


The following CUDA Matrix Multiplication Code was used for all CUDA Matrix Multiplication tests:


CUDA Host Code.png


CUDA Device Header Code.png


CUDA Device Code.png

OpenCL Code

The following OpenCL Matrix Multiplication Code was used for all Matrix Multiplication tests on an OpenCL compatible GPU (same GPU for CUDA tests):


OpenCL Complier Directives.png


OpenCL Host Code.png


OpenCL Device Code.png

Matrix Multiplication – CUDA vs. OpenMP

The OpenMP Matrix Multiplication solution from Workshop 3 was used for testing in this scenario.


For this test, the CUDA code was ran on a CUDA enabled GPU and OpenMP code was ran on the CPU, a truer test of CUDA vs. OpenMP would be to have the OpenMP code run on the GPU. After learning that matrix multiplication operations are optimized for GPUs and of course CUDA is optimized for NVIDIA GPUs, this seems like an unfair test, with CUDA crushing OpenMP.

Matrix Multiplication - CUDA vs OpenCL


As covered earlier, CUDA is proprietary to NVIDIA and only works on CUDA enabled NVIDIA GPUs. This is not the case for OpenCL, which is open source and runs on a wide variety of GPUs and CPUs. This should provide the expected result that CUDA will outperform OpenCL on any given CUDA enabled device. Interestingly, with a small array size, OpenCL seems to outperform CUDA, this is presumably due to parallel overhead. Memory access optimization is always critical in High Performance Computing and must always be considered. Memory access can make or break the efficiency of an algorithm.

CUDA Limitations

Main limitations of CUDA: No recursive functions. Minimum unit block of 32 threads.


Due to its proprietary nature, a limitation of CUDA is the number of devices supported and systems it can extend. This proprietary nature allows NVIDIA to completely maximize performance, resulting in CUDA outperforming OpenCL, at least on any given CUDA enabled device.


CUDA pt 1.gif CUDA pt 2.gif CUDA pt 3.gif