Open main menu

CDOT Wiki β

GPU621/Threadless Horsemen

Revision as of 18:09, 25 November 2018 by Tsarkarcd (talk | contribs) (Comparing Multi-threading in Julia vs OpenMP)

Comparing Multi-threading in Julia vs OpenMP

Group Members

  1. Tanvir Sarkar
  2. Nathan Misener

Introduction: The Julia Programming Language

Bit of History

  • It's a relatively new language and runtime (started in 2009, launched in 2012).
  • The 1.0 release happened this August (
  • It's very ambitious in the sense that it aims to be very good in several areas.
  • It was originally thought up by Stefan Karpinski, a CS grad student working on simulating wireless networks

" STEFAN KARPINSKI WAS building a software tool that could simulate the behavior of wireless networks, and his code was a complete mess. But it wasn't his fault. ... He knew how to build software. The problem was that in order to build his network simulation tool, he needed four different programming languages. No single language was suited to the task at hand, but using four languages complicated everything from writing the code to debugging it and patching it. ... Today's languages were each designed with different goals in mind. Matlab was built for matrix calculations, and it's great at linear algebra. The R language is meant for statistics. Ruby and Python are good general purpose languages, beloved by web developers because they make coding faster and easier. But they don't run as quickly as languages like C and Java. What we need, Karpinski realized after struggling to build his network simulation tool, is a single language that does everything well."

More info: Developers' blog post on why they created Julia:


  • MATLAB is often used for matrix calculations and linear algebra
  • R is used for statistical computing
  • Python is used for data science, scripting, web development
  • They might provide great libraries and developer efficiency due to language design, but they're slower than C.

Julia has syntax which resembles python, and is JIT compiled to native machine code thanks to LLVM (Low-level Virtual Machine). It aims to provide great developer efficiency as well as fast execution times. Amongst the Julia community, the language is sometimes referred to as a potential MATLAB killer. It's open source, and the programs execute faster while keeping syntax simple.

Use Cases

Various organizations have used Julia in their projects, in a variety of fields (Machine Leaning, Bioinformatics, Astronomy, Insurance, Energy, Robtics, etc).


Small Excerpt from Racefox (digital sports coaching company):

"Racefox’s classification and motion analysis algorithms were first implemented in C++ to ensure fast execution. Initially, the team planned to send raw accelerometer data from the mobile device to the cloud for processing. ... Modifying the existing C++ code was proving too slow owing to the brittleness of the language, verbosity, and the need to go through lengthy modify-compile-run cycles. Racefox’s initial attempt to reimplement their algorithms in Matlab did not work very well. It involved a great deal of signal processing with for-loops and conditionals, something that Matlab was particularly bad at. The whole modify-run-analyze cycle was incredibly slow on Matlab, limiting the scope of experimentation. ... In about a week, they had a Julia implementation up and running that was orders of magnitude faster than their Matlab implementation. This was back in the Julia 0.2 days. Even as a young language, Julia was proving indispensable. Racefox was able to (very happily) abandon their C++ implementation and focus on one code base for both experimentation and production."

More Use Cases:

Julia's Forms of Parallelism


  • Experimental in Julia
  • Implements the fork join pattern (we've done it many times now)
  • We focused on this in our quantitative testing, since at the time of writing code, we only had experience with OpenMP


Multi-core Or Distributed Processing

"Most modern computers possess more than one CPU, and several computers can be combined together in a cluster. ... Julia provides a multiprocessing environment based on message passing to allow programs to run on multiple processes in separate memory domains at once."

  • Implements message passing
  • Similar to MPI, but has some key differences

Message passing in Julia is one sided (programmer only explicitly manages one process in a two-process op).

"Distributed programming in Julia is built on two primitives: remote references and remote calls. A remote reference is an object that can be used from any process to refer to an object stored on a particular process. A remote call is a request by one process to call a certain function on certain arguments on another (possibly the same) process."

Unlike sending and receiving messages using MPI, in Julia you would invoke a remote call, which is async.

  • you pass in the function, id of process to execute it, and args for the function
  • a Future is returned, which is a remote reference to the memory location associated with the process executing the function
  • you can call fetch on the Future to get the result (Introduction to Julia and its Parallel Implmentation, 2:00)

Coroutines (Green Threads)


  • OS processes are instances of a program being executed, and each one has its own address space
  • OS Threads are essentially lightweight processes, which share a process' address but maintain their own stack (processes can have many threads)
  • Fibers are a newer concept, and they're essentially lightweight threads which have their own stacks and start/yield (stop) during thread execution (threads can have many fibers)
  • Coroutines and fibers are functionally equivalent


"Fibers implement user space co-operative multitasking, rather than kernel level pre-emptive one. Thus, a fiber can not be forcefully pre-empted by the OS kernel ... Fibers do have their own stacks, but the fiber switching is done in user space by the execution environment, not the OS kernel generic scheduler. ... Fibers are a concurrency concept, but are not truly parallel. Typically, each fiber has a parent thread, just as each thread belongs to a process. Multiple fibers from the same thread can not run simultaneously. In other words, multiple execution paths can coexists in the form of fibers (i.e. concurrency) but they can not actually run at the same time (parallelism). ... Using a fiber library can be cumbersome, and thus some programming languages introduce coroutines. The two concepts are functionally equivalent. However, coroutines are implemented with specific syntax on the programming language level."

OpenMP vs Julia Code

For Multi-threading

  • Need to install Julia runtime, and set JULIA_NUM_THREADS env var just like you set OMP_NUM_THREADS
  • Julia code is more concise than C/C++ (also simpler to initialize arrays, don't need to manage memory)
  • Note that OpenMp code uses loop interchange (ikj) and Julia doesn't (ijk), which will be explained in the next section
OpenMP Julia Multi-threading
#include <omp.h>
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <chrono>

#pragma once
#define N 512
typedef double array[N];

// matmul returns the product c = a * b
void matmul(const double a[][N], const double b[][N], double c[][N], int n) {
    #pragma omp parallel for
    for (int i = 0; i < n; i++) {
        for (int k = 0; k < n; k++) {
            for (int j = 0; j < n; j++) {
                c[i][j] += a[i][k] * b[k][j];

// checksum returns the sum of the coefficients in matrix x[N][N]
double checksum(const double x[][N]) {
    double sum = 0.0;

    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            sum += x[i][j];
    return sum;

// initialize initializes matrix a[N][N]
void initialize(double a[][N]) {

    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            a[i][j] = static_cast<double>(i * j) / (N * N);

void reportTime(const char* msg, std::chrono::steady_clock::duration span) {
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(span);
    std::cout << msg << ": " <<
        ms.count() << " milliseconds" << std::endl;

int main(int argc, char** argv) {
    if (argc != 1) {
        std::cerr << "*** Incorrect number of arguments ***\n";
        std::cerr << "Usage: " << argv[0] << "\n";
        return 1;
    std::cout << std::fixed << std::setprecision(2);

    // allocate memory for matrices
    double* a = new double[N*N];
    double* b = new double[N*N];
    double* c = new double[N*N];
    array* aa = (array*)a;
    array* bb = (array*)b;
    array* cc = (array*)c;
    std::cout << "Matrix Multiplication " << N << " by " << N << std::endl;

    // initialize a and b
    std::chrono::steady_clock::time_point ts, te;
    ts = std::chrono::steady_clock::now();
    double* t = c;
    for (int i = 0; i < N * N; i++)
        *t++ = 0.0;
    te = std::chrono::steady_clock::now();
    reportTime("initialization", te - ts);

    // multiply matrices a and b
    ts = std::chrono::steady_clock::now();
    matmul(aa, bb, cc, N);
    te = std::chrono::steady_clock::now();
    reportTime("execution", te - ts);
    std::cout << "checksum = " << checksum(cc) << std::endl;

    // deallocate memory
    delete[] a;
    delete[] b;
    delete[] c;

    std::cout << "max threads: " << omp_get_max_threads() << std::endl;
N = 512

function matmul(a, b, c, n)
    Threads.@threads for i = 1:n
        for j = 1:n
            for k = 1:n
                c[i,j] += a[i,k] * b[k,j]

    return c

function checksum(x)
    sum = 0.0

    Threads.@threads for i = 1:N
        for j = 1:N
            sum += x[i,j]

    return sum

# creates and returns 2d array of zeroes
function initialize()
    a = zeros(Float64, N, N)

    Threads.@threads for i = 1:N
        for j = 1:N
            a[i,j] = (convert(Float64,((i-1) * (j-1))) / convert(Float64, (N * N)))

    return a

a = initialize();
b = initialize();
c = initialize();

start_t = time()

d = matmul(a, b, c, N)

end_t = time()
run_time = end_t - start_t

sum = checksum(d)

println("checksum = $sum")

println("matmul func time (s):")

println("num threads: ")
nt = Threads.nthreads()

More code here:

OpenMP vs Julia Results

  • add graphs
  • recap loop interchange benefits for openmp (locality of reference)
  • discuss julia storing arrays as column major, loop interchange was worse for julia
  • discuss different levels of optimization


  • Julia's a very a new language compared to C++, MATLAB, Python
  • Mostly used in scientific computing, but designed to be good at many things
  • JIT compiled to native code thanks to LLVM, faster than interpreted languages like Python, slower than compile-ahead-of-time languages like C++
  • Although slower than C++, implements simpler syntax (looks similar to Python)
  • The default compiler takes care of optimization tasks for you. Don't need to worry about locality of reference (loop interchange) or vectorization
  • Multi-threading is still experimental, and it's recommended to use distributed processing or coroutines (green threads) for parallelism