PIL Cuda

From CDOT Wiki
Revision as of 15:24, 10 December 2014 by Gabriel Castro Londono (talk | contribs) (Pillow and Guassian Blur in CUDA)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary

PIL Cuda

Team Members

  1. Gabriel Castro

Pillow and Guassian Blur in CUDA

Pillow

Pillow is an imaging processing is a a library written in python with much of the processing done in C. Pillow has many filters that can be applied to an image to change or distort it. For this project I chose to optimize the Gaussian Blur filter, because after some profiling I determined that it is the single filter with the highest run time.

sample

#!/usr/bin/env python
from PIL import Image, ImageFilter
im = Image.open("test.png")
filtered = im.filter(ImageFilter.GaussianBlur(5))
filtered.save("out.png")

The Gaussian Blur

The Gaussian Blur filter is an algorithm applied to images to make them blurry and/or reducing noise. The algorithm works on every pixel and colour channel (RGBA) in the image individually. For any given pixel (x,y) take the weighted average of all the pixels in a given radius (r).

Given an image of size `x * y` with `c` channels and a blur radius of `r` pixels.

There are `x * y * c * (r * r + 1)` floating point operations.

This means a change in any dimension will result in an exponential increase.

ex. a 800x600 RGB image with a 5px blur radius will have 600*800*3*(5*5+1) = 37,440,000 floating point ops

optimization 1

We can actually reduce the number of radius based operations by process sing the image in 2 steps.

1. Blur every pixel but only along the `x` axis, and store in a temporary buffer.

2. Blur every pixel in the temporary buffer along the `y` axis, store in result buffer.


CUDA with Python setup tools

Since Pillow is a python library that uses cython built via setup tools we need to find a way to inject the nvcc compiler for CUDA. I was able to get everything working by modifying a script I found online, see the final version here

and adding the this code to the existing build script


_LIB_IMAGING_CUDA = ["UnsharpMaskCuda"]
have_cuda = False
CUDA = None
try:
    import setup_cuda
    CUDA = setup_cuda.CUDA
    have_cuda = True
except EnvironmentError as e:
    print('CUDA not found')
    print(e)
if have_cuda:
    defs.append(("HAVE_CUDA", 1))
    setup_cuda.customize_compiler_for_nvcc(self.compiler)
    for f in _LIB_IMAGING_CUDA:
        files.append('libImaging/' + f + '.cu')
    libs.append('cudart')
    libs.append('stdc++')
    _add_directory(self.compiler.include_dirs, 'libImaging/')


NOTE.
This requires $CUDAHOME to be set to /usr/local/cuda or appropriate
 and PATH to include $CUDAHOME/bin
 and LD_LIBRARY_PATH to include $CUDAHOME/lib64
 * tested on Ubuntu 14.04 with CUDA 6.5


Pillow C Api

The Pillow library has a C api that can for the purposes of this post be described with the following code.

struct _Imaing {
    int xsize;
    int ysize;
    // y-sized array of pointers to x-sized arrays of (UINT8) pixel data
    UINT8** image8;
}
typedef _Imaging* Imaging;
Imaging gblur(Imaging im, Imaging imOut, float floatRadius);
This is rather simplified, as the `Imaging` struct contains many more properties and we're taking advantage of the fact that the library separates an incoming image into separate `Imaging` structs for each colour channel.

Gaussian Blur in CUDA

While all the code is available on github https://github.com/GabrielCastro/Pillow this is a simplified psudo code version of the blur process

__global__ static void blurRows(const px8_t* __restrict__ in, float* __restrict__ buff,
    const size_t xSize, const size_t ySize,
    const float* __restrict__ mask,  const size_t radius) {

    const size_t y = blockIdx.y * blockDim.y + threadIdx.y;
    const size_t x = blockIdx.x * blockDim.x + threadIdx.x;
    if (y >= ySize || x >= xSize) {
        return;
    }
    const size_t buffIdx = y * xSize + x;

    float sum = 0;

    for (size_t p = 0; p < radius; ++p) {
        float maskVal = mask[p];
        int offset = (int)(-((float)radius / 2.0) + (float)p + 0.5);
        int xOff = x + offset;
        if (xOff < 0) {
            offset = -x;
        } else if (xOff >= xSize) {
            offset = xSize - x - 1;
        }
        size_t pxIndex = buffIdx + offset;
        sum += in[pxIndex] * maskVal;
    }

    buff[buffIdx] = sum;
}


__global__ static void blurCols(const float* __restrict__ buff, px8_t* __restrict__ out,
    const size_t xSize, const size_t ySize,
    const float* __restrict__ mask,  const size_t radius) {

    const size_t y = blockIdx.y * blockDim.y + threadIdx.y;
    const size_t x = blockIdx.x * blockDim.x + threadIdx.x;
    if (y >= ySize || x >= xSize) {
        return;
    }

    size_t outIdx = y * xSize + x;
    float sum = 0;

    for (size_t p = 0; p < radius; ++p) {
        float maskVal = mask[p];
        int offset = (int)(-((float)radius / 2.0) + (float)p + 0.5);
        int lOff = y + offset;
        if (lOff < 0) {
            offset = -y;
        } else if (lOff >= ySize) {
            offset = ySize - y - 1;
        }

        size_t buffIdx = outIdx + offset * xSize;
        sum += buff[buffIdx] * maskVal;
    }
    out[outIdx] = (px8_t)CLIP(sum);
}


gblur(Imaging in, Imagin out, int radius) {
    UINT8* d_img =  // allocate in->xsize * in->ysize on device
    float* d_buff = // allocate a tmp buffer on device of size in->ysize * in->xsize
    float* d_mask = // allocate and create a guassian blur mask on device

    // copy the image into the device
    size_t rowSize = in->xsize * sizeof(UINT8);
    for (int i = 0; i < in->ysize; ++i) {
        cudaMemcpyAsync(d_img + i * rowSize, in->image8[i], rowSize, cudaMemcpyHostToDevice, 0);
    }
    size_t xBlocks = (in->xsize + ntpb - 1) / ntpb;
    size_t yBlocks = (in->ysize + ntpb - 1) / ntpb;

    dim3 grid(xBlocks, yBlocks);
    dim3 block(ntpb, ntpb);

    blurRows<<<grid,block>>>(d_img, d_buff, in->xsize, in->ysize, d_mask, radius);
    blurCols<<<grid,block>>>(d_buff, d_img, in->xsize, in->ysize, d_mask, radius);

    for (int i = 0; i < in->ysize; ++i) {
        cudaMemcpyAsync(out->image8[i], d_img + i * rowSize, rowSize, cudaMemcpyDeviceToHost, 0);
    }
    cudaDeviceSynchronize();
    cudaFree(d_img);
    cudaFree(d_buff);
    cudaFree((void*) d_mask);
}