Changes

Sirius

1,728 bytes removed, 10:37, 9 April 2018

→‎Vehicle detection and tracking (Rosario A. Cali)

For me the most important thing is to solve the problem regardless of the tools used and I think that reimplementing everything from scratch using OpenCV and CUDA is a viable solution.

~~===~~ Source Code for Vehicle Detection ~~===<syntaxhighlight lang="cpp">void detect_vehicles() {~~ ~~for (unsigned int i = 0; i < files.size(); i++) {~~ ~~// Load one image at the time and display it~~ ~~load_image(img, files[i]);~~ ~~win.set_image(img);~~ ~~// Run the detector on the image and show the output~~ ~~for (auto&& d : net(img)) {~~ ~~auto fd = sp(img, d);~~ ~~rectangle rect;~~ ~~for (unsigned long j = 0; j < fd.num_parts(); ++j)~~ ~~rect += fd.part(j);~~ ~~if (d.label == "rear")~~ ~~win.add_overlay(rect, rgb_pixel(255, 0, 0), d.label);~~ ~~else~~ ~~win.add_overlay(rect, rgb_pixel(255, 255, 0), d.label);~~ } ~~// Clear the overlay~~ ~~dlib::sleep(1000);~~ ~~win.clear_overlay();~~ }}~~</syntaxhighlight>~~

=== Box Blur on an image using opencv C++ Library (Max Fainshtein) ===

=== Assignment 3 ===

~~Upon using Nvidia's Visual Profiler it was evident~~ We had realized that our implementation of a kernel had made some massive improvements, compared to the serial version, but after analyzing the Assignment 2 version we ~~can~~ had noticed that we could still make ~~some~~ improvements ~~to try and improve our kernel even further~~.

Problem:

----

~~Nvidia's Visual Profiler showed that we were not using all~~ The kernels had been executing concurrently but the ~~Streaming Multi Processors to their maximum capability~~percentage of concurrency was quite low.

Solution:

----

One way to address low compute utilization is attempt increase occupancy of each SM. According to Cuda's occupancy calculator the machine we were using for testing had a compute capability of 6.1. This means that each SM had 32 resident blocks and 2048 resident threads. To achieve maximum occupancy you would have 2048/32 = 64 threads/ block. To determine an appropriate grid size we would divide the total number of pixels by the 64 threads/block. This allows us to use dynamic grid sizing depending Initiate thread count based on ~~the size~~ Compute Capability of the ~~image passed in~~CUDA device.

~~<syntaxhighlight lang="cpp>~~ ~~int iDevice;~~ ~~cudaDeviceProp prop;~~ ~~cudaGetDevice(&iDevice);~~ ~~cudaGetDeviceProperties(&prop, iDevice);~~ ~~int resident_threads = prop.maxThreadsPerMultiProcessor;~~ ~~int resident_blocks = 8;~~ ~~if (prop.major >= 3 && prop.major < 5) {~~ ~~resident_blocks = 16;~~ } ~~else if (prop.major >= 5 && prop.major <= 6) {~~ ~~resident_blocks = 32;~~ } ~~//determine~~ The number of threads/that were initialized per block ~~dim3 blockDims(resident_threads/resident_blocks,1,1);~~ ~~//Calculate grid size to cover the whole image~~ ~~dim3 gridDims(pixels/blockDims.x);</syntaxhighlight>~~ ~~This resulted in a compute utilization increase from 33% to close 43% but unfortunately this did not yield much improvements~~had been calculated based on resident threads and blocks.

The number of blocks for the grid had been recalculated to incorporate the complexity of the image and the new threads per block.

Racali

81

edits

Changes

Sirius

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools