Changes

Sirius

1,698 bytes added, 10:39, 9 April 2018

→‎Assignment 3

For me the most important thing is to solve the problem regardless of the tools used and I think that reimplementing everything from scratch using OpenCV and CUDA is a viable solution.

=== Source Code for Vehicle Detection===<syntaxhighlight lang="cpp">void detect_vehicles() { for (unsigned int i = 0; i < files.size(); i++) { // Load one image at the time and display it load_image(img, files[i]); win.set_image(img); // Run the detector on the image and show the output for (auto&& d : net(img)) { auto fd = sp(img, d); rectangle rect; for (unsigned long j = 0; j < fd.num_parts(); ++j) rect += fd.part(j); if (d.label == "rear") win.add_overlay(rect, rgb_pixel(255, 0, 0), d.label); else win.add_overlay(rect, rgb_pixel(255, 255, 0), d.label); } // Clear the overlay dlib::sleep(1000); win.clear_overlay(); }}</syntaxhighlight>

=== Box Blur on an image using opencv C++ Library (Max Fainshtein) ===

=== Assignment 3 ===

~~We had realized~~ Upon using Nvidia's Visual Profiler it was evident that ~~our implementation of a kernel had made~~ we can make some ~~massive~~ improvements~~, compared~~ to ~~the serial version, but after analyzing the Assignment 2 version we had noticed that we could still make improvements~~try and improve our kernel even further.

Problem:

----

~~The kernels had been executing concurrently but~~ Nvidia's Visual Profiler showed that we were not using all the ~~percentage of concurrency was quite low~~Streaming Multi Processors to their maximum capability.

Solution:

----

~~Initiate thread count based~~ One way to address low compute utilization is attempt increase occupancy of each SM. According to Cuda's occupancy calculator the machine we were using for testing had a compute capability of 6.1. This means that each SM had 32 resident blocks and 2048 resident threads. To achieve maximum occupancy you would have 2048/32 = 64 threads/ block. To determine an appropriate grid size we would divide the total number of pixels by the 64 threads/block. This allows us to use dynamic grid sizing depending on ~~Compute Capability~~ the size of the ~~CUDA device~~image passed in.

~~The number of~~ <syntaxhighlight lang="cpp>int iDevice;cudaDeviceProp prop;cudaGetDevice(&iDevice);cudaGetDeviceProperties(&prop, iDevice);int resident_threads = prop.maxThreadsPerMultiProcessor;int resident_blocks = 8;if (prop.major >= 3 && prop.major < 5) { resident_blocks = 16; }else if (prop.major >= 5 && prop.major <= 6) { resident_blocks = 32;}//determine threads ~~that were initialized per~~ /block ~~had been calculated based on resident threads and blocks~~dim3 blockDims(resident_threads/resident_blocks,1,1); //Calculate grid size to cover the whole imagedim3 gridDims(pixels/blockDims.x);</syntaxhighlight> This resulted in a compute utilization increase from 33% to close 43% but unfortunately this did not yield much improvements.

The number of blocks for the grid had been recalculated to incorporate the complexity of the image and the new threads per block.

Msivanesan4

96

edits

CDOT Wiki β

Changes

Sirius

CDOT Wiki ^β