Changes

Jump to: navigation, search

Sirius

1,728 bytes removed, 10:37, 9 April 2018
Vehicle detection and tracking (Rosario A. Cali)
For me the most important thing is to solve the problem regardless of the tools used and I think that reimplementing everything from scratch using OpenCV and CUDA is a viable solution.
=== Source Code for Vehicle Detection ===<syntaxhighlight lang="cpp">void detect_vehicles() { for (unsigned int i = 0; i < files.size(); i++) { // Load one image at the time and display it load_image(img, files[i]); win.set_image(img);  // Run the detector on the image and show the output for (auto&& d : net(img)) { auto fd = sp(img, d); rectangle rect;  for (unsigned long j = 0; j < fd.num_parts(); ++j) rect += fd.part(j);  if (d.label == "rear") win.add_overlay(rect, rgb_pixel(255, 0, 0), d.label); else win.add_overlay(rect, rgb_pixel(255, 255, 0), d.label); }  // Clear the overlay dlib::sleep(1000); win.clear_overlay(); }}</syntaxhighlight>
=== Box Blur on an image using opencv C++ Library (Max Fainshtein) ===
=== Assignment 3 ===
Upon using Nvidia's Visual Profiler it was evident We had realized that our implementation of a kernel had made some massive improvements, compared to the serial version, but after analyzing the Assignment 2 version we can had noticed that we could still make some improvements to try and improve our kernel even further.
<br><br>
Problem:
----
Nvidia's Visual Profiler showed that we were not using all The kernels had been executing concurrently but the Streaming Multi Processors to their maximum capabilitypercentage of concurrency was quite low.
<br><br>
Solution:
----
One way to address low compute utilization is attempt increase occupancy of each SM. According to Cuda's occupancy calculator the machine we were using for testing had a compute capability of 6.1. This means that each SM had 32 resident blocks and 2048 resident threads. To achieve maximum occupancy you would have 2048/32 = 64 threads/ block. To determine an appropriate grid size we would divide the total number of pixels by the 64 threads/block. This allows us to use dynamic grid sizing depending Initiate thread count based on the size Compute Capability of the image passed inCUDA device.
<br><br>
 <syntaxhighlight lang="cpp> int iDevice; cudaDeviceProp prop; cudaGetDevice(&iDevice); cudaGetDeviceProperties(&prop, iDevice); int resident_threads = prop.maxThreadsPerMultiProcessor; int resident_blocks = 8; if (prop.major >= 3 && prop.major < 5) { resident_blocks = 16; } else if (prop.major >= 5 && prop.major <= 6) { resident_blocks = 32; } //determine The number of threads/that were initialized per block dim3 blockDims(resident_threads/resident_blocks,1,1); //Calculate grid size to cover the whole image dim3 gridDims(pixels/blockDims.x);</syntaxhighlight> This resulted in a compute utilization increase from 33% to close 43% but unfortunately this did not yield much improvementshad been calculated based on resident threads and blocks.
<br><br>
The number of blocks for the grid had been recalculated to incorporate the complexity of the image and the new threads per block.
81
edits

Navigation menu