Changes

← Older edit

UnknownX

1,577 bytes added, 05:50, 13 April 2017

→‎Assignment 2 - V1 Parallelization

[[File:Pycpu.PNG]]

Here is the code to calculate each pixel. It is good to use GPU to calculate them because each pixel is independent.

for (int y = 0; y < N; ++y) {

for (int x = 0; x < N; ++x) {

pixs[3 * (y * N + x) + 2] = (int)pix_col.z;

}

~~for (int y = 0; y < N; ++y) {~~

~~for (int x = 0; x < N; ++x) {~~

~~out << pixs[3 * (y * N + x)] << ' '~~

~~<< pixs[3 * (y * N + x) + 1] << ' '~~

~~<< pixs[3 * (y * N + x) + 2] << '\n';~~

}

== Assignment 2 - V1 Parallelization== Output result(converted to PNG formate):

[[File:GpuassOutput.PNG]]

Run time graph:

[[File:Pygpu2.PNG]]

CPU code:

The most expensive part in the program.

for (int y = 0; y < N; ++y) {

for (int x = 0; x < N; ++x) {

clamp255(pix_col);

}

//Store RGB to array pixs[3 * (y * N + x)] = (int)pix_col.x;

pixs[3 * (y * N + x) + 1] = (int)pix_col.y;

pixs[3 * (y * N + x) + 2] = (int)pix_col.z;

}

~~GPU~~Main code on .cu: 1. Allocate memory on device. 2. run kunal. ntpb = 1024. 3. copy the key data out.

int size = N * N;

int nblocks = (size + ntpb - 1) / ntpb;

Kernel: before: for (int y = 0; y < N; ++y) for (int x = 0; x < N; ++x)after: int idx = blockIdx.x * blockDim.x + threadIdx.x; int x = idx / N; int y = idx % N;

__global__ void kernel_tray(Vec3 pix_col, int N, int* pixs_x, int* pixs_y, int* pixs_z) {

clamp255(pix_col);

}

//Store RGB to arrays

pixs_x[y * N + x] = (int)pix_col.x;

pixs_y[y * N + x] = (int)pix_col.y;

pixs_z[y * N + x] = (int)pix_col.z;

}

Profile on nvvp:

[[File:matrix.senecac.on.ca/~zzha1/Capture.PNG]]

== Assignment 3 - Optimization ==

=== V2 -- One array ===

PPM file output:

[[File:Txt.PNG]]

We allocate three arrays to store the all the results. Each pixel stores in 3 arrys, and it is slow.

Instead of 3 arrays, we allocate a bigger array and store all the pixels in this array.

For the first pixel.

1st: R _ _ _ _ _ _ _

2nd: G _ _ _ _ _ _ _

3rd: B _ _ _ _ _ _ _

new array: R G B _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Before

int* d_pixs_x;

----

=== V2 V3 -- Occupancy ===If we use 1024 threads, we only get 50%. However, if we change it to 640, we can get 60%. before: const int ntpb = 1024;After: const int ntpb = 640;

[[File:pyoccu.PNG]]

----

=== V3 V4 -- Coalescence === Before this modification, here is our array. R1 G1 B1 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ R2 G2 B2 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _After we modify switch the x and y. R1 G1 B1 R1 G1 B1 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Before

int idx = blockIdx.x * blockDim.x + threadIdx.x;

----

=== V4 V5 -- Double -> float ===

struct Vec3 {

'''What problems does it solve?'''

1. Using too many registers To get 100%, we have to use less than 32 registers. If we change it from double to float, it reduces from 44 to 29.

[[File:Pyoccu2.PNG]]

2. Calculating in double ~~slowly~~ is very slow on Geforce device. ~~ ~~

[[File:pyfloat.PNG]]

==Links==

Referances: https://www.youtube.com/watch?v=ARn_yhgk7aE

PPT: https://docs.google.com/presentation/d/10Cr_zIDUultkQLzdyC3_3B-GKO_bl6RJFHpWNg72tRk/edit#slide=id.g20678afd80_0_1313

Zzha1

51

edits

Changes

UnknownX

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools