Changes

Jump to: navigation, search

UnknownX

1,356 bytes added, 05:50, 13 April 2017
Assignment 2 - V1 Parallelization
[[File:Pycpu.PNG]]<br />
Here is the code to calculate each pixel. It is good to use GPU to calculate them because each pixel is independent.
for (int y = 0; y < N; ++y) {
for (int x = 0; x < N; ++x) {
== Assignment 2 - V1 Parallelization== Output result(converted to PNG formate): 
[[File:GpuassOutput.PNG]]
 
Run time graph:
[[File:Pygpu2.PNG]]
CPU code:
 
The most expensive part in the program.
 
for (int y = 0; y < N; ++y) {
for (int x = 0; x < N; ++x) {
clamp255(pix_col);
}
//Store RGB to array pixs[3 * (y * N + x)] = (int)pix_col.x;
pixs[3 * (y * N + x) + 1] = (int)pix_col.y;
pixs[3 * (y * N + x) + 2] = (int)pix_col.z;
}
GPUMain code on .cu: 1. Allocate memory on device. 2. run kunal. ntpb = 1024. 3. copy the key data out. 
int size = N * N;
int nblocks = (size + ntpb - 1) / ntpb;
Kernel: before: for (int y = 0; y < N; ++y) for (int x = 0; x < N; ++x)after: int idx = blockIdx.x * blockDim.x + threadIdx.x; int x = idx / N; int y = idx % N;
__global__ void kernel_tray(Vec3 pix_col, int N, int* pixs_x, int* pixs_y, int* pixs_z) {
clamp255(pix_col);
}
//Store RGB to arrays
pixs_x[y * N + x] = (int)pix_col.x;
pixs_y[y * N + x] = (int)pix_col.y;
pixs_z[y * N + x] = (int)pix_col.z;
}
 
Profile on nvvp:
[[File:matrix.senecac.on.ca/~zzha1/Capture.PNG]]
== Assignment 3 - Optimization ==
=== V3 -- Occupancy ===
If we use 1024 threads, we only get 50%. However, if we change it to 640, we can get 60%. <br />
before:
const int ntpb = 1024;
=== V4 -- Coalescence ===
 
Before this modification, here is our array.
R1 G1 B1 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ R2 G2 B2 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
After we modify switch the x and y.
R1 G1 B1 R1 G1 B1 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 
 
Before
int idx = blockIdx.x * blockDim.x + threadIdx.x;
'''What problems does it solve?''' <br />
1. Using too many registers
To get 100%, we have to use less than 32 registers. If we change it from double to float, it reduces from 44 to 29.
[[File:Pyoccu2.PNG]]
2. Calculating in double slowly is very slow on Geforce device. <br />
[[File:pyfloat.PNG]]
==referencesLinks== Referances: https://www.youtube.com/watch?v=ARn_yhgk7aE  PPT: https://docs.google.com/presentation/d/10Cr_zIDUultkQLzdyC3_3B-GKO_bl6RJFHpWNg72tRk/edit#slide=id.g20678afd80_0_1313
51
edits

Navigation menu