Latest revision as of 21:20, 5 December 2014

Calculations of Pi

Team Member

Tony Yu

Progress

Assignment 1

Brief Overview

Monte Carlo approach to calculating Pi involves a circle on top of a square. Dots will be randomly drawn onto the square, and by adding up all the dots that landed within the circle then dividing by 4, will get you a value close to Pi. The more dots drawn, the more accurate Pi will become.

Findings

~~The code used is taken from the following site http://www.cplusplus.com/forum/beginner/1149/ with some changes to the code.~~

~~This program will calculate pi to a precision based on the value entered by the user. Currently it displays to a precision of 9 decimal places.~~

Will attempt to figure out a way to calculate to as much decimal places as possible, which should drastically increase the time it takes to run the program. Currrent possible solution is to use the BigNumber library.

(Updated)

The code used is taken from the following site https://helloacm.com/cc-coding-exercise-finding-approximation-of-pi-using-monto-carlo-algorithm/ with some changes to it.

This program will calculate pi using the Monte Carlo approach to a precision based on the value entered by the user.

Value of 1 Million

Value of 10 Million

Value of 100 Million

Value of 1 Billion

Assignment 2 and 3

I have combined the 2nd and 3rd part of the assignment together, since I had some issues with the kernel.

The following results compares the upgraded code with the original to show a significant increase in speed.

(Note)

For some reason the code crashes my graphic driver past 8000000 (8 million) dots, and even at 8 million it crashes most of the time, but the value is still correct.

Approach

Instead of doing everything within the main, I created a separate function for it. All the random number generating is done within the kernel via the Curand command. The kernel is also responsible for all the calculations and uses shared memory for all the threads within the block in order to obtain a partial sum. Here are some snippets of the code.

Some Code Snippets

If the dot is within the circle, sets the tid (threadIdx.x) index of the temp array in shared memory to 1 and sync the threads. Then sum up all the 1s in the temp array for that specific block and pass it out into another array.

After copying from the device to host, obtain the total sum of results from all kernels by using a for loop through all the indexes and adding the values together. This total sum is then used to calculate the value of pi.

Execution Times for Values of 1, 5 and 8 Million

Comparison Chart

Issues

The main issue for me was to figure out how to use the kernel for this approach. At first I tried to pass a value of either 1 or 0 for whether or not the dot landed within the circle within each thread, and pass it out into an array individually. Later on Chris gave me the idea of getting a partial sum for all the threads within each block and pass that out instead, which is a way better approach.

Another big issue was the crashing of the graphic driver. If the program takes more than 3 seconds to execute, the driver would crash. Even when I changed the registry to allow 15 seconds before crashing, it still crashes at 3.

For optimization, I tried using reduction, however it didn't seem to speed up the program.

Different Approach

Another approach to do this is by using a different algorithm, as the one I used at first. However, that program will only go up to 9 significant digits, since anything over will go above the maximum value of a float. This program shows an execution time of 0.05 seconds for all values entered by the user, but will require to use the BigNumber library or such in order to show more significant digits.

@@ Line 5: / Line 5: @@
 == Progress ==
 === Assignment 1 ===
@@ Line 49: / Line 50: @@
 '''(Note)'''
-For some reason the code crashes my graphic driver past 8000000 (8 million) dots, and even at 8 million it crashes most of the time. The Nvidia Visual Profiler doesn't work either, it gets stuck on generating timeline, so I used clock_t in the code instead in order to calculate execution time of the kernel. Don't think this is 100% accurate though.
+For some reason the code crashes my graphic driver past 8000000 (8 million) dots, and even at 8 million it crashes most of the time, but the value is still correct.
-'''Value of 1 Million'''
+'''Approach'''
+Instead of doing everything within the main, I created a separate function for it. All the random number generating is done within the kernel via the Curand command. The kernel is also responsible for all the calculations and uses shared memory for all the threads within the block in order to obtain a partial sum. Here are some snippets of the code.
-[[File:MillionMonteCarlo.JPG]]
+''' Some Code Snippets '''
-'''Value of 5 Million'''
+If the dot is within the circle, sets the tid (threadIdx.x) index of the temp array in shared memory to 1 and sync the threads. Then sum up all the 1s in the temp array for that specific block and pass it out into another array.
-[[File:5MillionMonteCarlo.JPG]]
+[[File:Code1.JPG]]
-'''Value of 8 Million'''
+After copying from the device to host, obtain the total sum of results from all kernels by using a for loop through all the indexes and adding the values together. This total sum is then used to calculate the value of pi.
-[[File:8MillionMonteCarlo.JPG]]
+[[File:Code2.JPG]]
+'''Execution Times for Values of 1, 5 and 8 Million'''
+[[File:reportTime.JPG]]
 '''Comparison Chart'''
@@ Line 69: / Line 75: @@
+'''Issues'''
-''' Some Code Snippets '''
+The main issue for me was to figure out how to use the kernel for this approach. At first I tried to pass a value of either 1 or 0 for whether or not the dot landed within the circle within each thread, and pass it out into an array individually. Later on Chris gave me the idea of getting a partial sum for all the threads within each block and pass that out instead, which is a way better approach.
+Another big issue was the crashing of the graphic driver. If the program takes more than 3 seconds to execute, the driver would crash. Even when I changed the registry to allow 15 seconds before crashing, it still crashes at 3.
-Sets the tid (threadIdx.x) index of the temp array in shared memory to 1, when the total <= 1.0, and sync the threads. Then sum up all the 1s in the array for that specific block and pass it out into another array.
+For optimization, I tried using reduction, however it didn't seem to speed up the program.
-[[File:Code1.JPG]]
-After copying from the device to host, obtain the total sum of results from all kernels and calculates the value of pi.
+'''Different Approach'''
-[[File:Code2.JPG]]
+Another approach to do this is by using a different algorithm, as the one I used at first. However, that program will only go up to 9 significant digits, since anything over will go above the maximum value of a float. This program shows an execution time of 0.05 seconds for all values entered by the user, but will require to use the BigNumber library or such in order to show more significant digits.

Difference between revisions of "DPS915/CodeKirin"

Latest revision as of 21:20, 5 December 2014

Contents

Calculations of Pi

Team Member

Progress

Assignment 1

Assignment 2 and 3

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools