Another optimization would be to optimize the original algorithm itself. The pow() function itself very expensive because it is software implemented. A better alternative would be to use hardware functions. In this case, it would be quicker to use the (*) operator and multiply the values with themselves as opposed to setting them to the power of 2.
After modifying the code to reflect this change, both the GPU and CPU methods are roughly equal to each other (in terms of execution time) when the value of n is 800.