I am recently interested in the properties of Bose-Einstein condensation in two-dimension hydrodynamic turbulence. Without the Ekman term (linear friction), the time to reach a saturated state is proportional to the Reynolds number, which can be very long. Therefore, I want to improve my GPU code to use adaptive time stepping.

The idea of adaptive time stepping is simple. One find the maximum flow speed and apply the Courant–Friedrichs–Lewy stability condition to adjust the step size. For the convection term, the condition is

    \[ \Delta t \le C_\mathrm{CFL}\frac{\Delta x}{u}, \]

where the Courant–Friedrichs–Lewy constant, C_\mathrm{CFL}, depends on the details of the integration schemes.

This technique is almost standard in computational fluid dynamic codes written for CPUs. However, for GPUs, things are a little bit more complicated. GPUs are massive parallel processors. Using a naive loop to find the maximum

double find_max(const double *u, const int n1, const int n2)
{
  double max = 0.0;
  for(int i = 0; i < n1 * n2; ++i) if(max < u[i]) max = u[i];
  return max;
}

is inefficient. Indeed, this is the reason I haven’t put adaptive time stepping into sg2. An idea raises naturally is to employ a tree-based algorithm. Ideally, the comparison at the same level are done in parallel, which reduces the step complexity to \mathcal{O}(\ln N). In reality, however, \mathcal{O}(\ln N) steps means \mathcal{O}(N \ln N) cost complexity on a parallel machine. This is again inefficient.

The true resolution is to mix serial and parallel algorithms. Applying Brent’s theory, we should launch only \mathcal{O}(N/\ln N) threads. Each thread needs to do \mathcal{O}(\ln N) sequential work. Their results are then combined together by the tree-based algorithm. The total cost is only \mathcal{O}(N).

The CUDA SDK comes with a reduction example. On Mac OS X, it is located in the directory /Developer/GPU Computing/C/src/reduction. The rationality behind the implementation is documented in this white paper. However, I found the highly optimized code quite ugly. Because the step size computation is only a small part of the code, I can afford to use a less optimized, but easier to understand code:

#define TIDE 512

__global__ void kern(double *max, const double *u, const int n1, const int n2)
{
  __shared__ double max[TIDE];

  const int t = threadIdx.x;
  const int j = blockDim.x * blockIdx.x + t;
  int i;

  max[t] = 0.0;

  /* Serial part of the algorithm */
  if(j < N2) for(i = 0; i < n1; ++i) {
    const int h = i * n2 + j;
    if(max[t] < u[h]) max[t] = u[h];
  }
  __syncthreads();

  /* Parallel tree-based reduction */
  for(i = TIDE / 2; i > 0; i /= 2) {
    if(t < i && max[t] < max[t + i]) max[t] = max[t + i];
    __syncthreads();
  }

  /* Output is an array with gsz elements, see below */
  if(t == 0) max[blockIdx.x] = max[0];
}

Note that each block has one output. So max[] contains gridDim.x elements. Within the same kernel, we cannot reduce it into a single number because blocks cannot talk to each other by shared memory. In principle, we can call the reduction kernel another time with only gridDim.x == 1 but it is just easier to do this final reduction on the CPU. Here is the host (driver) function for that:

double find_max(const double *u, const int n1, const int n2)
{
  const int bsz = TIDE;
  const int gsz = (N2 - 1) / TIDE + 1;
  double max = 0.0;

  kern<<<gsz, bsz>>>(buff, u, n1, n2);
  cudaMemcpy(host, buff, sizeof(double) * gsz, cudaMemcpyDeviceToHost);

  for(int i = 0; i < gsz; ++i)
    if(max < host[i]) max = host[i];
  return max;
}
 

Leave a Reply

Your email address will not be published. Required fields are marked *

Set your Twitter account name in your settings to use the TwitterBar Section.