NVIDIA GPU’s currently run on two types of architectures – Fermi and Kepler. The Fermi architecture has been around for several years. The new Kepler architecture was announced a few months ago. For CUDA software developers one very nice feature in Kepler is Dynamic Parallelism. The figure above shows data traffic between CPU’s and GPU’s across the PCIe bus in the older Fermi architecture (left) and the new Kepler architecture (right).
In Fermi, data transits from CPU to GPU across the PCIe bus, a single CUDA kernel execute son the GPU, and then data returns across the bus back to the CPU. This round-robin data traffic occurs for each and every kernel launch. Unfortunately the PCIe bus is slow, much slower than CPU and GPU speeds, and so we try to avoid CPU to GPU data transfers whenever possible. In fact, the PCIe bus can become a severe bottleneck for many types of algorithms.
In Kepler, data transits from CPU to GPU through the PCIe bus, but then with Dynamic Parallelism one kernel can spawn one to several additional kernels without the need for data transfers back to the CPU. All of the kernels and their associated datasets remain on the GPU, thus avoiding the need for PCIe bus traffic. When some required synchronization point is reached, the GPU kernels finally terminate and data transfers to the CPU take place. This form of Dynamic Parallelism can result in considerable speedup for many types of software applications.
Kepler’s Dynamic Parallelism will allow GPU software developers to create nested kernels, minimize CPU to GPU communication streams, and improve software performance dramatically.