Dynamic Parallelism is an extension to CUDA that enables kernels to directly call other kernels. Earlier versions of CUDA only allowed kernels to be launched from the host code. When we studied , the segmented approach required multiple kernel calls.
Table of Contents What is cuDNN? Setting up cuDNN Handling Errors Representing Data Dense Layers Activation Functions Loss Functions Convolutions Pooling What is cuDNN? NVIDIA cuDNN provides optimized implementations of core operations used in deep learning. It is designed to be integrated into higher-level machine learning frameworks, such as TensorFlow, PyTorch, and Caffe.
Setting up cuDNN To use cuDNN in your applications, each program needs to establish a handle to the cuDNN library. This is done by creating a cudnnHandle_t object and initializing it with cudnnCreate.
Table of Contents Introduction Parallelization over vertices Parallelization over edges Improving work efficiency Privatization to reduce contention Additional Optimizations Introduction For an introduction on basic graph theory and traversal algorithms, see these notes.
Parallelization over vertices When it comes to parallel algorithms, the first thing that may come to your mind is that they must operate on either the edges or the vertices. In either case, it doesn’t take much to imagine that such algorithms will require communication between threads due to dependencies in the graph or traversal algorithm.
Table of Contents Radix Sort Optimizing Memory Access Efficiency Choosing a different Radix value Radix Sort For a background on Radix Sort, see these notes on Sorting in Linear Time.
Radix sort relies on counting sort for each section, and each section must be processed before moving onto the next. The parallel solution will not attempt to address this sequential dependency. Instead, we will focus on the parallelization of the counting sort step.
Table of Contents Introduction Coordinate List Format (COO) Compressed Sparse Row Format (CSR) ELL Format ELL-COO Format Jagged Diagonal Storage Format (JDS) Introduction Sparse matrices are matrices with mostly zero elements. They are common in scientific computing, machine learning, and other fields. It is important to study them in the context of GPU computing because they can be very large and require a lot of memory. Effeciently representing and computing with sparse matrices provides a substantial benefit to many applications.
Table of Contents Key Concepts and Challenges Introduction The Merge Operation Tiled Merge Circular Buffers Thread Coarsening Key Concepts and Challenges Dynamic input data identification Data locality Buffer management schemes Introduction The merge operation takes two sorted subarrays and combines them into a single sorted array. You may be familiar with this approach from studying Divide and Conquer Algorithms. Parallelizing the merge operation is a non-trivial task and will require the use of a few new techniques.
Table of Contents What is it? Naive Parallel Reduction Kogge-Stone Algorithm Brent-Kung Algorithm Adding Coarsening Segmented Parallel Scan Optimizing Memory Efficiency The Takeaway What is it? Parallelizes sequential problems. Works with computations that can be described in terms of a recursion. Used as a primitive operation for sorting, tree operations, and recurrences. Studying this will also reveal how parallelization can increase the complexity beyond that of a traditional sequential approach. Example: Inclusive Scan Given an array of numbers, the inclusive scan computes the sum of all elements up to a given index. For example, given the array [1, 2, 3, 4, 5], the inclusive scan would produce [1, 3, 6, 10, 15]. You could solve this recursively, but it would be horribly inefficient. A sequential solution is achievable with dynamic programming. However, a parallel solution is much more efficient.
Table of Contents Introduction Reduction Trees A Simple Kernel Minimizing Control Divergence Memory Divergence of Reduction Reducing the number of global memory requests Hierarchical Reduction Thread Coarsening - Back Again The following notes follow Chapter 10 of Programming Massively Parallel Processors (Hwu, Kirk, and El Hajj 2022).
Introduction Given a set of values, a reduction produces a single output. It is an important part of many parallel algorithms including MapReduce. Other patterns that we have studied can also be viewed as reductions, such as GPU Pattern: Parallel Histogram. Implementing this parallel pattern requires careful consideration of thread communication, and will be the focus of these notes.
Table of Contents Histograms Latency of Atomic Operations Privatization Coarsening Aggregation The Takeaway These notes follow the presentation of the parallel histogram pattern in the book Programming Massively Parallel Processors: A Hands-on Approach (Hwu, Kirk, and El Hajj 2022).
Histograms Examples of histograms include: Frequency of words in a document Distribution of pixel intensities in an image Distribution of particle energies in a physics simulation Distribution of thread block execution times in a GPU kernel Consider the program below which computes a histogram of the letters in a string. The input is assumed to be lower case. Since this is executed sequentially, there is no risk of multiple threads writing to the same memory location at the same time.
Table of Contents Differential Equations Stencils Example: Basic Stencil Tiled Stencil Thread Coarsening Register Tiling Summary Questions Used in differential equations Frequently use higher precision Some similarity to convolutions Differential Equations Any computational problem requires discretization of data or equations so that they can be solved numerically. This is fundamental in numerical analysis, where differential equations need to be approximated.
Structured grids are used in finite-difference methods for solving partial differential equations (PDEs). Approximate derivatives can be computed point-wise by considering the neighbors on the grid. As we saw earlier in this course, a grid representation is a natural way to think about data parallelism.
Table of Contents Convolution Properties of Convolutions Implementing a Convolution Kernel Constant Memory and Caching Tiled Convolutions Caching the Halo Cells This pattern involves tiling and input data staging.
Recall from Lab 1 where we implementing a kernel to blur an image. This kernel worked on each individual output pixel by computing the weighted average of each pixel from the input image, centered on the output pixel location. When we implemented this, we set the weight of every pixel to 1. Whether you were aware of it or not, you implemented a convolution between an input image and weighted kernel.
Table of Contents Overview of Nsight Getting Started with Nsight Case Study: Matrix Multiplication Tips and Best Practices OCL Notes Overview of Nsight NVIDIA NSight Compute is a profiling tool for CUDA kernels. It features an expert system that can help you identify performance bottlenecks in your code. It is essential for methodically optimizing your code. These notes will cover the basics of using Nsight Compute to profile your CUDA applications.