Computer Science

Priority Queues

Table of Contents Quick Facts Introduction Implementation Exercises Quick Facts Time Complexity: \(O(\lg n)\) Space Complexity: \(O(n)\) Introduction Besides being the primary data structure for Heapsort, a heap is also used to implement a priority queue. A priority queue is a key-value data structure in which the keys are used to determine the priority of each element in the queue. There are two variants, maximum and minimum, and they support the following operations:

Master Theorem

Table of Contents Example: Merge Sort Example: Matrix Multiplication Example: Median Finding Example: Cormen et al. Exercise 4.5-2 In the study of Divide and Conquer Algorithms, a recurrence tree can be used to determine the runtime complexity. These notes focus on the master theorem, a blueprint for solving any recurrence of the form \[ T(n) = aT(n/b) + f(n). \]

Divide and Conquer Algorithms

Table of Contents Definition Solving Recurrences Example: Merge Sort Example: Multiplying Square Matrices Example: Convex Hull Example: Median Search Definition Divide and conquer algorithms are a class of algorithms that solve a problem by breaking it into smaller subproblems, solving the subproblems recursively, and then combining the solutions to the subproblems to form a solution to the original problem. Problems that can be solved in this manner are typically highly parallelizable. These notes investigate a few examples of classic divide and conquer algorithms and their analysis.

GPU Pattern: Stencils

Table of Contents Differential Equations Stencils Example: Basic Stencil Tiled Stencil Thread Coarsening Register Tiling Summary Questions Used in differential equations Frequently use higher precision Some similarity to convolutions Differential Equations Any computational problem requires discretization of data or equations so that they can be solved numerically. This is fundamental in numerical analysis, where differential equations need to be approximated.

GPU Pattern: Convolution

Table of Contents Convolution Properties of Convolutions Implementing a Convolution Kernel Constant Memory and Caching Tiled Convolutions Caching the Halo Cells This pattern involves tiling and input data staging. Recall from Lab 1 where we implementing a kernel to blur an image. This kernel worked on each individual output pixel by computing the weighted average of each pixel from the input image, centered on the output pixel location. When we implemented this, we set the weight of every pixel to 1. Whether you were aware of it or not, you implemented a convolution between an input image and weighted kernel.

GPU Performance Basics

Table of Contents Memory Coalescing Hiding Memory Latency Thread Coarsening Optimization Checklist Identifying Bottlenecks The Takeaway These notes are on “Chapter 6: Performance Considerations” from the book Programming Massively Parallel Processors (Hwu, Kirk, and El Hajj 2022). Memory Coalescing Global memory accesses are one of the largest bottlenecks in GPU applications. DRAM has high latency based on its design. Each cell has a transistor and a capacitor. If the capacitor is charged, it represents a 1. The process to detect the charges in these cells is on the order of 10s of nanoseconds. DRAM can read consecutive groups of cells via bursts. This means that if the data we wish to access is stored consecutively, it can be accessed within the same burst. Contrast that was random access, in which the DRAM will have to make multiple bursts to read the required data. Memory coalescing refers to optimizing our global memory accesses to take advantage of DRAM bursts.

CUDA Memory Architecture

Table of Contents Introduction Memory Access Memory Types Tiling Example: Tiled Matrix Multiplication Boundary Checking Memory Use and Occupancy Dynamically Changing the Block Size The Takeaway Introduction So far, the kernels we have used assume everything is on global memory. Even though there are thousands of cores that can effectively hide the latency of transferring data to and from global memory, we will see this delay will become a bottleneck in many applications. These notes explore the different types of memory available on the GPU and how to use them effectively.

CUDA Architecture

Table of Contents Architecture Block Scheduling Synchronization Warps Control Divergence Warp Scheduling Resource Partitioning Dynamic Launch Configurations The Takeaway Architecture A GPU consists of chip that is composed of several streaming multiprocessors (SMs). Each SM has a number of cores that execute instructions in parallel. The H100, seen below, has 144 SMs (you can actually count them by eye). Each SM has 128 FP32 cores for a total of 18,432 cores. Historically, CUDA has used DDR memory, but newer architectures use high-bandwidth memory (HBM). This is closely integrated with the GPU for faster data transfer.

Multidimensional Grids and Data

Table of Contents Summary Multidimensional Grid Organization Example: Color to Grayscale No longer embarrassing: overlapping data Matrix Multiplication What’s Next? Summary The CUDA Programming model allows us to organize our data in a multidimensional grid. The purpose of this is primarily for our own convenience, but it also allows us to take advantage of the GPU’s memory hierarchy. In Lab 0, we only required a single dimension for our grid as well as each block since the input was a vector. When performing computations on multidimensional data like matrices, we can match the dimensions of our launch configuration to the dimensions of our data.

Heterogeneous Data Parallel Computing

Table of Contents Key Concepts Summary CUDA C Programs Example: Vector Addition Error Checking Key Concepts Task Parallelism vs. Data Parallelism kernels threads grids blocks global memory data transfer error checking compilation of CUDA programs Summary This topic introduces the basics of data parallelism and CUDA programming. The most important concept is that data parallelism is achieved through independent computations on each sample or groups of samples. The basic structure of a CUDA C program consists of writing a kernel that is executed independently on many threads. Memory must be allocated on the GPU device before transferring the data from the host machine (CPU). Upon completion of the kernel, the results need to be transferred back to the host.

MapReduce

Table of Contents What is MapReduce? Hadoop Distributed File System (HDFS) MapReduce Overview Hadoop v2 AKA YARN Summary These are my personal notes from the book Fundamentals of Database Systems by (Elmasri and Navathe 2015). I highly recommend reading the original source material. The contents of the article should only serve as a brief overview of the topic.

Distributed Databases

Table of Contents Overview Data Fragmentation Data Replication Data Concurrency Distributed systems excel at partitioning large problems into smaller chunks that can be processed in parallel. This requires parallel thinking instead of serial thinking. Many algorithms and solutions that run serially may be easier to adapt to parallel applications than others. Distributed solutions are the natural next step to scaling up a system. In the context of databases, the main challenges related to distribution, replication, distributed transactions, distributed metadata management, and distributed query processing.