Products

Why a GPU rasterizer matters for computational lithography in both performance and precision at scale

By Loay Hegazy, Mohamed Taher, Sherif Hammouda

Rasterization—converting continuous geometric shapes into discrete pixel grids—is fundamental to computational lithography. In optical proximity correction (OPC) and mask synthesis workflows, rasterization must achieve both speed and nanometer-scale precision.

Watch now (3 min): Reshaping the lithography industry with Calibre Advanced OPC Solutions

Traditional CPU-based rasterizers process workloads sequentially, creating a bottleneck in the design cycle. GPUs offer massive parallelism, but applying them to lithography rasterization presents challenges: maintaining floating-point precision, preserving sub-pixel connectivity and handling irregular memory access patterns.

Based on originally presented at SC25, the international conference for high performance computing, networking, storage and analysis, we describe a GPU rasterizer designed specifically for computational lithography, and present benchmark results and practical implications for mask synthesis workflows.

Fractional pixel coverage: the lithography rasterization challenge

Rasterization in graphics typically uses a binary coverage model: pixels are either fully covered or not. This suffices for visual rendering, but lithography demands fractional pixel coverage (figure 1). Light intensity and resist behavior depend on the precise area of each pixel occupied by a polygon—a requirement that becomes critical at nanometer scales.

An image showing the conversion of a solid black parallelogram on a grid into a pixel-based representation. An arrow points from the smooth parallelogram to a pixelated version, where some pixels are fully black, and boundary pixels are shaded in varying degrees of gray to represent fractional coverage
Figure 1. Example of converting a polygon into a pixel-based representation through rasterization

Consider a polygon edge that crosses a pixel diagonally. A binary model would assign the pixel fully to one polygon or the other, introducing a 50% error in coverage. A fractional model computes the exact overlap area, capturing the physics of the lithographic process. For thin features and sub-wavelength structures, this precision directly impacts manufacturability.

The challenge is computing fractional coverage for billions of pixels while maintaining connectivity between sub-pixel geometries—ensuring that thin features do not fragment into disconnected regions during pixelization.

GPU rasterizer algorithm: five-stage parallel architecture

Our GPU rasterizer decomposes the problem into five sequential stages, each designed for parallel execution on thousands of GPU threads.

Stage 1: Initialization and polygon assignment

The algorithm begins by zeroing all pixels in the output grid and reserving shared memory for polygon data. Polygons are assigned to thread blocks based on spatial location, enabling simultaneous processing of multiple polygons across the GPU.

A coarse-grained bounding box approach identifies which polygons might overlap specific regions. This pruning step is essential: it prevents threads from evaluating pixel-polygon pairs that cannot possibly intersect, reducing wasted computation.

Stage 2: Bounding box calculation

For each assigned polygon, a precise bounding box is computed from vertex coordinates. This defines the minimal region requiring detailed rasterization, limiting computation to relevant pixels.

The bounding box approach introduces a tradeoff: it may include extra area outside the polygon (as shown in figure 2), creating a collision zone that requires additional processing. However, this overhead is acceptable because it enables efficient spatial pruning and coalesced memory access.

A diagram illustrating an irregular five-sided polygon enclosed within a dashed rectangular bounding box
Figure 2. Example of computing bounding box for a polygon

Stage 3: Thread-pixel allocation

GPU threads are dynamically allocated to individual pixels or small groups of pixels within each polygon’s bounding box. This fine-grained parallelism is the key to GPU efficiency. Each thread independently determines its pixel’s contribution to coverage, with minimal inter-thread synchronization.

A fixed-size thread block processes pixels in segments, scanning the bounding box in a coalesced memory access pattern. For example, a 1D thread block of 4 threads processes the first 4 pixels, then shifts to the next 4, continuing until the entire bounding box is scanned (figure 3). This pattern maximizes memory bandwidth utilization, a critical factor for GPU performance.

A two-part image showing the rasterization of an L-shaped polygon. The left side shows an 8x8 grid with a bounding box around an L-shaped polygon. Above the polygon are boxes that represent thread blocks
Figure 3. Example on output grid for rasterization of a simple L-shape using a block of threads

Stage 4: Pixel classification

For each pixel, the algorithm classifies it as inside, outside, or on the boundary of the polygon:

  • Outside pixels remain at their initialized value of zero.
  • Inside pixels are set to 1.0, indicating full coverage.
  • Boundary pixels require detailed computation. The polygon edge intersecting the pixel is analyzed to calculate the trapezoidal area formed by the edge and pixel boundary. This area, computed using floating-point arithmetic, represents the fractional coverage.

Stage 5: Atomic operations and connectivity preservation

When multiple polygons overlap a single pixel, atomic operations ensure correct accumulation of coverage values, a crucial capability for achieving nanometer-scale accuracy and smooth sub-pixel rendering. Our algorithm uses floating-point atomics to handle concurrent writes from different thread blocks, maintaining precision and preventing data races conditions.

Connectivity preservation is achieved through careful handling of sub-pixel geometries. By computing exact fractional coverage rather than rounding to binary coverage, the algorithm prevents thin features from fragmenting during rasterization.

NVIDIA CUDA implementation and memory optimization

The implementation targets NVIDIA CUDA, leveraging GPU architecture for performance:

  • Memory optimization: Polygon vertices and edges are stored in data structures designed for coalesced memory access. Shared memory is reserved for the polygon with the maximum vertex count, minimizing global memory latency for frequently accessed data.
  • Kernel design: Multiple CUDA kernels handle distinct pipeline stages. One kernel computes bounding boxes; another performs pixel classification and coverage calculation. This modular design allows independent optimization of each stage.
  • Load balancing: Dynamic load balancing mechanisms ensure GPU cores remain busy despite irregular polygon distributions. Work queues and adaptive thread allocation distribute computation evenly across the GPU.

GPU rasterization benchmark results: 290x speedup for Manhattan shapes

The GPU rasterizer was evaluated on NVIDIA H100 GPUs against highly optimized CPU implementations. Test cases included both Manhattan (rectilinear) and curvilinear polygon datasets.

Manhattan geometries

For Manhattan shapes—axis-aligned polygons common in standard cell layouts and routing layers—the GPU rasterizer achieved speedups up to 290x compared to CPU implementations, as shown in figure 4.

Bar chart comparing CPU and GPU runtimes for different CPU:GPU configurations. For all configurations, GPU time is significantly lower than CPU time
Figure 4. CPU and GPU runtimes for Manhattan datasets. GPU achieved large speedups with pixel errors under 1% against CPU results

This substantial acceleration reflects the regularity of Manhattan geometries.

Curvilinear geometries

For curvilinear shapes—arbitrary polygons with non-axis-aligned edges—the GPU rasterizer delivered speedups up to 45x, as shown in figure 5.

A bar chart comparing CPU and GPU runtimes in milliseconds (log scale) for different CPU:GPU configurations. Light blue bars represent CPU time and orange bars represent GPU time. For all configurations, GPU time is significantly lower than CPU time
Figure 5. CPU and GPU runtimes for curvilinear datasets. GPU achieved large speedups with pixel errors under 1% against CPU results

The lower speedup compared to Manhattan geometries reflects increased computational complexity: evaluating whether a pixel lies inside a curvilinear polygon requires more arithmetic operations and edge equations are more complex. However, 45x acceleration remains substantial and demonstrates that GPU rasterization is effective even for intricate geometries.

Accuracy

Across all test cases, the GPU rasterizer achieved less than 1% absolute error compared to reference CPU calculations. This low error rate confirms that aggressive parallelization does not compromise precision—a critical requirement for nanometer-scale manufacturing.

The error analysis validates the floating-point approach: despite the complexity of parallel atomic operations and rounding in floating-point arithmetic, the algorithm maintains accuracy within acceptable tolerances for lithography simulation.

Practical benefits: GPU rasterization in OPC workflows

The performance gains translate directly to reduced turnaround time in OPC and mask synthesis:

Iteration speed: OPC is inherently iterative. Engineers adjust polygon edges, re-rasterize, simulate and analyze results. A 45–290x speedup in rasterization reduces the cycle time for each iteration, enabling more design variations to be explored within fixed time windows.

Scalability: Modern masks contain millions of polygons. The GPU rasterizer’s parallel architecture scales with polygon count and raster resolution, maintaining performance as designs grow in complexity.

Precision preservation: The <1% error rate ensures that GPU-accelerated rasterization can replace CPU implementations without sacrificing accuracy. This is essential for adoption in production workflows, where mask fidelity directly impacts yield.

Limitations and future directions for GPU rasterization deployment

While the benchmark results are compelling, several practical constraints merit consideration before deploying GPU rasterization in production. GPU memory is finite, and very large rasters or polygon datasets may exceed available capacity, necessitating tiling or out-of-core processing strategies that introduce additional complexity and potential performance overhead.

The performance advantage also varies with geometry type. Designs with high curvilinear content—increasingly common in advanced nodes—will see lower acceleration, potentially limiting the benefit for certain mask types.

Additionally, the benchmarks measure rasterization time in isolation, which does not account for integration overhead in complete workflows. Data movement between CPU and GPU, kernel launch overhead and synchronization with other OPC tasks can reduce effective speedup when the GPU rasterizer is embedded in a larger design flow.

We identify several avenues for future work, including integration into existing OPC and mask synthesis tools, combining CPU and GPU processing and extending rasterization for advance lithography and 3D ICs.

Conclusion: Making GPU Rasterizer the Standard for Advanced Node Mask Synthesis

This work demonstrates that GPU acceleration is viable for computational lithography rasterization, achieving 290x speedup for Manhattan geometries and 45x for curvilinear shapes while maintaining <1% error. The GPU-friendly algorithm—combining bounding box pruning, fine-grained thread allocation and floating-point precision—addresses the core challenges of parallelizing rasterization.

For IC manufacturers, the practical implication is clear: GPU-accelerated rasterization can significantly reduce OPC turnaround time without sacrificing mask accuracy. As GPU hardware continues to evolve and integration into production tools matures, this approach is likely to become standard in advanced node mask synthesis.

The work also highlights the importance of algorithm design for GPU execution. Naive parallelization often fails; success requires careful attention to memory access patterns, load balancing and precision. This lesson extends beyond rasterization to other computationally intensive EDA tasks.

Ready to accelerate your mask synthesis? Siemens EDA’s GPU-powered OPC solution is available now. Learn more in our technical paper, A massively parallel GPU rasterizer for next-generation computational lithography.

Calibre IC Design & Manufacturing

Leave a Reply

This article first appeared on the Siemens Digital Industries Software blog at https://blogs.sw.siemens.com/calibre/2026/04/13/why-a-gpu-rasterizer-matters-for-computational-lithography-in-both-performance-and-precision-at-scale/