Architecture

Built for 4K at 60fps

PaintFE is engineered from the ground up for speed at high resolution. Every architectural decision, from memory layout to GPU compute pipelines, made so that a brush stroke on a 4096×4096 canvas feels the same as painting on a tiny thumbnail.

~1MB

Readback per brush stroke at 4K
vs ~33MB a full copy

Cost to snapshot a 4K canvas with COW tiles
~36KB instead of ~64MB

0×

Per-pixel sqrt() calls during brush strokes
LUT replaces them all

GPU

Blur, gradients, liquify, mesh warp: all compute-shader accelerated

Render Pipeline

Hybrid CPU + GPU

PaintFE uses a hybrid rendering strategy. Interactive brush previews stay on CPU for zero-latency responsiveness. Heavy compositing and filter ops move to GPU compute shaders via wgpu (WebGPU-compatible, native backend).

Per-Frame Compositing Path

Layer Store
TiledImage/COW

→

GPU Upload
wgpu texture

→

WGSL Compositor
Blend + merge

→

Dirty Readback
sub-region only

→

bytemuck cast
zero-copy

→

Partial Upload
set_partial()

→

Display

Brush Stroke Path

Input Event

→

LUT Alpha
no sqrt()

→

Chunk-level COW
Arc::make_mut()

→

Incremental Cache
dirty rect only

→

set_partial()
~6KB not ~33MB

→

Display

Technologies

Core Technologies

GPU

wgpu 0.20 (WebGPU)

GPU compositing and compute shaders using WGSL. Runs on all major graphics APIs: Vulkan, Metal, DirectX 12, and OpenGL ES fallback. All blend modes, gradient rasters, liquify warp, and mesh warp run as compute dispatches.

CPU

rayon 1.7 (CPU Parallelism)

Composite operations, filter cores, and flip/rotate are all parallelized via rayon. A row-level parallel composite on a 4K canvas uses all available CPU cores simultaneously without extra allocations.

0cpy

bytemuck (Zero-Copy Casting)

GPU readback raw bytes are cast directly to Color32 slices via bytemuck::cast_slice. At 4K (8.3M pixels) this is 0 CPU cycles vs a full per-pixel conversion loop.

COW

Copy-on-Write Tiled Storage

Images are stored as a grid of tiles, each wrapped in Arc. Cloning a layer costs only pointer copies, about 36KB at 4K. Actual chunk memory is duplicated only when that chunk is written, giving undo history delta-like efficiency automatically.

Plan B — Architecture Overhaul

The B-Series Optimizations

PaintFE's major performance work follows a numbered plan tracked in the codebase. These are the architectural changes that made 4K interactive painting viable.

Async GPU Readback

Double-buffered staging buffers ping-pong between frames. While frame N is drawn, frame N−1's data is already mapped to CPU, eliminating GPU stalls during interactive previews.

Partial Texture Upload

TextureHandle::set_partial() uploads only the dirty rectangle, typically the brush size. A 40×40 brush writes ~6KB to the GPU instead of ~33MB for a full clone.

COW Arc Tiles

Each tile is an Arc<RgbaImage>. Snapshot = pointer copy. Mutation = lazy-copy only the touched tile at Arc::make_mut(). Undo stack stores history for free.

Brush Alpha LUT

Pre-computed lookup table maps dist² / radius² to alpha. Eliminates all per-pixel sqrt() calls. LUT is rebuilt only when brush size or hardness changes.

B3/B4

Superseded by B5

Explicit delta undo (B3) and async effect commits (B4) became unnecessary once COW tiles made snapshots near-free and filter jobs were already on rayon threads.

GPU

Compute Pipelines

Gradient rasterization, liquify warp, mesh warp displacement, Gaussian blur, and HSL adjustments all have dedicated WGSL compute shaders with CPU fallbacks.

Plan A — Safe Optimizations

Incremental Wins (A-Series)

Alongside the major B-series, a parallel track of smaller, lower-risk improvements targeted specific hot paths.

ID	Optimization	Impact
A1	Zero-copy swap for GPU readback display	Eliminates a redundant 33MB clone at 4K on every composite
A2	Cached staging buffer	Reuses GPU readback buffer across frames, no realloc
A3	Cached blend uniform slots	GPU blend buffers reused via `queue.write_buffer()`, no GPU allocs
A5	Chunk-level prefetch in composite_partial	Pre-fetches all layer chunk data per column, CPU cache friendly
A8	Selective LOD invalidation	Only active layer's LOD is rebuilt on dirty, not all layers
A9	VecDeque history stacks	O(1) undo history prune + cached memory_usage counter
A14	SingleLayerSnapshotCommand	Dialog commits save only the affected layer, 1/N memory per undo step
A16	Chunk-level flip/rotate	Block transforms operate on whole tiles with par_iter, avoids per-pixel paths
A17	Flat visited array for flood fill	Vec<bool> replaces HashMap, 10–20× faster for large fills

WGSL Compute

GPU Compute Pipelines

Where CPU parallelism has limits, PaintFE dispatches to GPU compute shaders. Each pipeline is self-contained with CPU fallback paths for systems without adequate GPU support.

GpuGradientPipeline

Rasterizes linear, reflected, radial, and diamond gradients. Uses a color stop LUT (256×4 RGBA storage buffer) and a params uniform buffer, both cached across frames. Output texture is reused on same dimensions.

GpuLiquifyPipeline

Bilinear-interpolated displacement warp on GPU. Source snapshot stays constant during a stroke (invalidated once per invalidate_source()); only the displacement field is re-uploaded each frame as a storage buffer.

GpuMeshWarpDisplacementPipeline

Evaluates the Catmull-Rom bicubic spline surface on GPU. Uploads deformed grid points (~200 bytes), dispatches 16×16 workgroups, reads back a pixel-sized displacement field for use by GpuLiquifyPipeline.

Compositor

Owns all WGPU render passes. Constructs per-layer bind groups, caches blend uniform buffers, runs a single render pass over all visible layers each composite frame. All 25 blend modes implemented in WGSL.

Memory

Smart Memory Usage

Deep undo histories at 4K are expensive, unless your undo system is built on the same COW structure as the canvas.

Undo Memory — 4K Canvas, 20-Step History

Full snapshot undo
(naive, before COW)

~1.2 GB

COW Arc tiles undo
(current, per modified chunk)

~40 MB

PixelPatch brush undo
(stroke delta only)

~5 MB

Estimates for a typical painting workflow on a 4096×4096 RGBA canvas with 4 layers.

TexturePool

GPU textures are not freed between frames. A pool recycles textures of matching dimensions, eliminating the allocation–deallocation cycle that causes frame stutters.

Reusable Compute Buffers

GPU readback staging buffers, displacement Vecs, preview flat buffers, and per-frame pixel caches all persist across frames. Near-zero heap allocation per frame at steady state.

See it run.

Download PaintFE and feel the difference at full resolution.

Download Free View Source on GitHub