Built for 4K at 60fps

PaintFE is engineered from the ground up for speed at high resolution. Every architectural decision, from memory layout to GPU compute pipelines, made so that a brush stroke on a 4096×4096 canvas feels the same as painting on a tiny thumbnail.

~1MB
Readback per brush stroke at 4K
vs ~33MB a full copy
~0
Cost to snapshot a 4K canvas with COW tiles
~36KB instead of ~64MB
0×
Per-pixel sqrt() calls during brush strokes
LUT replaces them all
GPU
Blur, gradients, liquify, mesh warp: all compute-shader accelerated

Hybrid CPU + GPU

PaintFE uses a hybrid rendering strategy. Interactive brush previews stay on CPU for zero-latency responsiveness. Heavy compositing and filter ops move to GPU compute shaders via wgpu (WebGPU-compatible, native backend).

Per-Frame Compositing Path
Layer Store
TiledImage/COW
GPU Upload
wgpu texture
WGSL Compositor
Blend + merge
Dirty Readback
sub-region only
bytemuck cast
zero-copy
Partial Upload
set_partial()
Display
Brush Stroke Path
Input Event
LUT Alpha
no sqrt()
Chunk-level COW
Arc::make_mut()
Incremental Cache
dirty rect only
set_partial()
~6KB not ~33MB
Display

Core Technologies

GPU
wgpu 0.20 (WebGPU)
GPU compositing and compute shaders using WGSL. Runs on all major graphics APIs: Vulkan, Metal, DirectX 12, and OpenGL ES fallback. All blend modes, gradient rasters, liquify warp, and mesh warp run as compute dispatches.
CPU
rayon 1.7 (CPU Parallelism)
Composite operations, filter cores, and flip/rotate are all parallelized via rayon. A row-level parallel composite on a 4K canvas uses all available CPU cores simultaneously without extra allocations.
0cpy
bytemuck (Zero-Copy Casting)
GPU readback raw bytes are cast directly to Color32 slices via bytemuck::cast_slice. At 4K (8.3M pixels) this is 0 CPU cycles vs a full per-pixel conversion loop.
COW
Copy-on-Write Tiled Storage
Images are stored as a grid of tiles, each wrapped in Arc. Cloning a layer costs only pointer copies, about 36KB at 4K. Actual chunk memory is duplicated only when that chunk is written, giving undo history delta-like efficiency automatically.

The B-Series Optimizations

PaintFE's major performance work follows a numbered plan tracked in the codebase. These are the architectural changes that made 4K interactive painting viable.

B1
Async GPU Readback
Double-buffered staging buffers ping-pong between frames. While frame N is drawn, frame N−1's data is already mapped to CPU, eliminating GPU stalls during interactive previews.
B2
Partial Texture Upload
TextureHandle::set_partial() uploads only the dirty rectangle, typically the brush size. A 40×40 brush writes ~6KB to the GPU instead of ~33MB for a full clone.
B5
COW Arc Tiles
Each tile is an Arc<RgbaImage>. Snapshot = pointer copy. Mutation = lazy-copy only the touched tile at Arc::make_mut(). Undo stack stores history for free.
B6
Brush Alpha LUT
Pre-computed lookup table maps dist² / radius² to alpha. Eliminates all per-pixel sqrt() calls. LUT is rebuilt only when brush size or hardness changes.
B3/B4
Superseded by B5
Explicit delta undo (B3) and async effect commits (B4) became unnecessary once COW tiles made snapshots near-free and filter jobs were already on rayon threads.
GPU
Compute Pipelines
Gradient rasterization, liquify warp, mesh warp displacement, Gaussian blur, and HSL adjustments all have dedicated WGSL compute shaders with CPU fallbacks.

Incremental Wins (A-Series)

Alongside the major B-series, a parallel track of smaller, lower-risk improvements targeted specific hot paths.

ID Optimization Impact
A1 Zero-copy swap for GPU readback display Eliminates a redundant 33MB clone at 4K on every composite
A2 Cached staging buffer Reuses GPU readback buffer across frames, no realloc
A3 Cached blend uniform slots GPU blend buffers reused via queue.write_buffer(), no GPU allocs
A5 Chunk-level prefetch in composite_partial Pre-fetches all layer chunk data per column, CPU cache friendly
A8 Selective LOD invalidation Only active layer's LOD is rebuilt on dirty, not all layers
A9 VecDeque history stacks O(1) undo history prune + cached memory_usage counter
A14 SingleLayerSnapshotCommand Dialog commits save only the affected layer, 1/N memory per undo step
A16 Chunk-level flip/rotate Block transforms operate on whole tiles with par_iter, avoids per-pixel paths
A17 Flat visited array for flood fill Vec<bool> replaces HashMap, 10–20× faster for large fills

GPU Compute Pipelines

Where CPU parallelism has limits, PaintFE dispatches to GPU compute shaders. Each pipeline is self-contained with CPU fallback paths for systems without adequate GPU support.

GpuGradientPipeline
Rasterizes linear, reflected, radial, and diamond gradients. Uses a color stop LUT (256×4 RGBA storage buffer) and a params uniform buffer, both cached across frames. Output texture is reused on same dimensions.
GpuLiquifyPipeline
Bilinear-interpolated displacement warp on GPU. Source snapshot stays constant during a stroke (invalidated once per invalidate_source()); only the displacement field is re-uploaded each frame as a storage buffer.
GpuMeshWarpDisplacementPipeline
Evaluates the Catmull-Rom bicubic spline surface on GPU. Uploads deformed grid points (~200 bytes), dispatches 16×16 workgroups, reads back a pixel-sized displacement field for use by GpuLiquifyPipeline.
Compositor
Owns all WGPU render passes. Constructs per-layer bind groups, caches blend uniform buffers, runs a single render pass over all visible layers each composite frame. All 25 blend modes implemented in WGSL.

Smart Memory Usage

Deep undo histories at 4K are expensive, unless your undo system is built on the same COW structure as the canvas.

Undo Memory — 4K Canvas, 20-Step History
Full snapshot undo
(naive, before COW)
~1.2 GB
COW Arc tiles undo
(current, per modified chunk)
~40 MB
PixelPatch brush undo
(stroke delta only)
~5 MB

Estimates for a typical painting workflow on a 4096×4096 RGBA canvas with 4 layers.

TexturePool
GPU textures are not freed between frames. A pool recycles textures of matching dimensions, eliminating the allocation–deallocation cycle that causes frame stutters.
Reusable Compute Buffers
GPU readback staging buffers, displacement Vecs, preview flat buffers, and per-frame pixel caches all persist across frames. Near-zero heap allocation per frame at steady state.

See it run.

Download PaintFE and feel the difference at full resolution.

Download Free View Source on GitHub