Built for 4K at 60fps
PaintFE is engineered from the ground up for speed at high resolution. Every architectural decision, from memory layout to GPU compute pipelines, made so that a brush stroke on a 4096×4096 canvas feels the same as painting on a tiny thumbnail.
vs ~33MB a full copy
~36KB instead of ~64MB
LUT replaces them all
Hybrid CPU + GPU
PaintFE uses a hybrid rendering strategy. Interactive brush previews stay on CPU for zero-latency responsiveness. Heavy compositing and filter ops move to GPU compute shaders via wgpu (WebGPU-compatible, native backend).
TiledImage/COW
wgpu texture
Blend + merge
sub-region only
zero-copy
set_partial()
no sqrt()
Arc::make_mut()
dirty rect only
~6KB not ~33MB
Core Technologies
Color32 slices via bytemuck::cast_slice.
At 4K (8.3M pixels) this is 0 CPU cycles vs a full per-pixel conversion loop.
Arc.
Cloning a layer costs only pointer copies, about 36KB at 4K. Actual chunk memory is duplicated only
when that chunk is written, giving undo history delta-like efficiency automatically.
The B-Series Optimizations
PaintFE's major performance work follows a numbered plan tracked in the codebase. These are the architectural changes that made 4K interactive painting viable.
TextureHandle::set_partial() uploads only the dirty rectangle, typically
the brush size. A 40×40 brush writes ~6KB to the GPU instead of ~33MB for a full clone.Arc<RgbaImage>. Snapshot = pointer copy. Mutation = lazy-copy only
the touched tile at Arc::make_mut(). Undo stack stores history for free.dist² / radius² to alpha.
Eliminates all per-pixel sqrt() calls. LUT is rebuilt only when brush size or hardness changes.Incremental Wins (A-Series)
Alongside the major B-series, a parallel track of smaller, lower-risk improvements targeted specific hot paths.
| ID | Optimization | Impact |
|---|---|---|
| A1 | Zero-copy swap for GPU readback display | Eliminates a redundant 33MB clone at 4K on every composite |
| A2 | Cached staging buffer | Reuses GPU readback buffer across frames, no realloc |
| A3 | Cached blend uniform slots | GPU blend buffers reused via queue.write_buffer(), no GPU allocs |
| A5 | Chunk-level prefetch in composite_partial | Pre-fetches all layer chunk data per column, CPU cache friendly |
| A8 | Selective LOD invalidation | Only active layer's LOD is rebuilt on dirty, not all layers |
| A9 | VecDeque history stacks | O(1) undo history prune + cached memory_usage counter |
| A14 | SingleLayerSnapshotCommand | Dialog commits save only the affected layer, 1/N memory per undo step |
| A16 | Chunk-level flip/rotate | Block transforms operate on whole tiles with par_iter, avoids per-pixel paths |
| A17 | Flat visited array for flood fill | Vec<bool> replaces HashMap, 10–20× faster for large fills |
GPU Compute Pipelines
Where CPU parallelism has limits, PaintFE dispatches to GPU compute shaders. Each pipeline is self-contained with CPU fallback paths for systems without adequate GPU support.
invalidate_source()); only the displacement field is re-uploaded each frame as a storage buffer.
Smart Memory Usage
Deep undo histories at 4K are expensive, unless your undo system is built on the same COW structure as the canvas.
Estimates for a typical painting workflow on a 4096×4096 RGBA canvas with 4 layers.
See it run.
Download PaintFE and feel the difference at full resolution.