Select a neural network type to begin exploration
Compare f(x) and f′(x) — the derivative is what backprop actually multiplies through each layer
Drag layers from the palette to assemble a custom neural network architecture
Drag layers here to build your network
Start with an Input layer, add hidden layers, end with Output
Schematic train/val curves showing typical convergence and generalization gap — not empirical data.
Normalized 0–100 across 6 axes. CNN = EfficientNet-B7 (Tan & Le 2019); Transformer = BERT-L / ViT-L (Vaswani 2017, Dosovitskiy 2021); LSTM = AWD-LSTM (Merity et al. 2018). Speed and memory are architectural — not hardware-specific.
* Task Performance: normalized 0–100 score relative to each architecture's primary domain (ImageNet for vision, GLUE/PTB for sequence, FID-derived for generative). Sources: He et al. 2015, Tan & Le 2019, Vaswani et al. 2017, Merity et al. 2018, Ho et al. 2020, Gu & Dao 2023, Karras et al. 2020.
On Long Range Arena (Tay et al. 2021), attention-based models outperform recurrent baselines by 15–40% depending on task, with the gap widening at sequences beyond 1,000 tokens. On WMT 2014 EN-DE, Transformers matched the best recurrent ensemble in 1/4 the training time (Vaswani et al. 2017).
CoAtNet (Dai et al. 2021) combines depthwise convolution with self-attention to achieve 90.9% ImageNet Top-1 accuracy — outperforming pure ViT-L by 2.3 pp at comparable compute, and setting a then-SOTA record without extra data.
Understanding gradient flow and optimization dynamics is crucial for effective neural network design and training.
How neural networks actually learn: first- and second-moment gradient methods, adaptive step sizes, and the geometry of the loss landscape.
When: Large datasets, near-convex problems. Baseline for every comparison.
Params: η (learning rate), batch size.
When: Deep networks with noisy gradients. β = 0.9 is a reliable default; Nesterov looks ahead before computing the gradient.
Params: η, β (momentum coefficient).
When: RNNs and non-stationary objectives. Divides the LR by a per-parameter gradient scale to equalize step sizes.
Params: η, ρ ≈ 0.9, ε ≈ 1e-8.
When: Default for most deep learning. Combines momentum with per-parameter adaptive scaling; bias-corrected at early steps.
Params: η ≈ 1e-3, β₁ = 0.9, β₂ = 0.999, ε = 1e-8.
When: Transformers and large pretrained models. Decouples weight decay from the adaptive gradient scale — critical for regularization to work correctly.
Params: Same as Adam + λ (weight decay ≈ 0.01–0.1).
When: Large vision / multimodal training. Tracks only one moment (vs. Adam's two), so memory is 2× smaller. Use ~3–10× smaller η than Adam.
Params: η, β₁ = 0.9, β₂ = 0.99, λ (weight decay).
| Optimizer | Extra memory | Adaptive LR | Handles anisotropy | Best for |
|---|---|---|---|---|
| SGD | 0 | No | Poorly | CV with carefully tuned schedule; matches Adam with effort |
| Momentum | 1 buffer | No | Better | Deep CNNs, ResNets — standard for ImageNet training |
| RMSprop | 1 buffer | Yes | Yes | RNNs, RL with non-stationary reward signals |
| Adam | 2 buffers | Yes | Yes | Default — NLP, GANs, generative models |
| AdamW | 2 buffers | Yes | Yes | Transformers and LLMs — decoupled weight decay is essential |
| Lion | 1 buffer | Implicit | Partial | Large vision / multimodal training where memory is tight |
Surface: f(x,y) = 0.05x² + 5y² — an elongated bowl with extreme curvature anisotropy (100× steeper in y than x). This exposes the core failure mode of fixed-LR SGD. Adjust the learning rate and observe the different trajectories.
Click Run to start.
Plain-language definitions for the core terms used across the lab. Hover a dotted term anywhere on the page for a quick definition.