← Back to Projects | Neural Network Theory Lab v2

Neural Network Theory Laboratory

7
Core Architectures
1957
First Perceptron
Mathematical Depth
2024
Modern Architectures

Neural Network Architectures

Interactive Neural Network Lab

Select a Network Type

Select a neural network type to begin exploration

Parameters

Activation Function Explorer

Compare f(x) and f′(x) — the derivative is what backprop actually multiplies through each layer

f(x) — Activation output

f′(x) — Gradient signal through this layer

Properties

Select an activation above

Build Your Own Network

Drag layers from the palette to assemble a custom neural network architecture

Layer Palette

Input
Dense
Conv2D
MaxPool
LSTM
Attention
Dropout
BatchNorm
Flatten
Output

Network Architecture

Drag layers here to build your network

Start with an Input layer, add hidden layers, end with Output

Network Preview

Add layers to see preview

Architecture Summary

Layers: 0
Parameters (est.): 0
Status: No layers

Performance Comparison

Loss Convergence (illustrative)

Schematic train/val curves showing typical convergence and generalization gap — not empirical data.

Architecture Comparison

Normalized 0–100 across 6 axes. CNN = EfficientNet-B7 (Tan & Le 2019); Transformer = BERT-L / ViT-L (Vaswani 2017, Dosovitskiy 2021); LSTM = AWD-LSTM (Merity et al. 2018). Speed and memory are architectural — not hardware-specific.

Performance Matrix

* Task Performance: normalized 0–100 score relative to each architecture's primary domain (ImageNet for vision, GLUE/PTB for sequence, FID-derived for generative). Sources: He et al. 2015, Tan & Le 2019, Vaswani et al. 2017, Merity et al. 2018, Ho et al. 2020, Gu & Dao 2023, Karras et al. 2020.

Research Insights

Pattern Discovery

On Long Range Arena (Tay et al. 2021), attention-based models outperform recurrent baselines by 15–40% depending on task, with the gap widening at sequences beyond 1,000 tokens. On WMT 2014 EN-DE, Transformers matched the best recurrent ensemble in 1/4 the training time (Vaswani et al. 2017).

Hybrid Opportunity

CoAtNet (Dai et al. 2021) combines depthwise convolution with self-attention to achieve 90.9% ImageNet Top-1 accuracy — outperforming pure ViT-L by 2.3 pp at comparable compute, and setting a then-SOTA record without extra data.

Mathematical Foundations

Understanding gradient flow and optimization dynamics is crucial for effective neural network design and training.

Evolution Timeline

Optimization Theory

How neural networks actually learn: first- and second-moment gradient methods, adaptive step sizes, and the geometry of the loss landscape.

SGD

1951

When: Large datasets, near-convex problems. Baseline for every comparison.

Params: η (learning rate), batch size.

SGD + Momentum

1964

When: Deep networks with noisy gradients. β = 0.9 is a reliable default; Nesterov looks ahead before computing the gradient.

Params: η, β (momentum coefficient).

RMSprop

2012

When: RNNs and non-stationary objectives. Divides the LR by a per-parameter gradient scale to equalize step sizes.

Params: η, ρ ≈ 0.9, ε ≈ 1e-8.

Adam

2014

When: Default for most deep learning. Combines momentum with per-parameter adaptive scaling; bias-corrected at early steps.

Params: η ≈ 1e-3, β₁ = 0.9, β₂ = 0.999, ε = 1e-8.

AdamW

2017

When: Transformers and large pretrained models. Decouples weight decay from the adaptive gradient scale — critical for regularization to work correctly.

Params: Same as Adam + λ (weight decay ≈ 0.01–0.1).

Lion

2023

When: Large vision / multimodal training. Tracks only one moment (vs. Adam's two), so memory is 2× smaller. Use ~3–10× smaller η than Adam.

Params: η, β₁ = 0.9, β₂ = 0.99, λ (weight decay).

Optimizer Comparison

Optimizer Extra memory Adaptive LR Handles anisotropy Best for
SGD0NoPoorlyCV with carefully tuned schedule; matches Adam with effort
Momentum1 bufferNoBetterDeep CNNs, ResNets — standard for ImageNet training
RMSprop1 bufferYesYesRNNs, RL with non-stationary reward signals
Adam2 buffersYesYesDefault — NLP, GANs, generative models
AdamW2 buffersYesYesTransformers and LLMs — decoupled weight decay is essential
Lion1 bufferImplicitPartialLarge vision / multimodal training where memory is tight

Loss Landscape Explorer

Surface: f(x,y) = 0.05x² + 5y² — an elongated bowl with extreme curvature anisotropy (100× steeper in y than x). This exposes the core failure mode of fixed-LR SGD. Adjust the learning rate and observe the different trajectories.

SGD Momentum (β=0.9) RMSprop Adam

Click Run to start.

Concepts Glossary

Plain-language definitions for the core terms used across the lab. Hover a dotted term anywhere on the page for a quick definition.