Neural Network Theory Laboratory

Neural Network Architectures

Interactive Neural Network Lab

Select a Network Type

Select a neural network type to begin exploration

Mathematical Foundation

Interactive Dataset Playground

Data Points

Dataset Info

Total Points: 0

Class A: 0

Class B: 0

Instructions

• Click / tap to add points
• Left click → Class A (blue)
• Right click → Class B (red)
• Mobile: use A/B toggle above
• Load preset patterns above

Parameters

Learning Rate: 0.01

Epochs: 100

Batch Size: 32

Activation Function

Optimizer

Activation Function Explorer

Compare f(x) and f′(x) — the derivative is what backprop actually multiplies through each layer

f(x) — Activation output

f′(x) — Gradient signal through this layer

Properties

Select an activation above

Build Your Own Network

Drag layers from the palette to assemble a custom neural network architecture

Layer Palette

Input

Dense

Conv2D

MaxPool

LSTM

Attention

Dropout

BatchNorm

Flatten

Output

Network Architecture

Drag layers here to build your network

Start with an Input layer, add hidden layers, end with Output

Network Preview

Architecture Summary

Layers: 0

Parameters (est.): 0

Status: No layers

Performance Comparison

Loss Convergence (illustrative)

Schematic train/val curves showing typical convergence and generalization gap — not empirical data.

Architecture Comparison

Normalized 0–100 across 6 axes. CNN = EfficientNet-B7 (Tan & Le 2019); Transformer = BERT-L / ViT-L (Vaswani 2017, Dosovitskiy 2021); LSTM = AWD-LSTM (Merity et al. 2018). Speed and memory are architectural — not hardware-specific.

Performance Matrix

* Task Performance: normalized 0–100 score relative to each architecture's primary domain (ImageNet for vision, GLUE/PTB for sequence, FID-derived for generative). Sources: He et al. 2015, Tan & Le 2019, Vaswani et al. 2017, Merity et al. 2018, Ho et al. 2020, Gu & Dao 2023, Karras et al. 2020.

Research Insights

Pattern Discovery

On Long Range Arena (Tay et al. 2021), attention-based models outperform recurrent baselines by 15–40% depending on task, with the gap widening at sequences beyond 1,000 tokens. On WMT 2014 EN-DE, Transformers matched the best recurrent ensemble in 1/4 the training time (Vaswani et al. 2017).

Hybrid Opportunity

CoAtNet (Dai et al. 2021) combines depthwise convolution with self-attention to achieve 90.9% ImageNet Top-1 accuracy — outperforming pure ViT-L by 2.3 pp at comparable compute, and setting a then-SOTA record without extra data.

Mathematical Foundations

Understanding gradient flow and optimization dynamics is crucial for effective neural network design and training.

Evolution Timeline

Optimization Theory

How neural networks actually learn: first- and second-moment gradient methods, adaptive step sizes, and the geometry of the loss landscape.

SGD

1951

When: Large datasets, near-convex problems. Baseline for every comparison.

Params: η (learning rate), batch size.

SGD + Momentum

1964

When: Deep networks with noisy gradients. β = 0.9 is a reliable default; Nesterov looks ahead before computing the gradient.

Params: η, β (momentum coefficient).

RMSprop

2012

When: RNNs and non-stationary objectives. Divides the LR by a per-parameter gradient scale to equalize step sizes.

Params: η, ρ ≈ 0.9, ε ≈ 1e-8.

Adam

2014

When: Default for most deep learning. Combines momentum with per-parameter adaptive scaling; bias-corrected at early steps.

Params: η ≈ 1e-3, β₁ = 0.9, β₂ = 0.999, ε = 1e-8.

AdamW

2017

When: Transformers and large pretrained models. Decouples weight decay from the adaptive gradient scale — critical for regularization to work correctly.

Params: Same as Adam + λ (weight decay ≈ 0.01–0.1).

Lion

2023

When: Large vision / multimodal training. Tracks only one moment (vs. Adam's two), so memory is 2× smaller. Use ~3–10× smaller η than Adam.

Params: η, β₁ = 0.9, β₂ = 0.99, λ (weight decay).

Optimizer Comparison

Optimizer	Extra memory	Adaptive LR	Handles anisotropy	Best for
SGD	0	No	Poorly	CV with carefully tuned schedule; matches Adam with effort
Momentum	1 buffer	No	Better	Deep CNNs, ResNets — standard for ImageNet training
RMSprop	1 buffer	Yes	Yes	RNNs, RL with non-stationary reward signals
Adam	2 buffers	Yes	Yes	Default — NLP, GANs, generative models
AdamW	2 buffers	Yes	Yes	Transformers and LLMs — decoupled weight decay is essential
Lion	1 buffer	Implicit	Partial	Large vision / multimodal training where memory is tight

Loss Landscape Explorer

Surface: f(x,y) = 0.05x² + 5y² — an elongated bowl with extreme curvature anisotropy (100× steeper in y than x). This exposes the core failure mode of fixed-LR SGD. Adjust the learning rate and observe the different trajectories.

Learning rate 0.10

SGD Momentum (β=0.9) RMSprop Adam

Click Run to start.

Concepts Glossary

Plain-language definitions for the core terms used across the lab. Hover a dotted term anywhere on the page for a quick definition.