Deep Learning Fundamentals

30 essential concepts in deep learning and neural networks

What You'll Learn

Master deep learning with 30 comprehensive flashcards covering neural network architecture, activation functions, backpropagation, optimization, regularization, RNNs, LSTMs, CNNs, and deep RL for trading applications.

Key Topics

Neural network architecture: layers, weights, biases, activations
Activation functions: Sigmoid, ReLU, Softmax with pros/cons
Training algorithms: Forward/backpropagation, gradient descent, SGD
Regularization techniques: L1/L2, dropout, early stopping, batch norm
Advanced architectures: RNNs, LSTMs, CNNs for time series
Deep RL and transfer learning for trading applications

Looking for more machine learning resources? Visit the Explore page to browse related decks or use the Create Your Own Deck flow to customize this set.

How to study this deck

Start with a quick skim of the questions, then launch study mode to flip cards until you can answer each prompt without hesitation. Revisit tricky cards using shuffle or reverse order, and schedule a follow-up review within 48 hours to reinforce retention.

Preview: Deep Learning Fundamentals

Question

What are the main components of a Neural Network architecture?

Answer

1. Input Layer: Receives raw features 2. Hidden Layers: Intermediate processing layers 3. Output Layer: Produces predictions 4. Weights: Learnable parameters connecting neurons 5. Biases: Learnable offset parameters 6. Activation Functions: Non-linear transformations

Question

What is a neuron (node) in a neural network?

Answer

Basic computational unit that: 1. Receives inputs (x₁, x₂, ..., xₙ) 2. Computes weighted sum: z = Σ(wᵢ × xᵢ) + b 3. Applies activation function: a = f(z) 4. Passes output to next layer Mimics biological neuron behavior

Question

What is the Sigmoid activation function and its properties?

Answer

σ(z) = 1 / (1 + e⁻ᶻ) Properties: - Output range: (0, 1) - S-shaped curve - Smooth and differentiable - Interpretable as probability Problems: - Vanishing gradient for large |z| - Not zero-centered - Computationally expensive Used mainly in output layer for binary classification

Question

What is the ReLU activation function and its advantages?

Answer

ReLU(z) = max(0, z) Advantages: ✓ Simple computation ✓ No vanishing gradient for z > 0 ✓ Sparse activation (many zeros) ✓ Faster training than sigmoid/tanh ✓ Biological plausibility Disadvantages: ✗ Dead neurons (always output 0 if z < 0) ✗ Not differentiable at z = 0 Most popular for hidden layers

Question

What is the Softmax activation function and when is it used?

Answer

softmax(zᵢ) = e^zᵢ / Σⱼ(e^zⱼ) Properties: - Converts logits to probabilities - All outputs sum to 1 - Output range: (0, 1) Used for: - Multi-class classification output layer - When need probability distribution over classes Example: Classify stock movement into {Strong Down, Down, Neutral, Up, Strong Up}

Question

What is forward propagation in neural networks?

Answer

Process of computing output from input: 1. Input layer receives features 2. For each layer l: z⁽ˡ⁾ = W⁽ˡ⁾ × a⁽ˡ⁻¹⁾ + b⁽ˡ⁾ a⁽ˡ⁾ = f(z⁽ˡ⁾) 3. Output layer produces prediction Data flows forward through network. Used during both training and inference.

Question

What is a loss function and what are common types?

Answer

Measures difference between prediction and actual value. Regression: - Mean Squared Error (MSE): (1/n)Σ(ŷ - y)² - Mean Absolute Error (MAE): (1/n)Σ|ŷ - y| Classification: - Binary Cross-Entropy: -Σ[y log(ŷ) + (1-y)log(1-ŷ)] - Categorical Cross-Entropy: -Σ(y × log(ŷ)) Goal: Minimize loss during training

Question

What is backpropagation?

Answer

Algorithm for computing gradients of loss with respect to all weights: 1. Forward pass: Compute predictions 2. Compute loss 3. Backward pass: Propagate error backwards 4. Use chain rule to compute ∂Loss/∂W for each weight 5. Update weights using gradient descent Enables efficient training of deep networks. Core algorithm for neural network learning.

Question

What is Gradient Descent and how does it work?

Answer

Optimization algorithm to minimize loss: 1. Initialize weights randomly 2. Compute loss on training data 3. Compute gradients: ∂Loss/∂W 4. Update weights: W = W - η × ∂Loss/∂W 5. Repeat until convergence η (eta) = learning rate Iteratively moves weights toward lower loss

Question

What is the learning rate (η) and how does it affect training?

Answer

Learning rate controls step size in gradient descent: Too large (η high): - Fast initially - May overshoot minimum - Training unstable/diverges Too small (η low): - Stable but very slow - May get stuck in local minimum - Takes forever to converge Typical values: 0.001 - 0.1 Often use learning rate schedules (decay over time)

Question

What is Stochastic Gradient Descent (SGD)?

Answer

Variant of gradient descent using mini-batches: Batch GD: Uses entire dataset per update (slow) Stochastic GD: Uses one sample per update (noisy) Mini-batch SGD: Uses small batch (e.g., 32 samples) Advantages: ✓ Faster updates ✓ Can escape local minima (noise helps) ✓ Memory efficient ✓ Online learning possible Most common in practice

Question

What is overfitting in neural networks and how do you detect it?

Answer

Model learns training data too well, fails to generalize. Signs: - Training loss keeps decreasing - Validation loss increases or plateaus - Large gap between training and validation accuracy - Model memorizes noise, not patterns Causes: - Too many parameters - Too complex network - Training too long - Insufficient training data

Question

What is L2 Regularization (Ridge) in neural networks?

Answer

Adds penalty for large weights to loss function: Loss_total = Loss_original + λ × Σ(W²) λ = regularization strength Effect: - Penalizes large weights - Encourages smaller, distributed weights - Reduces model complexity - Prevents overfitting Also called weight decay. Most common regularization technique

Question

What is L1 Regularization (Lasso) and how does it differ from L2?

Answer

L1: Loss_total = Loss_original + λ × Σ|W| Differences from L2: - L1 produces sparse weights (many exactly 0) - L1 performs feature selection - L2 produces small but non-zero weights - L2 generally preferred for neural networks L1: Feature selection L2: Weight shrinkage

Question

What is Dropout regularization?

Answer

Randomly drop (set to 0) neurons during training: - Each neuron has probability p of being dropped (typically p=0.5) - Different neurons dropped each iteration - All neurons active during testing Effects: ✓ Prevents co-adaptation of neurons ✓ Forces redundant representations ✓ Acts like ensemble of many networks ✓ Reduces overfitting Very effective, widely used technique

Question

What is Early Stopping?

Answer

Regularization technique that stops training when validation performance stops improving: 1. Monitor validation loss during training 2. If validation loss doesn't improve for N epochs (patience) 3. Stop training and use best weights Prevents overfitting by: - Not training too long - Finding sweet spot before memorization - Automatically determining when to stop Simple but effective method

Question

What is Batch Normalization?

Answer

Normalizes inputs to each layer during training: For each mini-batch: 1. Normalize: x̂ = (x - μ_batch) / σ_batch 2. Scale and shift: y = γx̂ + β Benefits: ✓ Faster training ✓ Higher learning rates possible ✓ Less sensitive to initialization ✓ Acts as regularization ✓ Reduces internal covariate shift Used in most modern architectures

Question

What is a Deep Neural Network (DNN)?

Answer

Neural network with multiple (2+) hidden layers: Input → Hidden₁ → Hidden₂ → ... → Hiddenₙ → Output Deep = many layers Advantages: - Can learn hierarchical features - More expressive than shallow networks - Better for complex tasks Challenges: - Harder to train (vanishing gradients) - More data needed - More computation required

Question

What is the vanishing gradient problem?

Answer

Gradients become extremely small in early layers of deep networks: Causes: - Chain rule multiplies many small numbers - Sigmoid/tanh saturate (gradient near 0) - Deep networks compound the problem Effects: - Early layers learn very slowly - Network fails to train properly Solutions: - ReLU activation - Batch normalization - Residual connections (ResNet) - Better initialization

Question

What is the exploding gradient problem?

Answer

Gradients become extremely large during backpropagation: Causes: - Chain rule multiplies many large numbers - Poor weight initialization - Deep networks Effects: - Weights update by huge amounts - Training becomes unstable - Loss diverges (NaN) Solutions: - Gradient clipping (cap maximum gradient) - Better initialization (Xavier, He) - Batch normalization - Lower learning rate

Question

What are Recurrent Neural Networks (RNNs) and when are they used?

Answer

Neural networks with loops, maintaining hidden state: h_t = f(W_h × h_{t-1} + W_x × x_t + b) Properties: - Process sequential data - Maintain memory of previous inputs - Share weights across time steps Use cases: - Time series forecasting (stock prices) - Natural language processing - Speech recognition - Any sequential data Challenges: Vanishing gradients over long sequences

Question

What is an LSTM (Long Short-Term Memory)?

Answer

Advanced RNN that solves vanishing gradient problem: Components: 1. Forget gate: What to forget from memory 2. Input gate: What new information to store 3. Output gate: What to output 4. Cell state: Long-term memory Advantages over RNN: ✓ Learns long-term dependencies ✓ No vanishing gradient problem ✓ Better for long sequences Used for: Stock price prediction, language models

Question

What are Convolutional Neural Networks (CNNs)?

Answer

Neural networks using convolution operations: Key components: 1. Convolutional layers: Extract local features 2. Pooling layers: Downsample, reduce dimensions 3. Fully connected layers: Final classification Properties: - Spatial invariance (location doesn't matter) - Parameter sharing - Hierarchical feature learning Use cases: - Image recognition - Pattern recognition in trading charts - Technical analysis automation

Question

How are neural networks used for time series forecasting in trading?

Answer

Approaches: 1. Feed-forward NN: - Input: Lagged prices, indicators - Output: Next day return 2. RNN/LSTM: - Input: Sequence of prices/indicators - Output: Future price movement 3. CNN: - Input: Chart patterns as images - Output: Buy/sell signals Challenges: - Financial data is noisy - Non-stationary distributions - Overfitting risk high

Question

What is Deep Reinforcement Learning (Deep RL)?

Answer

Combines Deep Learning with Reinforcement Learning: Instead of Q-table: - Use neural network to approximate Q(s,a) - Input: State (market conditions) - Output: Q-values for each action Advantages: ✓ Handles high-dimensional states ✓ Generalizes to unseen states ✓ Learns complex patterns Examples: - Deep Q-Networks (DQN) - AlphaGo - Algorithmic trading agents

Question

What is transfer learning in neural networks?

Answer

Using pre-trained model for new but related task: Process: 1. Take model trained on large dataset 2. Remove last layer(s) 3. Add new layers for your task 4. Fine-tune on your data Benefits: ✓ Requires less training data ✓ Faster training ✓ Better performance ✓ Leverages learned features Trading example: Model trained on all stocks, adapt to specific stock

Question

What is the difference between epochs, batches, and iterations?

Answer

Epoch: One complete pass through entire training dataset Batch: Subset of training data used in one gradient update - Batch size: Number of samples per batch (e.g., 32) Iteration: One gradient update (processing one batch) Relationship: Iterations per epoch = Training samples / Batch size Example: 1000 samples, batch size 32 → ~31 iterations per epoch

Question

What are common optimizers beyond basic SGD?

Answer

Advanced optimizers: 1. Momentum: Accumulates past gradients - Accelerates in consistent directions - Dampens oscillations 2. Adam (Adaptive Moment Estimation): - Adaptive learning rates per parameter - Combines momentum + RMSProp - Most popular optimizer 3. RMSProp: Adapts learning rate based on recent gradients 4. AdaGrad: Larger updates for infrequent features Adam is typically best default choice

Question

What is the Universal Approximation Theorem?

Answer

A neural network with: - Single hidden layer - Sufficient neurons - Non-linear activation Can approximate any continuous function to arbitrary accuracy. Implications: - NNs are extremely powerful - Theoretical guarantee of expressiveness - But: May need huge network, hard to train - Deep networks often more practical

Question

What are the main challenges of using neural networks for trading?

Answer

1. Data limitations: - Limited historical data - Non-stationary (market regimes change) 2. Overfitting: - Easy to memorize patterns - Hard to validate properly 3. Computational cost: - Training time - Infrastructure needs 4. Interpretability: - Black box models - Hard to understand decisions 5. Transaction costs: - Model doesn't account for execution 6. Market efficiency: - If pattern is obvious, already exploited