Deep Learning Fundamentals
30 essential concepts in deep learning and neural networks
What You'll Learn
Master deep learning with 30 comprehensive flashcards covering neural network architecture, activation functions, backpropagation, optimization, regularization, RNNs, LSTMs, CNNs, and deep RL for trading applications.
Key Topics
- Neural network architecture: layers, weights, biases, activations
- Activation functions: Sigmoid, ReLU, Softmax with pros/cons
- Training algorithms: Forward/backpropagation, gradient descent, SGD
- Regularization techniques: L1/L2, dropout, early stopping, batch norm
- Advanced architectures: RNNs, LSTMs, CNNs for time series
- Deep RL and transfer learning for trading applications
Looking for more machine learning resources? Visit the Explore page to browse related decks or use the Create Your Own Deck flow to customize this set.
How to study this deck
Start with a quick skim of the questions, then launch study mode to flip cards until you can answer each prompt without hesitation. Revisit tricky cards using shuffle or reverse order, and schedule a follow-up review within 48 hours to reinforce retention.
Preview: Deep Learning Fundamentals
Question
What are the main components of a Neural Network architecture?
Answer
1. Input Layer: Receives raw features 2. Hidden Layers: Intermediate processing layers 3. Output Layer: Produces predictions 4. Weights: Learnable parameters connecting neurons 5. Biases: Learnable offset parameters 6. Activation Functions: Non-linear transformations
Question
What is a neuron (node) in a neural network?
Answer
Basic computational unit that: 1. Receives inputs (x₁, x₂, ..., xₙ) 2. Computes weighted sum: z = Σ(wᵢ × xᵢ) + b 3. Applies activation function: a = f(z) 4. Passes output to next layer Mimics biological neuron behavior
Question
What is the Sigmoid activation function and its properties?
Answer
σ(z) = 1 / (1 + e⁻ᶻ) Properties: - Output range: (0, 1) - S-shaped curve - Smooth and differentiable - Interpretable as probability Problems: - Vanishing gradient for large |z| - Not zero-centered - Computationally expensive Used mainly in output layer for binary classification
Question
What is the ReLU activation function and its advantages?
Answer
ReLU(z) = max(0, z) Advantages: ✓ Simple computation ✓ No vanishing gradient for z > 0 ✓ Sparse activation (many zeros) ✓ Faster training than sigmoid/tanh ✓ Biological plausibility Disadvantages: ✗ Dead neurons (always output 0 if z < 0) ✗ Not differentiable at z = 0 Most popular for hidden layers
Question
What is the Softmax activation function and when is it used?
Answer
softmax(zᵢ) = e^zᵢ / Σⱼ(e^zⱼ) Properties: - Converts logits to probabilities - All outputs sum to 1 - Output range: (0, 1) Used for: - Multi-class classification output layer - When need probability distribution over classes Example: Classify stock movement into {Strong Down, Down, Neutral, Up, Strong Up}
Question
What is forward propagation in neural networks?
Answer
Process of computing output from input: 1. Input layer receives features 2. For each layer l: z⁽ˡ⁾ = W⁽ˡ⁾ × a⁽ˡ⁻¹⁾ + b⁽ˡ⁾ a⁽ˡ⁾ = f(z⁽ˡ⁾) 3. Output layer produces prediction Data flows forward through network. Used during both training and inference.
Question
What is a loss function and what are common types?
Answer
Measures difference between prediction and actual value. Regression: - Mean Squared Error (MSE): (1/n)Σ(ŷ - y)² - Mean Absolute Error (MAE): (1/n)Σ|ŷ - y| Classification: - Binary Cross-Entropy: -Σ[y log(ŷ) + (1-y)log(1-ŷ)] - Categorical Cross-Entropy: -Σ(y × log(ŷ)) Goal: Minimize loss during training
Question
What is backpropagation?
Answer
Algorithm for computing gradients of loss with respect to all weights: 1. Forward pass: Compute predictions 2. Compute loss 3. Backward pass: Propagate error backwards 4. Use chain rule to compute ∂Loss/∂W for each weight 5. Update weights using gradient descent Enables efficient training of deep networks. Core algorithm for neural network learning.
Question
What is Gradient Descent and how does it work?
Answer
Optimization algorithm to minimize loss: 1. Initialize weights randomly 2. Compute loss on training data 3. Compute gradients: ∂Loss/∂W 4. Update weights: W = W - η × ∂Loss/∂W 5. Repeat until convergence η (eta) = learning rate Iteratively moves weights toward lower loss
Question
What is the learning rate (η) and how does it affect training?
Answer
Learning rate controls step size in gradient descent: Too large (η high): - Fast initially - May overshoot minimum - Training unstable/diverges Too small (η low): - Stable but very slow - May get stuck in local minimum - Takes forever to converge Typical values: 0.001 - 0.1 Often use learning rate schedules (decay over time)
Question
What is Stochastic Gradient Descent (SGD)?
Answer
Variant of gradient descent using mini-batches: Batch GD: Uses entire dataset per update (slow) Stochastic GD: Uses one sample per update (noisy) Mini-batch SGD: Uses small batch (e.g., 32 samples) Advantages: ✓ Faster updates ✓ Can escape local minima (noise helps) ✓ Memory efficient ✓ Online learning possible Most common in practice
Question
What is overfitting in neural networks and how do you detect it?
Answer
Model learns training data too well, fails to generalize. Signs: - Training loss keeps decreasing - Validation loss increases or plateaus - Large gap between training and validation accuracy - Model memorizes noise, not patterns Causes: - Too many parameters - Too complex network - Training too long - Insufficient training data
Question
What is L2 Regularization (Ridge) in neural networks?
Answer
Adds penalty for large weights to loss function: Loss_total = Loss_original + λ × Σ(W²) λ = regularization strength Effect: - Penalizes large weights - Encourages smaller, distributed weights - Reduces model complexity - Prevents overfitting Also called weight decay. Most common regularization technique
Question
What is L1 Regularization (Lasso) and how does it differ from L2?
Answer
L1: Loss_total = Loss_original + λ × Σ|W| Differences from L2: - L1 produces sparse weights (many exactly 0) - L1 performs feature selection - L2 produces small but non-zero weights - L2 generally preferred for neural networks L1: Feature selection L2: Weight shrinkage
Question
What is Dropout regularization?
Answer
Randomly drop (set to 0) neurons during training: - Each neuron has probability p of being dropped (typically p=0.5) - Different neurons dropped each iteration - All neurons active during testing Effects: ✓ Prevents co-adaptation of neurons ✓ Forces redundant representations ✓ Acts like ensemble of many networks ✓ Reduces overfitting Very effective, widely used technique
Question
What is Early Stopping?
Answer
Regularization technique that stops training when validation performance stops improving: 1. Monitor validation loss during training 2. If validation loss doesn't improve for N epochs (patience) 3. Stop training and use best weights Prevents overfitting by: - Not training too long - Finding sweet spot before memorization - Automatically determining when to stop Simple but effective method
Question
What is Batch Normalization?
Answer
Normalizes inputs to each layer during training: For each mini-batch: 1. Normalize: x̂ = (x - μ_batch) / σ_batch 2. Scale and shift: y = γx̂ + β Benefits: ✓ Faster training ✓ Higher learning rates possible ✓ Less sensitive to initialization ✓ Acts as regularization ✓ Reduces internal covariate shift Used in most modern architectures
Question
What is a Deep Neural Network (DNN)?
Answer
Neural network with multiple (2+) hidden layers: Input → Hidden₁ → Hidden₂ → ... → Hiddenₙ → Output Deep = many layers Advantages: - Can learn hierarchical features - More expressive than shallow networks - Better for complex tasks Challenges: - Harder to train (vanishing gradients) - More data needed - More computation required
Question
What is the vanishing gradient problem?
Answer
Gradients become extremely small in early layers of deep networks: Causes: - Chain rule multiplies many small numbers - Sigmoid/tanh saturate (gradient near 0) - Deep networks compound the problem Effects: - Early layers learn very slowly - Network fails to train properly Solutions: - ReLU activation - Batch normalization - Residual connections (ResNet) - Better initialization
Question
What is the exploding gradient problem?
Answer
Gradients become extremely large during backpropagation: Causes: - Chain rule multiplies many large numbers - Poor weight initialization - Deep networks Effects: - Weights update by huge amounts - Training becomes unstable - Loss diverges (NaN) Solutions: - Gradient clipping (cap maximum gradient) - Better initialization (Xavier, He) - Batch normalization - Lower learning rate
Question
What are Recurrent Neural Networks (RNNs) and when are they used?
Answer
Neural networks with loops, maintaining hidden state: h_t = f(W_h × h_{t-1} + W_x × x_t + b) Properties: - Process sequential data - Maintain memory of previous inputs - Share weights across time steps Use cases: - Time series forecasting (stock prices) - Natural language processing - Speech recognition - Any sequential data Challenges: Vanishing gradients over long sequences
Question
What is an LSTM (Long Short-Term Memory)?
Answer
Advanced RNN that solves vanishing gradient problem: Components: 1. Forget gate: What to forget from memory 2. Input gate: What new information to store 3. Output gate: What to output 4. Cell state: Long-term memory Advantages over RNN: ✓ Learns long-term dependencies ✓ No vanishing gradient problem ✓ Better for long sequences Used for: Stock price prediction, language models
Question
What are Convolutional Neural Networks (CNNs)?
Answer
Neural networks using convolution operations: Key components: 1. Convolutional layers: Extract local features 2. Pooling layers: Downsample, reduce dimensions 3. Fully connected layers: Final classification Properties: - Spatial invariance (location doesn't matter) - Parameter sharing - Hierarchical feature learning Use cases: - Image recognition - Pattern recognition in trading charts - Technical analysis automation
Question
How are neural networks used for time series forecasting in trading?
Answer
Approaches: 1. Feed-forward NN: - Input: Lagged prices, indicators - Output: Next day return 2. RNN/LSTM: - Input: Sequence of prices/indicators - Output: Future price movement 3. CNN: - Input: Chart patterns as images - Output: Buy/sell signals Challenges: - Financial data is noisy - Non-stationary distributions - Overfitting risk high
Question
What is Deep Reinforcement Learning (Deep RL)?
Answer
Combines Deep Learning with Reinforcement Learning: Instead of Q-table: - Use neural network to approximate Q(s,a) - Input: State (market conditions) - Output: Q-values for each action Advantages: ✓ Handles high-dimensional states ✓ Generalizes to unseen states ✓ Learns complex patterns Examples: - Deep Q-Networks (DQN) - AlphaGo - Algorithmic trading agents
Question
What is transfer learning in neural networks?
Answer
Using pre-trained model for new but related task: Process: 1. Take model trained on large dataset 2. Remove last layer(s) 3. Add new layers for your task 4. Fine-tune on your data Benefits: ✓ Requires less training data ✓ Faster training ✓ Better performance ✓ Leverages learned features Trading example: Model trained on all stocks, adapt to specific stock
Question
What is the difference between epochs, batches, and iterations?
Answer
Epoch: One complete pass through entire training dataset Batch: Subset of training data used in one gradient update - Batch size: Number of samples per batch (e.g., 32) Iteration: One gradient update (processing one batch) Relationship: Iterations per epoch = Training samples / Batch size Example: 1000 samples, batch size 32 → ~31 iterations per epoch
Question
What are common optimizers beyond basic SGD?
Answer
Advanced optimizers: 1. Momentum: Accumulates past gradients - Accelerates in consistent directions - Dampens oscillations 2. Adam (Adaptive Moment Estimation): - Adaptive learning rates per parameter - Combines momentum + RMSProp - Most popular optimizer 3. RMSProp: Adapts learning rate based on recent gradients 4. AdaGrad: Larger updates for infrequent features Adam is typically best default choice
Question
What is the Universal Approximation Theorem?
Answer
A neural network with: - Single hidden layer - Sufficient neurons - Non-linear activation Can approximate any continuous function to arbitrary accuracy. Implications: - NNs are extremely powerful - Theoretical guarantee of expressiveness - But: May need huge network, hard to train - Deep networks often more practical
Question
What are the main challenges of using neural networks for trading?
Answer
1. Data limitations: - Limited historical data - Non-stationary (market regimes change) 2. Overfitting: - Easy to memorize patterns - Hard to validate properly 3. Computational cost: - Training time - Infrastructure needs 4. Interpretability: - Black box models - Hard to understand decisions 5. Transaction costs: - Model doesn't account for execution 6. Market efficiency: - If pattern is obvious, already exploited