Reinforcement Learning
30 essential concepts in reinforcement learning for trading
What You'll Learn
Master reinforcement learning with 30 flashcards covering RL fundamentals, Q-learning, policies, value functions, discount factors, exploration vs exploitation, and RL for algorithmic trading applications.
Key Topics
- RL components: agent, environment, state, action, reward
- Policy (π) and value functions (V, Q)
- Bellman equation and optimal policies
- Discount factor (γ) and temporal credit assignment
- Exploration vs exploitation tradeoff
- Q-Learning algorithm and convergence
Looking for more machine learning resources? Visit the Explore page to browse related decks or use the Create Your Own Deck flow to customize this set.
How to study this deck
Start with a quick skim of the questions, then launch study mode to flip cards until you can answer each prompt without hesitation. Revisit tricky cards using shuffle or reverse order, and schedule a follow-up review within 48 hours to reinforce retention.
Preview: Reinforcement Learning
Question
What are the five core components of a Reinforcement Learning problem?
Answer
1) Agent: The learner/decision maker, 2) Environment: The world the agent interacts with, 3) State: Current situation/observation, 4) Action: Choices available to agent, 5) Reward: Feedback signal from environment indicating action quality.
Question
What is a policy (π) in reinforcement learning?
Answer
A policy π is a mapping from states to actions. It defines the agent's behavior by specifying which action to take in each state. Can be deterministic π(s) = a or stochastic π(a|s) = probability of taking action a in state s.
Question
What is the Value Function V^π(s)?
Answer
V^π(s) is the expected cumulative discounted reward starting from state s and following policy π. It represents how 'good' it is to be in state s under policy π. V^π(s) = E[R_t + γR_{t+1} + γ²R_{t+2} + ... | s_t = s, π]
Question
What is the Q-value Function Q^π(s,a)?
Answer
Q^π(s,a) is the expected cumulative discounted reward of taking action a in state s, then following policy π. It represents how 'good' it is to take action a in state s. Q^π(s,a) = E[R_t + γR_{t+1} + γ²R_{t+2} + ... | s_t = s, a_t = a, π]
Question
What is the discount factor (γ) and what does it control?
Answer
γ (gamma) ∈ [0,1] balances immediate vs future rewards. γ close to 0: myopic (values immediate rewards). γ close to 1: farsighted (values future rewards equally). γ = 0: only immediate reward matters. γ = 1: all future rewards equally important.
Question
What is the transition function T(s,a,s')?
Answer
T(s,a,s') = P(s'|s,a) is the probability of transitioning to state s' given current state s and action a. Defines the environment dynamics. In model-based RL, this is learned. In model-free RL, this is not explicitly represented.
Question
What is Q-Learning?
Answer
Q-Learning is an off-policy, model-free RL algorithm that learns optimal Q-values Q*(s,a) directly from experience without needing a model of environment dynamics. It learns the optimal policy by iteratively updating Q-values using the Bellman equation.
Question
What is the Q-Learning update rule?
Answer
Q(s,a) ← Q(s,a) + α[r + γ max_{a'} Q(s',a') - Q(s,a)]. Where: α = learning rate, r = immediate reward, γ = discount factor, s' = next state, max_{a'} Q(s',a') = best Q-value in next state.
Question
What does 'off-policy' mean in Q-Learning?
Answer
Off-policy means the algorithm learns about the optimal policy while following a different behavior policy (e.g., ε-greedy). Q-Learning updates use max_{a'} Q(s',a') regardless of which action was actually taken, learning optimal Q* even while exploring.
Question
What is the learning rate (α) in Q-Learning?
Answer
α ∈ (0,1] controls how much new information overrides old information. Small α: slow, stable learning. Large α: fast learning but unstable, sensitive to noise. Typically starts higher and decays over time. α = 1 means only most recent experience matters.
Question
What is the exploration vs exploitation tradeoff?
Answer
Exploration: Try new actions to discover potentially better strategies. Exploitation: Use current best-known actions to maximize reward. Too much exploration wastes time; too much exploitation may miss better options. Balance is crucial for learning optimal policy.
Question
What is ε-greedy exploration strategy?
Answer
With probability ε, choose a random action (explore). With probability 1-ε, choose the action with highest Q-value (exploit). Common approach: start with high ε (e.g., 0.3) and decay it over time as learning progresses.
Question
What is maximization bias in Q-Learning?
Answer
Using max_{a'} Q(s',a') in updates can overestimate Q-values because the same values are used for both selecting and evaluating actions. This positive bias accumulates, leading to suboptimal policies. Solution: Double Q-Learning uses separate Q-functions for selection and evaluation.
Question
What is Double Q-Learning and why use it?
Answer
Double Q-Learning maintains two Q-functions: Q1 and Q2. Randomly choose which to update. Use one to select action, the other to evaluate it: Q1(s,a) ← Q1(s,a) + α[r + γ Q2(s', argmax_a Q1(s',a)) - Q1(s,a)]. This reduces maximization bias.
Question
What is a Deep Q-Network (DQN)?
Answer
DQN uses a neural network to approximate Q(s,a) instead of a table. Enables learning in large/continuous state spaces where tabular Q-Learning fails. Takes state as input, outputs Q-values for all actions. Crucial innovation for applying RL to complex problems.
Question
What is experience replay in DQN?
Answer
Store agent's experiences (s,a,r,s') in a replay buffer. Sample random mini-batches from buffer for training. Benefits: breaks correlation between consecutive samples, improves data efficiency by reusing experiences, stabilizes training.
Question
What is the target network in DQN?
Answer
A separate network with frozen weights used to generate target Q-values for training. Main network's weights updated every step, target network updated periodically (every N steps). Prevents the 'moving target' problem that destabilizes training.
Question
What are policy-based methods in RL?
Answer
Instead of learning value functions, directly optimize policy parameters θ to maximize expected return. Policy π_θ(a|s) outputs probability distribution over actions. Gradient ascent on expected reward. Examples: REINFORCE, PPO, TRPO.
Question
What is the REINFORCE algorithm?
Answer
A policy gradient method that updates policy parameters in the direction that increases probability of actions that led to high returns. Update: θ ← θ + α∇log π_θ(a|s)G_t, where G_t is the return from time t. High variance but unbiased.
Question
What are Actor-Critic methods?
Answer
Combine policy-based (actor) and value-based (critic) methods. Actor: policy network that selects actions. Critic: value network that estimates V(s) or Q(s,a). Critic guides actor's learning, reducing variance. Examples: A2C, A3C, PPO.
Question
What is the advantage function A(s,a)?
Answer
A(s,a) = Q(s,a) - V(s). Measures how much better action a is compared to average action in state s. Positive advantage: better than average. Negative: worse than average. Reduces variance in policy gradient methods while maintaining correct gradient direction.
Question
Model-based vs Model-free RL?
Answer
Model-based: Learns transition function T(s,a,s') and reward function R(s,a). Can plan using the model. More sample efficient but requires accurate model. Model-free: Learns value/policy directly from experience without building environment model. Examples: Q-Learning (model-free), Dyna-Q (model-based).
Question
On-policy vs Off-policy methods?
Answer
On-policy: Learn about the policy being followed (behavior policy = target policy). Example: SARSA. Off-policy: Learn about optimal policy while following different policy. Example: Q-Learning. Off-policy can learn from old data and exploratory actions.
Question
In trading RL: What is a typical state representation?
Answer
State can include: technical indicators (RSI, Bollinger Bands, Momentum, MACD), current position (long/short/neutral), portfolio value, recent price movements, volatility measures. States are often discretized or normalized for tabular Q-Learning.
Question
In trading RL: What are typical actions?
Answer
Common action spaces: 1) Discrete: {Buy, Sell, Hold} or {Long, Short, Cash}, 2) Continuous: position size from -1000 to +1000 shares, 3) Combined: discrete action type + continuous amount. Constraints: position limits, transaction costs.
Question
In trading RL: What are typical reward functions?
Answer
Rewards can be: 1) Daily return (portfolio value change), 2) Sharpe ratio (risk-adjusted return), 3) Profit minus transaction costs, 4) Combination with penalties for excessive trading or risk. Reward engineering is critical for learning effective trading policies.
Question
Why use querySetState() instead of query() in testPolicy()?
Answer
In a Q-Learner, query() updates Q-values (learning mode), while querySetState() only retrieves the best action without updating (testing mode). testPolicy() must not train/update the learner. Consecutive testPolicy() calls must produce identical results.
Question
What is the Bellman Equation for Q-values?
Answer
Q*(s,a) = R(s,a) + γ Σ_{s'} P(s'|s,a) max_{a'} Q*(s',a'). States that optimal Q-value equals immediate reward plus discounted value of best future action. This recursive relationship forms the basis for Q-Learning updates.
Question
How does γ (discount factor) affect trading strategy?
Answer
Low γ (e.g., 0.5): Short-term trading, focuses on immediate profits, more responsive to recent changes. High γ (e.g., 0.95): Long-term investing, considers distant future rewards, smoother strategy. For day trading: lower γ. For swing/position trading: higher γ.
Question
What improvements does Rainbow DQN include?
Answer
Rainbow combines: 1) Double DQN (reduces overestimation), 2) Dueling DQN (separate value/advantage streams), 3) Prioritized replay (sample important transitions more), 4) Multi-step learning, 5) Distributional RL, 6) Noisy networks. State-of-art DQN variant.