Reinforcement Learning

30 essential concepts in reinforcement learning for trading

What You'll Learn

Master reinforcement learning with 30 flashcards covering RL fundamentals, Q-learning, policies, value functions, discount factors, exploration vs exploitation, and RL for algorithmic trading applications.

Key Topics

RL components: agent, environment, state, action, reward
Policy (π) and value functions (V, Q)
Bellman equation and optimal policies
Discount factor (γ) and temporal credit assignment
Exploration vs exploitation tradeoff
Q-Learning algorithm and convergence

Looking for more machine learning resources? Visit the Explore page to browse related decks or use the Create Your Own Deck flow to customize this set.

How to study this deck

Start with a quick skim of the questions, then launch study mode to flip cards until you can answer each prompt without hesitation. Revisit tricky cards using shuffle or reverse order, and schedule a follow-up review within 48 hours to reinforce retention.

Preview: Reinforcement Learning

Question

What are the five core components of a Reinforcement Learning problem?

Answer

1) Agent: The learner/decision maker, 2) Environment: The world the agent interacts with, 3) State: Current situation/observation, 4) Action: Choices available to agent, 5) Reward: Feedback signal from environment indicating action quality.

Question

What is a policy (π) in reinforcement learning?

Answer

A policy π is a mapping from states to actions. It defines the agent's behavior by specifying which action to take in each state. Can be deterministic π(s) = a or stochastic π(a|s) = probability of taking action a in state s.

Question

What is the Value Function V^π(s)?

Answer

V^π(s) is the expected cumulative discounted reward starting from state s and following policy π. It represents how 'good' it is to be in state s under policy π. V^π(s) = E[R_t + γR_{t+1} + γ²R_{t+2} + ... | s_t = s, π]

Question

What is the Q-value Function Q^π(s,a)?

Answer

Q^π(s,a) is the expected cumulative discounted reward of taking action a in state s, then following policy π. It represents how 'good' it is to take action a in state s. Q^π(s,a) = E[R_t + γR_{t+1} + γ²R_{t+2} + ... | s_t = s, a_t = a, π]

Question

What is the discount factor (γ) and what does it control?

Answer

γ (gamma) ∈ [0,1] balances immediate vs future rewards. γ close to 0: myopic (values immediate rewards). γ close to 1: farsighted (values future rewards equally). γ = 0: only immediate reward matters. γ = 1: all future rewards equally important.

Question

What is the transition function T(s,a,s')?

Answer

T(s,a,s') = P(s'|s,a) is the probability of transitioning to state s' given current state s and action a. Defines the environment dynamics. In model-based RL, this is learned. In model-free RL, this is not explicitly represented.

Question

What is Q-Learning?

Answer

Q-Learning is an off-policy, model-free RL algorithm that learns optimal Q-values Q*(s,a) directly from experience without needing a model of environment dynamics. It learns the optimal policy by iteratively updating Q-values using the Bellman equation.

Question

What is the Q-Learning update rule?

Answer

Q(s,a) ← Q(s,a) + α[r + γ max_{a'} Q(s',a') - Q(s,a)]. Where: α = learning rate, r = immediate reward, γ = discount factor, s' = next state, max_{a'} Q(s',a') = best Q-value in next state.

Question

What does 'off-policy' mean in Q-Learning?

Answer

Off-policy means the algorithm learns about the optimal policy while following a different behavior policy (e.g., ε-greedy). Q-Learning updates use max_{a'} Q(s',a') regardless of which action was actually taken, learning optimal Q* even while exploring.

Question

What is the learning rate (α) in Q-Learning?

Answer

α ∈ (0,1] controls how much new information overrides old information. Small α: slow, stable learning. Large α: fast learning but unstable, sensitive to noise. Typically starts higher and decays over time. α = 1 means only most recent experience matters.

Question

What is the exploration vs exploitation tradeoff?

Answer

Exploration: Try new actions to discover potentially better strategies. Exploitation: Use current best-known actions to maximize reward. Too much exploration wastes time; too much exploitation may miss better options. Balance is crucial for learning optimal policy.

Question

What is ε-greedy exploration strategy?

Answer

With probability ε, choose a random action (explore). With probability 1-ε, choose the action with highest Q-value (exploit). Common approach: start with high ε (e.g., 0.3) and decay it over time as learning progresses.

Question

What is maximization bias in Q-Learning?

Answer

Using max_{a'} Q(s',a') in updates can overestimate Q-values because the same values are used for both selecting and evaluating actions. This positive bias accumulates, leading to suboptimal policies. Solution: Double Q-Learning uses separate Q-functions for selection and evaluation.

Question

What is Double Q-Learning and why use it?

Answer

Double Q-Learning maintains two Q-functions: Q1 and Q2. Randomly choose which to update. Use one to select action, the other to evaluate it: Q1(s,a) ← Q1(s,a) + α[r + γ Q2(s', argmax_a Q1(s',a)) - Q1(s,a)]. This reduces maximization bias.

Question

What is a Deep Q-Network (DQN)?

Answer

DQN uses a neural network to approximate Q(s,a) instead of a table. Enables learning in large/continuous state spaces where tabular Q-Learning fails. Takes state as input, outputs Q-values for all actions. Crucial innovation for applying RL to complex problems.

Question

What is experience replay in DQN?

Answer

Store agent's experiences (s,a,r,s') in a replay buffer. Sample random mini-batches from buffer for training. Benefits: breaks correlation between consecutive samples, improves data efficiency by reusing experiences, stabilizes training.

Question

What is the target network in DQN?

Answer

A separate network with frozen weights used to generate target Q-values for training. Main network's weights updated every step, target network updated periodically (every N steps). Prevents the 'moving target' problem that destabilizes training.

Question

What are policy-based methods in RL?

Answer

Instead of learning value functions, directly optimize policy parameters θ to maximize expected return. Policy π_θ(a|s) outputs probability distribution over actions. Gradient ascent on expected reward. Examples: REINFORCE, PPO, TRPO.

Question

What is the REINFORCE algorithm?

Answer

A policy gradient method that updates policy parameters in the direction that increases probability of actions that led to high returns. Update: θ ← θ + α∇log π_θ(a|s)G_t, where G_t is the return from time t. High variance but unbiased.

Question

What are Actor-Critic methods?

Answer

Combine policy-based (actor) and value-based (critic) methods. Actor: policy network that selects actions. Critic: value network that estimates V(s) or Q(s,a). Critic guides actor's learning, reducing variance. Examples: A2C, A3C, PPO.

Question

What is the advantage function A(s,a)?

Answer

A(s,a) = Q(s,a) - V(s). Measures how much better action a is compared to average action in state s. Positive advantage: better than average. Negative: worse than average. Reduces variance in policy gradient methods while maintaining correct gradient direction.

Question

Model-based vs Model-free RL?

Answer

Model-based: Learns transition function T(s,a,s') and reward function R(s,a). Can plan using the model. More sample efficient but requires accurate model. Model-free: Learns value/policy directly from experience without building environment model. Examples: Q-Learning (model-free), Dyna-Q (model-based).

Question

On-policy vs Off-policy methods?

Answer

On-policy: Learn about the policy being followed (behavior policy = target policy). Example: SARSA. Off-policy: Learn about optimal policy while following different policy. Example: Q-Learning. Off-policy can learn from old data and exploratory actions.

Question

In trading RL: What is a typical state representation?

Answer

State can include: technical indicators (RSI, Bollinger Bands, Momentum, MACD), current position (long/short/neutral), portfolio value, recent price movements, volatility measures. States are often discretized or normalized for tabular Q-Learning.

Question

In trading RL: What are typical actions?

Answer

Common action spaces: 1) Discrete: {Buy, Sell, Hold} or {Long, Short, Cash}, 2) Continuous: position size from -1000 to +1000 shares, 3) Combined: discrete action type + continuous amount. Constraints: position limits, transaction costs.

Question

In trading RL: What are typical reward functions?

Answer

Rewards can be: 1) Daily return (portfolio value change), 2) Sharpe ratio (risk-adjusted return), 3) Profit minus transaction costs, 4) Combination with penalties for excessive trading or risk. Reward engineering is critical for learning effective trading policies.

Question

Why use querySetState() instead of query() in testPolicy()?

Answer

In a Q-Learner, query() updates Q-values (learning mode), while querySetState() only retrieves the best action without updating (testing mode). testPolicy() must not train/update the learner. Consecutive testPolicy() calls must produce identical results.

Question

What is the Bellman Equation for Q-values?

Answer

Q*(s,a) = R(s,a) + γ Σ_{s'} P(s'|s,a) max_{a'} Q*(s',a'). States that optimal Q-value equals immediate reward plus discounted value of best future action. This recursive relationship forms the basis for Q-Learning updates.

Question

How does γ (discount factor) affect trading strategy?

Answer

Low γ (e.g., 0.5): Short-term trading, focuses on immediate profits, more responsive to recent changes. High γ (e.g., 0.95): Long-term investing, considers distant future rewards, smoother strategy. For day trading: lower γ. For swing/position trading: higher γ.

Question

What improvements does Rainbow DQN include?

Answer

Rainbow combines: 1) Double DQN (reduces overestimation), 2) Dueling DQN (separate value/advantage streams), 3) Prioritized replay (sample important transitions more), 4) Multi-step learning, 5) Distributional RL, 6) Noisy networks. State-of-art DQN variant.