Reinforcement Learning And Q Learning
25 essential Q-Learning concepts for algorithmic trading
What You'll Learn
Master Q-Learning implementation with 25 flashcards covering Q-table, Q-update equation, query vs querySetState, epsilon-greedy exploration, learning rate, state discretization, and Q-Learning for trading strategies.
Key Topics
- Q-table structure and Q-value updates
- Q-update equation with learning rate and discount factor
- query() vs querySetState() for training vs testing
- Epsilon-greedy exploration strategy
- State discretization for continuous spaces
- Q-Learning for trading: state/action/reward design
Looking for more machine learning resources? Visit the Explore page to browse related decks or use the Create Your Own Deck flow to customize this set.
How to study this deck
Start with a quick skim of the questions, then launch study mode to flip cards until you can answer each prompt without hesitation. Revisit tricky cards using shuffle or reverse order, and schedule a follow-up review within 48 hours to reinforce retention.
Preview: Reinforcement Learning And Q Learning
Question
What are the 5 core components of a Reinforcement Learning problem?
Answer
1. Agent (decision maker) 2. Environment (world agent interacts with) 3. State (current situation) 4. Action (choices available to agent) 5. Reward (feedback signal)
Question
What is a Policy (π) in Reinforcement Learning?
Answer
A mapping from states to actions. It defines what action the agent should take in each state. Can be deterministic π(s) = a or stochastic π(a|s).
Question
What is the Value Function V^π(s)?
Answer
The expected cumulative discounted reward starting from state s and following policy π. Represents how good it is to be in a particular state.
Question
What is the Q-value Function Q^π(s,a)?
Answer
The expected cumulative discounted reward starting from state s, taking action a, then following policy π. Represents how good it is to take a specific action in a specific state.
Question
What is the discount factor (γ) and what does it control?
Answer
γ (gamma) is between 0 and 1. It balances immediate vs future rewards. γ close to 0: values immediate rewards. γ close to 1: values future rewards equally with immediate rewards.
Question
What is the Q-Learning update rule?
Answer
Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)] Where: α = learning rate r = immediate reward γ = discount factor s' = next state max_a' Q(s',a') = best Q-value in next state
Question
What does the learning rate (α) control in Q-Learning?
Answer
α controls how much new information overrides old information. α = 0: agent learns nothing (only uses prior knowledge). α = 1: agent only considers most recent information.
Question
What is the exploration vs exploitation tradeoff?
Answer
Exploration: trying new actions to discover potentially better rewards. Exploitation: using known good actions to maximize immediate reward. Must balance both for effective learning.
Question
What is the ε-greedy strategy?
Answer
With probability ε: choose random action (explore) With probability 1-ε: choose action with highest Q-value (exploit) ε typically decays over time as agent learns.
Question
Is Q-Learning on-policy or off-policy? What does this mean?
Answer
Q-Learning is OFF-POLICY. It learns the optimal policy while following a different policy (e.g., ε-greedy). It updates Q-values using the maximum Q-value of next state, regardless of action actually taken.
Question
What is the Transition Function P(s'|s,a)?
Answer
The probability of transitioning to state s' given current state s and action a. Defines the dynamics of the environment. In model-free RL (like Q-Learning), we don't need to know this explicitly.
Question
What is the difference between model-based and model-free RL?
Answer
Model-Based: Learns a model of environment (transition function P(s'|s,a) and reward function). Can plan ahead. Model-Free: Learns value function or policy directly without learning environment model. Q-Learning is model-free.
Question
What is maximization bias in Q-Learning?
Answer
Q-Learning tends to overestimate action values due to using max operator in update rule. The same values are used for both selecting and evaluating actions, leading to positive bias.
Question
How does Double Q-Learning solve maximization bias?
Answer
Uses two separate Q-functions (Q1 and Q2). One selects the best action, the other evaluates it. This decouples selection from evaluation, reducing overestimation.
Question
What is a Deep Q-Network (DQN)?
Answer
Uses a neural network to approximate the Q-function instead of a Q-table. Allows handling high-dimensional state spaces. Key innovations: experience replay and target network.
Question
What is experience replay in DQN?
Answer
Stores past experiences (s, a, r, s') in a replay buffer. Randomly samples mini-batches for training. Benefits: breaks correlation between consecutive samples, improves data efficiency, stabilizes training.
Question
What is a target network in DQN?
Answer
A separate copy of the Q-network that is updated less frequently. Used to generate target Q-values during training. Prevents unstable feedback loops and improves convergence.
Question
What is the REINFORCE algorithm?
Answer
A policy gradient method that directly optimizes the policy. Updates policy parameters in direction that increases expected reward. Monte Carlo approach - waits until episode ends to update.
Question
What are Actor-Critic methods?
Answer
Combine value-based and policy-based methods. Actor: learns policy (what to do). Critic: learns value function (how good are actions). Examples: A2C, TRPO, PPO.
Question
What is the advantage function in Actor-Critic?
Answer
A(s,a) = Q(s,a) - V(s) Measures how much better action a is compared to the average action in state s. Reduces variance in policy gradient estimates.
Question
In Q-Learning for trading, what might states represent?
Answer
Discretized market conditions based on technical indicators (e.g., Bollinger Band position, momentum level, RSI value). State captures current market situation relevant for trading decisions.
Question
In Q-Learning for trading, what are typical actions?
Answer
Buy (go long), Sell (go short), or Hold (maintain current position). May also include position sizes (e.g., buy 1000 shares, sell 1000 shares).
Question
In Q-Learning for trading, how are rewards typically structured?
Answer
Based on returns achieved: profit from trades, portfolio value change, or risk-adjusted returns (Sharpe ratio). May include penalties for transaction costs or excessive trading.
Question
Why must testPolicy() produce identical output for consecutive calls in Project 8?
Answer
testPolicy() should only execute the learned policy, not train/update the learner. Deterministic behavior ensures reproducibility and proper separation of training (addEvidence) from testing (testPolicy).
Question
What method should be called in testPolicy() for Q-Learner: query() or querySetState()?
Answer
querySetState() should be used in testPolicy(). It queries the learned policy without updating Q-values. query() updates Q-values and should only be used during training (addEvidence).
Question
How does discount factor γ affect trading strategy in RL?
Answer
Low γ (near 0): agent focuses on immediate profits, may lead to day-trading behavior. High γ (near 1): agent considers long-term returns, may hold positions longer. Choice depends on trading horizon and strategy goals.
Question
What is the curse of dimensionality in RL?
Answer
As state/action space grows, the number of state-action pairs grows exponentially. Q-table becomes huge and sparse. Most states never visited during training. Solution: function approximation (e.g., neural networks in DQN).
Question
What is the Bellman Equation for Q-values?
Answer
Q*(s,a) = E[r + γ max_a' Q*(s',a')] Optimal Q-value equals expected immediate reward plus discounted maximum future Q-value. Foundation for Q-Learning update rule.
Question
What is the difference between Q-Learning and SARSA?
Answer
Q-Learning (off-policy): Updates using max_a' Q(s',a') - learns optimal policy. SARSA (on-policy): Updates using Q(s',a') where a' is action actually taken - learns policy being followed. Q-Learning more aggressive, SARSA more conservative.
Question
What are the three main improvements in Rainbow DQN?
Answer
1. Double DQN: reduces overestimation bias 2. Dueling DQN: separates state value and action advantages 3. Prioritized Experience Replay: samples important experiences more frequently (Plus: Distributional RL, Noisy Nets, Multi-step learning)
Question
What is the role of the reward signal in shaping agent behavior?
Answer
Reward defines the goal. Agent learns to maximize cumulative reward. Poorly designed rewards lead to unintended behavior. In trading: reward = returns might encourage risky behavior; reward = Sharpe ratio encourages risk-adjusted performance.