Reinforcement Learning And Q Learning

25 essential Q-Learning concepts for algorithmic trading

What You'll Learn

Master Q-Learning implementation with 25 flashcards covering Q-table, Q-update equation, query vs querySetState, epsilon-greedy exploration, learning rate, state discretization, and Q-Learning for trading strategies.

Key Topics

Q-table structure and Q-value updates
Q-update equation with learning rate and discount factor
query() vs querySetState() for training vs testing
Epsilon-greedy exploration strategy
State discretization for continuous spaces
Q-Learning for trading: state/action/reward design

Looking for more machine learning resources? Visit the Explore page to browse related decks or use the Create Your Own Deck flow to customize this set.

How to study this deck

Start with a quick skim of the questions, then launch study mode to flip cards until you can answer each prompt without hesitation. Revisit tricky cards using shuffle or reverse order, and schedule a follow-up review within 48 hours to reinforce retention.

Preview: Reinforcement Learning And Q Learning

Question

What are the 5 core components of a Reinforcement Learning problem?

Answer

1. Agent (decision maker) 2. Environment (world agent interacts with) 3. State (current situation) 4. Action (choices available to agent) 5. Reward (feedback signal)

Question

What is a Policy (π) in Reinforcement Learning?

Answer

A mapping from states to actions. It defines what action the agent should take in each state. Can be deterministic π(s) = a or stochastic π(a|s).

Question

What is the Value Function V^π(s)?

Answer

The expected cumulative discounted reward starting from state s and following policy π. Represents how good it is to be in a particular state.

Question

What is the Q-value Function Q^π(s,a)?

Answer

The expected cumulative discounted reward starting from state s, taking action a, then following policy π. Represents how good it is to take a specific action in a specific state.

Question

What is the discount factor (γ) and what does it control?

Answer

γ (gamma) is between 0 and 1. It balances immediate vs future rewards. γ close to 0: values immediate rewards. γ close to 1: values future rewards equally with immediate rewards.

Question

What is the Q-Learning update rule?

Answer

Q(s,a) ← Q(s,a) + α[r + γ max_a' Q(s',a') - Q(s,a)] Where: α = learning rate r = immediate reward γ = discount factor s' = next state max_a' Q(s',a') = best Q-value in next state

Question

What does the learning rate (α) control in Q-Learning?

Answer

α controls how much new information overrides old information. α = 0: agent learns nothing (only uses prior knowledge). α = 1: agent only considers most recent information.

Question

What is the exploration vs exploitation tradeoff?

Answer

Exploration: trying new actions to discover potentially better rewards. Exploitation: using known good actions to maximize immediate reward. Must balance both for effective learning.

Question

What is the ε-greedy strategy?

Answer

With probability ε: choose random action (explore) With probability 1-ε: choose action with highest Q-value (exploit) ε typically decays over time as agent learns.

Question

Is Q-Learning on-policy or off-policy? What does this mean?

Answer

Q-Learning is OFF-POLICY. It learns the optimal policy while following a different policy (e.g., ε-greedy). It updates Q-values using the maximum Q-value of next state, regardless of action actually taken.

Question

What is the Transition Function P(s'|s,a)?

Answer

The probability of transitioning to state s' given current state s and action a. Defines the dynamics of the environment. In model-free RL (like Q-Learning), we don't need to know this explicitly.

Question

What is the difference between model-based and model-free RL?

Answer

Model-Based: Learns a model of environment (transition function P(s'|s,a) and reward function). Can plan ahead. Model-Free: Learns value function or policy directly without learning environment model. Q-Learning is model-free.

Question

What is maximization bias in Q-Learning?

Answer

Q-Learning tends to overestimate action values due to using max operator in update rule. The same values are used for both selecting and evaluating actions, leading to positive bias.

Question

How does Double Q-Learning solve maximization bias?

Answer

Uses two separate Q-functions (Q1 and Q2). One selects the best action, the other evaluates it. This decouples selection from evaluation, reducing overestimation.

Question

What is a Deep Q-Network (DQN)?

Answer

Uses a neural network to approximate the Q-function instead of a Q-table. Allows handling high-dimensional state spaces. Key innovations: experience replay and target network.

Question

What is experience replay in DQN?

Answer

Stores past experiences (s, a, r, s') in a replay buffer. Randomly samples mini-batches for training. Benefits: breaks correlation between consecutive samples, improves data efficiency, stabilizes training.

Question

What is a target network in DQN?

Answer

A separate copy of the Q-network that is updated less frequently. Used to generate target Q-values during training. Prevents unstable feedback loops and improves convergence.

Question

What is the REINFORCE algorithm?

Answer

A policy gradient method that directly optimizes the policy. Updates policy parameters in direction that increases expected reward. Monte Carlo approach - waits until episode ends to update.

Question

What are Actor-Critic methods?

Answer

Combine value-based and policy-based methods. Actor: learns policy (what to do). Critic: learns value function (how good are actions). Examples: A2C, TRPO, PPO.

Question

What is the advantage function in Actor-Critic?

Answer

A(s,a) = Q(s,a) - V(s) Measures how much better action a is compared to the average action in state s. Reduces variance in policy gradient estimates.

Question

In Q-Learning for trading, what might states represent?

Answer

Discretized market conditions based on technical indicators (e.g., Bollinger Band position, momentum level, RSI value). State captures current market situation relevant for trading decisions.

Question

In Q-Learning for trading, what are typical actions?

Answer

Buy (go long), Sell (go short), or Hold (maintain current position). May also include position sizes (e.g., buy 1000 shares, sell 1000 shares).

Question

In Q-Learning for trading, how are rewards typically structured?

Answer

Based on returns achieved: profit from trades, portfolio value change, or risk-adjusted returns (Sharpe ratio). May include penalties for transaction costs or excessive trading.

Question

Why must testPolicy() produce identical output for consecutive calls in Project 8?

Answer

testPolicy() should only execute the learned policy, not train/update the learner. Deterministic behavior ensures reproducibility and proper separation of training (addEvidence) from testing (testPolicy).

Question

What method should be called in testPolicy() for Q-Learner: query() or querySetState()?

Answer

querySetState() should be used in testPolicy(). It queries the learned policy without updating Q-values. query() updates Q-values and should only be used during training (addEvidence).

Question

How does discount factor γ affect trading strategy in RL?

Answer

Low γ (near 0): agent focuses on immediate profits, may lead to day-trading behavior. High γ (near 1): agent considers long-term returns, may hold positions longer. Choice depends on trading horizon and strategy goals.

Question

What is the curse of dimensionality in RL?

Answer

As state/action space grows, the number of state-action pairs grows exponentially. Q-table becomes huge and sparse. Most states never visited during training. Solution: function approximation (e.g., neural networks in DQN).

Question

What is the Bellman Equation for Q-values?

Answer

Q*(s,a) = E[r + γ max_a' Q*(s',a')] Optimal Q-value equals expected immediate reward plus discounted maximum future Q-value. Foundation for Q-Learning update rule.

Question

What is the difference between Q-Learning and SARSA?

Answer

Q-Learning (off-policy): Updates using max_a' Q(s',a') - learns optimal policy. SARSA (on-policy): Updates using Q(s',a') where a' is action actually taken - learns policy being followed. Q-Learning more aggressive, SARSA more conservative.

Question

What are the three main improvements in Rainbow DQN?

Answer

1. Double DQN: reduces overestimation bias 2. Dueling DQN: separates state value and action advantages 3. Prioritized Experience Replay: samples important experiences more frequently (Plus: Distributional RL, Noisy Nets, Multi-step learning)

Question

What is the role of the reward signal in shaping agent behavior?

Answer

Reward defines the goal. Agent learns to maximize cumulative reward. Poorly designed rewards lead to unintended behavior. In trading: reward = returns might encourage risky behavior; reward = Sharpe ratio encourages risk-adjusted performance.