Trading Environments
QuantRL-Lab wraps financial markets as Gymnasium environments. The agent receives an observation of the market and its portfolio state, chooses an action, and receives a scalar reward — the standard RL loop.
Available Environments
| Environment | Status | Description |
|---|---|---|
SingleStockTradingEnv |
Stable | Trades a single asset. Full order-type support. |
MultiStockTradingEnv |
Planned | Portfolio of multiple assets with cross-asset observations. |
| Futures / Options envs | Roadmap | Contract expiry, margin, and derivatives semantics. |
SingleStockTradingEnv
The core environment. It delegates all algorithmic decisions to three injected strategy objects, keeping the environment itself thin and stable.
Constructor
SingleStockTradingEnv(
data: pd.DataFrame | np.ndarray,
config: SingleStockEnvConfig,
action_strategy: BaseActionStrategy,
reward_strategy: BaseRewardStrategy,
observation_strategy: BaseObservationStrategy,
price_column: str | int | None = None, # auto-detected if None
)
Configuration
See Configuration for the full parameter reference. The key fields:
from quantrl_lab.environments.stock.components.config import SingleStockEnvConfig, SimulationConfig
config = SingleStockEnvConfig(
initial_balance=100_000,
window_size=20,
simulation=SimulationConfig(transaction_cost_pct=0.001, slippage=0.001),
)
Minimal working example
import pandas as pd
from stable_baselines3 import PPO
from quantrl_lab.data.sources import YFinanceDataLoader
from quantrl_lab.data.processing.processor import DataProcessor
from quantrl_lab.environments.stock.single import SingleStockTradingEnv
from quantrl_lab.environments.stock.components.config import SingleStockEnvConfig
from quantrl_lab.environments.stock.strategies.actions.standard import StandardActionStrategy
from quantrl_lab.environments.stock.strategies.observations.feature_aware import FeatureAwareObservationStrategy
from quantrl_lab.environments.stock.strategies.rewards.portfolio_value import PortfolioValueChangeReward
# 1. Get data
loader = YFinanceDataLoader()
raw_df = loader.get_historical_ohlcv_data(["AAPL"], start="2021-01-01", end="2024-01-01")
processor = DataProcessor(ohlcv_data=raw_df)
df, _ = processor.data_processing_pipeline(indicators=["SMA", "RSI", "MACD"])
# 2. Build environment
config = SingleStockEnvConfig(initial_balance=100_000, window_size=20)
env = SingleStockTradingEnv(
data=df,
config=config,
action_strategy=StandardActionStrategy(),
reward_strategy=PortfolioValueChangeReward(),
observation_strategy=FeatureAwareObservationStrategy(),
)
# 3. Train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=50_000)
Action Space
The action space is defined by the action strategy. Two strategies are available.
StandardActionStrategy — 3-element Box
The default strategy. Produces a Box(3,) action:
| Dimension | Range | Meaning |
|---|---|---|
action_type |
[-1, 1] → mapped to [0, 6] |
Which order type to place |
amount |
[-1, 1] → mapped to [0, 1] |
Fraction of available balance/shares |
price_modifier |
[0.9, 1.1] |
Limit price as a multiple of market price |
The symmetric [-1, 1] range for action_type and amount is intentional — an untrained agent outputs values near 0, which maps to the middle of the action range rather than always choosing "Hold" or 0% amount.
Action types (Actions enum):
| Int | Name | Description |
|---|---|---|
| 0 | Hold |
No action |
| 1 | Buy |
Market buy amount% of available cash |
| 2 | Sell |
Market sell amount% of held shares |
| 3 | LimitBuy |
Place limit buy order at price × price_modifier |
| 4 | LimitSell |
Place limit sell order at price × price_modifier |
| 5 | StopLoss |
Place stop-loss order at price × price_modifier |
| 6 | TakeProfit |
Place take-profit order at price × price_modifier |
from quantrl_lab.environments.stock.strategies.actions.standard import StandardActionStrategy
import numpy as np
action_strategy = StandardActionStrategy()
print(action_strategy.define_action_space())
# Box([-1. -1. 0.9], [1. 1. 1.1], (3,), float32)
# Manual action examples:
# Market buy 50% of cash:
action = np.array([-0.333, 0.0, 1.0]) # action_type → 2 (Buy), amount → 0.5
# Limit sell 75% of shares at 3% above market:
action = np.array([0.333, 0.5, 1.03]) # action_type → 4 (LimitSell), amount → 0.75
# Stop-loss 100% of shares at 5% below market:
action = np.array([0.667, 1.0, 0.95]) # action_type → 5 (StopLoss), amount → 1.0
TimeInForceActionStrategy — 4-element Box
An extended strategy that adds explicit Time-In-Force (TIF) control as a 4th dimension. Useful when the agent needs to differentiate between persistent and short-lived limit orders.
| Dimension | Range | Meaning |
|---|---|---|
action_type |
[-1, 1] → [0, 6] |
Same as Standard |
amount |
[0, 1] |
Fraction of available balance/shares |
price_modifier |
[0.9, 1.1] |
Limit price multiplier |
tif_type |
[-1, 1] → {0, 1, 2} |
Order lifetime policy |
TIF types:
| Int | Name | Behaviour |
|---|---|---|
| 0 | GTC (Good Till Cancelled) |
Order persists until filled or cancelled |
| 1 | IOC (Immediate or Cancel) |
Fills immediately or is cancelled |
| 2 | TTL (Time To Live) |
Expires after order_expiration_steps steps |
from quantrl_lab.environments.stock.strategies.actions.time_in_force import TimeInForceActionStrategy
action_strategy = TimeInForceActionStrategy()
print(action_strategy.define_action_space())
# Box([-1. 0. 0.9 -1. ], [1. 1. 1.1 1. ], (4,), float32)
Observation Space
The default observation strategy is FeatureAwareObservationStrategy. It produces a flat Box(N,) vector:
Total size: window_size × num_features + 9
Market window
A rolling window of window_size steps (oldest → newest) of the full feature matrix. Columns are normalized by type:
| Column type | Normalisation | Examples |
|---|---|---|
| Price-like | Divided by first value in window (relative return scale) | open, high, low, close, SMA, EMA, BB_upper |
| Oscillators 0–100 | Divided by 100 | RSI, STOCH, MFI, ADX |
| Williams %R (−100–0) | (x + 100) / 100 |
WILLR |
| CCI (unbounded ~±200) | Divided by 200 | CCI |
| ATR | ATR / close (% volatility) |
ATR |
| MACD | MACD / close (scale-free) |
MACD, MACD_signal |
| OBV | Z-scored within the window | OBV |
| Sentiment / analyst / sector | Passed through as-is | sentiment_score, sector_change |
Detection is keyword-based on column names — no manual labelling needed.
Portfolio features (always appended, 9 values)
| Feature | Description |
|---|---|
portfolio_balance_ratio |
cash / initial_balance |
position_size_ratio |
(shares × price) / portfolio_value |
unrealized_pl_pct |
(current_price − avg_entry) / avg_entry |
price_pos_in_range |
Position of current price within recent high/low range [0, 1] |
recent_volatility |
Std-dev of returns over volatility_lookback steps |
recent_trend |
Linear slope of price over trend_lookback steps |
risk_reward_ratio |
(avg_take_profit − price) / (price − avg_stop_loss) |
dist_to_stop_loss |
(price − avg_stop_loss) / price |
dist_to_take_profit |
(avg_take_profit − price) / price |
from quantrl_lab.environments.stock.strategies.observations.feature_aware import FeatureAwareObservationStrategy
obs_strategy = FeatureAwareObservationStrategy(
volatility_lookback=10, # steps for recent volatility calc
trend_lookback=10, # steps for recent trend calc
normalize_stationary=True # scale oscillators like RSI, ADX to [0,1]
)
# After env is built:
feature_names = obs_strategy.get_feature_names(env)
print(feature_names[:5])
# ['open_t-19', 'high_t-19', 'low_t-19', 'close_t-19', 'volume_t-19']
print(feature_names[-9:])
# ['portfolio_balance_ratio', 'position_size_ratio', ...]
Reward Shaping
Rewards are pluggable. Every strategy inherits BaseRewardStrategy and implements calculate_reward(env) -> float. Strategies can also implement on_step_end(env) for stateful updates and reset() for between-episode cleanup.
Available reward strategies
PortfolioValueChangeReward
The simplest baseline. Reward = % change in portfolio value since the previous step.
from quantrl_lab.environments.stock.strategies.rewards.portfolio_value import PortfolioValueChangeReward
reward_strategy = PortfolioValueChangeReward()
# reward = (current_value - prev_value) / prev_value
DifferentialSharpeReward
Dense step-level Sharpe signal. Scales the current excess return by the historical volatility, so large returns in a low-volatility regime score higher than the same return in a volatile one.
from quantrl_lab.environments.stock.strategies.rewards.sharpe import DifferentialSharpeReward
reward_strategy = DifferentialSharpeReward(
risk_free_rate=0.0, # per-step risk-free rate (usually 0 for daily)
decay=0.99, # EMA decay for running mean/variance; closer to 1 = longer memory
)
# reward = excess_return / running_std_dev
DifferentialSortinoReward
Like Sharpe but penalises only downside volatility (returns below target_return). Upside volatility is not penalised, so the agent is not discouraged from large positive moves.
from quantrl_lab.environments.stock.strategies.rewards.sortino import DifferentialSortinoReward
reward_strategy = DifferentialSortinoReward(
target_return=0.0, # minimum acceptable return (MAR)
decay=0.99,
)
# reward = current_return / running_downside_deviation
DrawdownPenaltyReward
Continuous penalty proportional to the current drawdown from the episode's high-water mark. Keeps persistent pressure on the agent to recover from losses.
from quantrl_lab.environments.stock.strategies.rewards.drawdown import DrawdownPenaltyReward
reward_strategy = DrawdownPenaltyReward(penalty_factor=1.0)
# reward = -(drawdown_pct * penalty_factor)
TurnoverPenaltyReward
Penalises excessive trading by amplifying the transaction costs already embedded in portfolio value. Discourages "churn" trades whose P&L is swamped by fees.
from quantrl_lab.environments.stock.strategies.rewards.turnover import TurnoverPenaltyReward
reward_strategy = TurnoverPenaltyReward(penalty_factor=2.0)
# reward = -(fees_paid_this_step * penalty_factor)
# penalty_factor=1.0 doubles the cost impact; 5.0 aggressively punishes churning
InvalidActionPenalty
Fixed penalty when the agent tries an illegal action — e.g. selling with no shares held. Teaches the agent feasibility faster than relying on P&L alone.
from quantrl_lab.environments.stock.strategies.rewards.invalid_action import InvalidActionPenalty
reward_strategy = InvalidActionPenalty(penalty=-0.5)
# reward = -0.5 if action was invalid, else 0.0
BoredomPenaltyReward
Penalises holding a position beyond a grace period without meaningful price movement. Encourages timely entries and exits rather than indefinite holding.
from quantrl_lab.environments.stock.strategies.rewards.boredom import BoredomPenaltyReward
reward_strategy = BoredomPenaltyReward(
penalty_per_step=-0.001, # small penalty per step after grace period
grace_period=10, # steps before penalty kicks in
min_profit_pct=0.005, # unrealized profit % that would "reset" the timer
)
LimitExecutionReward
Bonus when a limit order fills at a better price than the prevailing market price at the time it was placed. Rewards the agent for using limit orders effectively rather than always paying market.
from quantrl_lab.environments.stock.strategies.rewards.execution_bonus import LimitExecutionReward
reward_strategy = LimitExecutionReward(improvement_multiplier=10.0)
# reward = price_improvement_pct * improvement_multiplier
# e.g. limit buy 2% below market → +0.20 bonus
OrderExpirationPenaltyReward
Fixed penalty per expired pending order. Discourages "order spam" — placing unrealistic limit orders that never fill and clog the system until TTL.
from quantrl_lab.environments.stock.strategies.rewards.expiration import OrderExpirationPenaltyReward
reward_strategy = OrderExpirationPenaltyReward(penalty_per_order=-0.1)
# reward = num_expired_orders * penalty_per_order
Combining with CompositeReward
Most real experiments combine multiple components. CompositeReward handles weighting and optional per-component normalisation.
from quantrl_lab.environments.stock.strategies.rewards.composite import CompositeReward
from quantrl_lab.environments.stock.strategies.rewards.sortino import DifferentialSortinoReward
from quantrl_lab.environments.stock.strategies.rewards.drawdown import DrawdownPenaltyReward
from quantrl_lab.environments.stock.strategies.rewards.turnover import TurnoverPenaltyReward
from quantrl_lab.environments.stock.strategies.rewards.invalid_action import InvalidActionPenalty
reward_strategy = CompositeReward(
strategies=[
DifferentialSortinoReward(decay=0.99),
DrawdownPenaltyReward(penalty_factor=0.5),
TurnoverPenaltyReward(penalty_factor=1.0),
InvalidActionPenalty(penalty=-0.5),
],
weights=[0.6, 0.2, 0.1, 0.1],
normalize_weights=True, # weights are normalised to sum to 1 (default)
auto_scale=False, # if True, each component is z-scored before weighting
)
normalize_weights=True (default): weights are scaled so they always sum to 1 even if they don't already.
auto_scale=True: each component is standardised to N(0,1) via Welford's online algorithm before being weighted. Use this when components have very different natural scales and you can't hand-tune weights easily. Running stats persist across episodes for stability.
Full End-to-End Example
A complete pipeline: fetch data → process → build env → train → evaluate.
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from quantrl_lab.data.sources import YFinanceDataLoader
from quantrl_lab.data.processing.processor import DataProcessor
from quantrl_lab.environments.stock.single import SingleStockTradingEnv
from quantrl_lab.environments.stock.components.config import (
SingleStockEnvConfig, SimulationConfig, RewardConfig
)
from quantrl_lab.environments.stock.strategies.actions.standard import StandardActionStrategy
from quantrl_lab.environments.stock.strategies.observations.feature_aware import FeatureAwareObservationStrategy
from quantrl_lab.environments.stock.strategies.rewards.composite import CompositeReward
from quantrl_lab.environments.stock.strategies.rewards.sortino import DifferentialSortinoReward
from quantrl_lab.environments.stock.strategies.rewards.drawdown import DrawdownPenaltyReward
from quantrl_lab.environments.stock.strategies.rewards.turnover import TurnoverPenaltyReward
from quantrl_lab.environments.stock.strategies.rewards.invalid_action import InvalidActionPenalty
# ── 1. Data ──────────────────────────────────────────────────────────────────
loader = YFinanceDataLoader()
raw_df = loader.get_historical_ohlcv_data(["AAPL"], start="2019-01-01", end="2024-01-01")
processor = DataProcessor(ohlcv_data=raw_df)
splits, meta = processor.data_processing_pipeline(
indicators=["RSI", {"SMA": {"window": 50}}, {"EMA": {"window": 20}}, "ATR", "MACD"],
split_config={"train": 0.8, "test": 0.2},
)
train_df, test_df = splits["train"], splits["test"]
# ── 2. Strategies ─────────────────────────────────────────────────────────────
action_strategy = StandardActionStrategy()
obs_strategy = FeatureAwareObservationStrategy(volatility_lookback=10, trend_lookback=10)
reward_strategy = CompositeReward(
strategies=[
DifferentialSortinoReward(decay=0.99),
DrawdownPenaltyReward(penalty_factor=0.5),
TurnoverPenaltyReward(penalty_factor=1.0),
InvalidActionPenalty(penalty=-0.5),
],
weights=[0.6, 0.2, 0.1, 0.1],
)
# ── 3. Config ─────────────────────────────────────────────────────────────────
config = SingleStockEnvConfig(
initial_balance=100_000,
window_size=20,
simulation=SimulationConfig(transaction_cost_pct=0.001, slippage=0.001),
rewards=RewardConfig(clip_range=(-1.0, 1.0)),
)
# ── 4. Train environment ───────────────────────────────────────────────────────
def make_train_env():
return SingleStockTradingEnv(
data=train_df, config=config,
action_strategy=action_strategy,
reward_strategy=reward_strategy,
observation_strategy=obs_strategy,
)
train_env = DummyVecEnv([make_train_env])
model = PPO("MlpPolicy", train_env, verbose=1, n_steps=2048, batch_size=64)
model.learn(total_timesteps=200_000)
# ── 5. Evaluate on test set ───────────────────────────────────────────────────
test_env = SingleStockTradingEnv(
data=test_df, config=config,
action_strategy=StandardActionStrategy(),
reward_strategy=DifferentialSortinoReward(), # use clean strategy for eval
observation_strategy=FeatureAwareObservationStrategy(),
)
obs, _ = test_env.reset()
portfolio_values = []
while True:
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = test_env.step(action)
portfolio_values.append(info["portfolio_value"])
if terminated or truncated:
break
final_value = portfolio_values[-1]
total_return = (final_value - 100_000) / 100_000
print(f"Final portfolio value: ${final_value:,.2f} ({total_return:.1%})")
Custom Reward Strategy
Extend BaseRewardStrategy to implement any reward logic:
from quantrl_lab.environments.core.interfaces import BaseRewardStrategy
class CalmarRatioReward(BaseRewardStrategy):
"""Reward = annualised return / max drawdown (Calmar-inspired)."""
def __init__(self, penalty_factor: float = 1.0):
super().__init__()
self.penalty_factor = penalty_factor
self._peak_value = 0.0
self._max_drawdown = 1e-9
self._cumulative_return = 0.0
self._step = 0
def calculate_reward(self, env) -> float:
price = env._get_current_price()
value = env.portfolio.get_value(price)
if self._peak_value < value:
self._peak_value = value
dd = (self._peak_value - value) / (self._peak_value + 1e-9)
self._max_drawdown = max(self._max_drawdown, dd)
ret = (value - env.prev_portfolio_value) / (env.prev_portfolio_value + 1e-9)
self._cumulative_return += ret
self._step += 1
calmar = self._cumulative_return / (self._max_drawdown * self.penalty_factor)
return float(calmar / (self._step + 1)) # normalise by episode length
def reset(self):
self._peak_value = 0.0
self._max_drawdown = 1e-9
self._cumulative_return = 0.0
self._step = 0
Roadmap
| Feature | Status | Notes |
|---|---|---|
MultiStockTradingEnv |
In development | Portfolio over N assets; cross-sectional observations |
| Continuous position sizing | Planned | Fractional shares, no rounding |
| Futures environment | Planned | Margin, leverage, contract expiry |
| Options environment | Exploratory | Greeks, IV surface as observations |
| Live trading integration | Partial | AlpacaTrader in deployment/trading/ handles live execution |