Trading Environments

QuantRL-Lab wraps financial markets as Gymnasium environments. The agent receives an observation of the market and its portfolio state, chooses an action, and receives a scalar reward — the standard RL loop.

Available Environments

Environment	Status	Description
`SingleStockTradingEnv`	Stable	Trades a single asset. Full order-type support.
`MultiStockTradingEnv`	Planned	Portfolio of multiple assets with cross-asset observations.
Futures / Options envs	Roadmap	Contract expiry, margin, and derivatives semantics.

`SingleStockTradingEnv`

The core environment. It delegates all algorithmic decisions to three injected strategy objects, keeping the environment itself thin and stable.

Constructor

SingleStockTradingEnv(
    data: pd.DataFrame | np.ndarray,
    config: SingleStockEnvConfig,
    action_strategy: BaseActionStrategy,
    reward_strategy: BaseRewardStrategy,
    observation_strategy: BaseObservationStrategy,
    price_column: str | int | None = None,   # auto-detected if None
)

Configuration

See Configuration for the full parameter reference. The key fields:

from quantrl_lab.environments.stock.components.config import SingleStockEnvConfig, SimulationConfig

config = SingleStockEnvConfig(
    initial_balance=100_000,
    window_size=20,
    simulation=SimulationConfig(transaction_cost_pct=0.001, slippage=0.001),
)

Minimal working example

import pandas as pd
from stable_baselines3 import PPO

from quantrl_lab.data.sources import YFinanceDataLoader
from quantrl_lab.data.processing.processor import DataProcessor
from quantrl_lab.environments.stock.single import SingleStockTradingEnv
from quantrl_lab.environments.stock.components.config import SingleStockEnvConfig
from quantrl_lab.environments.stock.strategies.actions.standard import StandardActionStrategy
from quantrl_lab.environments.stock.strategies.observations.feature_aware import FeatureAwareObservationStrategy
from quantrl_lab.environments.stock.strategies.rewards.portfolio_value import PortfolioValueChangeReward

# 1. Get data
loader = YFinanceDataLoader()
raw_df = loader.get_historical_ohlcv_data(["AAPL"], start="2021-01-01", end="2024-01-01")

processor = DataProcessor(ohlcv_data=raw_df)
df, _ = processor.data_processing_pipeline(indicators=["SMA", "RSI", "MACD"])

# 2. Build environment
config = SingleStockEnvConfig(initial_balance=100_000, window_size=20)

env = SingleStockTradingEnv(
    data=df,
    config=config,
    action_strategy=StandardActionStrategy(),
    reward_strategy=PortfolioValueChangeReward(),
    observation_strategy=FeatureAwareObservationStrategy(),
)

# 3. Train
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=50_000)

Action Space

The action space is defined by the action strategy. Two strategies are available.

`StandardActionStrategy` — 3-element Box

The default strategy. Produces a Box(3,) action:

action = [action_type, amount, price_modifier]

Dimension	Range	Meaning
`action_type`	`[-1, 1]` → mapped to `[0, 6]`	Which order type to place
`amount`	`[-1, 1]` → mapped to `[0, 1]`	Fraction of available balance/shares
`price_modifier`	`[0.9, 1.1]`	Limit price as a multiple of market price

The symmetric [-1, 1] range for action_type and amount is intentional — an untrained agent outputs values near 0, which maps to the middle of the action range rather than always choosing "Hold" or 0% amount.

Action types (Actions enum):

Int	Name	Description
0	`Hold`	No action
1	`Buy`	Market buy `amount%` of available cash
2	`Sell`	Market sell `amount%` of held shares
3	`LimitBuy`	Place limit buy order at `price × price_modifier`
4	`LimitSell`	Place limit sell order at `price × price_modifier`
5	`StopLoss`	Place stop-loss order at `price × price_modifier`
6	`TakeProfit`	Place take-profit order at `price × price_modifier`

from quantrl_lab.environments.stock.strategies.actions.standard import StandardActionStrategy
import numpy as np

action_strategy = StandardActionStrategy()
print(action_strategy.define_action_space())
# Box([-1.  -1.   0.9], [1.  1.  1.1], (3,), float32)

# Manual action examples:
# Market buy 50% of cash:
action = np.array([-0.333, 0.0, 1.0])   # action_type → 2 (Buy), amount → 0.5

# Limit sell 75% of shares at 3% above market:
action = np.array([0.333, 0.5, 1.03])   # action_type → 4 (LimitSell), amount → 0.75

# Stop-loss 100% of shares at 5% below market:
action = np.array([0.667, 1.0, 0.95])   # action_type → 5 (StopLoss), amount → 1.0

`TimeInForceActionStrategy` — 4-element Box

An extended strategy that adds explicit Time-In-Force (TIF) control as a 4th dimension. Useful when the agent needs to differentiate between persistent and short-lived limit orders.

action = [action_type, amount, price_modifier, tif_type]

Dimension	Range	Meaning
`action_type`	`[-1, 1]` → `[0, 6]`	Same as Standard
`amount`	`[0, 1]`	Fraction of available balance/shares
`price_modifier`	`[0.9, 1.1]`	Limit price multiplier
`tif_type`	`[-1, 1]` → `{0, 1, 2}`	Order lifetime policy

TIF types:

Int	Name	Behaviour
0	`GTC` (Good Till Cancelled)	Order persists until filled or cancelled
1	`IOC` (Immediate or Cancel)	Fills immediately or is cancelled
2	`TTL` (Time To Live)	Expires after `order_expiration_steps` steps

from quantrl_lab.environments.stock.strategies.actions.time_in_force import TimeInForceActionStrategy

action_strategy = TimeInForceActionStrategy()
print(action_strategy.define_action_space())
# Box([-1.   0.   0.9 -1. ], [1.  1.  1.1  1. ], (4,), float32)

Observation Space

The default observation strategy is FeatureAwareObservationStrategy. It produces a flat Box(N,) vector:

observation = [market_window (flattened), portfolio_features]

Total size: window_size × num_features + 9

Market window

A rolling window of window_size steps (oldest → newest) of the full feature matrix. Columns are normalized by type:

Column type	Normalisation	Examples
Price-like	Divided by first value in window (relative return scale)	`open`, `high`, `low`, `close`, `SMA`, `EMA`, `BB_upper`
Oscillators 0–100	Divided by 100	`RSI`, `STOCH`, `MFI`, `ADX`
Williams %R (−100–0)	`(x + 100) / 100`	`WILLR`
CCI (unbounded ~±200)	Divided by 200	`CCI`
ATR	`ATR / close` (% volatility)	`ATR`
MACD	`MACD / close` (scale-free)	`MACD`, `MACD_signal`
OBV	Z-scored within the window	`OBV`
Sentiment / analyst / sector	Passed through as-is	`sentiment_score`, `sector_change`

Detection is keyword-based on column names — no manual labelling needed.

Portfolio features (always appended, 9 values)

Feature	Description
`portfolio_balance_ratio`	`cash / initial_balance`
`position_size_ratio`	`(shares × price) / portfolio_value`
`unrealized_pl_pct`	`(current_price − avg_entry) / avg_entry`
`price_pos_in_range`	Position of current price within recent high/low range [0, 1]
`recent_volatility`	Std-dev of returns over `volatility_lookback` steps
`recent_trend`	Linear slope of price over `trend_lookback` steps
`risk_reward_ratio`	`(avg_take_profit − price) / (price − avg_stop_loss)`
`dist_to_stop_loss`	`(price − avg_stop_loss) / price`
`dist_to_take_profit`	`(avg_take_profit − price) / price`

from quantrl_lab.environments.stock.strategies.observations.feature_aware import FeatureAwareObservationStrategy

obs_strategy = FeatureAwareObservationStrategy(
    volatility_lookback=10,   # steps for recent volatility calc
    trend_lookback=10,        # steps for recent trend calc
    normalize_stationary=True # scale oscillators like RSI, ADX to [0,1]
)

# After env is built:
feature_names = obs_strategy.get_feature_names(env)
print(feature_names[:5])
# ['open_t-19', 'high_t-19', 'low_t-19', 'close_t-19', 'volume_t-19']
print(feature_names[-9:])
# ['portfolio_balance_ratio', 'position_size_ratio', ...]

Reward Shaping

Rewards are pluggable. Every strategy inherits BaseRewardStrategy and implements calculate_reward(env) -> float. Strategies can also implement on_step_end(env) for stateful updates and reset() for between-episode cleanup.

Available reward strategies

`PortfolioValueChangeReward`

The simplest baseline. Reward = % change in portfolio value since the previous step.

from quantrl_lab.environments.stock.strategies.rewards.portfolio_value import PortfolioValueChangeReward

reward_strategy = PortfolioValueChangeReward()
# reward = (current_value - prev_value) / prev_value

`DifferentialSharpeReward`

Dense step-level Sharpe signal. Scales the current excess return by the historical volatility, so large returns in a low-volatility regime score higher than the same return in a volatile one.

from quantrl_lab.environments.stock.strategies.rewards.sharpe import DifferentialSharpeReward

reward_strategy = DifferentialSharpeReward(
    risk_free_rate=0.0,   # per-step risk-free rate (usually 0 for daily)
    decay=0.99,           # EMA decay for running mean/variance; closer to 1 = longer memory
)
# reward = excess_return / running_std_dev

`DifferentialSortinoReward`

Like Sharpe but penalises only downside volatility (returns below target_return). Upside volatility is not penalised, so the agent is not discouraged from large positive moves.

from quantrl_lab.environments.stock.strategies.rewards.sortino import DifferentialSortinoReward

reward_strategy = DifferentialSortinoReward(
    target_return=0.0,   # minimum acceptable return (MAR)
    decay=0.99,
)
# reward = current_return / running_downside_deviation

`DrawdownPenaltyReward`

Continuous penalty proportional to the current drawdown from the episode's high-water mark. Keeps persistent pressure on the agent to recover from losses.

from quantrl_lab.environments.stock.strategies.rewards.drawdown import DrawdownPenaltyReward

reward_strategy = DrawdownPenaltyReward(penalty_factor=1.0)
# reward = -(drawdown_pct * penalty_factor)

`TurnoverPenaltyReward`

Penalises excessive trading by amplifying the transaction costs already embedded in portfolio value. Discourages "churn" trades whose P&L is swamped by fees.

from quantrl_lab.environments.stock.strategies.rewards.turnover import TurnoverPenaltyReward

reward_strategy = TurnoverPenaltyReward(penalty_factor=2.0)
# reward = -(fees_paid_this_step * penalty_factor)
# penalty_factor=1.0 doubles the cost impact; 5.0 aggressively punishes churning

`InvalidActionPenalty`

Fixed penalty when the agent tries an illegal action — e.g. selling with no shares held. Teaches the agent feasibility faster than relying on P&L alone.

from quantrl_lab.environments.stock.strategies.rewards.invalid_action import InvalidActionPenalty

reward_strategy = InvalidActionPenalty(penalty=-0.5)
# reward = -0.5 if action was invalid, else 0.0

`BoredomPenaltyReward`

Penalises holding a position beyond a grace period without meaningful price movement. Encourages timely entries and exits rather than indefinite holding.

from quantrl_lab.environments.stock.strategies.rewards.boredom import BoredomPenaltyReward

reward_strategy = BoredomPenaltyReward(
    penalty_per_step=-0.001,  # small penalty per step after grace period
    grace_period=10,          # steps before penalty kicks in
    min_profit_pct=0.005,     # unrealized profit % that would "reset" the timer
)

`LimitExecutionReward`

Bonus when a limit order fills at a better price than the prevailing market price at the time it was placed. Rewards the agent for using limit orders effectively rather than always paying market.

from quantrl_lab.environments.stock.strategies.rewards.execution_bonus import LimitExecutionReward

reward_strategy = LimitExecutionReward(improvement_multiplier=10.0)
# reward = price_improvement_pct * improvement_multiplier
# e.g. limit buy 2% below market → +0.20 bonus

`OrderExpirationPenaltyReward`

Fixed penalty per expired pending order. Discourages "order spam" — placing unrealistic limit orders that never fill and clog the system until TTL.

from quantrl_lab.environments.stock.strategies.rewards.expiration import OrderExpirationPenaltyReward

reward_strategy = OrderExpirationPenaltyReward(penalty_per_order=-0.1)
# reward = num_expired_orders * penalty_per_order

Combining with `CompositeReward`

Most real experiments combine multiple components. CompositeReward handles weighting and optional per-component normalisation.

from quantrl_lab.environments.stock.strategies.rewards.composite import CompositeReward
from quantrl_lab.environments.stock.strategies.rewards.sortino import DifferentialSortinoReward
from quantrl_lab.environments.stock.strategies.rewards.drawdown import DrawdownPenaltyReward
from quantrl_lab.environments.stock.strategies.rewards.turnover import TurnoverPenaltyReward
from quantrl_lab.environments.stock.strategies.rewards.invalid_action import InvalidActionPenalty

reward_strategy = CompositeReward(
    strategies=[
        DifferentialSortinoReward(decay=0.99),
        DrawdownPenaltyReward(penalty_factor=0.5),
        TurnoverPenaltyReward(penalty_factor=1.0),
        InvalidActionPenalty(penalty=-0.5),
    ],
    weights=[0.6, 0.2, 0.1, 0.1],
    normalize_weights=True,   # weights are normalised to sum to 1 (default)
    auto_scale=False,         # if True, each component is z-scored before weighting
)

normalize_weights=True (default): weights are scaled so they always sum to 1 even if they don't already.

auto_scale=True: each component is standardised to N(0,1) via Welford's online algorithm before being weighted. Use this when components have very different natural scales and you can't hand-tune weights easily. Running stats persist across episodes for stability.

Full End-to-End Example

A complete pipeline: fetch data → process → build env → train → evaluate.

import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv

from quantrl_lab.data.sources import YFinanceDataLoader
from quantrl_lab.data.processing.processor import DataProcessor
from quantrl_lab.environments.stock.single import SingleStockTradingEnv
from quantrl_lab.environments.stock.components.config import (
    SingleStockEnvConfig, SimulationConfig, RewardConfig
)
from quantrl_lab.environments.stock.strategies.actions.standard import StandardActionStrategy
from quantrl_lab.environments.stock.strategies.observations.feature_aware import FeatureAwareObservationStrategy
from quantrl_lab.environments.stock.strategies.rewards.composite import CompositeReward
from quantrl_lab.environments.stock.strategies.rewards.sortino import DifferentialSortinoReward
from quantrl_lab.environments.stock.strategies.rewards.drawdown import DrawdownPenaltyReward
from quantrl_lab.environments.stock.strategies.rewards.turnover import TurnoverPenaltyReward
from quantrl_lab.environments.stock.strategies.rewards.invalid_action import InvalidActionPenalty

# ── 1. Data ──────────────────────────────────────────────────────────────────
loader = YFinanceDataLoader()
raw_df = loader.get_historical_ohlcv_data(["AAPL"], start="2019-01-01", end="2024-01-01")

processor = DataProcessor(ohlcv_data=raw_df)
splits, meta = processor.data_processing_pipeline(
    indicators=["RSI", {"SMA": {"window": 50}}, {"EMA": {"window": 20}}, "ATR", "MACD"],
    split_config={"train": 0.8, "test": 0.2},
)
train_df, test_df = splits["train"], splits["test"]

# ── 2. Strategies ─────────────────────────────────────────────────────────────
action_strategy  = StandardActionStrategy()
obs_strategy     = FeatureAwareObservationStrategy(volatility_lookback=10, trend_lookback=10)
reward_strategy  = CompositeReward(
    strategies=[
        DifferentialSortinoReward(decay=0.99),
        DrawdownPenaltyReward(penalty_factor=0.5),
        TurnoverPenaltyReward(penalty_factor=1.0),
        InvalidActionPenalty(penalty=-0.5),
    ],
    weights=[0.6, 0.2, 0.1, 0.1],
)

# ── 3. Config ─────────────────────────────────────────────────────────────────
config = SingleStockEnvConfig(
    initial_balance=100_000,
    window_size=20,
    simulation=SimulationConfig(transaction_cost_pct=0.001, slippage=0.001),
    rewards=RewardConfig(clip_range=(-1.0, 1.0)),
)

# ── 4. Train environment ───────────────────────────────────────────────────────
def make_train_env():
    return SingleStockTradingEnv(
        data=train_df, config=config,
        action_strategy=action_strategy,
        reward_strategy=reward_strategy,
        observation_strategy=obs_strategy,
    )

train_env = DummyVecEnv([make_train_env])
model = PPO("MlpPolicy", train_env, verbose=1, n_steps=2048, batch_size=64)
model.learn(total_timesteps=200_000)

# ── 5. Evaluate on test set ───────────────────────────────────────────────────
test_env = SingleStockTradingEnv(
    data=test_df, config=config,
    action_strategy=StandardActionStrategy(),
    reward_strategy=DifferentialSortinoReward(),   # use clean strategy for eval
    observation_strategy=FeatureAwareObservationStrategy(),
)

obs, _ = test_env.reset()
portfolio_values = []

while True:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = test_env.step(action)
    portfolio_values.append(info["portfolio_value"])
    if terminated or truncated:
        break

final_value = portfolio_values[-1]
total_return = (final_value - 100_000) / 100_000
print(f"Final portfolio value: ${final_value:,.2f}  ({total_return:.1%})")

Custom Reward Strategy

Extend BaseRewardStrategy to implement any reward logic:

from quantrl_lab.environments.core.interfaces import BaseRewardStrategy

class CalmarRatioReward(BaseRewardStrategy):
    """Reward = annualised return / max drawdown (Calmar-inspired)."""

    def __init__(self, penalty_factor: float = 1.0):
        super().__init__()
        self.penalty_factor = penalty_factor
        self._peak_value = 0.0
        self._max_drawdown = 1e-9
        self._cumulative_return = 0.0
        self._step = 0

    def calculate_reward(self, env) -> float:
        price = env._get_current_price()
        value = env.portfolio.get_value(price)

        if self._peak_value < value:
            self._peak_value = value

        dd = (self._peak_value - value) / (self._peak_value + 1e-9)
        self._max_drawdown = max(self._max_drawdown, dd)

        ret = (value - env.prev_portfolio_value) / (env.prev_portfolio_value + 1e-9)
        self._cumulative_return += ret
        self._step += 1

        calmar = self._cumulative_return / (self._max_drawdown * self.penalty_factor)
        return float(calmar / (self._step + 1))   # normalise by episode length

    def reset(self):
        self._peak_value = 0.0
        self._max_drawdown = 1e-9
        self._cumulative_return = 0.0
        self._step = 0

Roadmap

Feature	Status	Notes
`MultiStockTradingEnv`	In development	Portfolio over N assets; cross-sectional observations
Continuous position sizing	Planned	Fractional shares, no rounding
Futures environment	Planned	Margin, leverage, contract expiry
Options environment	Exploratory	Greeks, IV surface as observations
Live trading integration	Partial	`AlpacaTrader` in `deployment/trading/` handles live execution

Trading Environments

Available Environments

SingleStockTradingEnv

Constructor

Configuration

Minimal working example

Action Space

StandardActionStrategy — 3-element Box

TimeInForceActionStrategy — 4-element Box

Observation Space

Market window

Portfolio features (always appended, 9 values)

Reward Shaping

Available reward strategies

PortfolioValueChangeReward

DifferentialSharpeReward

DifferentialSortinoReward

DrawdownPenaltyReward

TurnoverPenaltyReward

InvalidActionPenalty

BoredomPenaltyReward

LimitExecutionReward

OrderExpirationPenaltyReward

Combining with CompositeReward