QuantRL-Lab Architecture Guide

This document covers the key architectural decisions and design patterns in QuantRL-Lab. For practical usage, see the User Guide.

System Layers

QuantRL-Lab is organized into four layers:

Layer	Responsibility	Key Components
Data	Fetch, normalize, and engineer features from market data	`DataSource`, `DataProcessor`, `DataPipeline`
Environment	Gymnasium-compatible trading simulation	`SingleStockTradingEnv`, pluggable strategies
Experiment	Train, tune, and evaluate RL agents	`BacktestRunner`, `ExperimentJob`, `OptunaRunner`
Utilities	Feature selection, logging	`IndicatorRegistry`, `AgentExplainer`

Data flows top-to-bottom: raw market data → processed features → environment → RL agent → evaluation results.

Strategy Pattern (Dependency Injection)

The core design principle. SingleStockTradingEnv accepts three pluggable strategies at construction time — it never hardcodes how actions, observations, or rewards work:

env = SingleStockTradingEnv(
    data=df,
    config=config,
    action_strategy=...,       # How raw agent output maps to market orders
    observation_strategy=...,  # What state the agent sees
    reward_strategy=...,       # What scalar signal the agent optimizes
)

Each strategy type has an abstract base class in quantrl_lab.environments.core.interfaces:

BaseActionStrategy

def define_action_space(self) -> gym.spaces.Space: ...
def handle_action(self, env: TradingEnvProtocol, action: Any) -> Tuple[Any, Dict]: ...

BaseObservationStrategy

def define_observation_space(self, env: TradingEnvProtocol) -> gym.spaces.Space: ...
def build_observation(self, env: TradingEnvProtocol) -> np.ndarray: ...
def get_feature_names(self, env: TradingEnvProtocol) -> List[str]: ...

BaseRewardStrategy

def calculate_reward(self, env: TradingEnvProtocol) -> float: ...
def on_step_end(self, env: TradingEnvProtocol): ...  # optional hook for stateful strategies

Strategies are injected, not inherited. The environment calls handle_action(), calculate_reward(), and build_observation() each step — it doesn't care about the implementation. This makes it trivial to swap reward functions without touching environment code.

Step Execution Order

Each env.step(action) call follows this exact sequence (see single.py):

Store prev_portfolio_value (for reward calculations)
portfolio.process_open_orders() — fill or expire pending limit/stop orders
action_strategy.handle_action(env, action) — decode and execute the new order
Advance current_step, check terminated / truncated
reward_strategy.calculate_reward(env) — compute scalar reward
Clip reward to reward_clip_range
reward_strategy.on_step_end(env) — stateful hook (e.g. update running stats)
observation_strategy.build_observation(env) — construct next state vector
Return (observation, reward, terminated, truncated, info)

Note: reward is computed before observation. Both strategies receive the environment instance (env) so they can access env.portfolio, env.data, env.current_step, env.action_type, etc.

Data Processing Pipeline

Raw OHLCV data passes through a builder-pattern pipeline (DataPipeline) composed of ProcessingStep units:

DataPipeline
├── TechnicalIndicatorStep   (SMA, EMA, RSI, MACD, ...)
├── AnalystEstimatesStep     (price targets, ratings — requires FMP API key)
├── MarketContextStep        (sector/industry performance)
├── SentimentEnrichmentStep  (news sentiment via HuggingFace — optional)
├── NumericConversionStep    (cast columns to float32)
└── ColumnCleanupStep        (drop non-numeric, rename)

DataProcessor.data_processing_pipeline() wraps this into a single call. All steps are tracked in the returned ProcessingMetadata object (which columns were dropped, which indicators were added, etc).

ProcessingStep is a Protocol (structural typing) rather than an ABC — any class with a process(df, metadata) method qualifies. This makes it easy to add custom steps without modifying the pipeline itself.

Protocol Pattern for Data Sources

Data sources use a hybrid of ABC + Protocols. DataSource (ABC) provides shared infrastructure (source_name, connect(), supported_features). Optional capabilities are expressed as runtime-checkable Protocols — a loader satisfies a protocol simply by having the required methods, with no inheritance needed:

Protocol	Key Methods
`HistoricalDataCapable`	`get_historical_ohlcv_data()`
`LiveDataCapable`	`get_latest_quote()`, `get_latest_trade()`
`StreamingCapable`	`subscribe_to_updates()`, `start_streaming()`, `stop_streaming()`
`NewsDataCapable`	`get_news_data()`
`FundamentalDataCapable`	`get_fundamental_data()`
`AnalystDataCapable`	`get_historical_grades()`, `get_historical_rating()`
`SectorDataCapable`	`get_historical_sector_performance()`, `get_historical_industry_performance()`
`CompanyProfileCapable`	`get_company_profile()`

This avoids forcing every loader to inherit from a sprawling ABC chain. Adding a new capability means defining a new protocol and implementing the method on the relevant loaders — the class hierarchy doesn't change. Capabilities are checked at runtime via isinstance(loader, SomeProtocol), and supported_features auto-discovers which protocols a loader implements.

Registry Pattern for Technical Indicators

Technical indicators are registered via a decorator on module import, eliminating hardcoded lists:

@IndicatorRegistry.register("RSI")
def compute_rsi(df: pd.DataFrame, window: int = 14) -> pd.DataFrame:
    ...

The registry provides: - IndicatorRegistry.list_all() — discover all available indicators dynamically - IndicatorRegistry.apply("RSI", df, window=14) — apply any indicator by name

This is what makes the feature selection workflow (notebooks/feature_selection.ipynb) possible: the notebook can iterate over all registered indicators programmatically, compute them all, and rank by correlation with returns — without maintaining a hardcoded list anywhere.

Adding a new indicator is a one-step operation: decorate the function with @IndicatorRegistry.register("NAME"). It becomes immediately available to DataProcessor and the feature selection pipeline.

Non-Obvious Behaviours

Things that aren't apparent from the API surface and tend to cause confusion:

current_step advances after the action, not before reward. Reward is computed on the post-action state at the new current_step, so env.data[env.current_step] inside calculate_reward() is the bar after the trade was placed, not the bar it was placed on.
action_type is set by handle_action() and read by reward strategies. If your reward strategy inspects env.action_type, it will always see the current step's action — handle_action() stores it on self before calculate_reward() is called.
Portfolio resets on reset() but retains no episode history. env.reset() resets balance, shares, and open orders to initial state. The portfolio's transaction log is also cleared. Episode data is only preserved if you collect it externally (e.g. via BacktestRunner).
Price column is auto-detected. The environment searches for columns named close, Close, or adj_close; falls back to the 4th column (index 3) if none match. Pass price_column= explicitly to override.
window_size affects observation padding, not data slicing. The full dataset is always available via env.data. window_size controls how many bars the observation strategy uses for its rolling window, with zero-padding at the start of an episode.
CompositeReward weights are not automatically normalised. Pass normalize_weights=True to normalise them to sum to 1. auto_scale=True is separate — it standardises each component to N(0,1) before weighting, which is generally recommended when mixing strategies with different magnitudes.
n_envs > 1 in ExperimentJob uses vectorised environments for training only. Evaluation always runs on a single environment regardless of n_envs.
Limit and stop orders persist across steps. An unexecuted order placed at step t stays in portfolio.open_orders and is checked at the start of every subsequent step via process_open_orders() until it fills or the episode ends. Orders placed with OrderTIF.TTL additionally expire after SimulationConfig.order_expiration_steps steps (default 5).