QuantRL-Lab Architecture Guide
This document covers the key architectural decisions and design patterns in QuantRL-Lab. For practical usage, see the User Guide.
Table of Contents
- System Layers
- Strategy Pattern (Dependency Injection)
- Step Execution Order
- Data Processing Pipeline
- Protocol Pattern for Data Sources
- Registry Pattern for Technical Indicators
- Non-Obvious Behaviours
System Layers
QuantRL-Lab is organized into four layers:
| Layer | Responsibility | Key Components |
|---|---|---|
| Data | Fetch, normalize, and engineer features from market data | DataSource, DataProcessor, DataPipeline |
| Environment | Gymnasium-compatible trading simulation | SingleStockTradingEnv, pluggable strategies |
| Experiment | Train, tune, and evaluate RL agents | BacktestRunner, ExperimentJob, OptunaRunner |
| Utilities | Feature selection, logging | IndicatorRegistry, AgentExplainer |
Data flows top-to-bottom: raw market data → processed features → environment → RL agent → evaluation results.
Strategy Pattern (Dependency Injection)
The core design principle. SingleStockTradingEnv accepts three pluggable strategies at construction time — it never hardcodes how actions, observations, or rewards work:
env = SingleStockTradingEnv(
data=df,
config=config,
action_strategy=..., # How raw agent output maps to market orders
observation_strategy=..., # What state the agent sees
reward_strategy=..., # What scalar signal the agent optimizes
)
Each strategy type has an abstract base class in quantrl_lab.environments.core.interfaces:
BaseActionStrategy
def define_action_space(self) -> gym.spaces.Space: ...
def handle_action(self, env: TradingEnvProtocol, action: Any) -> Tuple[Any, Dict]: ...
BaseObservationStrategy
def define_observation_space(self, env: TradingEnvProtocol) -> gym.spaces.Space: ...
def build_observation(self, env: TradingEnvProtocol) -> np.ndarray: ...
def get_feature_names(self, env: TradingEnvProtocol) -> List[str]: ...
BaseRewardStrategy
def calculate_reward(self, env: TradingEnvProtocol) -> float: ...
def on_step_end(self, env: TradingEnvProtocol): ... # optional hook for stateful strategies
Strategies are injected, not inherited. The environment calls handle_action(), calculate_reward(), and build_observation() each step — it doesn't care about the implementation. This makes it trivial to swap reward functions without touching environment code.
Step Execution Order
Each env.step(action) call follows this exact sequence (see single.py):
- Store
prev_portfolio_value(for reward calculations) portfolio.process_open_orders()— fill or expire pending limit/stop ordersaction_strategy.handle_action(env, action)— decode and execute the new order- Advance
current_step, checkterminated/truncated reward_strategy.calculate_reward(env)— compute scalar reward- Clip reward to
reward_clip_range reward_strategy.on_step_end(env)— stateful hook (e.g. update running stats)observation_strategy.build_observation(env)— construct next state vector- Return
(observation, reward, terminated, truncated, info)
Note: reward is computed before observation. Both strategies receive the environment instance (env) so they can access env.portfolio, env.data, env.current_step, env.action_type, etc.
Data Processing Pipeline
Raw OHLCV data passes through a builder-pattern pipeline (DataPipeline) composed of ProcessingStep units:
DataPipeline
├── TechnicalIndicatorStep (SMA, EMA, RSI, MACD, ...)
├── AnalystEstimatesStep (price targets, ratings — requires FMP API key)
├── MarketContextStep (sector/industry performance)
├── SentimentEnrichmentStep (news sentiment via HuggingFace — optional)
├── NumericConversionStep (cast columns to float32)
└── ColumnCleanupStep (drop non-numeric, rename)
DataProcessor.data_processing_pipeline() wraps this into a single call. All steps are tracked in the returned ProcessingMetadata object (which columns were dropped, which indicators were added, etc).
ProcessingStep is a Protocol (structural typing) rather than an ABC — any class with a process(df, metadata) method qualifies. This makes it easy to add custom steps without modifying the pipeline itself.
Protocol Pattern for Data Sources
Data sources use a hybrid of ABC + Protocols. DataSource (ABC) provides shared infrastructure (source_name, connect(), supported_features). Optional capabilities are expressed as runtime-checkable Protocols — a loader satisfies a protocol simply by having the required methods, with no inheritance needed:
| Protocol | Key Methods |
|---|---|
HistoricalDataCapable |
get_historical_ohlcv_data() |
LiveDataCapable |
get_latest_quote(), get_latest_trade() |
StreamingCapable |
subscribe_to_updates(), start_streaming(), stop_streaming() |
NewsDataCapable |
get_news_data() |
FundamentalDataCapable |
get_fundamental_data() |
AnalystDataCapable |
get_historical_grades(), get_historical_rating() |
SectorDataCapable |
get_historical_sector_performance(), get_historical_industry_performance() |
CompanyProfileCapable |
get_company_profile() |
This avoids forcing every loader to inherit from a sprawling ABC chain. Adding a new capability means defining a new protocol and implementing the method on the relevant loaders — the class hierarchy doesn't change. Capabilities are checked at runtime via isinstance(loader, SomeProtocol), and supported_features auto-discovers which protocols a loader implements.
Registry Pattern for Technical Indicators
Technical indicators are registered via a decorator on module import, eliminating hardcoded lists:
@IndicatorRegistry.register("RSI")
def compute_rsi(df: pd.DataFrame, window: int = 14) -> pd.DataFrame:
...
The registry provides:
- IndicatorRegistry.list_all() — discover all available indicators dynamically
- IndicatorRegistry.apply("RSI", df, window=14) — apply any indicator by name
This is what makes the feature selection workflow (notebooks/feature_selection.ipynb) possible: the notebook can iterate over all registered indicators programmatically, compute them all, and rank by correlation with returns — without maintaining a hardcoded list anywhere.
Adding a new indicator is a one-step operation: decorate the function with @IndicatorRegistry.register("NAME"). It becomes immediately available to DataProcessor and the feature selection pipeline.
Non-Obvious Behaviours
Things that aren't apparent from the API surface and tend to cause confusion:
-
current_stepadvances after the action, not before reward. Reward is computed on the post-action state at the newcurrent_step, soenv.data[env.current_step]insidecalculate_reward()is the bar after the trade was placed, not the bar it was placed on. -
action_typeis set byhandle_action()and read by reward strategies. If your reward strategy inspectsenv.action_type, it will always see the current step's action —handle_action()stores it onselfbeforecalculate_reward()is called. -
Portfolio resets on
reset()but retains no episode history.env.reset()resets balance, shares, and open orders to initial state. The portfolio's transaction log is also cleared. Episode data is only preserved if you collect it externally (e.g. viaBacktestRunner). -
Price column is auto-detected. The environment searches for columns named
close,Close, oradj_close; falls back to the 4th column (index 3) if none match. Passprice_column=explicitly to override. -
window_sizeaffects observation padding, not data slicing. The full dataset is always available viaenv.data.window_sizecontrols how many bars the observation strategy uses for its rolling window, with zero-padding at the start of an episode. -
CompositeRewardweights are not automatically normalised. Passnormalize_weights=Trueto normalise them to sum to 1.auto_scale=Trueis separate — it standardises each component to N(0,1) before weighting, which is generally recommended when mixing strategies with different magnitudes. -
n_envs > 1inExperimentJobuses vectorised environments for training only. Evaluation always runs on a single environment regardless ofn_envs. -
Limit and stop orders persist across steps. An unexecuted order placed at step
tstays inportfolio.open_ordersand is checked at the start of every subsequent step viaprocess_open_orders()until it fills or the episode ends. Orders placed withOrderTIF.TTLadditionally expire afterSimulationConfig.order_expiration_stepssteps (default 5).