Batch Processing¶
Parallel multi-replay parsing via ProcessPoolExecutor.
Each worker process parses one replay independently, so performance scales with CPU cores.
Memory
parse_many_to_parquet writes and discards each replay immediately, keeping memory
usage flat regardless of batch size. parse_many_to_dataframe holds all results in
memory until concatenation — prefer parse_many_to_parquet for large batches.
Parquet dependency
Parquet output requires an optional engine. Install pyarrow (recommended):
gem.batch
¶
Bulk replay parsing — process many .dem files in parallel.
Provides three public functions:
- :func:
parse_many— parse a list/folder of replays, returnlist[ParseResult]. - :func:
parse_many_to_dataframe— same, but concatenate all successful results into adict[str, DataFrame](one row-set per table, with amatch_pathcolumn added for provenance). - :func:
parse_many_to_parquet— parse-and-write each replay into its own subdirectory underoutput_dir, one.parquetfile per table. Replays are processed and discarded one at a time to keep memory bounded.
All three functions use ProcessPoolExecutor for true parallelism (CPU-bound
work) and display a Rich progress bar by default.
ParseResult
dataclass
¶
Outcome of parsing a single replay.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Absolute path to the |
required |
match
|
ParsedMatch | None
|
Populated :class: |
required |
error
|
Exception | None
|
Exception raised during parsing, or |
required |
Source code in src/gem/batch.py
ok: bool
property
¶
Return True when parsing succeeded.
parse_many(source: str | Path | Sequence[str | Path], *, workers: int | None = None, recursive: bool = False, progress: bool = True, timeout: float | None = None) -> list[ParseResult]
¶
Parse multiple replays in parallel and return a result per replay.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | Path | Sequence[str | Path]
|
Either a directory path (all |
required |
workers
|
int | None
|
Number of worker processes. Defaults to |
None
|
recursive
|
bool
|
When source is a directory, scan subdirectories too. |
False
|
progress
|
bool
|
Show a Rich progress bar while parsing. |
True
|
timeout
|
float | None
|
Per-replay timeout in seconds. |
None
|
Returns:
| Type | Description |
|---|---|
list[ParseResult]
|
List of :class: |
list[ParseResult]
|
|
Source code in src/gem/batch.py
parse_many_to_dataframe(source: str | Path | Sequence[str | Path], *, workers: int | None = None, recursive: bool = False, progress: bool = True, timeout: float | None = None) -> dict[str, pd.DataFrame]
¶
Parse multiple replays and concatenate results into per-table DataFrames.
Each DataFrame gets a match_path column added so rows can be traced
back to their source replay.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | Path | Sequence[str | Path]
|
Directory path or explicit list of replay paths. |
required |
workers
|
int | None
|
Number of worker processes (default: |
None
|
recursive
|
bool
|
Scan subdirectories when source is a directory. |
False
|
progress
|
bool
|
Show a Rich progress bar while parsing. |
True
|
timeout
|
float | None
|
Per-replay timeout in seconds. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, DataFrame]
|
|
dict[str, DataFrame]
|
func: |
dict[str, DataFrame]
|
replays concatenated together. |
Source code in src/gem/batch.py
parse_many_to_parquet(source: str | Path | Sequence[str | Path], output_dir: str | Path, *, workers: int | None = None, recursive: bool = False, progress: bool = True, timeout: float | None = None, index: bool = False) -> list[Path]
¶
Parse multiple replays and write each to its own parquet subdirectory.
Each replay is written and discarded immediately to keep memory usage bounded regardless of batch size. The output layout is::
output_dir/
<replay_stem>/
players.parquet
combat_log.parquet
...
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
str | Path | Sequence[str | Path]
|
Directory path or explicit list of replay paths. |
required |
output_dir
|
str | Path
|
Root directory to write parquet subdirectories into. |
required |
workers
|
int | None
|
Number of worker processes (default: |
None
|
recursive
|
bool
|
Scan subdirectories when source is a directory. |
False
|
progress
|
bool
|
Show a Rich progress bar while parsing. |
True
|
timeout
|
float | None
|
Per-replay timeout in seconds. |
None
|
index
|
bool
|
Whether to include the DataFrame index in parquet output. |
False
|
Returns:
| Type | Description |
|---|---|
list[Path]
|
List of all parquet file paths written. |