BagelQuant Data is a Polars-native, source-agnostic local data lake framework for quantitative research. It stores canonical research data in Parquet, stores operational state in SQLite, and exposes a stable Python API centered on DataLake.open(...).
The framework is intentionally built around two ideas:
- Source adapters fetch data, but they do not define the core lake architecture.
- Dataset specifications define behavior, so normal datasets can be added without changing storage, query, finance, or management code.
Documentation Map
- Architecture: system layers, storage zones, metadata, canonical fields, and point-in-time semantics.
- Configuration: lake root, credentials, package extras, and local environment setup.
- Management API: source registration, dataset registration, status, inspection, and destructive operations.
- Extraction API: raw records, single-field long panels, multiple fields, reference data, and observation grids.
- Financial API: point-in-time alignment and generic composable financial transforms.
- Incremental Updates: update flow, staging, canonical commits, manifests, and current V1 limitations.
- Tushare Source: bundled Tushare adapter, credentials, dataset specs, and canonical time mappings.
- Adding A Source: how to implement a new source adapter.
- Adding A Dataset: how to write dataset YAML and choose strategies.
- Implementation Plan: design goals and roadmap notes.
First Example
import polars as pl
from bagelquant_data import DataLake, DatasetSpec
lake = DataLake.open("data")
spec = DatasetSpec(
name="daily",
source="custom",
source_dataset="daily",
category="market",
field_mapping={"ts_code": "ts_code", "trade_date": "trade_date"},
required_columns=("asset_id", "time"),
primary_key=("asset_id", "time"),
asset_column="ts_code",
time_column="trade_date",
partition_strategy="year_month",
deduplication="primary_key_last",
sort_columns=("time", "asset_id"),
)
lake.ingest_frame(
spec,
pl.DataFrame(
{
"trade_date": ["20250102", "20250103"],
"ts_code": ["000001.SZ", "000001.SZ"],
"open": [11.20, 11.31],
"close": [11.25, 11.37],
}
),
)
close = lake.query.field("daily", "close", source="custom", collect=True)
print(close)
The output is a long panel with exactly three columns:
time asset_id close
2025-01-02 000001.SZ 11.25
2025-01-03 000001.SZ 11.37
Design Boundaries
BagelQuant Data does not preserve old lake layouts or old public APIs. The current framework is a clean API and storage design. It also does not provide hardcoded financial indicators such as eps_ttm() or roe_ttm(). Instead, it provides generic building blocks such as trailing aggregation, ratio calculation, stock averaging, and point-in-time alignment.
Development Checks
Run the local verification suite with:
uv run pytest
uv run pyright