bagelquant-data is a Polars-native, source-agnostic data lake framework for quantitative research.

  • Polars is the dataframe engine.
  • Parquet is the canonical analytical storage format.
  • SQLite stores mutable metadata, manifests, run state, and source/dataset registration.
  • Tushare is implemented as the first source adapter under bagelquant_data.sources.tushare.
  • Non-reference research extraction returns one field at a time as time | asset_id | value.
import polars as pl

from bagelquant_data import DataLake, DatasetSpec

lake = DataLake.open("data")
spec = DatasetSpec(
    name="daily",
    source="custom",
    source_dataset="daily",
    category="market",
    field_mapping={"ts_code": "ts_code", "trade_date": "trade_date"},
    required_columns=("asset_id", "time"),
    primary_key=("asset_id", "time"),
    asset_column="ts_code",
    time_column="trade_date",
    partition_strategy="year_month",
    deduplication="primary_key_last",
    sort_columns=("time", "asset_id"),
)
lake.ingest_frame(
    spec,
    pl.DataFrame(
        {
            "trade_date": ["2024-01-02"],
            "ts_code": ["000001.SZ"],
            "close": [100.0],
        }
    ),
)

close = lake.query.field("daily", "close", source="custom", collect=True)
print(close)  # time, asset_id, close

Documentation is available in two languages:

  • English: docs/en/index.md
  • Chinese: docs/cn/index.md

Development

uv run pytest