BagelQuant Data Documentation

BagelQuant Data is a Polars-native, source-agnostic local data lake framework for quantitative research. It stores canonical research data in Parquet, stores operational state in SQLite, and exposes a stable Python API centered on DataLake.open(...).

The framework is intentionally built around two ideas:

Source adapters fetch data, but they do not define the core lake architecture.
Dataset specifications define behavior, so normal datasets can be added without changing storage, query, finance, or management code.

Documentation Map

Architecture: system layers, storage zones, metadata, canonical fields, and point-in-time semantics.
Configuration: lake root, credentials, package extras, and local environment setup.
Management API: source registration, dataset registration, status, inspection, and destructive operations.
Extraction API: raw records, single-field long panels, multiple fields, reference data, and observation grids.
Financial API: point-in-time alignment and generic composable financial transforms.
Incremental Updates: update flow, staging, canonical commits, manifests, and current V1 limitations.
Tushare Source: bundled Tushare adapter, credentials, dataset specs, and canonical time mappings.
Adding A Source: how to implement a new source adapter.
Adding A Dataset: how to write dataset YAML and choose strategies.
Implementation Plan: design goals and roadmap notes.

First Example

import polars as pl

from bagelquant_data import DataLake, DatasetSpec

lake = DataLake.open("data")

spec = DatasetSpec(
    name="daily",
    source="custom",
    source_dataset="daily",
    category="market",
    field_mapping={"ts_code": "ts_code", "trade_date": "trade_date"},
    required_columns=("asset_id", "time"),
    primary_key=("asset_id", "time"),
    asset_column="ts_code",
    time_column="trade_date",
    partition_strategy="year_month",
    deduplication="primary_key_last",
    sort_columns=("time", "asset_id"),
)

lake.ingest_frame(
    spec,
    pl.DataFrame(
        {
            "trade_date": ["20250102", "20250103"],
            "ts_code": ["000001.SZ", "000001.SZ"],
            "open": [11.20, 11.31],
            "close": [11.25, 11.37],
        }
    ),
)

close = lake.query.field("daily", "close", source="custom", collect=True)
print(close)

The output is a long panel with exactly three columns:

time        asset_id    close
2025-01-02  000001.SZ   11.25
2025-01-03  000001.SZ   11.37

Design Boundaries

BagelQuant Data does not preserve old lake layouts or old public APIs. The current framework is a clean API and storage design. It also does not provide hardcoded financial indicators such as eps_ttm() or roe_ttm(). Instead, it provides generic building blocks such as trailing aggregation, ratio calculation, stock averaging, and point-in-time alignment.

Development Checks

Run the local verification suite with:

uv run pytest
uv run pyright