The primary configuration is the lake root path passed to DataLake.open(...).

from bagelquant_data import DataLake

lake = DataLake.open("data")

Opening the lake creates the storage zones under that root if they do not already exist.

Default Local Layout

data/
    lake/
    staging/
    rejected/
    metadata/lake.db
    tmp/

The repository also includes a sample configuration file:

config/bagelquant-data.toml

At this stage, the Python API is the source of truth. The TOML file documents intended local defaults and can be used by future CLI or application integration.

Dependencies

Core dependencies:

  • polars
  • pyarrow

Optional Tushare dependencies:

  • pandas
  • tushare

Install or sync with the project tooling:

uv sync
uv sync --extra tushare

Credentials

Credentials are configured at runtime. They should not be written into dataset YAML, Parquet files, committed configuration files, or documentation examples.

For Tushare, either pass a token:

from bagelquant_data.sources.tushare import TushareSource

source = TushareSource(token="...")

Or configure after registration. This saves the token into the local lake metadata DB so future runs can register TushareSource() and update without passing the token again:

lake.sources.register(TushareSource())
lake.sources.configure_tushare(token="...")

Or set an environment variable:

export TUSHARE_TOKEN="..."

Saved source options are local to the lake root, stored under data/metadata/lake.db, and redacted from source listings.

Local Development

Run tests:

uv run pytest

Run static checking:

uv run pyright

Run the thin CLI:

uv run bagelquant-data --root data status
uv run bagelquant-data --root data source-list
uv run bagelquant-data --root data dataset-list --source tushare

Reusable setup and update scripts are documented in scripts/README.md.

What Not To Configure

Do not configure old lake layouts, legacy migration paths, or backward compatibility switches. The framework intentionally targets the new canonical design only.