BagelQuant Data is organized as a source-agnostic framework. Tushare is the first bundled source, but the core modules do not import Tushare-specific code.
Layer Overview
The package is split into clear layers:
core: dataset specifications, source protocols, registries, request context, normalization contracts, partitioning, deduplication, validation, hashing, and exceptions.sources: source adapters.sources/tushareis the first implementation.storage: local filesystem layout, Parquet writes, atomic replacement, SQLite metadata, staging, rejected records, and manifests.pipeline: ingestion, update orchestration, validation, and commit logic.query: raw canonical records, single-field panels, reference data, records inspection, and observation grids.finance: generic point-in-time and financial transformations.management: publicDataLakefacade and managers for sources, datasets, status, and updates.cli: a thin command-line wrapper around the Python API.
Storage Zones
Opening a lake creates the following layout:
data/
lake/
staging/
rejected/
metadata/
lake.db
tmp/
data/lake contains validated canonical Parquet files. Only canonical lake data is exposed through lake.query and lake.finance.
data/staging contains temporary source responses written during ingestion. Staging files are allowed to be small because they are not the analytical storage format. Successful commits remove staging fragments for the run.
data/rejected contains records that cannot safely enter canonical storage. Examples include missing canonical identifiers, invalid dates, or malformed source responses.
data/metadata/lake.db is SQLite operational state. It stores registered sources, registered datasets, ingestion runs, partition manifests, rejected summaries, asset state, and partition locks. WAL mode is enabled by the metadata store.
data/tmp is reserved for future build fragments, compaction work, validation, and partition replacement tasks.
Canonical Records
Canonical Parquet records are row-oriented. They may contain many fields from the source response, plus canonical fields added by normalization.
All non-reference datasets must have:
asset_id: canonical asset identifier.time: the observation time or the information availability time.
Point-in-time financial datasets also use:
period: the economic or accounting period represented by the record.
Operational fields may include:
sourcesource_datasetingested_atrecord_hash
The original source columns are preserved when possible. For example, financial statement records may keep ann_date, f_ann_date, and end_date while also exposing canonical time and period.
Public Output Contract
Canonical storage is not the same as public research output.
lake.query.raw(...) returns canonical row-oriented records. This is useful for inspection, debugging, and advanced workflows.
lake.query.field(...) returns one research field at a time as a long panel:
time | asset_id | requested_value_column
The API does not pivot into wide tables. A multi-field request returns a dictionary of independent long panels:
ohlcv = lake.query.fields(
"daily",
["open", "high", "low", "close", "vol"],
source="tushare",
)
Each returned frame has exactly three columns.
Point-In-Time Semantics
time and period are intentionally separate.
time is when information became available to a researcher.
period is the economic or accounting period represented by the record.
For daily market data, time is usually the trade date. For financial statements, time is usually the announcement or final announcement date, while period is the statement end date.
The finance API must never expose a financial record at an observation date earlier than its canonical time.
Partitioning
Partitioning is dataset-driven. Initial built-in strategies are:
single_file: one canonical file for the dataset.year_month:year=YYYY/month=MM/data.parquet.year_bucket:year=YYYY/bucket=BB/data.parquet, where bucket is a stable hash ofasset_id.
The stable bucket algorithm uses Blake2b rather than Python’s built-in hash(), because Python hash values are intentionally process-randomized.
Metadata Manifest
Every canonical partition write updates SQLite manifest rows with:
- source
- dataset
- partition path
- partition values
- row count
- file size
- minimum and maximum
time - content hash
- schema hash
- update time
Normal status calls use the manifest and avoid rescanning every Parquet file.
Atomic Writes
Canonical Parquet replacement follows this pattern:
- Write a temporary file on the same filesystem.
- Read it back with Polars.
- Validate the row count.
- Atomically replace the destination path.
- Update the partition manifest.
This prevents incomplete files from becoming visible through the query API.