Concepts

DataSource

DataSource isolates provider access behind read, exists, and describe. Adapters live in bagelquant_data.datasource and consume DataRequest objects so callers do not depend on provider-specific client APIs.

Loader

Loader coordinates requests and returns standardized LoadedDataset objects. When configured with a lake, it reads local lake snapshots first and only hits a provider for bootstrap or explicit refresh.

Data Lake Manager

DataLakeManager owns add, edit, delete, list, and manual provider updates for the local lake.

Each source’s first configured table is its universe-like reference table. For Tushare, this is stock_basic, refreshed from listed, delisted, and paused stocks to avoid survivorship bias.

Transform

Transforms are stateless DataFrame operations that can be chained with Transform.

Metadata

Metadata exists independently of data. Contracts describe dataset identity, schema, freshness, ownership, version, and lineage.

Data Lake

Lake storage is separated by data source, table, year, and month. Writes create immutable snapshots and update latest pointers at the table and partition level. Every stored table has a date index plus create_time and delete_flag columns. The lake also maintains source-local asset and field id tables. Reference tables that are not panel-like, such as stock_basic, keep their ordinary row index while still receiving lifecycle columns.

Use LocalDataLake.read for direct table reads, read_panel_field for date-by-asset panels, fields for field catalogs, and asset_ids for source asset catalogs.

Cache

Cache interfaces are optional and should not change dataset identity or reproducibility guarantees.