Datasets are defined by DatasetSpec or YAML. A normal dataset should not require changes to the core framework.
Minimal YAML Shape
name: daily
source: tushare
source_dataset: daily
category: market
field_mapping:
ts_code: ts_code
trade_date: trade_date
required_columns: [asset_id, time]
primary_key: [asset_id, time]
asset_column: ts_code
time_column: trade_date
request_planner: snapshot
normalizer: standard
deduplication: primary_key_last
partition_strategy: year_month
update_mode: upsert
sort_columns: [time, asset_id]
reference: false
Register it:
lake.datasets.add_from_yaml("path/to/dataset.yaml")
Required Fields
Every spec needs:
name: canonical dataset name.source: registered source name.source_dataset: provider endpoint or table name.category: descriptive group such asmarket,financial_statement,financial_event, orreference.field_mapping: source-to-canonical rename map used by the normalizer.required_columns: columns that must exist after normalization.
Canonical Time Mapping
For non-reference datasets, set:
asset_columntime_column
For point-in-time financial datasets, also set:
period_columnpoint_in_time: true
Example:
asset_column: ts_code
time_column: f_ann_date
period_column: end_date
point_in_time: true
Keys
Use primary_key when records should be unique by a specific set of columns.
Use business_key when records may have revisions or multiple valid versions, but still share a logical business identity.
Do not assume asset_id + time is always unique.
Request Planner
Initial planners:
snapshot: one request for the dataset or date range.by_asset: one request per asset when assets are supplied.
Future planners may include by_trade_date, by_period, by_date_range, paged, or custom registered planners.
Normalizer
standard maps configured fields and derives canonical columns:
asset_idtimeperiodwhen configuredsourcesource_dataset
Use a custom normalizer only when a source response needs specialized parsing.
Deduplication
Initial strategies:
noneexact_record_hashprimary_key_last
Choose primary_key_last for daily market data where the latest record should replace the old row for the same key.
Choose exact_record_hash for financial records where multiple versions can be valid and only exact duplicates should be dropped.
Partition Strategy
Initial strategies:
single_file: reference or small datasets.year_month: dense market time series.year_bucket: financial statements and sparse event records.
For year_bucket, configure bucket_count:
partition_strategy: year_bucket
partition_options:
bucket_count: 32
Reference Datasets
Reference datasets set:
reference: true
partition_strategy: single_file
They are read with lake.query.reference(...) and are exempt from the single-value panel contract.
Validation Checklist
Before adding a dataset, confirm:
- canonical
asset_idandtimecan be derived for non-reference data periodis present for point-in-time financial data- required source columns are preserved
- the partition strategy matches access patterns
- deduplication does not discard meaningful revisions
- sort order supports common scans