How to Build a Point-In-Time (PIT) Equity Database
A comprehensive, production-grade guide for constructing a fully leakage-free, survivorship-free, revision-aware PIT database for equity research and quantitative trading.
Why PIT Matters
Every equity dataset suffers from three major biases:
- Lookahead bias: future information appears too early.
- Survivorship bias: dead/delisted firms vanish.
- Revision bias: datasets overwrite historical values with corrected ones.
A Point‑In‑Time (PIT) database solves all three by preserving exactly what was known on each historical date.
The Three Time Dimensions of PIT
Every data point has three timestamps:
Event Date
When economic activity occurred (e.g., fiscal period end).
Announcement / Filing Date
When information became public (rdq, report_dt).
PIT Load Date (pdate)
When the vendor first loaded the data into the database.
A true PIT signal must satisfy:
pdate ≤ t and announce_date ≤ t
Compustat PIT Essentials
Compustat-specific fields for PIT reconstruction:
pdate: first appearancepitem: version identifierrdq: earnings announcement datereport_dt: SEC filing dateindfmt,datafmt,popsrc,consol: statement type filtersdatadate: fiscal period end (not a PIT date)
Compustat overwrites history by default, so pdate is critical.
CRSP PIT Considerations
CRSP is partially PIT:
- Prices are PIT (never revised)
- Shares outstanding can be revised
- Delisting returns must be included to avoid survivorship bias
- Use the CCM link table properly:
linktype ∈ {LU, LC}linkprim ∈ {P, C}
Architecture of a PIT Equity Database
Recommended structure:
RAW LAYER → STAGING LAYER → PIT LAYER
(Vendor files) (Clean + linked) (Daily valid data)
- Raw layer: immutable vendor dumps\
- Staging layer: cleaning, linking, normalization\
- PIT layer: effective intervals with daily validity
PIT Table Schema
A PIT table must contain:
- permno\
- gvkey\
- item value\
- event_date\
- announce_date\
- pdate\
- start_date = max(announce_date, pdate)\
- end_date = next_start_date - 1
Query rule:
SELECT *
FROM pit_table
WHERE start_date <= t AND t <= end_date
Constructing Effective Intervals
Step 1 – Sort by announce date
df = df.sort_values(["permno", "announce"])
Step 2 – Next announce & next pdate
df["next_announce"] = df.groupby("permno")["announce"].shift(-1)
df["next_pdate"] = df.groupby("permno")["pdate"].shift(-1)
Step 3 – Compute start_date
df["start_date"] = df[["announce", "pdate"]].max(axis=1)
Step 4 – Compute end_date
df["next_start"] = df.groupby("permno")["start_date"].shift(-1)
df["end_date"] = df["next_start"] - pd.Timedelta(days=1)
df["end_date"] = df["end_date"].fillna(pd.Timestamp.today())
Daily PIT Joins Using merge_asof
daily = daily.sort_values(["permno", "date"])
joined = pd.merge_asof(
daily,
df.sort_values(["permno", "start_date"]),
by="permno",
left_on="date",
right_on="start_date",
direction="backward"
)
joined = joined[joined["date"] <= joined["end_date"]]
Handling Vendor Edge Cases
Missing announcement dates
Fallback order: 1. rdq 2. report_dt 3. Estimate from next quarter
Preliminary vs. final filings
Prefer earliest complete filing.
Restatements
Use earliest pdate per pitem.
Multi‑class shares, ADRs
Filter CCM links carefully.
Performance Optimization for Large PIT Systems
- Use Parquet for columnar storage\
- Use Polars for 5–10× faster interval joins\
- Partition by permno / year\
- Cache PIT tables\
- Minimize repeated interval computations
Unit Tests for PIT Validation
No backward leakage
assert (joined["date"] >= joined["announce"]).all()
assert (joined["date"] >= joined["pdate"]).all()
Interval integrity
assert (df["start_date"] <= df["end_date"]).all()
No gaps / overlaps
df["gap"] = df.groupby("permno")["start_date"].diff() > pd.Timedelta(days=1)
assert not df["gap"].any()
Example Timeline
Quarter Fiscal End Filed Effective Interval
--------- ------------ ------------ --------------------
Q4 2023 2023‑12‑31 2024‑02‑02 Feb 02 → May 02
Q1 2024 2024‑03‑31 2024‑05‑03 May 03 → Aug 01
Q2 2024 2024‑06‑30 2024‑08‑02 Aug 02 → Nov 01
Daily alignment becomes realistic and leak‑free.
Summary Checklist
✔ Track three timestamps: event, announce, pdate
✔ Compute start_date = max(announce, pdate)
✔ Build effective intervals per permno
✔ Join via backward merge_asof
✔ Include CRSP delisting returns
✔ Validate intervals & PIT logic
✔ Expect IC & Sharpe to drop
✔ Cache PIT tables for speed