FinLab Data Flow In-Depth
This document provides a detailed explanation of FinLab's data architecture, sources, update frequency, caching mechanism, and query best practices.
Data Architecture Overview
graph TB
A[資料來源] --> B[GCS Bucket]
B --> C[BigQuery]
C --> D[finlab.data.get]
D --> E{本地快取?}
E -->|有| F[返回快取資料]
E -->|無| G[從 BigQuery 下載]
G --> H[儲存至快取]
H --> F
F --> I[FinlabDataFrame]
I --> J[策略開發]
style A fill:#e1f5ff
style D fill:#fff4e1
style F fill:#e8f5e9
Data Sources
Taiwan Stock Data
| Category | Data Source | Update Frequency | Data Range |
|---|---|---|---|
| Price | TWSE/TPEx | Daily at 18:00 | 2007 to present |
| Financial Statements | MOPS (Market Observation Post System) | After each quarterly report | 2000 to present |
| Monthly Revenue | MOPS | After the 10th of each month | 2000 to present |
| Institutional Trading | TWSE | Daily at 18:00 | 2010 to present |
| Technical Indicators | Calculated by FinLab | Real-time computation | Depends on price data |
US Stock Data
| Category | Data Source | Update Frequency | Data Range |
|---|---|---|---|
| Price | Yahoo Finance | Daily | 2010 to present |
| Financial Statements | SEC EDGAR | Quarterly | 2010 to present |
Data Update Schedule
gantt
title 台股資料每日更新時程
dateFormat HH:mm
axisFormat %H:%M
section 股價資料
收盤 :done, 13:30, 1m
爬蟲開始 :done, 18:00, 10m
資料上傳 :done, 18:10, 20m
可下載 :crit, 18:30, 1m
section 籌碼資料
爬蟲開始 :done, 18:00, 30m
資料上傳 :done, 18:30, 30m
可下載 :crit, 19:00, 1m
section 財報資料
公布 :done, 08:00, 1m
爬蟲開始 :done, 09:00, 60m
資料上傳 :done, 10:00, 30m
可下載 :crit, 10:30, 1m
Data Caching Mechanism
Local Cache Directory
import finlab
# Default cache location
print(finlab.get_data_dir())
# Output: /Users/username/.finlab/data
# Custom cache location
finlab.set_data_dir('/path/to/custom/cache')
Cache Update Logic
flowchart TD
A[呼叫 data.get] --> B{本地有快取?}
B -->|否| C[從 BigQuery 下載]
B -->|是| D{快取是否過期?}
D -->|是| C
D -->|否| E[使用快取資料]
C --> F[儲存至本地]
F --> E
E --> G[返回 DataFrame]
Cache Management
from finlab import data
# Clear all cache
data.clear_cache()
# Clear cache for specific data
data.clear_cache('price:收盤價')
# Force re-download (ignore cache)
close = data.get('price:收盤價', force_download=True)
Data Query Best Practices
1. Batch Load Data
Bad practice: Calling data.get() repeatedly
# Running 100 backtests, loading data each time (slow!)
for param in range(100):
close = data.get('price:收盤價') # Redundant loading
position = close > close.average(param)
report = sim(position)
Good practice: Load once, reuse multiple times
# Load only once
close = data.get('price:收盤價')
# Run 100 backtests
for param in range(100):
position = close > close.average(param)
report = sim(position)
2. Use Date Filtering
# Load only data after 2020
close = data.get('price:收盤價')
close_recent = close[close.index >= '2020-01-01']
# Or use loc
close_recent = close.loc['2020-01-01':]
3. Use Stock Filtering
# Load only specific stocks
close = data.get('price:收盤價')
close_subset = close[['2330', '2317', '2454']]
# Or use filter conditions
market_cap = data.get('etl:market_value')
large_cap = market_cap > 100_000_000_000 # Market cap > 100 billion TWD
close_large = close[large_cap]
4. Avoid Redundant Computation
Bad practice
for stock in ['2330', '2317', '2454']:
close = data.get('price:收盤價')[stock]
ma20 = close.rolling(20).mean() # Recomputed each time
Good practice
close = data.get('price:收盤價')[['2330', '2317', '2454']]
ma20 = close.rolling(20).mean() # Compute all stocks at once
Data Quality & Processing
Handling Missing Values
from finlab import data
close = data.get('price:收盤價')
# Check missing values
missing_ratio = close.isna().sum() / len(close)
print(f"缺失值比例:\n{missing_ratio[missing_ratio > 0.1]}")
# Method 1: Forward fill (recommended)
close_filled = close.ffill()
# Method 2: Remove stocks with missing values
close_clean = close.dropna(axis=1, how='any')
# Method 3: Remove dates with missing values
close_clean = close.dropna(axis=0, how='any')
Handling Outliers
# Remove data for limit-up locked stocks
close = data.get('price:收盤價')
open_price = data.get('price:開盤價')
# Check if limit-up locked (open = close = limit up)
limit_up = close / close.shift() - 1 > 0.095
locked = (close == open_price) & limit_up
# Remove locked dates
close_filtered = close[~locked]
Common Data Table Reference
Price Data
# Adjusted close price (accounting for dividends and stock splits)
adj_close = data.get('etl:adj_close')
# Raw close price
raw_close = data.get('price:收盤價')
# Volume
volume = data.get('price:成交股數')
Financial Statement Data
# Earnings per share
eps = data.get('financial_statement:每股盈餘')
# Return on equity
roe = data.get('fundamental_features:股東權益報酬率')
# Operating margin
operating_margin = data.get('fundamental_features:營業利益率')
Monthly Revenue Data
# Monthly revenue
rev = data.get('monthly_revenue:當月營收')
# Revenue year-over-year growth rate
rev_yoy = data.get('monthly_revenue:去年同月增減(%)')
Institutional Trading Data
# Investment trust net buy/sell
trust = data.get('institutional_investors_trading_summary:投信買賣超股數')
# Foreign investor net buy/sell
foreign = data.get('institutional_investors_trading_summary:外資買賣超股數')
# Margin utilization rate
margin_ratio = data.get('margin_transactions:融資使用率')
Data Storage & Sharing
Save Data to CSV
close = data.get('price:收盤價')
# Save the entire DataFrame
close.to_csv('close_prices.csv')
# Save specific stocks
close[['2330', '2317']].to_csv('selected_stocks.csv')
Load Data from CSV
import pandas as pd
# Load CSV
close = pd.read_csv('close_prices.csv', index_col=0, parse_dates=True)
# Convert to FinlabDataFrame
from finlab.dataframe import FinlabDataFrame
close = FinlabDataFrame(close)
# Now you can use FinLab methods
ma20 = close.average(20)