FinLab Data Flow In-Depth

This document provides a detailed explanation of FinLab's data architecture, sources, update frequency, caching mechanism, and query best practices.

Data Architecture Overview

graph TB
    A[資料來源] --> B[GCS Bucket]
    B --> C[BigQuery]
    C --> D[finlab.data.get]
    D --> E{本地快取?}
    E -->|有| F[返回快取資料]
    E -->|無| G[從 BigQuery 下載]
    G --> H[儲存至快取]
    H --> F
    F --> I[FinlabDataFrame]
    I --> J[策略開發]

    style A fill:#e1f5ff
    style D fill:#fff4e1
    style F fill:#e8f5e9

Data Sources

Taiwan Stock Data

Category	Data Source	Update Frequency	Data Range
Price	TWSE/TPEx	Daily at 18:00	2007 to present
Financial Statements	MOPS (Market Observation Post System)	After each quarterly report	2000 to present
Monthly Revenue	MOPS	After the 10th of each month	2000 to present
Institutional Trading	TWSE	Daily at 18:00	2010 to present
Technical Indicators	Calculated by FinLab	Real-time computation	Depends on price data

US Stock Data

Category	Data Source	Update Frequency	Data Range
Price	Yahoo Finance	Daily	2010 to present
Financial Statements	SEC EDGAR	Quarterly	2010 to present

Data Update Schedule

gantt
    title 台股資料每日更新時程
    dateFormat HH:mm
    axisFormat %H:%M

    section 股價資料
    收盤        :done, 13:30, 1m
    爬蟲開始     :done, 18:00, 10m
    資料上傳     :done, 18:10, 20m
    可下載       :crit, 18:30, 1m

    section 籌碼資料
    爬蟲開始     :done, 18:00, 30m
    資料上傳     :done, 18:30, 30m
    可下載       :crit, 19:00, 1m

    section 財報資料
    公布         :done, 08:00, 1m
    爬蟲開始     :done, 09:00, 60m
    資料上傳     :done, 10:00, 30m
    可下載       :crit, 10:30, 1m

Data Caching Mechanism

Local Cache Directory

import finlab

# Default cache location
print(finlab.get_data_dir())
# Output: /Users/username/.finlab/data

# Custom cache location
finlab.set_data_dir('/path/to/custom/cache')

Cache Update Logic

flowchart TD
    A[呼叫 data.get] --> B{本地有快取?}
    B -->|否| C[從 BigQuery 下載]
    B -->|是| D{快取是否過期?}
    D -->|是| C
    D -->|否| E[使用快取資料]
    C --> F[儲存至本地]
    F --> E
    E --> G[返回 DataFrame]

Cache Management

from finlab import data

# Clear all cache
data.clear_cache()

# Clear cache for specific data
data.clear_cache('price:收盤價')

# Force re-download (ignore cache)
close = data.get('price:收盤價', force_download=True)

Data Query Best Practices

1. Batch Load Data

Bad practice: Calling data.get() repeatedly

# Running 100 backtests, loading data each time (slow!)
for param in range(100):
    close = data.get('price:收盤價')  # Redundant loading
    position = close > close.average(param)
    report = sim(position)

Good practice: Load once, reuse multiple times

# Load only once
close = data.get('price:收盤價')

# Run 100 backtests
for param in range(100):
    position = close > close.average(param)
    report = sim(position)

2. Use Date Filtering

# Load only data after 2020
close = data.get('price:收盤價')
close_recent = close[close.index >= '2020-01-01']

# Or use loc
close_recent = close.loc['2020-01-01':]

3. Use Stock Filtering

# Load only specific stocks
close = data.get('price:收盤價')
close_subset = close[['2330', '2317', '2454']]

# Or use filter conditions
market_cap = data.get('etl:market_value')
large_cap = market_cap > 100_000_000_000  # Market cap > 100 billion TWD
close_large = close[large_cap]

4. Avoid Redundant Computation

Bad practice

for stock in ['2330', '2317', '2454']:
    close = data.get('price:收盤價')[stock]
    ma20 = close.rolling(20).mean()  # Recomputed each time

Good practice

close = data.get('price:收盤價')[['2330', '2317', '2454']]
ma20 = close.rolling(20).mean()  # Compute all stocks at once

Data Quality & Processing

Handling Missing Values

from finlab import data

close = data.get('price:收盤價')

# Check missing values
missing_ratio = close.isna().sum() / len(close)
print(f"缺失值比例:\n{missing_ratio[missing_ratio > 0.1]}")

# Method 1: Forward fill (recommended)
close_filled = close.ffill()

# Method 2: Remove stocks with missing values
close_clean = close.dropna(axis=1, how='any')

# Method 3: Remove dates with missing values
close_clean = close.dropna(axis=0, how='any')

Handling Outliers

# Remove data for limit-up locked stocks
close = data.get('price:收盤價')
open_price = data.get('price:開盤價')

# Check if limit-up locked (open = close = limit up)
limit_up = close / close.shift() - 1 > 0.095
locked = (close == open_price) & limit_up

# Remove locked dates
close_filtered = close[~locked]

Common Data Table Reference

Price Data

# Adjusted close price (accounting for dividends and stock splits)
adj_close = data.get('etl:adj_close')

# Raw close price
raw_close = data.get('price:收盤價')

# Volume
volume = data.get('price:成交股數')

Financial Statement Data

# Earnings per share
eps = data.get('financial_statement:每股盈餘')

# Return on equity
roe = data.get('fundamental_features:股東權益報酬率')

# Operating margin
operating_margin = data.get('fundamental_features:營業利益率')

Monthly Revenue Data

# Monthly revenue
rev = data.get('monthly_revenue:當月營收')

# Revenue year-over-year growth rate
rev_yoy = data.get('monthly_revenue:去年同月增減(%)')

Institutional Trading Data

# Investment trust net buy/sell
trust = data.get('institutional_investors_trading_summary:投信買賣超股數')

# Foreign investor net buy/sell
foreign = data.get('institutional_investors_trading_summary:外資買賣超股數')

# Margin utilization rate
margin_ratio = data.get('margin_transactions:融資使用率')

Save Data to CSV

close = data.get('price:收盤價')

# Save the entire DataFrame
close.to_csv('close_prices.csv')

# Save specific stocks
close[['2330', '2317']].to_csv('selected_stocks.csv')

Load Data from CSV

import pandas as pd

# Load CSV
close = pd.read_csv('close_prices.csv', index_col=0, parse_dates=True)

# Convert to FinlabDataFrame
from finlab.dataframe import FinlabDataFrame
close = FinlabDataFrame(close)

# Now you can use FinLab methods
ma20 = close.average(20)