FinLab Data Flow In-Depth

Data Architecture Overview

graph TB
    A[Data Sources] --> B[GCS Bucket]
    B --> C[BigQuery]
    C --> D[finlab.data.get]
    D --> E{Local cache?}
    E -->|Yes| F[Return cached data]
    E -->|No| G[Download from BigQuery]
    G --> H[Store to cache]
    H --> F
    F --> I[FinlabDataFrame]
    I --> J[Strategy Development]

    style A fill:#e1f5ff
    style D fill:#fff4e1
    style F fill:#e8f5e9

Data Sources

Taiwan Stock Data

Category	Data Source	Update Frequency	Data Range
Price	TWSE/TPEx	Daily at 18:00	2007 to present
Financial Statements	MOPS (Market Observation Post System)	After each quarterly report	2000 to present
Monthly Revenue	MOPS	After the 10th of each month	2000 to present
Institutional Trading	TWSE	Daily at 18:00	2010 to present
Technical Indicators	Calculated by FinLab	Real-time computation	Depends on price data

US Stock Data

Category	Data Source	Update Frequency	Data Range
Price	Yahoo Finance	Daily	2010 to present
Financial Statements	SEC EDGAR	Quarterly	2010 to present

Data Update Schedule

gantt
    title Taiwan Stock Data Daily Update Timeline
    dateFormat HH:mm
    axisFormat %H:%M

    section Price Data
    Market close    :done, 13:30, 1m
    Scraper starts  :done, 18:00, 10m
    Data uploaded   :done, 18:10, 20m
    Available       :crit, 18:30, 1m

    section Chip Data
    Scraper starts  :done, 18:00, 30m
    Data uploaded   :done, 18:30, 30m
    Available       :crit, 19:00, 1m

    section Financial Statements
    Disclosure      :done, 08:00, 1m
    Scraper starts  :done, 09:00, 60m
    Data uploaded   :done, 10:00, 30m
    Available       :crit, 10:30, 1m

Data Caching Mechanism

Local Cache Directory

import finlab

# Default cache location
print(finlab.get_data_dir())
# Output: /Users/username/.finlab/data

# Custom cache location
finlab.set_data_dir('/path/to/custom/cache')

Cache Update Logic

flowchart TD
    A[Call data.get] --> B{Local cache exists?}
    B -->|No| C[Download from BigQuery]
    B -->|Yes| D{Cache expired?}
    D -->|Yes| C
    D -->|No| E[Use cached data]
    C --> F[Store locally]
    F --> E
    E --> G[Return DataFrame]

Cache Management

from finlab import data

# Clear all cache
data.clear_cache()

# Clear cache for specific data
data.clear_cache('price:收盤價')

# Force re-download (ignore cache)
close = data.get('price:收盤價', force_download=True)

Data Query Best Practices

1. Batch Load Data

Bad practice: Calling data.get() repeatedly

# Running 100 backtests, loading data each time (slow!)
for param in range(100):
    close = data.get('price:收盤價')  # Redundant loading
    position = close > close.average(param)
    report = sim(position)

Good practice: Load once, reuse multiple times

# Load only once
close = data.get('price:收盤價')

# Run 100 backtests
for param in range(100):
    position = close > close.average(param)
    report = sim(position)

2. Use Date Filtering

# Load only data after 2020
close = data.get('price:收盤價')
close_recent = close[close.index >= '2020-01-01']

# Or use loc
close_recent = close.loc['2020-01-01':]

3. Use Stock Filtering

# Load only specific stocks
close = data.get('price:收盤價')
close_subset = close[['2330', '2317', '2454']]

# Or use filter conditions
market_cap = data.get('etl:market_value')
large_cap = market_cap > 100_000_000_000  # Market cap > 100 billion TWD
close_large = close[large_cap]

4. Avoid Redundant Computation

Bad practice

for stock in ['2330', '2317', '2454']:
    close = data.get('price:收盤價')[stock]
    ma20 = close.rolling(20).mean()  # Recomputed each time

Good practice

close = data.get('price:收盤價')[['2330', '2317', '2454']]
ma20 = close.rolling(20).mean()  # Compute all stocks at once

Data Quality & Processing

Handling Missing Values

from finlab import data

close = data.get('price:收盤價')

# Check missing values
missing_ratio = close.isna().sum() / len(close)
print(f"Missing-value ratio:\n{missing_ratio[missing_ratio > 0.1]}")

# Method 1: Forward fill (recommended)
close_filled = close.ffill()

# Method 2: Remove stocks with missing values
close_clean = close.dropna(axis=1, how='any')

# Method 3: Remove dates with missing values
close_clean = close.dropna(axis=0, how='any')

Handling Outliers

# Remove data for limit-up locked stocks
close = data.get('price:收盤價')
open_price = data.get('price:開盤價')

# Check if limit-up locked (open = close = limit up)
limit_up = close / close.shift() - 1 > 0.095
locked = (close == open_price) & limit_up

# Remove locked dates
close_filtered = close[~locked]

Common Data Table Reference

Price Data

# Adjusted close price (accounting for dividends and stock splits)
adj_close = data.get('etl:adj_close')

# Raw close price
raw_close = data.get('price:收盤價')

# Volume
volume = data.get('price:成交股數')

Financial Statement Data

# Earnings per share
eps = data.get('financial_statement:每股盈餘')

# Return on equity
roe = data.get('fundamental_features:股東權益報酬率')

# Operating margin
operating_margin = data.get('fundamental_features:營業利益率')

Monthly Revenue Data

# Monthly revenue
rev = data.get('monthly_revenue:當月營收')

# Revenue year-over-year growth rate
rev_yoy = data.get('monthly_revenue:去年同月增減(%)')

Institutional Trading Data

# Investment trust net buy/sell
trust = data.get('institutional_investors_trading_summary:投信買賣超股數')

# Foreign investor net buy/sell
foreign = data.get('institutional_investors_trading_summary:外資買賣超股數')

# Margin utilization rate
margin_ratio = data.get('margin_transactions:融資使用率')

Save Data to CSV

close = data.get('price:收盤價')

# Save the entire DataFrame
close.to_csv('close_prices.csv')

# Save specific stocks
close[['2330', '2317']].to_csv('selected_stocks.csv')

Load Data from CSV

import pandas as pd

# Load CSV
close = pd.read_csv('close_prices.csv', index_col=0, parse_dates=True)

# Convert to FinlabDataFrame
from finlab.dataframe import FinlabDataFrame
close = FinlabDataFrame(close)

# Now you can use FinLab methods
ma20 = close.average(20)

FinLab Data Flow In-Depth

Data Architecture Overview

Data Sources

Taiwan Stock Data

US Stock Data

Data Update Schedule

Data Caching Mechanism

Local Cache Directory

Cache Update Logic

Cache Management

Data Query Best Practices

1. Batch Load Data

2. Use Date Filtering

3. Use Stock Filtering

4. Avoid Redundant Computation

Data Quality & Processing

Handling Missing Values

Handling Outliers

Common Data Table Reference

Price Data

Financial Statement Data

Monthly Revenue Data

Institutional Trading Data

Data Storage & Sharing

Save Data to CSV

Load Data from CSV

Reference Resources