Machine Learning Strategy Complete Workflow
This document provides the complete development workflow for machine learning quantitative strategies, from feature engineering, label generation, and model training to backtest validation and live deployment.
ML Strategy Development -- Recommended with AI Assistant
ML strategy development involves feature engineering, label design, model training, and more. After installing FinLab Skill, the AI coding assistant can provide code examples and debugging help at every stage.
Workflow Overview
graph TD
A[原始資料] --> B[特徵工程<br/>finlab.ml.feature]
B --> C[標籤生成<br/>finlab.ml.label]
C --> D[資料集切分]
D --> E[模型訓練<br/>finlab.ml.qlib]
E --> F[預測持倉權重]
F --> G[回測驗證]
G --> H{績效滿意?}
H -->|否| I[調整特徵/模型]
I --> B
H -->|是| J[樣本外測試]
J --> K{通過驗證?}
K -->|否| I
K -->|是| L[實盤部署]
Stage 1: Feature Engineering
The success of machine learning depends 80% on feature engineering. FinLab provides powerful feature engineering tools.
1.1 Load and Merge Fundamental Features
from finlab import data
from finlab.ml import feature as mlf
# Load fundamental data
pb_ratio = data.get('price_earning_ratio:股價淨值比')
pe_ratio = data.get('price_earning_ratio:本益比')
roe = data.get('fundamental_features:股東權益報酬率')
roa = data.get('fundamental_features:資產報酬率')
# Combine into feature set
fundamental_features = mlf.combine({
'pb': pb_ratio,
'pe': pe_ratio,
'roe': roe,
'roa': roa
}, resample='W') # 重採樣為週度資料
print(fundamental_features.head())
# Output:
# pb pe roe roa
# (2010-01-04 00:00:00, '1101') 1.47 18.85 7.80 3.21
# (2010-01-04 00:00:00, '1102') 1.44 14.58 9.87 4.15
1.2 Add Technical Indicator Features
# Method 1: Use random technical indicators (explore best indicators)
ta_features = mlf.combine({
'talib': mlf.ta(mlf.ta_names(n=5)) # 每個指標隨機產生 5 種參數配置
}, resample='W')
# Method 2: Specify particular technical indicators
from finlab import data
close = data.get('price:收盤價')
volume = data.get('price:成交股數')
specific_ta = mlf.combine({
'rsi': close.ta.RSI(timeperiod=14),
'macd': close.ta.MACD(),
'bbands': close.ta.BBANDS(timeperiod=20),
'obv': volume.ta.OBV(close)
}, resample='W')
print(f"技術指標特徵數: {ta_features.shape[1]}") # e.g., 450 features
1.3 Add Custom Features
# Revenue-related features
rev = data.get('monthly_revenue:當月營收')
rev_yoy = data.get('monthly_revenue:去年同月增減(%)')
custom_features = mlf.combine({
'rev_ma3': rev.average(3), # 3 個月營收移動平均
'rev_ma12': rev.average(12), # 12 個月營收移動平均
'rev_momentum': rev.average(3) / rev.average(12), # 營收動能
'rev_yoy': rev_yoy, # 營收年增率
}, resample='W')
print(custom_features.head())
1.4 Merge All Features
# Merge all feature sets
all_features = mlf.combine({
'fundamental': fundamental_features,
'technical': ta_features,
'custom': custom_features
}, resample='W')
print(f"總特徵數: {all_features.shape[1]}") # e.g., 470 features
print(f"資料筆數: {all_features.shape[0]}") # e.g., 150,000 rows
# Check missing values
missing_ratio = all_features.isna().sum() / len(all_features)
print(f"缺失值比例:\n{missing_ratio[missing_ratio > 0.5]}") # Show features with > 50% missing
# Remove features with too many missing values
all_features = all_features.loc[:, missing_ratio < 0.5]
print(f"過濾後特徵數: {all_features.shape[1]}")
Stage 2: Label Generation
Labels define our prediction target. finlab.ml.label provides various label generation functions, all accepting features.index (MultiIndex) as the first argument.
2.1 Predict Future Returns
from finlab.ml import label as mll
# Predict 1-week future return (most common)
label = mll.return_percentage(all_features.index, resample='W', period=1)
print(label.head())
# Output:
# datetime instrument
# 2010-01-04 00:00:00 1101 0.032
# 1102 -0.015
# 1103 0.021
# dtype: float64
# Check label distribution
print(label.describe())
# Output:
# count 150000.00
# mean 0.005
# std 0.087
# min -0.450
# 25% -0.042
# 50% 0.002
# 75% 0.051
# max 0.520
2.2 Excess Return Labels
# Excess return over market median for the same period
label_excess_median = mll.excess_over_median(all_features.index, resample='W', period=1)
# Excess return over market mean for the same period
label_excess_mean = mll.excess_over_mean(all_features.index, resample='W', period=1)
print(label_excess_median.describe())
2.3 Other Label Types
# Day trading return (open-to-close change)
label_daytrading = mll.daytrading_percentage(all_features.index)
# Risk metric: Maximum Adverse Excursion (max decline during holding period)
label_mae = mll.maximum_adverse_excursion(all_features.index, period=5)
# Risk metric: Maximum Favorable Excursion (max gain during holding period)
label_mfe = mll.maximum_favorable_excursion(all_features.index, period=5)
# Multi-period prediction (predict different time horizons)
label_1w = mll.return_percentage(all_features.index, resample='W', period=1)
label_2w = mll.return_percentage(all_features.index, resample='W', period=2)
label_4w = mll.return_percentage(all_features.index, resample='W', period=4)
Label Selection Recommendations
return_percentage: Most commonly used, directly predicts returnsexcess_over_median: Predicts relative performance, reduces market-wide movement impactdaytrading_percentage: Suitable for day trading strategiesmaximum_adverse_excursion: Suitable for risk management modelsperiodshould match the strategy rebalancing frequency: e.g.,resample='W', period=1predicts 1-week return
Stage 3: Dataset Preparation & Splitting
3.1 Align Features and Labels
Features (all_features) and labels (label) share the same MultiIndex (datetime, instrument) and can be split directly.
# Select label
label = mll.return_percentage(all_features.index, resample='W', period=1)
# Check alignment
print(f"特徵筆數: {len(all_features)}")
print(f"標籤筆數: {len(label)}")
print(f"標籤 NaN 比例: {label.isna().mean():.2%}")
3.2 Split Training and Test Sets
# Use time-based splitting (strictly avoid data leakage)
is_train = all_features.index.get_level_values('datetime') < '2023-01-01'
X_train = all_features[is_train]
y_train = label[is_train]
X_test = all_features[~is_train]
print(f"訓練集: {len(X_train)} 筆 ({X_train.index.get_level_values(0).min()} ~ {X_train.index.get_level_values(0).max()})")
print(f"測試集: {len(X_test)} 筆 ({X_test.index.get_level_values(0).min()} ~ {X_test.index.get_level_values(0).max()})")
print(f"訓練特徵維度: {X_train.shape}")
Time-Based Splitting is Critical
You must use time-based splitting, not random splitting. Random splitting causes data leakage -- the model sees future data to predict the past, leading to artificially inflated backtest performance.
Stage 4: Model Training
4.1 Using the LightGBM Model
import finlab.ml.qlib as q
# Create and train LightGBM model
model = q.LGBModel()
model.fit(X_train, y_train)
print("訓練完成!")
4.2 Using Other Models
# XGBoost
model_xgb = q.XGBModel()
model_xgb.fit(X_train, y_train)
# CatBoost
model_cat = q.CatBoostModel()
model_cat.fit(X_train, y_train)
# Linear model (fast validation, less prone to overfitting)
model_linear = q.LinearModel()
model_linear.fit(X_train, y_train)
# Deep learning
model_dnn = q.DNNModel()
model_dnn.fit(X_train, y_train)
# List all available models
models = q.get_models()
print(list(models.keys()))
4.3 Multi-Model Comparison
import finlab.ml.qlib as q
from finlab.backtest import sim
# Quick comparison of multiple models
results = {}
for name, ModelClass in [('LightGBM', q.LGBModel), ('XGBoost', q.XGBModel), ('Linear', q.LinearModel)]:
model = ModelClass()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
position = y_pred.is_largest(30)
report = sim(position, resample='W', name=f"ML {name}", upload=False)
results[name] = report
# Compare performance
for name, report in results.items():
stats = report.get_stats()
print(f"{name}: 年化報酬 {stats['daily_mean']:.2%}, 夏普率 {stats['daily_sharpe']:.2f}")
Stage 5: Prediction & Position Weight Generation
model.predict() returns a FinlabDataFrame (index = dates, columns = stock symbols), which can directly use FinlabDataFrame methods to convert into positions.
5.1 Generate Predictions
# Predict on test set
y_pred = model.predict(X_test)
print(y_pred.head())
# Output (FinlabDataFrame, index=dates, columns=stock symbols):
# 1101 1102 1103 1216 2330
# 2023-01-06 0.032 0.015 -0.008 0.045 0.023
# 2023-01-13 0.018 0.027 0.003 0.012 0.041
# Check prediction distribution
print(y_pred.stack().describe())
5.2 Convert to Positions
from finlab.backtest import sim
# Method 1: Top N stock selection (buy the N stocks with highest predictions)
position_topn = y_pred.is_largest(30)
# Method 2: Top 20% selection
position_quantile = y_pred > y_pred.quantile(0.8)
# Method 3: Allocate weights based on prediction values
position_weighted = y_pred / y_pred.sum()
print(f"Top 30 策略平均持股數: {position_topn.sum(axis=1).mean():.1f}")
print(f"前 20% 策略平均持股數: {position_quantile.sum(axis=1).mean():.1f}")
Stage 6: Backtest Validation
6.1 Run Backtest
from finlab.backtest import sim
# Backtest Top 30 strategy
report_topn = sim(
position_topn,
resample='W',
name="ML Top 30 策略",
upload=False
)
# Backtest weighted strategy
report_weighted = sim(
position_weighted,
resample='W',
name="ML 加權策略",
upload=False
)
# Display performance
report_topn.display()
6.2 Performance Comparison
import pandas as pd
stats_topn = report_topn.get_stats()
stats_weighted = report_weighted.get_stats()
comparison = pd.DataFrame({
'Top 30 策略': [
stats_topn['daily_mean'],
stats_topn['daily_sharpe'],
stats_topn['max_drawdown'],
stats_topn['win_ratio']
],
'加權策略': [
stats_weighted['daily_mean'],
stats_weighted['daily_sharpe'],
stats_weighted['max_drawdown'],
stats_weighted['win_ratio']
]
}, index=['年化報酬率', '夏普率', '最大回撤', '勝率'])
print(comparison)
6.3 In-Depth Analysis
# Liquidity analysis
report_topn.run_analysis('LiquidityAnalysis', required_volume=100000)
# MAE/MFE analysis
report_topn.display_mae_mfe_analysis()
# Period stability
report_topn.run_analysis('PeriodStatsAnalysis')
# Alpha/Beta
report_topn.run_analysis('AlphaBetaAnalysis')
Stage 7: Feature Engineering Iteration & Optimization
7.1 Reduce Feature Count
# Strategy 1: Use fewer technical indicators
features_small = mlf.combine({
'pb': pb_ratio,
'pe': pe_ratio,
'roe': roe,
'talib': mlf.ta(mlf.ta_names(n=1)[:20]) # Only take the first 20 indicators
}, resample='W')
label_small = mll.return_percentage(features_small.index, resample='W', period=1)
is_train_small = features_small.index.get_level_values('datetime') < '2023-01-01'
model_v2 = q.LGBModel()
model_v2.fit(features_small[is_train_small], label_small[is_train_small])
y_pred_v2 = model_v2.predict(features_small[~is_train_small])
position_v2 = y_pred_v2.is_largest(30)
report_v2 = sim(position_v2, resample='W', name="ML V2 精簡特徵", upload=False)
report_v2.display()
7.2 Adjust Label Prediction Period
# Test different prediction periods
for period in [1, 2, 4]:
label_n = mll.return_percentage(all_features.index, resample='W', period=period)
model_n = q.LGBModel()
model_n.fit(X_train, label_n[is_train])
y_pred_n = model_n.predict(X_test)
position_n = y_pred_n.is_largest(30)
report_n = sim(position_n, resample='W', name=f"ML 預測{period}週", upload=False)
stats = report_n.get_stats()
print(f"預測 {period} 週: 年化報酬 {stats['daily_mean']:.2%}, 夏普率 {stats['daily_sharpe']:.2f}")
7.3 Try Different Label Types
# Compare return vs excess return labels
label_return = mll.return_percentage(all_features.index, resample='W', period=1)
label_excess = mll.excess_over_median(all_features.index, resample='W', period=1)
for label_name, label_data in [('報酬率', label_return), ('超額報酬', label_excess)]:
model_cmp = q.LGBModel()
model_cmp.fit(X_train, label_data[is_train])
y_pred_cmp = model_cmp.predict(X_test)
position_cmp = y_pred_cmp.is_largest(30)
report_cmp = sim(position_cmp, resample='W', name=f"ML {label_name}", upload=False)
stats = report_cmp.get_stats()
print(f"{label_name}: 年化報酬 {stats['daily_mean']:.2%}, 夏普率 {stats['daily_sharpe']:.2f}")
Stage 8: Live Deployment
8.1 Build a Real-Time Prediction Pipeline
from finlab import data
from finlab.ml import feature as mlf, label as mll
import finlab.ml.qlib as q
import pickle
# 1. Train the full model (using all historical data)
features = mlf.combine({
'pb': data.get('price_earning_ratio:股價淨值比'),
'pe': data.get('price_earning_ratio:本益比'),
'roe': data.get('fundamental_features:股東權益報酬率'),
'talib': mlf.ta(mlf.ta_names(n=1)[:20])
}, resample='W')
label = mll.return_percentage(features.index, resample='W', period=1)
model = q.LGBModel()
model.fit(features, label)
# 2. Save the model
with open('ml_model.pkl', 'wb') as f:
pickle.dump(model, f)
# 3. Load model and predict latest positions
with open('ml_model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
y_pred = loaded_model.predict(features)
# 4. Get latest positions
position = y_pred.is_largest(30)
latest_position = position.iloc[-1]
latest_position = latest_position[latest_position > 0].sort_values(ascending=False)
print("最新持股建議:")
print(latest_position)
8.2 Automated Trading Setup
from finlab.backtest import sim
from finlab.online.sinopac_account import SinopacAccount
from finlab.online.order_executor import OrderExecutor
# Create a script to run weekly (every Monday)
def weekly_rebalance():
# Recalculate features
features = mlf.combine({
'pb': data.get('price_earning_ratio:股價淨值比'),
'pe': data.get('price_earning_ratio:本益比'),
'roe': data.get('fundamental_features:股東權益報酬率'),
'talib': mlf.ta(mlf.ta_names(n=1)[:20])
}, resample='W')
# Load model and predict
with open('ml_model.pkl', 'rb') as f:
model = pickle.load(f)
y_pred = model.predict(features)
position = y_pred.is_largest(30)
# Use sim to generate report
report = sim(position, resample='W', upload=False)
# Execute orders
account = SinopacAccount(simulation=False)
executor = OrderExecutor(report=report, account=account, fund=1000000)
executor.execute()
# Use cron or a scheduling tool to run weekly_rebalance() periodically
Complete Code Summary
# =============================================================================
# Machine Learning Strategy Complete Example
# =============================================================================
from finlab import data
from finlab.ml import feature as mlf
from finlab.ml import label as mll
import finlab.ml.qlib as q
from finlab.backtest import sim
# 1. Feature Engineering
close = data.get('price:收盤價')
pb = data.get('price_earning_ratio:股價淨值比')
pe = data.get('price_earning_ratio:本益比')
rev = data.get('monthly_revenue:當月營收')
features = mlf.combine({
'pb': pb,
'pe': pe,
'rev_ma3': rev.average(3),
'rev_ma12': rev.average(12),
'talib': mlf.ta(mlf.ta_names(n=1)[:20])
}, resample='W')
# 2. Label Generation
label = mll.return_percentage(features.index, resample='W', period=1)
# 3. Data Splitting
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train = features[is_train]
y_train = label[is_train]
X_test = features[~is_train]
# 4. Model Training
model = q.LGBModel()
model.fit(X_train, y_train)
# 5. Prediction & Positions
y_pred = model.predict(X_test)
position = y_pred.is_largest(30)
# 6. Backtest
report = sim(position, resample='W', name="ML Strategy", upload=False)
report.display()
# 7. Analysis
report.run_analysis('LiquidityAnalysis')
report.display_mae_mfe_analysis()
print("完成!")
Key Takeaways
Feature Engineering Stage
- Use diverse feature sources (fundamental, technical, custom)
- Use
mlf.combine()to unify merging, ensuring MultiIndex alignment - Check and handle missing values
- Control feature count (too many leads to overfitting)
Label Generation Stage
- Use
mll.return_percentage()and similar functions, passingfeatures.index resampleparameter should match features- Prediction period (
period) should be reasonable (too short = noisy, too long = hard to predict)
Model Training Stage
- Use wrapper classes like
q.LGBModel(), withfit()+predict() - Time-based train/test split (not random split)
- Start with a simple model (
LinearModel) to establish a baseline
Backtest Validation Stage
predict()returns FinlabDataFrame; useis_largest()to convert to positions- Out-of-sample testing is mandatory
- Run in-depth analysis (liquidity, MAE/MFE)
Live Deployment Stage
- Retrain the model periodically (e.g., quarterly)
- Monitor divergence between live and backtest performance
- Set up performance alert mechanisms
Common Error Handling Checklist
During ML strategy development, the following are key error checkpoints:
Stage 1: Feature Engineering
Common Errors:
- resample mismatch between features and labels
- Too many missing values resulting in insufficient training data
- Look-ahead bias (using future data to predict the past)
Validation Methods:
try:
# 1. Build features
features = mlf.combine({
'pb': pb,
'pe': pe,
'rev_ma3': rev.average(3)
}, resample='W')
if features.empty:
raise ValueError("❌ 特徵 DataFrame 為空")
# 2. Check missing value ratio
missing_ratio = features.isna().sum() / len(features)
high_missing_cols = missing_ratio[missing_ratio > 0.3].index.tolist()
if high_missing_cols:
print(f"⚠️ 警告:以下特徵缺失值 > 30%:{high_missing_cols}")
print("建議:移除這些特徵或使用 forward fill")
# 3. Check date range
print(f"特徵日期範圍:{features.index.get_level_values(0).min()} ~ {features.index.get_level_values(0).max()}")
# 4. Check feature count
num_features = features.shape[1]
if num_features > 500:
print(f"⚠️ 警告:特徵數量過多({num_features} 個),可能導致過度配適")
print("建議:< 200 個特徵為佳")
print(f"✅ 特徵工程完成:{num_features} 個特徵,{len(features)} 筆資料")
except KeyError as e:
print(f"❌ 資料表名稱錯誤:{e}")
print("請至 https://ai.finlab.tw/database 確認正確名稱")
except ValueError as e:
print(f"❌ 特徵驗證失敗:{e}")
Detailed Error Handling: See Data Download Error Handling
Stage 2: Label Generation
Common Errors:
- resample mismatch between labels and features
- Passing incorrect index (should pass features.index)
- Unreasonable prediction period setting
Validation Methods:
# Generate labels
label = mll.return_percentage(features.index, resample='W', period=1)
# Check label distribution
print("標籤統計:")
print(label.describe())
# Check label missing values
nan_ratio = label.isna().mean()
if nan_ratio > 0.1:
print(f"⚠️ 警告:標籤缺失值 {nan_ratio:.1%} > 10%")
print("可能原因:預測期過長,近期資料無標籤")
print(f"✅ 標籤生成完成:{len(label)} 筆資料")
Stage 3: Model Training
Common Errors: - Insufficient training data (< 1000 rows) - Using random split instead of time-based split - Overfitting (test set performance much worse than training set)
Validation Methods:
# Split train/test sets
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train = features[is_train]
y_train = label[is_train]
X_test = features[~is_train]
# 1. Check data volume
print(f"訓練集:{len(X_train)} 筆")
print(f"測試集:{len(X_test)} 筆")
if len(X_train) < 1000:
print("⚠️ 警告:訓練資料不足(< 1000 筆)")
print("建議:增加歷史資料範圍或降低 resample 頻率")
if len(X_test) < 100:
print("⚠️ 警告:測試資料過少(< 100 筆)")
# 2. Check date ordering
train_last = X_train.index.get_level_values(0).max()
test_first = X_test.index.get_level_values(0).min()
if train_last >= test_first:
raise ValueError(
f"❌ 訓練集與測試集日期重疊!\n"
f" 訓練集最後日期:{train_last}\n"
f" 測試集第一日期:{test_first}\n"
f" 這會導致資料洩露(data leakage)"
)
print(f"✅ 資料切分正確")
# 3. Model training
try:
model = q.LGBModel()
model.fit(X_train, y_train)
print(f"✅ 模型訓練完成")
except Exception as e:
print(f"❌ 模型訓練失敗:{e}")
print("請檢查:")
print("1. 特徵是否包含 NaN 或 Inf")
print("2. 標籤是否為數值型態")
print("3. 相關套件是否正確安裝(pip install lightgbm / xgboost)")
raise
Stage 4: Prediction & Backtesting
Common Errors: - Prediction results are all NaN - Position DataFrame format is incorrect - Backtest has no trade records
Validation Methods:
# 1. Prediction
y_pred = model.predict(X_test)
if y_pred.isna().all().all():
raise ValueError("❌ 預測結果全為 NaN")
# Check prediction distribution
print(f"預測值範圍:{y_pred.min().min():.4f} ~ {y_pred.max().max():.4f}")
print(f"預測平均值:{y_pred.stack().mean():.4f}")
# 2. Generate positions
position = y_pred.is_largest(30)
if position.empty:
raise ValueError("❌ 持倉 DataFrame 為空")
holding_count = position.sum(axis=1).mean()
if holding_count < 10:
print(f"⚠️ 警告:平均持倉數 {holding_count:.1f} < 10,可能過少")
print(f"✅ 持倉生成成功:平均持有 {holding_count:.1f} 檔")
# 3. Backtest
try:
report = sim(position, resample='W', name="ML Strategy", upload=False)
print(f"✅ 回測成功")
stats = report.get_stats()
print(f" 年化報酬:{stats['daily_mean']:.2%}")
print(f" 夏普率:{stats['daily_sharpe']:.2f}")
except Exception as e:
print(f"❌ 回測失敗:{e}")
print("請檢查:")
print("1. position 的 index 是否為 DatetimeIndex")
print("2. position 的 columns 是否為股票代號")
raise
Risks Specific to ML Strategies
Compared to traditional strategies, ML strategies require extra attention to:
- Data leakage -- using future data to predict the past
- Overfitting -- test set performance much worse than training set
- Model decay -- live performance degrades over time
Recommendations: - Strictly use time-series splitting (not random splitting) - Retrain the model periodically (quarterly or monthly) - Monitor live vs backtest divergence and set up alert mechanisms