Machine Learning Strategy Complete Workflow
This document provides the complete development workflow for machine learning quantitative strategies, from feature engineering, label generation, and model training to backtest validation and live deployment.
ML Strategy Development -- Recommended with AI Assistant
ML strategy development involves feature engineering, label design, model training, and more. After installing FinLab Skill, the AI coding assistant can provide code examples and debugging help at every stage.
Workflow Overview
graph TD
A[Raw Data] --> B[Feature Engineering<br/>finlab.ml.feature]
B --> C[Label Generation<br/>finlab.ml.label]
C --> D[Dataset Splitting]
D --> E[Model Training<br/>finlab.ml.qlib]
E --> F[Predict Position Weights]
F --> G[Backtest Validation]
G --> H{Performance Acceptable?}
H -->|No| I[Adjust Features / Model]
I --> B
H -->|Yes| J[Out-of-Sample Testing]
J --> K{Validation Passed?}
K -->|No| I
K -->|Yes| L[Deploy to Live Trading]
Stage 1: Feature Engineering
The success of machine learning depends 80% on feature engineering. FinLab provides powerful feature engineering tools.
1.1 Load and Merge Fundamental Features
from finlab import data
from finlab.ml import feature as mlf
# Load fundamental data
pb_ratio = data.get('price_earning_ratio:股價淨值比')
pe_ratio = data.get('price_earning_ratio:本益比')
roe = data.get('fundamental_features:股東權益報酬率')
roa = data.get('fundamental_features:資產報酬率')
# Combine into feature set
fundamental_features = mlf.combine({
'pb': pb_ratio,
'pe': pe_ratio,
'roe': roe,
'roa': roa
}, resample='W') # Resample to weekly frequency
print(fundamental_features.head())
# Output:
# pb pe roe roa
# (2010-01-04 00:00:00, '1101') 1.47 18.85 7.80 3.21
# (2010-01-04 00:00:00, '1102') 1.44 14.58 9.87 4.15
1.2 Add Technical Indicator Features
# Method 1: Use random technical indicators (explore best indicators)
ta_features = mlf.combine({
'talib': mlf.ta(mlf.ta_names(n=5)) # Randomly generate 5 parameter configurations per indicator
}, resample='W')
# Method 2: Specify particular technical indicators
from finlab import data
close = data.get('price:收盤價')
volume = data.get('price:成交股數')
specific_ta = mlf.combine({
'rsi': close.ta.RSI(timeperiod=14),
'macd': close.ta.MACD(),
'bbands': close.ta.BBANDS(timeperiod=20),
'obv': volume.ta.OBV(close)
}, resample='W')
print(f"Technical indicator feature count: {ta_features.shape[1]}") # e.g., 450 features
1.3 Add Custom Features
# Revenue-related features
rev = data.get('monthly_revenue:當月營收')
rev_yoy = data.get('monthly_revenue:去年同月增減(%)')
custom_features = mlf.combine({
'rev_ma3': rev.average(3), # Trailing 3-month revenue moving average
'rev_ma12': rev.average(12), # Trailing 12-month revenue moving average
'rev_momentum': rev.average(3) / rev.average(12), # Revenue momentum
'rev_yoy': rev_yoy, # Revenue YoY growth
}, resample='W')
print(custom_features.head())
1.4 Merge All Features
# Merge all feature sets
all_features = mlf.combine({
'fundamental': fundamental_features,
'technical': ta_features,
'custom': custom_features
}, resample='W')
print(f"Total feature count: {all_features.shape[1]}") # e.g., 470 features
print(f"Row count: {all_features.shape[0]}") # e.g., 150,000 rows
# Check missing values
missing_ratio = all_features.isna().sum() / len(all_features)
print(f"Missing value ratio:\n{missing_ratio[missing_ratio > 0.5]}") # Show features with > 50% missing
# Remove features with too many missing values
all_features = all_features.loc[:, missing_ratio < 0.5]
print(f"Filtered feature count: {all_features.shape[1]}")
Stage 2: Label Generation
Labels define our prediction target. finlab.ml.label provides various label generation functions, all accepting features.index (MultiIndex) as the first argument.
2.1 Predict Future Returns
from finlab.ml import label as mll
# Predict 1-week future return (most common)
label = mll.return_percentage(all_features.index, resample='W', period=1)
print(label.head())
# Output:
# datetime instrument
# 2010-01-04 00:00:00 1101 0.032
# 1102 -0.015
# 1103 0.021
# dtype: float64
# Check label distribution
print(label.describe())
# Output:
# count 150000.00
# mean 0.005
# std 0.087
# min -0.450
# 25% -0.042
# 50% 0.002
# 75% 0.051
# max 0.520
2.2 Excess Return Labels
# Excess return over market median for the same period
label_excess_median = mll.excess_over_median(all_features.index, resample='W', period=1)
# Excess return over market mean for the same period
label_excess_mean = mll.excess_over_mean(all_features.index, resample='W', period=1)
print(label_excess_median.describe())
2.3 Other Label Types
# Day trading return (open-to-close change)
label_daytrading = mll.daytrading_percentage(all_features.index)
# Risk metric: Maximum Adverse Excursion (max decline during holding period)
label_mae = mll.maximum_adverse_excursion(all_features.index, period=5)
# Risk metric: Maximum Favorable Excursion (max gain during holding period)
label_mfe = mll.maximum_favorable_excursion(all_features.index, period=5)
# Multi-period prediction (predict different time horizons)
label_1w = mll.return_percentage(all_features.index, resample='W', period=1)
label_2w = mll.return_percentage(all_features.index, resample='W', period=2)
label_4w = mll.return_percentage(all_features.index, resample='W', period=4)
Label Selection Recommendations
return_percentage: Most commonly used, directly predicts returnsexcess_over_median: Predicts relative performance, reduces market-wide movement impactdaytrading_percentage: Suitable for day trading strategiesmaximum_adverse_excursion: Suitable for risk management modelsperiodshould match the strategy rebalancing frequency: e.g.,resample='W', period=1predicts 1-week return
Stage 3: Dataset Preparation & Splitting
3.1 Align Features and Labels
Features (all_features) and labels (label) share the same MultiIndex (datetime, instrument) and can be split directly.
# Select label
label = mll.return_percentage(all_features.index, resample='W', period=1)
# Check alignment
print(f"Feature row count: {len(all_features)}")
print(f"Label row count: {len(label)}")
print(f"Label NaN ratio: {label.isna().mean():.2%}")
3.2 Split Training and Test Sets
# Use time-based splitting (strictly avoid data leakage)
is_train = all_features.index.get_level_values('datetime') < '2023-01-01'
X_train = all_features[is_train]
y_train = label[is_train]
X_test = all_features[~is_train]
print(f"Training set: {len(X_train)} rows ({X_train.index.get_level_values(0).min()} ~ {X_train.index.get_level_values(0).max()})")
print(f"Test set: {len(X_test)} rows ({X_test.index.get_level_values(0).min()} ~ {X_test.index.get_level_values(0).max()})")
print(f"Training feature shape: {X_train.shape}")
Time-Based Splitting is Critical
You must use time-based splitting, not random splitting. Random splitting causes data leakage -- the model sees future data to predict the past, leading to artificially inflated backtest performance.
Stage 4: Model Training
4.1 Using the LightGBM Model
import finlab.ml.qlib as q
# Create and train LightGBM model
model = q.LGBModel()
model.fit(X_train, y_train)
print("Training complete!")
4.2 Using Other Models
# XGBoost
model_xgb = q.XGBModel()
model_xgb.fit(X_train, y_train)
# CatBoost
model_cat = q.CatBoostModel()
model_cat.fit(X_train, y_train)
# Linear model (fast validation, less prone to overfitting)
model_linear = q.LinearModel()
model_linear.fit(X_train, y_train)
# Deep learning
model_dnn = q.DNNModel()
model_dnn.fit(X_train, y_train)
# List all available models
models = q.get_models()
print(list(models.keys()))
4.3 Multi-Model Comparison
import finlab.ml.qlib as q
from finlab.backtest import sim
# Quick comparison of multiple models
results = {}
for name, ModelClass in [('LightGBM', q.LGBModel), ('XGBoost', q.XGBModel), ('Linear', q.LinearModel)]:
model = ModelClass()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
position = y_pred.is_largest(30)
report = sim(position, resample='W', name=f"ML {name}", upload=False)
results[name] = report
# Compare performance
for name, report in results.items():
stats = report.get_stats()
print(f"{name}: annualized return {stats['daily_mean']:.2%}, Sharpe ratio {stats['daily_sharpe']:.2f}")
Stage 5: Prediction & Position Weight Generation
model.predict() returns a FinlabDataFrame (index = dates, columns = stock symbols), which can directly use FinlabDataFrame methods to convert into positions.
5.1 Generate Predictions
# Predict on test set
y_pred = model.predict(X_test)
print(y_pred.head())
# Output (FinlabDataFrame, index=dates, columns=stock symbols):
# 1101 1102 1103 1216 2330
# 2023-01-06 0.032 0.015 -0.008 0.045 0.023
# 2023-01-13 0.018 0.027 0.003 0.012 0.041
# Check prediction distribution
print(y_pred.stack().describe())
5.2 Convert to Positions
from finlab.backtest import sim
# Method 1: Top N stock selection (buy the N stocks with highest predictions)
position_topn = y_pred.is_largest(30)
# Method 2: Top 20% selection
position_quantile = y_pred > y_pred.quantile(0.8)
# Method 3: Allocate weights based on prediction values
position_weighted = y_pred / y_pred.sum()
print(f"Top 30 strategy average holdings: {position_topn.sum(axis=1).mean():.1f}")
print(f"Top 20% strategy average holdings: {position_quantile.sum(axis=1).mean():.1f}")
Stage 6: Backtest Validation
6.1 Run Backtest
from finlab.backtest import sim
# Backtest Top 30 strategy
report_topn = sim(
position_topn,
resample='W',
name="ML Top 30 Strategy",
upload=False
)
# Backtest weighted strategy
report_weighted = sim(
position_weighted,
resample='W',
name="ML Weighted Strategy",
upload=False
)
# Display performance
report_topn.display()
6.2 Performance Comparison
import pandas as pd
stats_topn = report_topn.get_stats()
stats_weighted = report_weighted.get_stats()
comparison = pd.DataFrame({
'Top 30 Strategy': [
stats_topn['daily_mean'],
stats_topn['daily_sharpe'],
stats_topn['max_drawdown'],
stats_topn['win_ratio']
],
'Weighted Strategy': [
stats_weighted['daily_mean'],
stats_weighted['daily_sharpe'],
stats_weighted['max_drawdown'],
stats_weighted['win_ratio']
]
}, index=['Annualized Return', 'Sharpe Ratio', 'Max Drawdown', 'Win Rate'])
print(comparison)
6.3 In-Depth Analysis
# Liquidity analysis
report_topn.run_analysis('LiquidityAnalysis', required_volume=100000)
# MAE/MFE analysis
report_topn.display_mae_mfe_analysis()
# Period stability
report_topn.run_analysis('PeriodStatsAnalysis')
# Alpha/Beta
report_topn.run_analysis('AlphaBetaAnalysis')
Stage 7: Feature Engineering Iteration & Optimization
7.1 Reduce Feature Count
# Strategy 1: Use fewer technical indicators
features_small = mlf.combine({
'pb': pb_ratio,
'pe': pe_ratio,
'roe': roe,
'talib': mlf.ta(mlf.ta_names(n=1)[:20]) # Only take the first 20 indicators
}, resample='W')
label_small = mll.return_percentage(features_small.index, resample='W', period=1)
is_train_small = features_small.index.get_level_values('datetime') < '2023-01-01'
model_v2 = q.LGBModel()
model_v2.fit(features_small[is_train_small], label_small[is_train_small])
y_pred_v2 = model_v2.predict(features_small[~is_train_small])
position_v2 = y_pred_v2.is_largest(30)
report_v2 = sim(position_v2, resample='W', name="ML V2 Reduced Features", upload=False)
report_v2.display()
7.2 Adjust Label Prediction Period
# Test different prediction periods
for period in [1, 2, 4]:
label_n = mll.return_percentage(all_features.index, resample='W', period=period)
model_n = q.LGBModel()
model_n.fit(X_train, label_n[is_train])
y_pred_n = model_n.predict(X_test)
position_n = y_pred_n.is_largest(30)
report_n = sim(position_n, resample='W', name=f"ML Predict {period}W", upload=False)
stats = report_n.get_stats()
print(f"Predict {period}W: annualized return {stats['daily_mean']:.2%}, Sharpe ratio {stats['daily_sharpe']:.2f}")
7.3 Try Different Label Types
# Compare return vs excess return labels
label_return = mll.return_percentage(all_features.index, resample='W', period=1)
label_excess = mll.excess_over_median(all_features.index, resample='W', period=1)
for label_name, label_data in [('Return', label_return), ('Excess Return', label_excess)]:
model_cmp = q.LGBModel()
model_cmp.fit(X_train, label_data[is_train])
y_pred_cmp = model_cmp.predict(X_test)
position_cmp = y_pred_cmp.is_largest(30)
report_cmp = sim(position_cmp, resample='W', name=f"ML {label_name}", upload=False)
stats = report_cmp.get_stats()
print(f"{label_name}: annualized return {stats['daily_mean']:.2%}, Sharpe ratio {stats['daily_sharpe']:.2f}")
Stage 8: Live Deployment
8.1 Build a Real-Time Prediction Pipeline
from finlab import data
from finlab.ml import feature as mlf, label as mll
import finlab.ml.qlib as q
import pickle
# 1. Train the full model (using all historical data)
features = mlf.combine({
'pb': data.get('price_earning_ratio:股價淨值比'),
'pe': data.get('price_earning_ratio:本益比'),
'roe': data.get('fundamental_features:股東權益報酬率'),
'talib': mlf.ta(mlf.ta_names(n=1)[:20])
}, resample='W')
label = mll.return_percentage(features.index, resample='W', period=1)
model = q.LGBModel()
model.fit(features, label)
# 2. Save the model
with open('ml_model.pkl', 'wb') as f:
pickle.dump(model, f)
# 3. Load model and predict latest positions
with open('ml_model.pkl', 'rb') as f:
loaded_model = pickle.load(f)
y_pred = loaded_model.predict(features)
# 4. Get latest positions
position = y_pred.is_largest(30)
latest_position = position.iloc[-1]
latest_position = latest_position[latest_position > 0].sort_values(ascending=False)
print("Latest position recommendations:")
print(latest_position)
8.2 Automated Trading Setup
from finlab.backtest import sim
from finlab.online.sinopac_account import SinopacAccount
from finlab.online.order_executor import OrderExecutor
# Create a script to run weekly (every Monday)
def weekly_rebalance():
# Recalculate features
features = mlf.combine({
'pb': data.get('price_earning_ratio:股價淨值比'),
'pe': data.get('price_earning_ratio:本益比'),
'roe': data.get('fundamental_features:股東權益報酬率'),
'talib': mlf.ta(mlf.ta_names(n=1)[:20])
}, resample='W')
# Load model and predict
with open('ml_model.pkl', 'rb') as f:
model = pickle.load(f)
y_pred = model.predict(features)
position = y_pred.is_largest(30)
# Use sim to generate report
report = sim(position, resample='W', upload=False)
# Execute orders
account = SinopacAccount(simulation=False)
executor = OrderExecutor(report=report, account=account, fund=1000000)
executor.execute()
# Use cron or a scheduling tool to run weekly_rebalance() periodically
Complete Code Summary
# =============================================================================
# Machine Learning Strategy Complete Example
# =============================================================================
from finlab import data
from finlab.ml import feature as mlf
from finlab.ml import label as mll
import finlab.ml.qlib as q
from finlab.backtest import sim
# 1. Feature Engineering
close = data.get('price:收盤價')
pb = data.get('price_earning_ratio:股價淨值比')
pe = data.get('price_earning_ratio:本益比')
rev = data.get('monthly_revenue:當月營收')
features = mlf.combine({
'pb': pb,
'pe': pe,
'rev_ma3': rev.average(3),
'rev_ma12': rev.average(12),
'talib': mlf.ta(mlf.ta_names(n=1)[:20])
}, resample='W')
# 2. Label Generation
label = mll.return_percentage(features.index, resample='W', period=1)
# 3. Data Splitting
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train = features[is_train]
y_train = label[is_train]
X_test = features[~is_train]
# 4. Model Training
model = q.LGBModel()
model.fit(X_train, y_train)
# 5. Prediction & Positions
y_pred = model.predict(X_test)
position = y_pred.is_largest(30)
# 6. Backtest
report = sim(position, resample='W', name="ML Strategy", upload=False)
report.display()
# 7. Analysis
report.run_analysis('LiquidityAnalysis')
report.display_mae_mfe_analysis()
print("Done!")
Key Takeaways
Feature Engineering Stage
- Use diverse feature sources (fundamental, technical, custom)
- Use
mlf.combine()to unify merging, ensuring MultiIndex alignment - Check and handle missing values
- Control feature count (too many leads to overfitting)
Label Generation Stage
- Use
mll.return_percentage()and similar functions, passingfeatures.index resampleparameter should match features- Prediction period (
period) should be reasonable (too short = noisy, too long = hard to predict)
Model Training Stage
- Use wrapper classes like
q.LGBModel(), withfit()+predict() - Time-based train/test split (not random split)
- Start with a simple model (
LinearModel) to establish a baseline
Backtest Validation Stage
predict()returns FinlabDataFrame; useis_largest()to convert to positions- Out-of-sample testing is mandatory
- Run in-depth analysis (liquidity, MAE/MFE)
Live Deployment Stage
- Retrain the model periodically (e.g., quarterly)
- Monitor divergence between live and backtest performance
- Set up performance alert mechanisms
Common Error Handling Checklist
During ML strategy development, the following are key error checkpoints:
Stage 1: Feature Engineering
Common Errors:
- resample mismatch between features and labels
- Too many missing values resulting in insufficient training data
- Look-ahead bias (using future data to predict the past)
Validation Methods:
try:
# 1. Build features
features = mlf.combine({
'pb': pb,
'pe': pe,
'rev_ma3': rev.average(3)
}, resample='W')
if features.empty:
raise ValueError("❌ Feature DataFrame is empty")
# 2. Check missing value ratio
missing_ratio = features.isna().sum() / len(features)
high_missing_cols = missing_ratio[missing_ratio > 0.3].index.tolist()
if high_missing_cols:
print(f"⚠️ Warning: the following features have > 30% missing values: {high_missing_cols}")
print("Suggestion: drop these features or apply forward fill")
# 3. Check date range
print(f"Feature date range: {features.index.get_level_values(0).min()} ~ {features.index.get_level_values(0).max()}")
# 4. Check feature count
num_features = features.shape[1]
if num_features > 500:
print(f"⚠️ Warning: too many features ({num_features}), may cause overfitting")
print("Suggestion: prefer < 200 features")
print(f"✅ Feature engineering complete: {num_features} features, {len(features)} rows")
except KeyError as e:
print(f"❌ Invalid data table name: {e}")
print("Please visit https://ai.finlab.tw/database to confirm the correct name")
except ValueError as e:
print(f"❌ Feature validation failed: {e}")
Detailed Error Handling: See Data Download Error Handling
Stage 2: Label Generation
Common Errors:
- resample mismatch between labels and features
- Passing incorrect index (should pass features.index)
- Unreasonable prediction period setting
Validation Methods:
# Generate labels
label = mll.return_percentage(features.index, resample='W', period=1)
# Check label distribution
print("Label statistics:")
print(label.describe())
# Check label missing values
nan_ratio = label.isna().mean()
if nan_ratio > 0.1:
print(f"⚠️ Warning: label missing ratio {nan_ratio:.1%} > 10%")
print("Likely cause: prediction period too long, recent rows have no label")
print(f"✅ Label generation complete: {len(label)} rows")
Stage 3: Model Training
Common Errors: - Insufficient training data (< 1000 rows) - Using random split instead of time-based split - Overfitting (test set performance much worse than training set)
Validation Methods:
# Split train/test sets
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train = features[is_train]
y_train = label[is_train]
X_test = features[~is_train]
# 1. Check data volume
print(f"Training set: {len(X_train)} rows")
print(f"Test set: {len(X_test)} rows")
if len(X_train) < 1000:
print("⚠️ Warning: insufficient training data (< 1000 rows)")
print("Suggestion: extend the historical range or lower the resample frequency")
if len(X_test) < 100:
print("⚠️ Warning: too few test rows (< 100)")
# 2. Check date ordering
train_last = X_train.index.get_level_values(0).max()
test_first = X_test.index.get_level_values(0).min()
if train_last >= test_first:
raise ValueError(
f"❌ Training and test set dates overlap!\n"
f" Training set last date: {train_last}\n"
f" Test set first date: {test_first}\n"
f" This will cause data leakage"
)
print(f"✅ Dataset split is correct")
# 3. Model training
try:
model = q.LGBModel()
model.fit(X_train, y_train)
print(f"✅ Model training complete")
except Exception as e:
print(f"❌ Model training failed: {e}")
print("Please check:")
print("1. Whether features contain NaN or Inf")
print("2. Whether labels are numeric")
print("3. Whether related packages are installed correctly (pip install lightgbm / xgboost)")
raise
Stage 4: Prediction & Backtesting
Common Errors: - Prediction results are all NaN - Position DataFrame format is incorrect - Backtest has no trade records
Validation Methods:
# 1. Prediction
y_pred = model.predict(X_test)
if y_pred.isna().all().all():
raise ValueError("❌ All predictions are NaN")
# Check prediction distribution
print(f"Prediction range: {y_pred.min().min():.4f} ~ {y_pred.max().max():.4f}")
print(f"Prediction mean: {y_pred.stack().mean():.4f}")
# 2. Generate positions
position = y_pred.is_largest(30)
if position.empty:
raise ValueError("❌ Position DataFrame is empty")
holding_count = position.sum(axis=1).mean()
if holding_count < 10:
print(f"⚠️ Warning: average holding count {holding_count:.1f} < 10, possibly too few")
print(f"✅ Positions generated successfully: average {holding_count:.1f} holdings")
# 3. Backtest
try:
report = sim(position, resample='W', name="ML Strategy", upload=False)
print(f"✅ Backtest succeeded")
stats = report.get_stats()
print(f" Annualized return: {stats['daily_mean']:.2%}")
print(f" Sharpe ratio: {stats['daily_sharpe']:.2f}")
except Exception as e:
print(f"❌ Backtest failed: {e}")
print("Please check:")
print("1. Whether position.index is a DatetimeIndex")
print("2. Whether position.columns are stock symbols")
raise
Risks Specific to ML Strategies
Compared to traditional strategies, ML strategies require extra attention to:
- Data leakage -- using future data to predict the past
- Overfitting -- test set performance much worse than training set
- Model decay -- live performance degrades over time
Recommendations: - Strictly use time-series splitting (not random splitting) - Retrain the model periodically (quarterly or monthly) - Monitor live vs backtest divergence and set up alert mechanisms