Machine Learning Strategy Complete Workflow

This document provides the complete development workflow for machine learning quantitative strategies, from feature engineering, label generation, and model training to backtest validation and live deployment.

ML Strategy Development -- Recommended with AI Assistant

ML strategy development involves feature engineering, label design, model training, and more. After installing FinLab Skill, the AI coding assistant can provide code examples and debugging help at every stage.

Workflow Overview

graph TD
    A[原始資料] --> B[特徵工程<br/>finlab.ml.feature]
    B --> C[標籤生成<br/>finlab.ml.label]
    C --> D[資料集切分]
    D --> E[模型訓練<br/>finlab.ml.qlib]
    E --> F[預測持倉權重]
    F --> G[回測驗證]
    G --> H{績效滿意?}
    H -->|否| I[調整特徵/模型]
    I --> B
    H -->|是| J[樣本外測試]
    J --> K{通過驗證?}
    K -->|否| I
    K -->|是| L[實盤部署]

Stage 1: Feature Engineering

The success of machine learning depends 80% on feature engineering. FinLab provides powerful feature engineering tools.

1.1 Load and Merge Fundamental Features

from finlab import data
from finlab.ml import feature as mlf

# Load fundamental data
pb_ratio = data.get('price_earning_ratio:股價淨值比')
pe_ratio = data.get('price_earning_ratio:本益比')
roe = data.get('fundamental_features:股東權益報酬率')
roa = data.get('fundamental_features:資產報酬率')

# Combine into feature set
fundamental_features = mlf.combine({
    'pb': pb_ratio,
    'pe': pe_ratio,
    'roe': roe,
    'roa': roa
}, resample='W')  # 重採樣為週度資料

print(fundamental_features.head())
# Output:
#                                        pb     pe    roe    roa
# (2010-01-04 00:00:00, '1101')        1.47  18.85   7.80   3.21
# (2010-01-04 00:00:00, '1102')        1.44  14.58   9.87   4.15

1.2 Add Technical Indicator Features

# Method 1: Use random technical indicators (explore best indicators)
ta_features = mlf.combine({
    'talib': mlf.ta(mlf.ta_names(n=5))  # 每個指標隨機產生 5 種參數配置
}, resample='W')

# Method 2: Specify particular technical indicators
from finlab import data

close = data.get('price:收盤價')
volume = data.get('price:成交股數')

specific_ta = mlf.combine({
    'rsi': close.ta.RSI(timeperiod=14),
    'macd': close.ta.MACD(),
    'bbands': close.ta.BBANDS(timeperiod=20),
    'obv': volume.ta.OBV(close)
}, resample='W')

print(f"技術指標特徵數: {ta_features.shape[1]}")  # e.g., 450 features

1.3 Add Custom Features

# Revenue-related features
rev = data.get('monthly_revenue:當月營收')
rev_yoy = data.get('monthly_revenue:去年同月增減(%)')

custom_features = mlf.combine({
    'rev_ma3': rev.average(3),          # 3 個月營收移動平均
    'rev_ma12': rev.average(12),        # 12 個月營收移動平均
    'rev_momentum': rev.average(3) / rev.average(12),  # 營收動能
    'rev_yoy': rev_yoy,                 # 營收年增率
}, resample='W')

print(custom_features.head())

1.4 Merge All Features

# Merge all feature sets
all_features = mlf.combine({
    'fundamental': fundamental_features,
    'technical': ta_features,
    'custom': custom_features
}, resample='W')

print(f"總特徵數: {all_features.shape[1]}")  # e.g., 470 features
print(f"資料筆數: {all_features.shape[0]}")  # e.g., 150,000 rows

# Check missing values
missing_ratio = all_features.isna().sum() / len(all_features)
print(f"缺失值比例:\n{missing_ratio[missing_ratio > 0.5]}")  # Show features with > 50% missing

# Remove features with too many missing values
all_features = all_features.loc[:, missing_ratio < 0.5]
print(f"過濾後特徵數: {all_features.shape[1]}")

Stage 2: Label Generation

Labels define our prediction target. finlab.ml.label provides various label generation functions, all accepting features.index (MultiIndex) as the first argument.

2.1 Predict Future Returns

from finlab.ml import label as mll

# Predict 1-week future return (most common)
label = mll.return_percentage(all_features.index, resample='W', period=1)

print(label.head())
# Output:
# datetime             instrument
# 2010-01-04 00:00:00  1101          0.032
#                      1102         -0.015
#                      1103          0.021
# dtype: float64

# Check label distribution
print(label.describe())
# Output:
# count    150000.00
# mean          0.005
# std           0.087
# min          -0.450
# 25%          -0.042
# 50%           0.002
# 75%           0.051
# max           0.520

2.2 Excess Return Labels

# Excess return over market median for the same period
label_excess_median = mll.excess_over_median(all_features.index, resample='W', period=1)

# Excess return over market mean for the same period
label_excess_mean = mll.excess_over_mean(all_features.index, resample='W', period=1)

print(label_excess_median.describe())

2.3 Other Label Types

# Day trading return (open-to-close change)
label_daytrading = mll.daytrading_percentage(all_features.index)

# Risk metric: Maximum Adverse Excursion (max decline during holding period)
label_mae = mll.maximum_adverse_excursion(all_features.index, period=5)

# Risk metric: Maximum Favorable Excursion (max gain during holding period)
label_mfe = mll.maximum_favorable_excursion(all_features.index, period=5)

# Multi-period prediction (predict different time horizons)
label_1w = mll.return_percentage(all_features.index, resample='W', period=1)
label_2w = mll.return_percentage(all_features.index, resample='W', period=2)
label_4w = mll.return_percentage(all_features.index, resample='W', period=4)

Label Selection Recommendations

return_percentage: Most commonly used, directly predicts returns
excess_over_median: Predicts relative performance, reduces market-wide movement impact
daytrading_percentage: Suitable for day trading strategies
maximum_adverse_excursion: Suitable for risk management models
period should match the strategy rebalancing frequency: e.g., resample='W', period=1 predicts 1-week return

Stage 3: Dataset Preparation & Splitting

3.1 Align Features and Labels

Features (all_features) and labels (label) share the same MultiIndex (datetime, instrument) and can be split directly.

# Select label
label = mll.return_percentage(all_features.index, resample='W', period=1)

# Check alignment
print(f"特徵筆數: {len(all_features)}")
print(f"標籤筆數: {len(label)}")
print(f"標籤 NaN 比例: {label.isna().mean():.2%}")

3.2 Split Training and Test Sets

# Use time-based splitting (strictly avoid data leakage)
is_train = all_features.index.get_level_values('datetime') < '2023-01-01'

X_train = all_features[is_train]
y_train = label[is_train]
X_test = all_features[~is_train]

print(f"訓練集: {len(X_train)} 筆 ({X_train.index.get_level_values(0).min()} ~ {X_train.index.get_level_values(0).max()})")
print(f"測試集: {len(X_test)} 筆 ({X_test.index.get_level_values(0).min()} ~ {X_test.index.get_level_values(0).max()})")
print(f"訓練特徵維度: {X_train.shape}")

Time-Based Splitting is Critical

You must use time-based splitting, not random splitting. Random splitting causes data leakage -- the model sees future data to predict the past, leading to artificially inflated backtest performance.

Stage 4: Model Training

4.1 Using the LightGBM Model

import finlab.ml.qlib as q

# Create and train LightGBM model
model = q.LGBModel()
model.fit(X_train, y_train)

print("訓練完成!")

4.2 Using Other Models

# XGBoost
model_xgb = q.XGBModel()
model_xgb.fit(X_train, y_train)

# CatBoost
model_cat = q.CatBoostModel()
model_cat.fit(X_train, y_train)

# Linear model (fast validation, less prone to overfitting)
model_linear = q.LinearModel()
model_linear.fit(X_train, y_train)

# Deep learning
model_dnn = q.DNNModel()
model_dnn.fit(X_train, y_train)

# List all available models
models = q.get_models()
print(list(models.keys()))

4.3 Multi-Model Comparison

import finlab.ml.qlib as q
from finlab.backtest import sim

# Quick comparison of multiple models
results = {}
for name, ModelClass in [('LightGBM', q.LGBModel), ('XGBoost', q.XGBModel), ('Linear', q.LinearModel)]:
    model = ModelClass()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    position = y_pred.is_largest(30)
    report = sim(position, resample='W', name=f"ML {name}", upload=False)
    results[name] = report

# Compare performance
for name, report in results.items():
    stats = report.get_stats()
    print(f"{name}: 年化報酬 {stats['daily_mean']:.2%}, 夏普率 {stats['daily_sharpe']:.2f}")

Stage 5: Prediction & Position Weight Generation

model.predict() returns a FinlabDataFrame (index = dates, columns = stock symbols), which can directly use FinlabDataFrame methods to convert into positions.

5.1 Generate Predictions

# Predict on test set
y_pred = model.predict(X_test)

print(y_pred.head())
# Output (FinlabDataFrame, index=dates, columns=stock symbols):
#              1101    1102    1103    1216    2330
# 2023-01-06  0.032   0.015  -0.008   0.045   0.023
# 2023-01-13  0.018   0.027   0.003   0.012   0.041

# Check prediction distribution
print(y_pred.stack().describe())

5.2 Convert to Positions

from finlab.backtest import sim

# Method 1: Top N stock selection (buy the N stocks with highest predictions)
position_topn = y_pred.is_largest(30)

# Method 2: Top 20% selection
position_quantile = y_pred > y_pred.quantile(0.8)

# Method 3: Allocate weights based on prediction values
position_weighted = y_pred / y_pred.sum()

print(f"Top 30 策略平均持股數: {position_topn.sum(axis=1).mean():.1f}")
print(f"前 20% 策略平均持股數: {position_quantile.sum(axis=1).mean():.1f}")

Stage 6: Backtest Validation

6.1 Run Backtest

from finlab.backtest import sim

# Backtest Top 30 strategy
report_topn = sim(
    position_topn,
    resample='W',
    name="ML Top 30 策略",
    upload=False
)

# Backtest weighted strategy
report_weighted = sim(
    position_weighted,
    resample='W',
    name="ML 加權策略",
    upload=False
)

# Display performance
report_topn.display()

6.2 Performance Comparison

import pandas as pd

stats_topn = report_topn.get_stats()
stats_weighted = report_weighted.get_stats()

comparison = pd.DataFrame({
    'Top 30 策略': [
        stats_topn['daily_mean'],
        stats_topn['daily_sharpe'],
        stats_topn['max_drawdown'],
        stats_topn['win_ratio']
    ],
    '加權策略': [
        stats_weighted['daily_mean'],
        stats_weighted['daily_sharpe'],
        stats_weighted['max_drawdown'],
        stats_weighted['win_ratio']
    ]
}, index=['年化報酬率', '夏普率', '最大回撤', '勝率'])

print(comparison)

6.3 In-Depth Analysis

# Liquidity analysis
report_topn.run_analysis('LiquidityAnalysis', required_volume=100000)

# MAE/MFE analysis
report_topn.display_mae_mfe_analysis()

# Period stability
report_topn.run_analysis('PeriodStatsAnalysis')

# Alpha/Beta
report_topn.run_analysis('AlphaBetaAnalysis')

Stage 7: Feature Engineering Iteration & Optimization

7.1 Reduce Feature Count

# Strategy 1: Use fewer technical indicators
features_small = mlf.combine({
    'pb': pb_ratio,
    'pe': pe_ratio,
    'roe': roe,
    'talib': mlf.ta(mlf.ta_names(n=1)[:20])  # Only take the first 20 indicators
}, resample='W')

label_small = mll.return_percentage(features_small.index, resample='W', period=1)

is_train_small = features_small.index.get_level_values('datetime') < '2023-01-01'
model_v2 = q.LGBModel()
model_v2.fit(features_small[is_train_small], label_small[is_train_small])

y_pred_v2 = model_v2.predict(features_small[~is_train_small])
position_v2 = y_pred_v2.is_largest(30)
report_v2 = sim(position_v2, resample='W', name="ML V2 精簡特徵", upload=False)
report_v2.display()

7.2 Adjust Label Prediction Period

# Test different prediction periods
for period in [1, 2, 4]:
    label_n = mll.return_percentage(all_features.index, resample='W', period=period)

    model_n = q.LGBModel()
    model_n.fit(X_train, label_n[is_train])

    y_pred_n = model_n.predict(X_test)
    position_n = y_pred_n.is_largest(30)
    report_n = sim(position_n, resample='W', name=f"ML 預測{period}週", upload=False)

    stats = report_n.get_stats()
    print(f"預測 {period} 週: 年化報酬 {stats['daily_mean']:.2%}, 夏普率 {stats['daily_sharpe']:.2f}")

7.3 Try Different Label Types

# Compare return vs excess return labels
label_return = mll.return_percentage(all_features.index, resample='W', period=1)
label_excess = mll.excess_over_median(all_features.index, resample='W', period=1)

for label_name, label_data in [('報酬率', label_return), ('超額報酬', label_excess)]:
    model_cmp = q.LGBModel()
    model_cmp.fit(X_train, label_data[is_train])

    y_pred_cmp = model_cmp.predict(X_test)
    position_cmp = y_pred_cmp.is_largest(30)
    report_cmp = sim(position_cmp, resample='W', name=f"ML {label_name}", upload=False)

    stats = report_cmp.get_stats()
    print(f"{label_name}: 年化報酬 {stats['daily_mean']:.2%}, 夏普率 {stats['daily_sharpe']:.2f}")

Stage 8: Live Deployment

8.1 Build a Real-Time Prediction Pipeline

from finlab import data
from finlab.ml import feature as mlf, label as mll
import finlab.ml.qlib as q
import pickle

# 1. Train the full model (using all historical data)
features = mlf.combine({
    'pb': data.get('price_earning_ratio:股價淨值比'),
    'pe': data.get('price_earning_ratio:本益比'),
    'roe': data.get('fundamental_features:股東權益報酬率'),
    'talib': mlf.ta(mlf.ta_names(n=1)[:20])
}, resample='W')

label = mll.return_percentage(features.index, resample='W', period=1)

model = q.LGBModel()
model.fit(features, label)

# 2. Save the model
with open('ml_model.pkl', 'wb') as f:
    pickle.dump(model, f)

# 3. Load model and predict latest positions
with open('ml_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

y_pred = loaded_model.predict(features)

# 4. Get latest positions
position = y_pred.is_largest(30)
latest_position = position.iloc[-1]
latest_position = latest_position[latest_position > 0].sort_values(ascending=False)

print("最新持股建議:")
print(latest_position)

8.2 Automated Trading Setup

from finlab.backtest import sim
from finlab.online.sinopac_account import SinopacAccount
from finlab.online.order_executor import OrderExecutor

# Create a script to run weekly (every Monday)
def weekly_rebalance():
    # Recalculate features
    features = mlf.combine({
        'pb': data.get('price_earning_ratio:股價淨值比'),
        'pe': data.get('price_earning_ratio:本益比'),
        'roe': data.get('fundamental_features:股東權益報酬率'),
        'talib': mlf.ta(mlf.ta_names(n=1)[:20])
    }, resample='W')

    # Load model and predict
    with open('ml_model.pkl', 'rb') as f:
        model = pickle.load(f)

    y_pred = model.predict(features)
    position = y_pred.is_largest(30)

    # Use sim to generate report
    report = sim(position, resample='W', upload=False)

    # Execute orders
    account = SinopacAccount(simulation=False)
    executor = OrderExecutor(report=report, account=account, fund=1000000)
    executor.execute()

# Use cron or a scheduling tool to run weekly_rebalance() periodically

Complete Code Summary

# =============================================================================
# Machine Learning Strategy Complete Example
# =============================================================================

from finlab import data
from finlab.ml import feature as mlf
from finlab.ml import label as mll
import finlab.ml.qlib as q
from finlab.backtest import sim

# 1. Feature Engineering
close = data.get('price:收盤價')
pb = data.get('price_earning_ratio:股價淨值比')
pe = data.get('price_earning_ratio:本益比')
rev = data.get('monthly_revenue:當月營收')

features = mlf.combine({
    'pb': pb,
    'pe': pe,
    'rev_ma3': rev.average(3),
    'rev_ma12': rev.average(12),
    'talib': mlf.ta(mlf.ta_names(n=1)[:20])
}, resample='W')

# 2. Label Generation
label = mll.return_percentage(features.index, resample='W', period=1)

# 3. Data Splitting
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train = features[is_train]
y_train = label[is_train]
X_test = features[~is_train]

# 4. Model Training
model = q.LGBModel()
model.fit(X_train, y_train)

# 5. Prediction & Positions
y_pred = model.predict(X_test)
position = y_pred.is_largest(30)

# 6. Backtest
report = sim(position, resample='W', name="ML Strategy", upload=False)
report.display()

# 7. Analysis
report.run_analysis('LiquidityAnalysis')
report.display_mae_mfe_analysis()

print("完成!")

Key Takeaways

Feature Engineering Stage

Use diverse feature sources (fundamental, technical, custom)
Use mlf.combine() to unify merging, ensuring MultiIndex alignment
Check and handle missing values
Control feature count (too many leads to overfitting)

Label Generation Stage

Use mll.return_percentage() and similar functions, passing features.index
resample parameter should match features
Prediction period (period) should be reasonable (too short = noisy, too long = hard to predict)

Model Training Stage

Use wrapper classes like q.LGBModel(), with fit() + predict()
Time-based train/test split (not random split)
Start with a simple model (LinearModel) to establish a baseline

Backtest Validation Stage

predict() returns FinlabDataFrame; use is_largest() to convert to positions
Out-of-sample testing is mandatory
Run in-depth analysis (liquidity, MAE/MFE)

Live Deployment Stage

Retrain the model periodically (e.g., quarterly)
Monitor divergence between live and backtest performance
Set up performance alert mechanisms

Common Error Handling Checklist

During ML strategy development, the following are key error checkpoints:

Stage 1: Feature Engineering

Common Errors: - resample mismatch between features and labels - Too many missing values resulting in insufficient training data - Look-ahead bias (using future data to predict the past)

Validation Methods:

try:
    # 1. Build features
    features = mlf.combine({
        'pb': pb,
        'pe': pe,
        'rev_ma3': rev.average(3)
    }, resample='W')

    if features.empty:
        raise ValueError("❌ 特徵 DataFrame 為空")

    # 2. Check missing value ratio
    missing_ratio = features.isna().sum() / len(features)
    high_missing_cols = missing_ratio[missing_ratio > 0.3].index.tolist()

    if high_missing_cols:
        print(f"⚠️  警告：以下特徵缺失值 > 30%：{high_missing_cols}")
        print("建議：移除這些特徵或使用 forward fill")

    # 3. Check date range
    print(f"特徵日期範圍：{features.index.get_level_values(0).min()} ~ {features.index.get_level_values(0).max()}")

    # 4. Check feature count
    num_features = features.shape[1]
    if num_features > 500:
        print(f"⚠️  警告：特徵數量過多（{num_features} 個），可能導致過度配適")
        print("建議：< 200 個特徵為佳")

    print(f"✅ 特徵工程完成：{num_features} 個特徵，{len(features)} 筆資料")

except KeyError as e:
    print(f"❌ 資料表名稱錯誤：{e}")
    print("請至 https://ai.finlab.tw/database 確認正確名稱")

except ValueError as e:
    print(f"❌ 特徵驗證失敗：{e}")

Detailed Error Handling: See Data Download Error Handling

Stage 2: Label Generation

Common Errors: - resample mismatch between labels and features - Passing incorrect index (should pass features.index) - Unreasonable prediction period setting

Validation Methods:

# Generate labels
label = mll.return_percentage(features.index, resample='W', period=1)

# Check label distribution
print("標籤統計:")
print(label.describe())

# Check label missing values
nan_ratio = label.isna().mean()
if nan_ratio > 0.1:
    print(f"⚠️  警告：標籤缺失值 {nan_ratio:.1%} > 10%")
    print("可能原因：預測期過長，近期資料無標籤")

print(f"✅ 標籤生成完成：{len(label)} 筆資料")

Stage 3: Model Training

Common Errors: - Insufficient training data (< 1000 rows) - Using random split instead of time-based split - Overfitting (test set performance much worse than training set)

Validation Methods:

# Split train/test sets
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train = features[is_train]
y_train = label[is_train]
X_test = features[~is_train]

# 1. Check data volume
print(f"訓練集：{len(X_train)} 筆")
print(f"測試集：{len(X_test)} 筆")

if len(X_train) < 1000:
    print("⚠️  警告：訓練資料不足（< 1000 筆）")
    print("建議：增加歷史資料範圍或降低 resample 頻率")

if len(X_test) < 100:
    print("⚠️  警告：測試資料過少（< 100 筆）")

# 2. Check date ordering
train_last = X_train.index.get_level_values(0).max()
test_first = X_test.index.get_level_values(0).min()

if train_last >= test_first:
    raise ValueError(
        f"❌ 訓練集與測試集日期重疊！\n"
        f"   訓練集最後日期：{train_last}\n"
        f"   測試集第一日期：{test_first}\n"
        f"   這會導致資料洩露（data leakage）"
    )

print(f"✅ 資料切分正確")

# 3. Model training
try:
    model = q.LGBModel()
    model.fit(X_train, y_train)
    print(f"✅ 模型訓練完成")

except Exception as e:
    print(f"❌ 模型訓練失敗：{e}")
    print("請檢查：")
    print("1. 特徵是否包含 NaN 或 Inf")
    print("2. 標籤是否為數值型態")
    print("3. 相關套件是否正確安裝（pip install lightgbm / xgboost）")
    raise

Stage 4: Prediction & Backtesting

Common Errors: - Prediction results are all NaN - Position DataFrame format is incorrect - Backtest has no trade records

Validation Methods:

# 1. Prediction
y_pred = model.predict(X_test)

if y_pred.isna().all().all():
    raise ValueError("❌ 預測結果全為 NaN")

# Check prediction distribution
print(f"預測值範圍：{y_pred.min().min():.4f} ~ {y_pred.max().max():.4f}")
print(f"預測平均值：{y_pred.stack().mean():.4f}")

# 2. Generate positions
position = y_pred.is_largest(30)

if position.empty:
    raise ValueError("❌ 持倉 DataFrame 為空")

holding_count = position.sum(axis=1).mean()
if holding_count < 10:
    print(f"⚠️  警告：平均持倉數 {holding_count:.1f} < 10，可能過少")

print(f"✅ 持倉生成成功：平均持有 {holding_count:.1f} 檔")

# 3. Backtest
try:
    report = sim(position, resample='W', name="ML Strategy", upload=False)
    print(f"✅ 回測成功")

    stats = report.get_stats()
    print(f"   年化報酬：{stats['daily_mean']:.2%}")
    print(f"   夏普率：{stats['daily_sharpe']:.2f}")

except Exception as e:
    print(f"❌ 回測失敗：{e}")
    print("請檢查：")
    print("1. position 的 index 是否為 DatetimeIndex")
    print("2. position 的 columns 是否為股票代號")
    raise

Risks Specific to ML Strategies

Compared to traditional strategies, ML strategies require extra attention to:

Data leakage -- using future data to predict the past
Overfitting -- test set performance much worse than training set
Model decay -- live performance degrades over time

Recommendations: - Strictly use time-series splitting (not random splitting) - Retrain the model periodically (quarterly or monthly) - Monitor live vs backtest divergence and set up alert mechanisms

Machine Learning Strategy Complete Workflow

Workflow Overview

Stage 1: Feature Engineering

1.1 Load and Merge Fundamental Features

1.2 Add Technical Indicator Features

1.3 Add Custom Features

1.4 Merge All Features

Stage 2: Label Generation

2.1 Predict Future Returns

2.2 Excess Return Labels

2.3 Other Label Types

Stage 3: Dataset Preparation & Splitting

3.1 Align Features and Labels

3.2 Split Training and Test Sets

Stage 4: Model Training

4.1 Using the LightGBM Model

4.2 Using Other Models

4.3 Multi-Model Comparison

Stage 5: Prediction & Position Weight Generation

5.1 Generate Predictions

5.2 Convert to Positions

Stage 6: Backtest Validation

6.1 Run Backtest

6.2 Performance Comparison

6.3 In-Depth Analysis

Stage 7: Feature Engineering Iteration & Optimization

7.1 Reduce Feature Count

7.2 Adjust Label Prediction Period

7.3 Try Different Label Types

Stage 8: Live Deployment

8.1 Build a Real-Time Prediction Pipeline

8.2 Automated Trading Setup

Complete Code Summary

Key Takeaways

Feature Engineering Stage

Label Generation Stage

Model Training Stage

Backtest Validation Stage

Live Deployment Stage

Common Error Handling Checklist

Stage 1: Feature Engineering

Stage 2: Label Generation

Stage 3: Model Training

Stage 4: Prediction & Backtesting

Reference Resources