finlab.ml
Machine learning module providing feature engineering, label generation, and model training integration.
Use Cases
- Build feature sets for ML stock selection strategies
- Generate technical indicators (TA-Lib integration)
- Design training labels (returns, risk metrics)
- Integrate with the qlib framework for model training
- Predict future stock performance
Quick Examples
Feature Engineering
from finlab import data
from finlab.ml import feature as mlf
# Combine fundamental features
features = mlf.combine({
'pb': data.get('price_earning_ratio:股價淨值比'),
'pe': data.get('price_earning_ratio:本益比'),
'roe': data.get('fundamental_features:股東權益報酬率')
}, resample='W')
# Add technical indicators
features_ta = mlf.combine({
'fundamental': features,
'technical': mlf.ta(mlf.ta_names(n=10))
}, resample='W')
Label Generation
from finlab.ml import label as mll
# Generate future 1-week return labels
label = mll.daytrading_percentage(
features.index,
period=1,
resample='W'
)
qlib Model Training
import finlab.ml.qlib as q
# Split train/test sets
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train, y_train = features[is_train], label[is_train]
X_test = features[~is_train]
# Train LightGBM model
model = q.LGBModel()
model.fit(X_train, y_train)
# Predict and convert to position weights
pred = model.predict(X_test)
position = pred.is_largest(30) # Buy top 30
Detailed Guide
See Machine Learning Strategy Development for: - Complete ML strategy development workflow - Feature engineering best practices - Label design techniques - Model training and optimization - Overfitting prevention
API Reference
finlab.ml.feature
Feature engineering module for combining and processing various features.
combine()
finlab.ml.feature.combine
The combine function takes a dictionary of features as input and combines them into a single pandas DataFrame. combine 函數接受一個特徵字典作為輸入,並將它們合併成一個 pandas DataFrame。
| PARAMETER | DESCRIPTION |
|---|---|
features
|
a dictionary where values are dataframes or callables returning dataframes. 索引為日期時間,欄位 為證券代碼的 DataFrame,或可呼叫以取得 DataFrame 的函式。
TYPE:
|
resample
|
Optional argument to resample the data in the features. Default is None. 選擇性的參數,用於重新取樣特徵中的資料。預設為 None。
TYPE:
|
sample_filter
|
a boolean dictionary where index is date and columns are instrument representing the filter of features.
TYPE:
|
**kwargs
|
Additional keyword arguments to pass to the resampler function. 傳遞給重新取樣函數 resampler 的其他關鍵字引數。
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
|
A pandas DataFrame containing all the input features combined. 一個包含所有輸入特徵合併後的 pandas DataFrame。 |
Examples:
這段程式碼教我們如何使用finlab.ml.feature和finlab.data模組,來合併兩個特徵:RSI和股價淨值比。我們使用f.combine函數來進行合併,其中特徵的名稱是字典的鍵,對應的資料是值。 我們從data.indicator('RSI')取得'rsi'特徵,這個函數計算相對強弱指數。我們從data.get('price_earning_ratio:股價淨值比')取得'pb'特徵,這個函數獲取股價淨值比。最後,我們得到一個包含這兩個特徵的DataFrame。
from finlab import data
import finlab.ml.feature as f
import finlab.ml.qlib as q
features = f.combine({
# 用 data.get 簡單產生出技術指標
'pb': data.get('price_earning_ratio:股價淨值比'),
# 用 data.indicator 產生技術指標的特徵
'rsi': data.indicator('RSI'),
# 用 f.ta 枚舉超多種 talib 指標
'talib': f.ta(f.ta_names()),
# 利用 qlib alph158 產生技術指標的特徵(請先執行 q.init(), q.dump() 才能使用)
'qlib158': q.alpha('Alpha158')
})
features.head()
| datetime | instrument | rsi | pb |
|---|---|---|---|
| 2020-01-01 | 1101 | 0 | 2 |
| 2020-01-02 | 1102 | 100 | 3 |
| 2020-01-03 | 1108 | 100 | 4 |
Usage Examples:
from finlab import data
from finlab.ml import feature as mlf
# Example 1: Combine fundamental features
features = mlf.combine({
'pb': data.get('price_earning_ratio:股價淨值比'),
'pe': data.get('price_earning_ratio:本益比'),
'roe': data.get('fundamental_features:股東權益報酬率')
}, resample='W')
# Example 2: Combine technical indicators
features = mlf.combine({
'talib': mlf.ta(['talib.RSI__period14__', 'talib.MACD__fastperiod12_slowperiod26_signalperiod9__macd__'])
}, resample='D')
# Example 3: Mix multiple feature types
features = mlf.combine({
'fundamental': mlf.combine({'pb': pb, 'pe': pe}),
'technical': mlf.ta(mlf.ta_names(n=5)),
'custom': custom_feature_df
}, resample='W')
resample Parameter
'D'- Daily'W'- Weekly (Friday)'M'- Monthly (month-end)- Features and labels must use the same resample frequency!
ta()
finlab.ml.feature.ta
ta(feature_names, factories=None, resample=None, start_time=None, end_time=None, adj=False, cpu=-1, **kwargs)
Calculate technical indicator values for a list of feature names.
| PARAMETER | DESCRIPTION |
|---|---|
feature_names
|
A list of technical indicator feature names. Defaults to None.
TYPE:
|
factories
|
A dictionary of factories to generate technical indicators. Defaults to {"talib": TalibIndicatorFactory()}.
TYPE:
|
resample
|
The frequency to resample the data to. Defaults to None.
TYPE:
|
start_time
|
The start time of the data. Defaults to None.
TYPE:
|
end_time
|
The end time of the data. Defaults to None.
TYPE:
|
**kwargs
|
Additional keyword arguments to pass to the resampler function.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
pd.DataFrame: technical indicator feature names and their corresponding values. |
Technical Indicator Computation:
from finlab.ml import feature as mlf
# Compute specific indicators
indicators = mlf.ta([
'talib.RSI__period14__',
'talib.MACD__fastperiod12_slowperiod26_signalperiod9__macd__',
'talib.BBANDS__timeperiod20_nbdevup2_nbdevdn2__upperband__'
], resample='W')
# Auto-generate random indicator combinations
random_indicators = mlf.ta(mlf.ta_names(n=10), resample='W')
ta_names()
finlab.ml.feature.ta_names
Generate a list of technical indicator feature names.
| PARAMETER | DESCRIPTION |
|---|---|
lb
|
The lower bound of the multiplier of the default parameter for the technical indicators.
TYPE:
|
ub
|
The upper bound of the multiplier of the default parameter for the technical indicators.
TYPE:
|
n
|
The number of random samples for each technical indicator.
TYPE:
|
factory
|
A factory object to generate technical indicators. Defaults to TalibIndicatorFactory.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[str]
|
List[str]: A list of technical indicator feature names. |
Examples:
import finlab.ml.feature as f
# method 1: generate each indicator with random parameters
features = f.ta()
# method 2: generate specific indicator
feature_names = ['talib.MACD__macdhist__fastperiod__52__slowperiod__212__signalperiod__75__']
features = f.ta(feature_names, resample='W')
# method 3: generate some indicator
feature_names = f.ta_names()
features = f.ta(feature_names)
Generate Indicator Name List:
from finlab.ml import feature as mlf
# Generate all TA-Lib indicators with 10 random parameter sets each
indicator_names = mlf.ta_names(n=10)
print(f"Total {len(indicator_names)} indicators")
# View examples
for name in indicator_names[:5]:
print(name)
# Output:
# talib.RSI__period14__
# talib.RSI__period7__
# talib.MACD__fastperiod12_slowperiod26_signalperiod9__macd__
# ...
Recommended n Values
- n=1: Default parameters for each indicator (~100 indicators)
- n=5: 5 random parameter sets per indicator (~500 indicators)
- n=10: More variety but longer computation time (~1000 indicators)
- Start with n=1 to test feasibility, then increase
Notes
- Too many indicators can lead to:
- Long computation time
- High memory usage
- Increased overfitting risk
- Recommend using feature selection to reduce indicator count
finlab.ml.label
Label generation module for designing machine learning target variables.
daytrading_percentage()
finlab.ml.label.daytrading_percentage
Calculate the percentage change of market prices over a given period.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
A multi-level index of datetime and instrument.
TYPE:
|
resample
|
The resample frequency for the output data. Defaults to None.
TYPE:
|
period
|
The number of periods to calculate the percentage change over. Defaults to 1.
TYPE:
|
trade_at_price
|
The price for execution. Defaults to
TYPE:
|
**kwargs
|
Additional arguments to be passed to the resampler function.
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
pd.Series: A pd.Series containing the percentage change of stock prices. |
Predict Future N-Period Returns:
from finlab.ml import feature as mlf, label as mll
# Build features
features = mlf.combine({...}, resample='W')
# Generate label: predict next 1-week return
label = mll.daytrading_percentage(
features.index,
period=1,
resample='W'
)
# Generate label: predict next 4-week return
label_4w = mll.daytrading_percentage(
features.index,
period=4,
resample='W'
)
Choosing the period Parameter
- period=1: Short-term prediction (suitable for weekly/monthly rebalancing)
- period=4: Medium-term prediction (suitable for quarterly rebalancing)
- Larger period: More stable prediction but slower signal
- Recommendation: Match with strategy rebalancing frequency
maximum_adverse_excursion()
finlab.ml.label.maximum_adverse_excursion
Calculate the maximum adverse excursion of market prices over a given period.
| PARAMETER | DESCRIPTION |
|---|---|
index
|
A multi-level index of datetime and instrument.
TYPE:
|
resample
|
The resample frequency for the output data. Defaults to None.
TYPE:
|
period
|
The number of periods to calculate the percentage change over. Defaults to 1.
TYPE:
|
trade_at_price
|
The price for execution. Defaults to
TYPE:
|
**kwargs
|
Additional arguments to be passed to the resampler function.
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
pd.Series: A pd.Series containing the percentage change of stock prices. |
Maximum Adverse Excursion (Risk Metric):
from finlab.ml import label as mll
# Compute maximum drawdown over next N periods
mae_label = mll.maximum_adverse_excursion(
features.index,
period=5
)
# Can be used to train risk prediction models
# More negative values indicate higher risk
finlab.ml.qlib
qlib framework integration module providing multiple machine learning models for stock prediction.
Model Classes
finlab.ml.qlib provides the following models:
finlab.ml.qlib.LGBModel
finlab.ml.qlib.XGBModel
finlab.ml.qlib.CatBoostModel
finlab.ml.qlib.LinearModel
Basic Usage Example:
from finlab import data
from finlab.ml import feature as mlf, label as mll
import finlab.ml.qlib as q
# 1. Prepare features
features = mlf.combine({
'pb': data.get('price_earning_ratio:股價淨值比'),
'pe': data.get('price_earning_ratio:本益比')
}, resample='W')
# 2. Prepare labels
label = mll.return_percentage(features.index, resample='W', period=1)
# 3. Split train/test sets
is_train = features.index.get_level_values('datetime') < '2020-01-01'
X_train, y_train = features[is_train], label[is_train]
X_test = features[~is_train]
# 4. Build and train model
model = q.XGBModel() # Options: q.LGBModel(), q.CatBoostModel(), q.LinearModel()
model.fit(X_train, y_train)
# 5. Predict
y_pred = model.predict(X_test)
Supported Model Types:
| Model | Description | Pros | Cons |
|---|---|---|---|
LGBModel() |
LightGBM | Fast, good performance, low memory usage | Requires lightgbm |
XGBModel() |
XGBoost | Stable, high interpretability | Slower training |
CatBoostModel() |
CatBoost | Handles categorical features well, low overfitting risk | High memory usage |
LinearModel() |
Linear Regression | Simple, fast, low overfitting risk | Usually lower performance |
Model Selection Recommendations
- Beginners:
LGBModel()(balanced speed and performance) - Stability-focused:
XGBModel() - Categorical features:
CatBoostModel() - Quick idea validation:
LinearModel() - Avoid overfitting: Start with
LinearModel()as baseline, then try complex models
Common Mistakes
- Data leakage: Ensure train and test sets do not overlap in time (use temporal split, not random split)
- Overfitting: Test set performance far worse than training set (IC < 0.02 may indicate overfitting)
- Look-ahead bias: Labels should use
.shift(-1)orperiod=1to avoid leaking future information
Complete Example
Building an ML Stock Selection Strategy
from finlab import data
from finlab.ml import feature as mlf, label as mll, qlib
from finlab.backtest import sim
# Step 1: Build feature set
print("Building features...")
features = mlf.combine({
# Fundamental features
'pb': data.get('price_earning_ratio:股價淨值比'),
'pe': data.get('price_earning_ratio:本益比'),
'roe': data.get('fundamental_features:股東權益報酬率'),
# Technical indicators (use few to avoid overfitting)
'technical': mlf.ta(mlf.ta_names(n=1)[:20]) # Only first 20
}, resample='W')
# Step 2: Generate labels
print("Generating labels...")
label = mll.daytrading_percentage(
features.index,
period=1,
resample='W'
)
# Step 3: Split train/test sets
print("Splitting data...")
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train, y_train = features[is_train], label[is_train]
X_test = features[~is_train]
# Step 4: Train model
print("Training model...")
model = q.LGBModel()
model.fit(X_train, y_train)
# Step 5: Predict
print("Generating predictions...")
pred = model.predict(X_test)
# Step 6: Convert to trading signals
print("Generating trading signals...")
position = pred.is_largest(30) # Buy stocks with highest predicted returns
# Step 7: Backtest
print("Running backtest...")
report = sim(position, resample='W')
report.display()
# Step 8: Analyze feature importance (for tree-based models)
if hasattr(model, 'feature_importances_'):
import pandas as pd
feature_importance = pd.Series(
model.feature_importances_,
index=features.columns.get_level_values(1).unique()
).sort_values(ascending=False)
print("\nTop 10 important features:")
print(feature_importance.head(10))
FAQ
Q: What happens if feature and label resample frequencies don't match?
Mismatched frequencies cause date misalignment, preventing correct feature-label pairing:
# Wrong: Inconsistent resample
features = mlf.combine({...}, resample='W') # Weekly
label = mll.daytrading_percentage(features.index, period=1, resample='D') # Daily
# -> Shape mismatch error
# Correct: Consistent resample
features = mlf.combine({...}, resample='W')
label = mll.daytrading_percentage(features.index, period=1, resample='W')
Q: How do I prevent overfitting?
# Method 1: Reduce feature count
features = mlf.ta(mlf.ta_names(n=1)[:20]) # Only 20 indicators
# Method 2: Use regularization (reduce model complexity)
# LightGBM supports L1/L2 regularization, tree depth limits, etc.
# Method 3: Use a simple model
model = q.LinearModel() # Linear model is less prone to overfitting
# Method 4: Out-of-sample testing
# Ensure test set performance is close to training set (no more than 50% gap)
Q: Training is too slow, what can I do?
# Method 1: Reduce data range
data.truncate_start = '2020-01-01'
# Method 2: Reduce feature count
features = mlf.ta(mlf.ta_names(n=1)[:10]) # Only 10 indicators
# Method 3: Use resample='W' or 'M' (instead of 'D')
features = mlf.combine({...}, resample='W') # Weekly data is faster
Q: How do I handle missing values?
# First check the missing value situation
print(f"Feature missing ratio: {features.isna().sum().sum() / features.size:.2%}")
print(f"Label missing ratio: {label.isna().sum() / len(label):.2%}")
# Method 1: Drop samples with missing values
features_clean = features.dropna()
label_clean = label.dropna()
# Method 2: Forward fill (suitable for time series)
features_filled = features.fillna(method='ffill')
label_filled = label.fillna(method='ffill')
# Method 3: Fill with 0 (suitable for technical indicators)
features_zero = features.fillna(0)
Q: How do I convert prediction results to trading signals?
# Method 1: Buy top N
position = pred.is_largest(30)
# Method 2: Set threshold
position = pred > pred.quantile(0.8) # Buy top 20%
# Method 3: Long-short strategy
long_position = pred.is_largest(20) # Long top 20
short_position = pred.is_smallest(20) # Short bottom 20
position = long_position - short_position
# Method 4: Allocate weights based on predicted values
position = pred / pred.sum() # Proportional allocation
Resources
- Machine Learning Strategy Workflow - End-to-end development guide
- Machine Learning Strategy Development - Detailed tutorial
- qlib Official Documentation - qlib framework docs
- TA-Lib Indicator List - All technical indicators