finlab.ml

Machine learning module providing feature engineering, label generation, and model training integration.

Use Cases

Build feature sets for ML stock selection strategies
Generate technical indicators (TA-Lib integration)
Design training labels (returns, risk metrics)
Integrate with the qlib framework for model training
Predict future stock performance

Quick Examples

Feature Engineering

from finlab import data
from finlab.ml import feature as mlf

# Combine fundamental features
features = mlf.combine({
    'pb': data.get('price_earning_ratio:股價淨值比'),
    'pe': data.get('price_earning_ratio:本益比'),
    'roe': data.get('fundamental_features:股東權益報酬率')
}, resample='W')

# Add technical indicators
features_ta = mlf.combine({
    'fundamental': features,
    'technical': mlf.ta(mlf.ta_names(n=10))
}, resample='W')

Label Generation

from finlab.ml import label as mll

# Generate future 1-week return labels
label = mll.daytrading_percentage(
    features.index,
    period=1,
    resample='W'
)

qlib Model Training

import finlab.ml.qlib as q

# Split train/test sets
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train, y_train = features[is_train], label[is_train]
X_test = features[~is_train]

# Train LightGBM model
model = q.LGBModel()
model.fit(X_train, y_train)

# Predict and convert to position weights
pred = model.predict(X_test)
position = pred.is_largest(30)  # Buy top 30

Detailed Guide

See Machine Learning Strategy Development for: - Complete ML strategy development workflow - Feature engineering best practices - Label design techniques - Model training and optimization - Overfitting prevention

API Reference

finlab.ml.feature

Feature engineering module for combining and processing various features.

combine()

finlab.ml.feature.combine

combine(features, resample=None, sample_filter=None, **kwargs)

The combine function takes a dictionary of features as input and combines them into a single pandas DataFrame. combine 函數接受一個特徵字典作為輸入，並將它們合併成一個 pandas DataFrame。

PARAMETER	DESCRIPTION
`features`	a dictionary where values are dataframes or callables returning dataframes. 索引為日期時間，欄位為證券代碼的 DataFrame，或可呼叫以取得 DataFrame 的函式。 TYPE: `Dict[str, DataFrame \| Callable]`
`resample`	Optional argument to resample the data in the features. Default is None. 選擇性的參數，用於重新取樣特徵中的資料。預設為 None。 TYPE: `str` DEFAULT: `None`
`sample_filter`	a boolean dictionary where index is date and columns are instrument representing the filter of features. TYPE: `DataFrame` DEFAULT: `None`
`**kwargs`	Additional keyword arguments to pass to the resampler function. 傳遞給重新取樣函數 resampler 的其他關鍵字引數。 DEFAULT: `{}`

RETURNS	DESCRIPTION
	A pandas DataFrame containing all the input features combined. 一個包含所有輸入特徵合併後的 pandas DataFrame。

Examples:

這段程式碼教我們如何使用finlab.ml.feature和finlab.data模組，來合併兩個特徵：RSI和股價淨值比。我們使用f.combine函數來進行合併，其中特徵的名稱是字典的鍵，對應的資料是值。我們從data.indicator('RSI')取得'rsi'特徵，這個函數計算相對強弱指數。我們從data.get('price_earning_ratio:股價淨值比')取得'pb'特徵，這個函數獲取股價淨值比。最後，我們得到一個包含這兩個特徵的DataFrame。

from finlab import data
import finlab.ml.feature as f
import finlab.ml.qlib as q

features = f.combine({

    # 用 data.get 簡單產生出技術指標
    'pb': data.get('price_earning_ratio:股價淨值比'),

    # 用 data.indicator 產生技術指標的特徵
    'rsi': data.indicator('RSI'),

    # 用 f.ta 枚舉超多種 talib 指標
    'talib': f.ta(f.ta_names()),

    # 利用 qlib alph158 產生技術指標的特徵(請先執行 q.init(), q.dump() 才能使用)
    'qlib158': q.alpha('Alpha158')

    })

features.head()

datetime	instrument	rsi	pb
2020-01-01	1101	0	2
2020-01-02	1102	100	3
2020-01-03	1108	100	4

Usage Examples:

from finlab import data
from finlab.ml import feature as mlf

# Example 1: Combine fundamental features
features = mlf.combine({
    'pb': data.get('price_earning_ratio:股價淨值比'),
    'pe': data.get('price_earning_ratio:本益比'),
    'roe': data.get('fundamental_features:股東權益報酬率')
}, resample='W')

# Example 2: Combine technical indicators
features = mlf.combine({
    'talib': mlf.ta(['talib.RSI__period14__', 'talib.MACD__fastperiod12_slowperiod26_signalperiod9__macd__'])
}, resample='D')

# Example 3: Mix multiple feature types
features = mlf.combine({
    'fundamental': mlf.combine({'pb': pb, 'pe': pe}),
    'technical': mlf.ta(mlf.ta_names(n=5)),
    'custom': custom_feature_df
}, resample='W')

resample Parameter

'D' - Daily
'W' - Weekly (Friday)
'M' - Monthly (month-end)
Features and labels must use the same resample frequency!

ta()

finlab.ml.feature.ta

ta(feature_names, factories=None, resample=None, start_time=None, end_time=None, adj=False, cpu=-1, **kwargs)

Calculate technical indicator values for a list of feature names.

PARAMETER	DESCRIPTION
`feature_names`	A list of technical indicator feature names. Defaults to None. TYPE: `Optional[List[str]]`
`factories`	A dictionary of factories to generate technical indicators. Defaults to {"talib": TalibIndicatorFactory()}. TYPE: `Optioanl[Dict[str, TalibIndicatorFactory]]` DEFAULT: `None`
`resample`	The frequency to resample the data to. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`start_time`	The start time of the data. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`end_time`	The end time of the data. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`**kwargs`	Additional keyword arguments to pass to the resampler function. DEFAULT: `{}`

RETURNS	DESCRIPTION
`DataFrame`	pd.DataFrame: technical indicator feature names and their corresponding values.

Technical Indicator Computation:

from finlab.ml import feature as mlf

# Compute specific indicators
indicators = mlf.ta([
    'talib.RSI__period14__',
    'talib.MACD__fastperiod12_slowperiod26_signalperiod9__macd__',
    'talib.BBANDS__timeperiod20_nbdevup2_nbdevdn2__upperband__'
], resample='W')

# Auto-generate random indicator combinations
random_indicators = mlf.ta(mlf.ta_names(n=10), resample='W')

ta_names()

finlab.ml.feature.ta_names

ta_names(lb=1, ub=10, n=1, factory=None)

Generate a list of technical indicator feature names.

PARAMETER	DESCRIPTION
`lb`	The lower bound of the multiplier of the default parameter for the technical indicators. TYPE: `int` DEFAULT: `1`
`ub`	The upper bound of the multiplier of the default parameter for the technical indicators. TYPE: `int` DEFAULT: `10`
`n`	The number of random samples for each technical indicator. TYPE: `int` DEFAULT: `1`
`factory`	A factory object to generate technical indicators. Defaults to TalibIndicatorFactory. TYPE: `IndicatorFactory` DEFAULT: `None`

RETURNS	DESCRIPTION
`List[str]`	List[str]: A list of technical indicator feature names.

Examples:

import finlab.ml.feature as f


# method 1: generate each indicator with random parameters
features = f.ta()

# method 2: generate specific indicator
feature_names = ['talib.MACD__macdhist__fastperiod__52__slowperiod__212__signalperiod__75__']
features = f.ta(feature_names, resample='W')

# method 3: generate some indicator
feature_names = f.ta_names()
features = f.ta(feature_names)

Generate Indicator Name List:

from finlab.ml import feature as mlf

# Generate all TA-Lib indicators with 10 random parameter sets each
indicator_names = mlf.ta_names(n=10)
print(f"Total {len(indicator_names)} indicators")

# View examples
for name in indicator_names[:5]:
    print(name)
# Output:
# talib.RSI__period14__
# talib.RSI__period7__
# talib.MACD__fastperiod12_slowperiod26_signalperiod9__macd__
# ...

Recommended n Values

n=1: Default parameters for each indicator (~100 indicators)
n=5: 5 random parameter sets per indicator (~500 indicators)
n=10: More variety but longer computation time (~1000 indicators)
Start with n=1 to test feasibility, then increase

Notes

Too many indicators can lead to:
- Long computation time
- High memory usage
- Increased overfitting risk
Recommend using feature selection to reduce indicator count

finlab.ml.label

Label generation module for designing machine learning target variables.

daytrading_percentage()

finlab.ml.label.daytrading_percentage

daytrading_percentage(index, **kwargs)

Calculate the percentage change of market prices over a given period.

PARAMETER	DESCRIPTION
`index`	A multi-level index of datetime and instrument. TYPE: `Index`
`resample`	The resample frequency for the output data. Defaults to None. TYPE: `Optional[str]`
`period`	The number of periods to calculate the percentage change over. Defaults to 1. TYPE: `int`
`trade_at_price`	The price for execution. Defaults to `close`. TYPE: `str`
`**kwargs`	Additional arguments to be passed to the resampler function. DEFAULT: `{}`

RETURNS	DESCRIPTION
`Series`	pd.Series: A pd.Series containing the percentage change of stock prices.

Predict Future N-Period Returns:

from finlab.ml import feature as mlf, label as mll

# Build features
features = mlf.combine({...}, resample='W')

# Generate label: predict next 1-week return
label = mll.daytrading_percentage(
    features.index,
    period=1,
    resample='W'
)

# Generate label: predict next 4-week return
label_4w = mll.daytrading_percentage(
    features.index,
    period=4,
    resample='W'
)

Choosing the period Parameter

period=1: Short-term prediction (suitable for weekly/monthly rebalancing)
period=4: Medium-term prediction (suitable for quarterly rebalancing)
Larger period: More stable prediction but slower signal
Recommendation: Match with strategy rebalancing frequency

maximum_adverse_excursion()

finlab.ml.label.maximum_adverse_excursion

maximum_adverse_excursion(index, period=1, trade_at_price='close')

Calculate the maximum adverse excursion of market prices over a given period.

PARAMETER	DESCRIPTION
`index`	A multi-level index of datetime and instrument. TYPE: `Index`
`resample`	The resample frequency for the output data. Defaults to None. TYPE: `Optional[str]`
`period`	The number of periods to calculate the percentage change over. Defaults to 1. TYPE: `int` DEFAULT: `1`
`trade_at_price`	The price for execution. Defaults to `close`. TYPE: `str` DEFAULT: `'close'`
`**kwargs`	Additional arguments to be passed to the resampler function.

RETURNS	DESCRIPTION
`Series`	pd.Series: A pd.Series containing the percentage change of stock prices.

Maximum Adverse Excursion (Risk Metric):

from finlab.ml import label as mll

# Compute maximum drawdown over next N periods
mae_label = mll.maximum_adverse_excursion(
    features.index,
    period=5
)

# Can be used to train risk prediction models
# More negative values indicate higher risk

finlab.ml.qlib

qlib framework integration module providing multiple machine learning models for stock prediction.

Model Classes

finlab.ml.qlib provides the following models:

finlab.ml.qlib.LGBModel

LGBModel()

LGBModel is a wrapper model for LightGBM model.

import finlab.ml.qlib as q

# build X_train, y_train, X_test

model = q.LGBModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

finlab.ml.qlib.XGBModel

XGBModel()

XGBModel is a wrapper model for XGBoost model.

import finlab.ml.qlib as q

# build X_train, y_train, X_test

model = q.XGBModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

finlab.ml.qlib.CatBoostModel

CatBoostModel()

CatBoostModel is a wrapper model for CatBoost model.

import finlab.ml.qlib as q

# build X_train, y_train, X_test

model = q.CatBoostModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

finlab.ml.qlib.LinearModel

LinearModel()

LinearModel is a wrapper model for Linear model.

import finlab.ml.qlib as q

# build X_train, y_train, X_test

model = q.LinearModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Basic Usage Example:

from finlab import data
from finlab.ml import feature as mlf, label as mll
import finlab.ml.qlib as q

# 1. Prepare features
features = mlf.combine({
    'pb': data.get('price_earning_ratio:股價淨值比'),
    'pe': data.get('price_earning_ratio:本益比')
}, resample='W')

# 2. Prepare labels
label = mll.return_percentage(features.index, resample='W', period=1)

# 3. Split train/test sets
is_train = features.index.get_level_values('datetime') < '2020-01-01'
X_train, y_train = features[is_train], label[is_train]
X_test = features[~is_train]

# 4. Build and train model
model = q.XGBModel()  # Options: q.LGBModel(), q.CatBoostModel(), q.LinearModel()
model.fit(X_train, y_train)

# 5. Predict
y_pred = model.predict(X_test)

Supported Model Types:

Model	Description	Pros	Cons
`LGBModel()`	LightGBM	Fast, good performance, low memory usage	Requires lightgbm
`XGBModel()`	XGBoost	Stable, high interpretability	Slower training
`CatBoostModel()`	CatBoost	Handles categorical features well, low overfitting risk	High memory usage
`LinearModel()`	Linear Regression	Simple, fast, low overfitting risk	Usually lower performance

Model Selection Recommendations

Beginners: LGBModel() (balanced speed and performance)
Stability-focused: XGBModel()
Categorical features: CatBoostModel()
Quick idea validation: LinearModel()
Avoid overfitting: Start with LinearModel() as baseline, then try complex models

Common Mistakes

Data leakage: Ensure train and test sets do not overlap in time (use temporal split, not random split)
Overfitting: Test set performance far worse than training set (IC < 0.02 may indicate overfitting)
Look-ahead bias: Labels should use .shift(-1) or period=1 to avoid leaking future information

Complete Example

Building an ML Stock Selection Strategy

from finlab import data
from finlab.ml import feature as mlf, label as mll, qlib
from finlab.backtest import sim

# Step 1: Build feature set
print("Building features...")
features = mlf.combine({
    # Fundamental features
    'pb': data.get('price_earning_ratio:股價淨值比'),
    'pe': data.get('price_earning_ratio:本益比'),
    'roe': data.get('fundamental_features:股東權益報酬率'),

    # Technical indicators (use few to avoid overfitting)
    'technical': mlf.ta(mlf.ta_names(n=1)[:20])  # Only first 20
}, resample='W')

# Step 2: Generate labels
print("Generating labels...")
label = mll.daytrading_percentage(
    features.index,
    period=1,
    resample='W'
)

# Step 3: Split train/test sets
print("Splitting data...")
is_train = features.index.get_level_values('datetime') < '2023-01-01'
X_train, y_train = features[is_train], label[is_train]
X_test = features[~is_train]

# Step 4: Train model
print("Training model...")
model = q.LGBModel()
model.fit(X_train, y_train)

# Step 5: Predict
print("Generating predictions...")
pred = model.predict(X_test)

# Step 6: Convert to trading signals
print("Generating trading signals...")
position = pred.is_largest(30)  # Buy stocks with highest predicted returns

# Step 7: Backtest
print("Running backtest...")
report = sim(position, resample='W')
report.display()

# Step 8: Analyze feature importance (for tree-based models)
if hasattr(model, 'feature_importances_'):
    import pandas as pd
    feature_importance = pd.Series(
        model.feature_importances_,
        index=features.columns.get_level_values(1).unique()
    ).sort_values(ascending=False)

    print("\nTop 10 important features:")
    print(feature_importance.head(10))

FAQ

Q: What happens if feature and label resample frequencies don't match?

Mismatched frequencies cause date misalignment, preventing correct feature-label pairing:

# Wrong: Inconsistent resample
features = mlf.combine({...}, resample='W')  # Weekly
label = mll.daytrading_percentage(features.index, period=1, resample='D')  # Daily
# -> Shape mismatch error

# Correct: Consistent resample
features = mlf.combine({...}, resample='W')
label = mll.daytrading_percentage(features.index, period=1, resample='W')

Q: How do I prevent overfitting?

# Method 1: Reduce feature count
features = mlf.ta(mlf.ta_names(n=1)[:20])  # Only 20 indicators

# Method 2: Use regularization (reduce model complexity)
# LightGBM supports L1/L2 regularization, tree depth limits, etc.
# Method 3: Use a simple model
model = q.LinearModel()  # Linear model is less prone to overfitting

# Method 4: Out-of-sample testing
# Ensure test set performance is close to training set (no more than 50% gap)

Q: Training is too slow, what can I do?

# Method 1: Reduce data range
data.truncate_start = '2020-01-01'

# Method 2: Reduce feature count
features = mlf.ta(mlf.ta_names(n=1)[:10])  # Only 10 indicators

# Method 3: Use resample='W' or 'M' (instead of 'D')
features = mlf.combine({...}, resample='W')  # Weekly data is faster

Q: How do I handle missing values?

# First check the missing value situation
print(f"Feature missing ratio: {features.isna().sum().sum() / features.size:.2%}")
print(f"Label missing ratio: {label.isna().sum() / len(label):.2%}")

# Method 1: Drop samples with missing values
features_clean = features.dropna()
label_clean = label.dropna()

# Method 2: Forward fill (suitable for time series)
features_filled = features.fillna(method='ffill')
label_filled = label.fillna(method='ffill')

# Method 3: Fill with 0 (suitable for technical indicators)
features_zero = features.fillna(0)

Q: How do I convert prediction results to trading signals?

# Method 1: Buy top N
position = pred.is_largest(30)

# Method 2: Set threshold
position = pred > pred.quantile(0.8)  # Buy top 20%

# Method 3: Long-short strategy
long_position = pred.is_largest(20)   # Long top 20
short_position = pred.is_smallest(20)  # Short bottom 20
position = long_position - short_position

# Method 4: Allocate weights based on predicted values
position = pred / pred.sum()  # Proportional allocation

Resources

Machine Learning Strategy Workflow - End-to-end development guide
Machine Learning Strategy Development - Detailed tutorial
qlib Official Documentation - qlib framework docs
TA-Lib Indicator List - All technical indicators