Skip to content

Machine Learning

Library Installation

Required Installation

Install core packages (finlab, TA-Lib):

pip install finlab
pip install ta-lib

Model Libraries (Optional)

finlab.ml.qlib supports various models (LightGBM, XGBoost, CatBoost, PyTorch, TensorFlow, etc.). Install them as needed following their official documentation; most are pre-installed in Colab.

  1. Install LightGBM
  2. Install XGBoost
  3. Install CatBoost
  4. Install Pytorch
  5. Install Tensorflow

ML workflows are complex -- consider using an AI assistant

From feature engineering to model training, ML strategies involve many steps. After installing FinLab Skill, the AI coding assistant can help you select features, split datasets, train models, and interpret backtest results.

Feature Processing

Using the Combine Function to Merge Features

finlab.ml.feature.combine merges features from multiple sources (technical, fundamental, custom) with resampling support. Examples:

  1. Merge P/B ratio and P/E ratio into one feature set:
from finlab import data
from finlab.ml import feature as mlf
features = mlf.combine({
  'pb': data.get('price_earning_ratio:股價淨值比'),
  'pe': data.get('price_earning_ratio:本益比')
}, resample='W')

features.head()
pb pe
(Timestamp('2010-01-04 00:00:00'), '1101') 1.47 18.85
(Timestamp('2010-01-04 00:00:00'), '1102') 1.44 14.58
(Timestamp('2010-01-04 00:00:00'), '1103') 0.79 40.89
(Timestamp('2010-01-04 00:00:00'), '1104') 0.92 73.6

This merges P/B and P/E into a single feature set.

  1. Merge technical indicators into one feature set:
from finlab.ml import feature as mlf
mlf.combine({
  'talib': mlf.ta(mlf.ta_names(n=1))
})
talib.HT_DCPERIOD__real__ talib.HT_DCPHASE__real__ talib.HT_PHASOR__quadrature__
(Timestamp('2024-04-01 00:00:00'), '9951') 23.4372 122.135 -0.0107087
(Timestamp('2024-04-01 00:00:00'), '9955') 18.4416 68.0654 -0.0168584
(Timestamp('2024-04-01 00:00:00'), '9958') 30.1035 -10.7866 0.159777
(Timestamp('2024-04-01 00:00:00'), '9960') 17.5025 94.0009 0.00310615
(Timestamp('2024-04-01 00:00:00'), '9962') 23.2931 90.0781 -0.0145453

In this example, we use the mlf.ta and mlf.ta_names functions to generate a set of technical indicator features. The process first generates random technical indicators via mlf.ta_names(n=1), then mlf.ta calculates the corresponding indicator values. Finally, combine merges these technical indicators into a DataFrame, providing a rich feature set for quantitative strategy development.

These two examples demonstrate the versatility and flexibility of the combine function. Whether for fundamental or technical analysis, it provides powerful data support for investors and analysts making more precise decisions in complex financial markets.

Using TA-Lib to Generate Technical Indicators

When using the finlab library for quantitative trading strategy development, technical indicators play an extremely important role. finlab provides a powerful set of tools for generating and utilizing these indicators. The ta and ta_names functions are key to generating technical indicator features.

ta_names Function

The ta_names function generates a series of TA-Lib technical indicator names. These names reflect the indicator's computation method and parameters. This function is very useful because it allows exploration and experimentation with different indicator configurations to find the optimal feature combination.

  • n parameter: In ta_names, the n parameter specifies how many random parameter configurations are generated for each indicator. For example, if n=10, then for each TA-Lib indicator, ta_names will generate 10 different parameter configurations. This allows you to select from many different settings to explore the relationship between data and strategy performance.

from finlab.ml import feature as mlf
mlf.ta_names(n=1)
['talib.HT_DCPERIOD__real__',
 'talib.HT_DCPHASE__real__',
 'talib.HT_PHASOR__quadrature__',
 'talib.HT_PHASOR__inphase__',
 'talib.HT_SINE__sine__',
 'talib.HT_SINE__leadsine__'
 ...
 ]

ta Function

Once you have the indicator name list (obtainable via ta_names), use the ta function to compute the actual indicator values. The ta function is a powerful tool that calculates indicator values based on specified names and parameter settings.

  • Functionality: ta accepts one or more indicator names generated by ta_names and computes their values. This is crucial for feature engineering as it allows building prediction models based on indicator results.

  • Flexibility: The combined use of these two functions provides great flexibility, allowing quant analysts and traders to test and optimize their strategies across different time periods and market conditions.

  • resample parameter: The ta function also supports a resample parameter that resamples computed indicator values to a specified time frequency. This is very useful for time series data processing.

from finlab.ml import feature as mlf
mlf.ta(['talib.HT_DCPERIOD__real__',
 'talib.HT_DCPHASE__real__',
 'talib.HT_PHASOR__quadrature__'], resample='W')
talib.HT_DCPERIOD__real__ talib.HT_DCPHASE__real__
(Timestamp('2024-04-07 00:00:00'), '9951') 23.4372 122.135
(Timestamp('2024-04-07 00:00:00'), '9955') 18.4416 68.0654
(Timestamp('2024-04-07 00:00:00'), '9958') 30.1035 -10.7866
(Timestamp('2024-04-07 00:00:00'), '9960') 17.5025 94.0009
(Timestamp('2024-04-07 00:00:00'), '9962') 23.2931 90.0781

In summary, ta_names and ta are two core tools in the finlab library for generating and computing technical indicator features. By experimenting with different parameter settings (using the n parameter in ta_names) and computing indicator values under those settings (using ta), quantitative strategy developers can deeply mine data to find the best indicator combinations to guide their trading decisions.

Label Generation

Using the Label Function to Generate Labels

finlab.ml.label provides various return/risk label computations for training prediction models.

Predicting daytrading_percentage

This function computes the percentage change in market prices within a given period, specifically from open to close.

  • resample: Must match the resample used in combine to align time periods.
  • period: Number of future periods (defined by resample) for computing change.
from finlab.ml import feature as mlf
from finlab.ml import label as mll
feature = mlf.combine(...)
label = mll.daytrading_percentage(feature.index)
datetime    instrument
2007-04-23  0015          0.000000
            0050          0.003454
            0051          0.004874
            0052          0.006510
            01001T        0.001509
dtype: float64

Predicting N-day Future Returns

Computes the percentage change within a given period for analyzing medium to long-term performance.

label = mll.return_percentage(feature.index, resample='W', period=1)

Maximum Adverse Excursion

MAE: Maximum adverse movement during the holding period (currently does not support resample).

label = mll.maximum_adverse_excursion(feature.index, period=1)

Maximum Favorable Excursion

MFE: Maximum favorable movement during the holding period (currently does not support resample).

label = mll.maximum_favorable_excursion(feature.index, period=1)

Excess Over Median

Excess return relative to the market-wide median return for the same period.

label = mll.excess_over_median(feature.index, resample='M', period=1)

Excess Over Mean

Excess return relative to the market-wide mean return for the same period.

label = mll.excess_over_mean(feature.index, resample='M', period=1)

Ensure the index and market settings are correct; labels can be directly combined with features for model training.

Model Training with Qlib

This code demonstrates how to use various machine learning models within the Qlib framework for quantitative investment strategy development. WrapperModel is a wrapper class for initializing and fitting different ML models, including LightGBM, XGBoost, CatBoost, linear models, TabNet, deep neural networks, and more. This wrapper makes using these models in Qlib simpler and more unified.

Below is a brief introduction and example usage for each model wrapper:

LGBModel

Wraps the LightGBM model.

import finlab.ml.qlib as q

# Construct X_train, y_train, X_test

model = q.LGBModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

How to construct X_train, y_train, X_test?

Example using data before 2020 as the training set:

is_train = features.index.get_level_values('datetime') < '2020-01-01'
X_train = features[is_train]
y_train = labels[is_train]
X_test = features[~is_train]

XGBModel

Wraps the XGBoost model.

model = q.XGBModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

DEnsmbleModel

Wraps the Double Ensemble model.

model = q.DEnsmbleModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

CatBoostModel

Wraps the CatBoost model.

model = q.CatBoostModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

LinearModel

Wraps a linear model.

model = q.LinearModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

TabnetModel

Wraps the TabNet model.

model = q.TabnetModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

DNNModel

Wraps a deep neural network model.

model = q.DNNModel()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

The wrappers let you focus on features and strategy without getting bogged down in model details.

get_models

get_models quickly retrieves the list of available models and initializes them, facilitating multi-model experimentation.

import finlab.ml.qlib as q

# Get all available models
models = q.get_models()

# Print all model names
print(list(models.keys()))

# Select and instantiate a model, e.g., LightGBM
model = models['LGBModel']()

# Assuming X_train, y_train, X_test are already prepared

# Train model
model.fit(X_train, y_train)

# Predict using the trained model
y_pred = model.predict(X_test)

The above demonstrates how to list models, create an LGBModel instance, and train/predict.

Running Backtests

Use sim to backtest, computing strategy returns and risk metrics:

from finlab.backtest import sim

position = y_pred.is_largest(50)

sim(position, resample='4W')

The above uses model rankings to generate position, specifies backtesting frequency with resample, and executes the backtest.