This series of articles describes how to use Dolphindb in an elegant and efficient way to measure back trading strategies. This article will introduce a backtest of the multi-factor Alpha strategy that is widely used on Wall Street. Multi-factor model is the most important type of model in quantitative trading stock selection. The basic idea is to find some indexes that are most relevant to the rate of return and build a stock portfolio according to these indexes (long the stocks with positive correlation and short the stocks with negative correlation). In the multi-factor model, the individual stock weight of a single factor generally achieves long-short equilibrium (market neutral), and the position without market risk exposure (beta is 0, so it is called alpha strategy) can achieve absolute return. Multiple factors are orthogonal to each other to facilitate strategy configuration and achieve optimal control of returns and risks. In addition, compared with arbitrage strategies (which usually achieve higher Sharpe ratios but have a poor scale), multi-factor alpha strategies have a good scale and can allocate large amounts of funds. Multifactor Alpha strategies are common among hedge funds.

1. Generation factor

The focus of this article is on the backtest framework for implementing the multi-factor Alpha strategy. Factors are not the focus, this part is usually done by financial engineers or strategists. In order to facilitate understanding, we take four common risk factors, momentum factor, beta factor, scale factor, and volatility factor, as examples to illustrate how to measure back into the Dolphindb database by multiple factors.

The input data sheet inData contains 6 fields: SYM, DATE, CLOSE, RET, MV, VOL.

def genSignals(inData){ USstocks = select sym, date, close, RET, MV from inData where weekday(date) between 1:5, close>5, VOL>0, MV>100000 order by sym, date update USstocks set prevMV=prev(MV), cumretIndex=cumprod(1+RET), signal_size=-sqrt(MV), signal_vol=-mstd(RET, 21)*sqrt(252) context by sym update USstocks set mRet = wavg(RET, prevMV) context by date update USstocks set signal_mom = move(cumretIndex,21)/move(cumretIndex,252)-1, signal_beta=mbeta(RET, mRet, 63) from USstocks context by sym return select sym, date, close, RET as ret, signal_size, Signal_beta, signal_vol, signal_mom from USstocks where date>=1991.01.01}

Dolphindb function description:

_abs: _ takes the absolute value.

_prev: _ moves all the elements in the vector one position to the right.

_cumprod: _ Calculates the cumulative product.

_sqrt: _ calculates the square root.

_mstd(X, k) : _ Calculates the moving standard deviation.

_wavg(X, k) : _ Calculate the weighted average.

_move(X, k) : _ If k is positive, move all elements of the vector k to the right, and if k is negative, move all elements of the vector k to the left.

_mbeta(X, Y, k) : _ Calculates the coefficient estimates of ordinary least squares regression.

genSignalsFunction description:

Firstly, data filtering is carried out to select the data of stocks with higher market value during the trading day. The filtered data were then used to calculate the four risk factors:

  • Size factor (signal_size) : Negative of the square root of MV
  • Volatility factor (signal_vol) : The negative of the volatility of stock prices over the past month
  • Momentum factor (signal_mom) : Momentum factor for the last 12 months (excluding the last month)
  • Beta factor (signal_beta) : Calculates the beta of individual stocks and markets using data from the past three months

2. Backtest framework

The backtest framework for the multi-factor Alpha strategy consists of three parts. The first is to generate the weight of each stock on each strategy for each historical period. All positions in a historical period can be called a Tranche. Then, based on Tranche’s holding time, it generates the daily position and profit and loss of each stock on each strategy of each Tranche. Finally, the performance of each strategy and all strategies is statistically analyzed.

2.1 Calculate the investment position of the historical period

We start by defining a function, formPeridate Port, to calculate the stock position for one period (one day). The parallel computation is then used to obtain the investment position for each historical period.

2.1.1 Calculate the stock portfolio for one day

The input of this step is the value of each stock on different factors, and the output is the investment weight of each stock on each factor. The importance of stock rights meets two conditions :(1) the weight sum of all stocks in a factor is zero, that is to say, long-short equilibrium. (2) Different factors are orthogonal to each other, that is to say, the inner product of the weight of the ith factor Wi and the value of the JTH factor Sj is 0 (I <>j). To achieve this goal, we introduced the factor matrix (each column of the matrix represents a factor, each row represents a stock) and added the unit factor (all elements are 1) to the factor matrix.

In practice, another consideration is to get rid of lightly weighted stocks. There are thousands of stocks in a pool, and most of them are given negligible weight. We define a nested function f to adjust the weighting of stocks in a single factor.

The formPeriodPort function takes three input parameters:

  • Signals is a data table generated by the GenSignals function with eight fields: stock symbol, date, closing price, rate of return, and four factors.
  • SignalNames is the name of all the factors, represented by a vector.
  • StockPercentile is used to control the number of stocks.

The output of the function is a data table that stores a day’s stock portfolio with four fields: Trche, Sym, Signalidx, Exposure.

def formPeriodPort(signals, signalNames, stockPercentile){ stockCount = signals.size() signalCount = signalNames.size() tranche = //demean all signals and add a unit column to the signal matrix sigMat = matrix(take(1, stockCount), each(x->x - avg(x), signals[signalNames])) //form weight matrix. transSigMat = sigMat.transpose() weightMat =[1:] /* form exposures. allocate two dollars on each signal, one for long and one for short trim small weights. In practice, we don't want to trade too many stocks */ f = def(sym, tranche, stockPercentile, signalVec, signalIdx){ t = table(sym, signalVec as exposure, iif(signalVec > 0, 1, -1) as sign) update t set exposure = exposure * (abs(exposure) < percentile(abs(exposure), stockPercentile)) context by sign update t set exposure = exposure / sum(exposure).abs() context by sign return select tranche as tranche, sym, signalIdx as signalIdx, exposure from t where exposure ! = 0 } return loop(f{signals.sym, tranche, stockPercentile}, weightMat, 1.. signalCount - 1).unionAll(false) }

Dolphindb function description:

_size: _ Returns the number of elements in the vector

_first: _ returns the first element

_matrix: _ builds the matrix

_transpose: _ matrix transpose

Dot: inner product of a matrix or vector

_inv: _ matrix inverse

_iif(condition, trueResult, falsereResult) : _ return trueResult if condition meets, falsereResult otherwise. This is equivalent to running if… on each element separately. The else statement.

_loop(func,args) : _ high-value template function that applies the function func to each element of the argument args and aggregates the result into a tuple. If args contains three k arguments, each of length n, then the loop will run n times.

_unionAll: _ Merge multiple tables

2.1.2 Calculate the stock portfolio for the past day

The amount of data used during the backtest was very large, so we put the data into an in-memory partitioned database and then used parallel computation. For more information on partitioned databases, refer to the Dolphindb partitioned database tutorial. We save the data generated by the GenSignals function to the partition table PartSignals, where one partition represents one day. Next, create a partitioned table Ports to hold the calculated stock portfolio, with one partition representing one year. Then, using the map-reduce function, apply the formperiodPort function to each day and merge each result into the partitioned table ports.

def formPortfolio(signals, signalNames, stockPercentile){ dates = (select count(*) from signals group by date having count(*)>1000).date.sort() db = database("", VALUE, dates) partSignals = db.createPartitionedTable(signals, "signals", `date).append! (signals) db = database("", RANGE, datetimeParse(string(year(dates.first()) .. (year(dates.last()) + 1)) + ".01.01", "yyyy.MM.dd")) symType = (select top 10 sym from signals).sym.type() ports = db.createPartitionedTable(table(1:0, `tranche`sym`signalIdx`exposure, [DATE,symType,INT,DOUBLE]), "", `tranche) return mr(sqlDS(<select * from partSignals>), formPeriodPort{,signalNames,stockPercentile},,unionAll{,ports}) }

Dolphindb function description:

_sort: _ Sorts the elements in the vector

_database(directory, [partitionType], [partitionScheme], [locations]) : _ Create database. If Directory is empty, an in-memory database is created.

_createPartitionedTable(DBHandle, Table, [TableName], PartitionColumns) : _ Create a partitioned table in the database.

_datatimeParse(X, format) : _ Converts a string to a datatype in DolphinDB.

_unionAll: _ merge tables

_type: _ Returns the ID of the data type.

_mr(ds, mapFunc, [reduceFunc], [finalFunc], [parallel=true]) : _map-reduce function

2.2 Calculate positions, profits and losses

The task of this step is to generate the position and profit and loss of each stock on each factor of each Tranche for each day according to the position held and the time held. First define a nested function f to calculate the profit and loss of some stock investment positions, then apply the nested function to all stock portfolios (using MR function), calculate the profit and loss of all stock portfolios, and save the results in the partition table PNLS.

The input parameters to the function CaclStockPNL include:

  • Ports: Portfolio table for each day, including 4 fields Trche, Sym, Signalidx, Exposure
  • DailyRTN: Daily return of the stock, including 3 fields date, sym, ret
  • HoldingDays: The number of days a stock is held

The output of the function is a breakdown of the profit and loss of the stock, including 8 fields Date, SYM, Signalidx, Trche, Age, RET, Exposure, PNL

def calcStockPnL(ports, dailyRtn, holdingDays){ ages = table(1.. holdingDays as age) dates = sort exec distinct(tranche) from ports dictDateIndex = dict(dates, 1.. dates.size()) dictIndexDate = dict(1.. dates.size(), dates) lastDaysTable = select max(date) as date from dailyRtn group by sym lastDays = dict(lastDaysTable.sym, // define a anonymous function to calculate the pnl for a part of the porfolios. f = def(ports, dailyRtn, holdingDays, ages, dictDateIndex, dictIndexDate,lastDays){ pos = select dictIndexDate[dictDateIndex[tranche]+age] as date, sym, signalIdx, tranche, age, Take (0.0,size age) as ret, exposure, Take (0.0,size age) as dictdictdate from cj(ports,ages) where isValid(dictindex [tranche]+age]), dictIndexDate[dictDateIndex[tranche]+age]<=lastDays[sym] update pos set ret = dailyRtn.ret from ej(pos, dailyRtn,`date`sym) update pos set exposure = exposure*cumprod(1+ret) from pos context by tranche, signalIdx, sym update pos set pnl = exposure*ret/(1+ret) return pos } // calculate pnls for all portfolios and save the result to a  partitioned in-memory table pnls db = database("", RANGE, datetimeParse(string(year(dates.first()) .. (year(dates.last()) + 1)) + ".01.01", "yyyy.MM.dd")) symType = (select top 10 sym from ports).sym.type() modelPnls = table(1:0, `date`sym`signalIdx`tranche`age`ret`exposure`pnl, [DATE,symType,INT,DATE,INT,DOUBLE,DOUBLE,DOUBLE]) pnls = db.createPartitionedTable(modelPnls, "", `tranche) return mr(sqlDS(<select * from ports>), f{,dailyRtn,holdingDays,ages,dictDateIndex, dictIndexDate,lastDays},,unionAll{,pnls}) }

Dolphindb function description:

_dict(key, value) : _ Create a dictionary.

_cj(leftTable, rightTable) : _ Cross join two tables.

_isValid: _ checks if the element is NULL. If it is not NULL, it returns 1; if it is NULL, it returns 0.

_ej(leftTable, rightTable, matchingCols, [rightMatchingCols]) : _ equivalently join two tables.

3. Run the instance

Using the US stock market as an example, we run a multi-factor Alpha strategy back test. The input daily data sheet USPRICES contains six fields: SYM (ticker symbol), DATE (date), CLOSE (closing price), RET (daily return), MV (market value) and VOL (trading volume).

// Load data usPrices =... HoldingDays = 5 stockPercentile = 20 SignalNames = 'signal_mom' signal_vol 'signal_beta' signal_size // Ports = FormPortfolio (SignalNames, SignalNames, Signals, Signals, Signals, Signals, Signals, Signals, Signals, Signals, Signals, Signals, Signals, Signals DailyRtn = select sym,date,ret from signals pos = calcstockPnl (ports, dailyRtn) PNLS = select sum(PNL) as PNL from pos group by date, sum(PNL) as PNL from pos group by date signalIdx factorPnl = select pnl from pnls pivot by date, signalIdx plot(each(cumsum,factorPnl[`C0`C1`C2`C3]).rename! (signalNames),, PNLS = select sum(Pnl) as Pnl from pos where signalIdx=0  group by date, age momAgePnl = select pnl from pnls pivot by date, age plot(each(cumsum,momAgePnl[`C1`C2`C3`C4`C5]).rename! (`C1`C2`C3`C4`C5),

Cumulative profit and loss chart of 4 factors

The cumulative gain and loss chart of the momentum factor

Dolphindb is a general-purpose distributed time-series database, but was developed very efficiently because of its extremely efficient built-in multiparadigm programming language for quantifying transactions. The above multi-factor backtest framework uses only three custom functions and more than 50 lines of code. Dolphindb is even more efficient, a daily backtracking of 25 years’ worth of US stocks that produces a profit and loss statement of more than 100m entries. Such a complex amount of calculation, on a single machine (4 cores) only 50 seconds to execute.

4. Discuss

The previous backtest framework addresses only part of the problem of a multi-factor strategy, that is, the allocation of stocks in a single factor. We still have two important problems to solve :(1) How to allocate the weight among multiple factors to balance the return and risk of investment. (2) Does a new factor bring additional Alpha? In other words, can a new factor be represented by multiple existing factors? If so, it may not be necessary for this new factor to exist. In our next article we’ll look at DolphinDB in response to these two questions.