历史文档

【历史文档】因子构建与标注样例-Alpha因子构建及因子测试

由clearyf创建,最终由small_q 被浏览 1274 用户

更新

本文内容对应旧版平台与旧版资源,其内容不再适合最新版平台,请查看新版平台的使用说明

新版量化开发IDE(AIStudio):

https://bigquant.com/wiki/doc/aistudio-aiide-NzAjgKapzW

新版模版策略:

https://bigquant.com/wiki/doc/demos-ecdRvuM1TU

新版数据平台:

https://bigquant.com/wiki/doc/dai-PLSbc1SbZX

新版表达式算子:

https://bigquant.com/wiki/doc/dai-sql-Rceb2JQBdS

新版因子平台:

https://bigquant.com/wiki/doc/bigalpha-EOVmVtJMS5

\

导语

本文目的是介绍如何使用bigexpr表达式对WorldQuant公开的101个alpha进行因子构建,并进行因子测试。

背景介绍

根据WorldQuant发表的论文《101 Formulaic Alphas 》 ,其中公式化地给出了101个alpha因子。与传统方法不一样的是,他们根据数据挖掘的方法构建了101个alpha,据说里面80%的因子仍然还行之有效并被运用在实盘项目中。

在BigQuant策略研究平台上,可通过表达式快速进行因子构建和数据标注,再也不需要自己手动编写冗长代码。

表达式简介

因为在机器学习和深度学习中,因子是一个很重要的概念,也被称为特征,开发AI算法的关键在于特征选择。如果是简单的基础因子,比如近5日收益率:close_5/close_0-1,因子构建比较简单,但是如果想构建近5日每日收益率和成交量的相关性这个因子就比较棘手,需要编写大量的代码来计算该因子。因此,我们设计了bigexpr表达式引擎

bigexpr是BigQuant开发的表达式计算引擎,通过编写简单的表达式,就可以对数据做任何运算,而无需编写代码。

bigexpr在平台上被广泛使用,M.advanced_auto_labeler 和 M.derived_feature_extractor 都已经由bigexpr驱动,您可以用表达式就可以定义标注目标和完成后特征抽取。

正如刚刚提到的近5日每日收益率和成交量的相关性因子可以这样定义:

{w:100}其中,correlation表示求相关系数,close_0表示当天收盘价,shift(close_0,1)表示前一日收盘价,volume_0表示当天成交量。因此,可以看出,并不需要编写大量代码计算该因子,通过表达式即可快速构建。

函数说明

表达式引擎中有不少简单函数,对其中的部分函数进行解释:

  • 可分为横截面函数和时间序列函数两大类,其中时间序列函数名多为以$ts\_$开头
  • abs(x) 、log(x)分别表示x的绝对值和x的自然对数
  • rank(x)表示某股票x值在横截面上的升序排名序号,并将排名归一到[0,1]的闭区间
  • delay(x,d)表示x值在d天前的值
  • delta(x,d)表示x值的最新值减去x值在d天前的值
  • correlation(x,y,d)、covariance(x,y,d)分别表示x和y在长度为d的时间窗口上的皮尔逊相关系数和协方差
  • ts_min(x,d)、ts_max(x,d)、ts_argmax(x,d)、s_argmin(x,d)、ts_rank(x)、sum(x,d)、stddev(x,d)等均可以通过函数名称了解其作用
  • group_mean(key, x),同时按日期和key做分组求平均,例如: group_mean(industry_sw_level1_0, pe_ttm_0) : 计算各行业的简单平均pe值
  • ta_sma(x, timeperiod),计算timeperiod周期的简单移动平均值

因子说明

BigQuant平台上系统因子超过2000个,包括了基本信息因子、量价因子、估值因子、财报因子、技术指标因子等。本文简单举若干因子进行介绍。

基本信息因子

部分因子:

  • list_days # 上市天数
  • list_board_0 # 上市板
  • company_found_date_0 # 公司成立天数
  • industry_sw_level1_0 # 申万一级行业类别
  • st_status_0 # ST状态
  • in_sse50_0 # 是否属于上证50指数成分
  • in_csi300_0 # 是否属于沪深300指数成分

量价因子

部分因子:

  • open_0 # 当日开盘价
  • open_1 # 前一日开盘价
  • close_0 # 当日收盘价
  • high_0 # 当日最高价
  • low_0 # 当日最低价
  • volume_0 # 当日成交量
  • amount_0 # 当日成交额
  • adjust_factor_0 # 复权因子

估值因子

部分因子:

  • market_cap_0 # 总市值
  • rank_market_cap_0 # 总市值排序
  • pe_ttm_0 # 市盈率(TTM)
  • rank_pe_ttm_0 # 市盈率(TTM)升序百分比排名
  • pe_lyr_0 # 市盈率(LYR)
  • pb_lf_0 # 市净率(LF)
  • ps_ttm_0 # 市销率(TTM)

财报因子

部分因子:

  • fs_net_profit_0 # 归属母公司股东的净利润
  • fs_net_profit_yoy_0 # 归属母公司股东的净利润同比增长率
  • fs_net_profit_qoq_0 # 归属母公司股东的净利润环比增长率
  • fs_roe_0 # 净资产收益率
  • fs_roa_0 # 总资产收益率
  • fs_gross_profit_margin_0 # 销售毛利率
  • fs_net_profit_margin_0 # 销售净利率
  • fs_eps_0 # 每股收益
  • fs_bps_0 # 每股净资产
  • fs_cash_ratio_0 # 现金比率

数据标注

和因子构建一样,数据标注也是机器学习算法中非常重要的一部分,更详细的文档为:自定义标注

之前没有表达式的时候,数据标注主要通过fast_auto_label实现,自从有了表达式以后,数据标注主要是通过advanced_auto_label实现。数据标注的整体思想和内容主要体现在标注表达式上,可以通过 M.instruments模块获取证券代码列表,然后通过 M.advanced_auto_labeler模块实现标注表达式的编写,如下代码所示。

代码 :

m1 = M.instruments.v2(
    start_date='2014-01-01',
    end_date='2015-01-01',
    market='CN_STOCK_A',
    instrument_list='',
    max_count=0
)
m2 = M.advanced_auto_labeler.v2(
    instruments=m1.data,
    label_expr="""# #号开始的表示注释
# 0. 每行一个,顺序执行,从第二个开始,可以使用label字段
# 1. 可用数据字段见 https://bigquant.com/docs/data_history_data.html
#   添加benchmark_前缀,可使用对应的benchmark数据
# 2. 可用操作符和函数见 `表达式引擎 <https://bigquant.com/docs/big_expr.html>`_

# 计算收益:5日收盘价(作为卖出价格)除以明日开盘价(作为买入价格)
shift(close, -5) / shift(open, -1)

# 极值处理:用1%和99%分位的值做clip
clip(label, all_quantile(label, 0.01), all_quantile(label, 0.99))

# 将分数映射到分类,这里使用20个分类
all_wbins(label, 20)

# 过滤掉一字涨停的情况 (设置label为NaN,在后续处理和训练中会忽略NaN的label)
where(shift(high, -1) == shift(low, -1), NaN, label)
""",
    start_date='',
    end_date='',
    benchmark='000300.SHA',
    drop_na_label=True,
    cast_label_int=True
)

接下来,我们对示例代码做解释:

  • label_expr为一个list,列表里四个元素决定了标注的具体操作,详细文档见:表达式引擎
  • 计算未来一段时间的相对收益作为标注的原始依据,这里可以使用bigexpr表达式,快速完成数据标注
  • 使用clip和all_quantile函数做极值处理
  • 将原始数据离散化,这里可以采取等宽离散化或者等频离散化,两者各有优劣
  • 通过where函数过滤掉一字涨停的样本数据

101 Alphas列表

['alpha_001 = (rank(ts_argmax(signedpower(where(((close_0/shift(close_0,1)-1) < 0), std((close_0/shift(close_0,1)-1), 20), close_0), 2), 5)) -0.5)',
'alpha_002 = (-1 * correlation(rank(delta(log(volume_0), 2)), rank(div((close_0 - open_0), open_0)), 6))',
'alpha_003 = (-1 * correlation(rank(open_0), rank(volume_0), 10))',
'alpha_004 = (-1 * ts_rank(rank(low_0), 9))',
'alpha_005 = (rank((open_0 - (sum(((high_0+low_0+open_0+close_0)*0.25), 10) / 10))) * (-1 * abs(rank((close_0 - ((high_0+low_0+open_0+close_0)*0.25))))))',
'alpha_006 = (-1 * correlation(open_0, volume_0, 10))',
'alpha_007 = where((mean(volume_0,20) < volume_0), ((-1 * ts_rank(abs(delta(close_0, 7)), 60)) * sign(delta(close_0, 7))), (-1* 1))',
'alpha_008 = (-1 * rank(((sum(open_0, 5) * sum((close_0/shift(close_0,1)-1), 5)) - delay((sum(open_0, 5) * sum((close_0/shift(close_0,1)-1), 5)),10))))',
'alpha_009 = where((0 < ts_min(delta(close_0, 1), 5)), delta(close_0, 1), where((ts_max(delta(close_0, 1), 5) < 0), delta(close_0, 1), (-1 * delta(close_0, 1))))',
'alpha_010 = rank(where((0 < ts_min(delta(close_0, 1), 4)), delta(close_0, 1), where((ts_max(delta(close_0, 1), 4) < 0), delta(close_0, 1), (-1 * delta(close_0, 1)))))',
'alpha_011 = ((rank(ts_max((((high_0+low_0+open_0+close_0)*0.25) - close_0), 3)) + rank(ts_min((((high_0+low_0+open_0+close_0)*0.25) - close_0), 3))) *rank(delta(volume_0, 3)))',
'alpha_012 = (sign(delta(volume_0, 1)) * (-1 * delta(close_0, 1)))',
'alpha_013 = (-1 * rank(covariance(rank(close_0), rank(volume_0), 5)))',
'alpha_014 = ((-1 * rank(delta((close_0/shift(close_0,1)-1), 3))) * correlation(open_0, volume_0, 10))',
'alpha_015 = (-1 * sum(rank(correlation(rank(high_0), rank(volume_0), 3)), 3))',
'alpha_016 = (-1 * rank(covariance(rank(high_0), rank(volume_0), 5)))',
'alpha_017 = (((-1 * rank(ts_rank(close_0, 10))) * rank(delta(delta(close_0, 1), 1))) *rank(ts_rank((div(volume_0, mean(volume_0,20))), 5)))',
'alpha_018 = (-1 * rank(((std(abs((close_0 - open_0)), 5) + (close_0 - open_0)) + correlation(close_0, open_0,10))))',
'alpha_019 = ((-1 * sign(((close_0 - delay(close_0, 7)) + delta(close_0, 7)))) * (1 + rank((1 + sum((close_0/shift(close_0,1)-1),250)))))',
'alpha_020 = (((-1 * rank((open_0 - delay(high_0, 1)))) * rank((open_0 - delay(close_0, 1)))) * rank((open_0 -delay(low_0, 1))))',
'alpha_021 = where((((sum(close_0, 8) / 8) + std(close_0, 8)) < (sum(close_0, 2) / 2)), (-1 * 1), where(((sum(close_0,2) / 2) < ((sum(close_0, 8) / 8) - std(close_0, 8))), 1, where(((1 < div(volume_0, mean(volume_0,20))) | (div(volume_0, mean(volume_0,20)) == 1)), 1, (-1 * 1))))',
'alpha_022 = (-1 * (delta(correlation(high_0, volume_0, 5), 5) * rank(std(close_0, 20))))',
'alpha_023 = where(((sum(high_0, 20) / 20) < high_0), (-1 * delta(high_0, 2)), 0)',
'alpha_024 = where((((delta((sum(close_0, 100) / 100), 100) / delay(close_0, 100)) < 0.05) | ((delta((sum(close_0, 100) / 100), 100) / delay(close_0, 100)) == 0.05)), (-1 * (close_0 - ts_min(close_0,100))), (-1 * delta(close_0, 3)))',
'alpha_025 = rank(((((-1 * (close_0/shift(close_0,1)-1)) * mean(volume_0,20)) * ((high_0+low_0+open_0+close_0)*0.25)) * (high_0 - close_0)))',
'alpha_026 = (-1 * ts_max(correlation(ts_rank(volume_0, 5), ts_rank(high_0, 5), 5), 3))',
'alpha_027 = where((0.5 < rank((sum(correlation(rank(volume_0), rank(((high_0+low_0+open_0+close_0)*0.25)), 6), 2) / 2.0))), (-1 * 1), 1)',
'alpha_028 = scale(((correlation(mean(volume_0,20), low_0, 5) + ((high_0 + low_0) / 2)) - close_0))',
'alpha_029 = (min(product(rank(rank(scale(log(sum(ts_min(rank(rank((-1 * rank(delta((close_0 - 1),5))))), 2), 1))))), 1), 5) + ts_rank(delay((-1 * (close_0/shift(close_0,1)-1)), 6), 5))',
'alpha_030 = div(((1.0 - rank(((sign((close_0 - delay(close_0, 1))) + sign((delay(close_0, 1) - delay(close_0, 2)))) +sign((delay(close_0, 2) - delay(close_0, 3)))))) * sum(volume_0, 5)), sum(volume_0, 20))',
'alpha_031 = ((rank(rank(rank(decay_linear((-1 * rank(rank(delta(close_0, 10)))), 10)))) + rank((-1 *delta(close_0, 3)))) + sign(scale(correlation(mean(volume_0,20), low_0, 12))))',
'alpha_032 = (scale(((sum(close_0, 7) / 7) - close_0)) + (20 * scale(correlation(((high_0+low_0+open_0+close_0)*0.25), delay(close_0, 5),230))))',
'alpha_033 = rank((-1 * ((1 - (open_0 / close_0))**1)))',
'alpha_034 = rank(((1 - rank(div(std((close_0/shift(close_0,1)-1), 2), std((close_0/shift(close_0,1)-1), 5)))) + (1 - rank(delta(close_0, 1)))))',
'alpha_035 = ((ts_rank(volume_0, 32) * (1 - ts_rank(((close_0 + high_0) - low_0), 16))) * (1 -ts_rank((close_0/shift(close_0,1)-1), 32)))',
'alpha_036 = (((((2.21 * rank(correlation((close_0 - open_0), delay(volume_0, 1), 15))) + (0.7 * rank((open_0- close_0)))) + (0.73 * rank(ts_rank(delay((-1 * (close_0/shift(close_0,1)-1)), 6), 5)))) + rank(abs(correlation(((high_0+low_0+open_0+close_0)*0.25),mean(volume_0,20), 6)))) + (0.6 * rank((((sum(close_0, 200) / 200) - open_0) * (close_0 - open_0)))))',
'alpha_037 = (rank(correlation(delay((open_0 - close_0), 1), close_0, 200)) + rank((open_0 - close_0)))',
'alpha_038 = ((-1 * rank(ts_rank(close_0, 10))) * rank((close_0 / open_0)))',
'alpha_039 = ((-1 * rank((delta(close_0, 7) * (1 - rank(decay_linear(div(volume_0, mean(volume_0,20)), 9)))))) * (1 +rank(sum((close_0/shift(close_0,1)-1), 250))))',
'alpha_040 = ((-1 * rank(std(high_0, 10))) * correlation(high_0, volume_0, 10))',
'alpha_041 = (((high_0 * low_0)**0.5) - ((high_0+low_0+open_0+close_0)*0.25))',
'alpha_042 = div(rank((((high_0+low_0+open_0+close_0)*0.25) - close_0)), rank((((high_0+low_0+open_0+close_0)*0.25) + close_0)))',
'alpha_043 = (ts_rank(div(volume_0, mean(volume_0,20)), 20) * ts_rank((-1 * delta(close_0, 7)), 8))',
'alpha_044 = (-1 * correlation(high_0, rank(volume_0), 5))',
'alpha_045 = (-1 * ((rank((sum(delay(close_0, 5), 20) / 20)) * correlation(close_0, volume_0, 2)) *rank(correlation(sum(close_0, 5), sum(close_0, 20), 2))))',
'alpha_046 = where((0.25 < (((delay(close_0, 20) - delay(close_0, 10)) / 10) - ((delay(close_0, 10) - close_0) / 10))), (-1 * 1), where(((((delay(close_0, 20) - delay(close_0, 10)) / 10) - ((delay(close_0, 10) - close_0) / 10)) < 0), 1, ((-1 * 1) * (close_0 - delay(close_0, 1)))))',
'alpha_047 = ((div((rank((1 / close_0)) * volume_0), mean(volume_0,20)) * ((high_0 * rank((high_0 - close_0))) / (sum(high_0, 5) /5))) - rank((((high_0+low_0+open_0+close_0)*0.25) - delay(((high_0+low_0+open_0+close_0)*0.25), 5))))',
'alpha_049 = where(((((delay(close_0, 20) - delay(close_0, 10)) / 10) - ((delay(close_0, 10) - close_0) / 10)) < (-1 *0.1)), 1, ((-1 * 1) * (close_0 - delay(close_0, 1))))',
'alpha_050 = (-1 * ts_max(rank(correlation(rank(volume_0), rank(((high_0+low_0+open_0+close_0)*0.25)), 5)), 5))',
'alpha_051 = where(((((delay(close_0, 20) - delay(close_0, 10)) / 10) - ((delay(close_0, 10) - close_0) / 10)) < (-1 *0.05)), 1, ((-1 * 1) * (close_0 - delay(close_0, 1))))',
'alpha_052 = ((((-1 * ts_min(low_0, 5)) + delay(ts_min(low_0, 5), 5)) * rank(((sum((close_0/shift(close_0,1)-1), 240) -sum((close_0/shift(close_0,1)-1), 20)) / 220))) * ts_rank(volume_0, 5))',
'alpha_053 = (-1 * delta(div(((close_0 - low_0) - (high_0 - close_0)), (close_0 - low_0)), 9))',
'alpha_054 = div((-1 * ((low_0 - close_0) * (open_0**5))), ((low_0 - high_0) * (close_0**5)))',
'alpha_055 = (-1 * correlation(rank(div((close_0 - ts_min(low_0, 12)), (ts_max(high_0, 12) - ts_min(low_0,12)))), rank(volume_0), 6))',
'alpha_056 = (0 - (1 * (rank(div(sum((close_0/shift(close_0,1)-1), 10), sum(sum((close_0/shift(close_0,1)-1), 2), 3))) * rank(((close_0/shift(close_0,1)-1) * market_cap_0)))))',
'alpha_057 = (0 - (1 * div((close_0 - ((high_0+low_0+open_0+close_0)*0.25)), decay_linear(rank(ts_argmax(close_0, 30)), 2))))',
'alpha_060 = (0 - (1 * ((2 * scale(rank((div(((close_0 - low_0) - (high_0 - close_0)), (high_0 - low_0)) * volume_0)))) -scale(rank(ts_argmax(close_0, 10))))))',
# 'alpha_061 = where(rank((((high_0+low_0+open_0+close_0)*0.25) - ts_min(((high_0+low_0+open_0+close_0)*0.25), 16.1219))) < rank(correlation(((high_0+low_0+open_0+close_0)*0.25), mean(volume_0,180), 17.9282)), 1, -1)',
# 'alpha_062 = ((rank(correlation(((high_0+low_0+open_0+close_0)*0.25), sum(mean(volume_0,20), 22.4101), 9.91009)) < rank(((rank(open_0) +rank(open_0)) < (rank(((high_0 + low_0) / 2)) + rank(high_0))))) * -1)',
# 'alpha_064 = (where(rank(correlation(sum(((open_0 * 0.178404) + (low_0 * (1 - 0.178404))), 12.7054),sum(mean(volume_0,120), 12.7054), 16.6208)) < rank(delta(((((high_0 + low_0) / 2) * 0.178404) + (((high_0+low_0+open_0+close_0)*0.25) * (1 -0.178404))), 3.69741)), 1, -1) * -1)',
# 'alpha_065 = (where(rank(correlation(((open_0 * 0.00817205) + (((high_0+low_0+open_0+close_0)*0.25) * (1 - 0.00817205))), sum(mean(volume_0,60),8.6911), 6.40374)) < rank((open_0 - ts_min(open_0, 13.635))), 1, -1) * -1)',
# 'alpha_066 = ((rank(decay_linear(delta(((high_0+low_0+open_0+close_0)*0.25), 3.51013), 7.23052)) + ts_rank(decay_linear(div((((low_0* 0.96633) + (low_0 * (1 - 0.96633))) - ((high_0+low_0+open_0+close_0)*0.25)), (open_0 - ((high_0 + low_0) / 2))), 11.4157), 6.72611)) * -1)',
# 'alpha_068 = (where(ts_rank(correlation(rank(high_0), rank(mean(volume_0,15)), 8.91644), 13.9333) <rank(delta(((close_0 * 0.518371) + (low_0 * (1 - 0.518371))), 1.06157)), 1, -1) * -1)',
# 'alpha_071 = max(ts_rank(decay_linear(correlation(ts_rank(close_0, 3.43976), ts_rank(mean(volume_0,180),12.0647), 18.0175), 4.20501), 15.6948), ts_rank(decay_linear((rank(((low_0 + open_0) - (((high_0+low_0+open_0+close_0)*0.25) +((high_0+low_0+open_0+close_0)*0.25))))**2), 16.4662), 4.4388))',
# 'alpha_072 = div(rank(decay_linear(correlation(((high_0 + low_0) / 2), mean(volume_0,40), 8.93345), 10.1519)), rank(decay_linear(correlation(ts_rank(((high_0+low_0+open_0+close_0)*0.25), 3.72469), ts_rank(volume_0, 18.5188), 6.86671),2.95011)))',
# 'alpha_073 = (max(rank(decay_linear(delta(((high_0+low_0+open_0+close_0)*0.25), 4.72775), 2.91864)),ts_rank(decay_linear((div(delta(((open_0 * 0.147155) + (low_0 * (1 - 0.147155))), 2.03608), ((open_0 *0.147155) + (low_0 * (1 - 0.147155)))) * -1), 3.33829), 16.7411)) * -1)',
# 'alpha_074 = (where(rank(correlation(close_0, sum(mean(volume_0,30), 37.4843), 15.1365)) <rank(correlation(rank(((high_0 * 0.0261661) + (((high_0+low_0+open_0+close_0)*0.25) * (1 - 0.0261661)))), rank(volume_0), 11.4791)), 1, -1)* -1)',
# 'alpha_075 = where(rank(correlation(((high_0+low_0+open_0+close_0)*0.25), volume_0, 4.24304)) < rank(correlation(rank(low_0), rank(mean(volume_0,50)),12.4413)), 1, -1)',
# 'alpha_077 = min(rank(decay_linear(((((high_0 + low_0) / 2) + high_0) - (((high_0+low_0+open_0+close_0)*0.25) + high_0)), 20.0451)),rank(decay_linear(correlation(((high_0 + low_0) / 2), mean(volume_0,40), 3.1614), 5.64125)))',
# 'alpha_078 = (rank(correlation(sum(((low_0 * 0.352233) + (((high_0+low_0+open_0+close_0)*0.25) * (1 - 0.352233))), 19.7428),sum(mean(volume_0,40), 19.7428), 6.83313))**rank(correlation(rank(((high_0+low_0+open_0+close_0)*0.25)), rank(volume_0), 5.77492)))',
# 'alpha_081 = (where(rank(log(product(rank((rank(correlation(((high_0+low_0+open_0+close_0)*0.25), sum(mean(volume_0,10), 49.6054),8.47743))**4)), 14.9655))) < rank(correlation(rank(((high_0+low_0+open_0+close_0)*0.25)), rank(volume_0), 5.07914)), 1, -1) * -1)',
'alpha_083 = div((rank(delay(div((high_0 - low_0), (sum(close_0, 5) / 5)), 2)) * rank(rank(volume_0))), div(((high_0 -low_0) / (sum(close_0, 5) / 5)), (((high_0+low_0+open_0+close_0)*0.25) - close_0)))',
# 'alpha_084 = signedpower(ts_rank((((high_0+low_0+open_0+close_0)*0.25) - ts_max(((high_0+low_0+open_0+close_0)*0.25), 15.3217)), 20.7127), delta(close_0,4.96796))',
# 'alpha_085 = (rank(correlation(((high_0 * 0.876703) + (close_0 * (1 - 0.876703))), mean(volume_0,30),9.61331))**rank(correlation(ts_rank(((high_0 + low_0) / 2), 3.70596), ts_rank(volume_0, 10.1595),7.11408)))',
# 'alpha_086 = (where(ts_rank(correlation(close_0, sum(mean(volume_0,20), 14.7444), 6.00049), 20.4195) < rank(((open_0+ close_0) - (((high_0+low_0+open_0+close_0)*0.25) + open_0))), 1, -1) * -1)',
# 'alpha_088 = min(rank(decay_linear(((rank(open_0) + rank(low_0)) - (rank(high_0) + rank(close_0))),8.06882)), ts_rank(decay_linear(correlation(ts_rank(close_0, 8.44728), ts_rank(mean(volume_0,60),20.6966), 8.01266), 6.65053), 2.61957))',
# 'alpha_092 = min(ts_rank(decay_linear(where((((high_0 + low_0) / 2) + close_0) < (low_0 + open_0), 1, -1), 14.7221),18.8683), ts_rank(decay_linear(correlation(rank(low_0), rank(mean(volume_0,30)), 7.58555), 6.94024),6.80584))',
# 'alpha_094 = ((rank((((high_0+low_0+open_0+close_0)*0.25) - ts_min(((high_0+low_0+open_0+close_0)*0.25), 11.5783)))**ts_rank(correlation(ts_rank(((high_0+low_0+open_0+close_0)*0.25),19.6462), ts_rank(mean(volume_0,60), 4.02992), 18.0926), 2.70756)) * -1)',
# 'alpha_095 = where(rank((open_0 - ts_min(open_0, 12.4105))) < ts_rank((rank(correlation(sum(((high_0 + low_0)/ 2), 19.1351), sum(mean(volume_0,40), 19.1351), 12.8742))**5), 11.7584), 1, -1)',
# 'alpha_096 = (max(ts_rank(decay_linear(correlation(rank(((high_0+low_0+open_0+close_0)*0.25)), rank(volume_0), 3.83878),4.16783), 8.38151), ts_rank(decay_linear(ts_argmax(correlation(ts_rank(close_0, 7.45404),ts_rank(mean(volume_0,60), 4.13242), 3.65459), 12.6556), 14.0365), 13.4143)) * -1)',
# 'alpha_098 = (rank(decay_linear(correlation(((high_0+low_0+open_0+close_0)*0.25), sum(mean(volume_0,5), 26.4719), 4.58418), 7.18088)) -rank(decay_linear(ts_rank(ts_argmin(correlation(rank(open_0), rank(mean(volume_0,15)), 20.8187), 8.62571),6.95668), 8.07206)))',
# 'alpha_099 = (where(rank(correlation(sum(((high_0 + low_0) / 2), 19.8975), sum(mean(volume_0,60), 19.8975), 8.8136)) <rank(correlation(low_0, volume_0, 6.28259)), 1, -1) * -1)',
'alpha_101 = div((close_0 - open_0), ((high_0 - low_0) + 0.001))' 
# 注释的目前有问题,暂无法提取
这里展示了WorldQuant公开的101个alpha及其表达式,感兴趣的朋友可以参考下面的 单因子测试 的代码做实验,唯一需要修改的是将具体的因子变动下,希望大家能开发出可以稳定盈利的策略,发掘出新的alpha。
注:部分因子可能是布尔型因子,因子值要么是1,要么是-1,这样的单因子在传入StockRanker的时候可能会出错,导致模型训练失败。部分长表达式因子需要通过测试案例中的别名简称方式重命名,并在传给训练模型之前进行因子简称转换处理,这是因为模型会根据指定的特征名作为列名读取数据表中的因子数据,别名处理后只有简称对应的列,此时传入完整的表达式将无法获取因子数据导致训练失败。

单因子测试

这里我们以'shift(close_0,15) / close_0'因子为例,介绍如何进行单因子测试,开发基于单因子的AI策略。

https://bigquant.com/experimentshare/8f9be143a866441180d2a112528b2236

相关阅读

小结: 了解上述方法过后,大家即可在策略研究平台上,通过表达式快速进行因子构建和数据标数。

\

标签

数据挖掘
评论
  • 感谢分享,以前101都是自己摸索着定义。希望后期注释部分也能完善
  • 系统因子哪里可以查?
  • 我是从 查的,但感觉没有 2000+ 个吧
  • 踩过的坑: alpha_005 = (rank((open_0 - (sum(((high_0+low_0+open_0+close_0)\*0.25), 10) / 10))) \* (-1 \* abs(rank((close_0 - ((high_0+low_0+open_0+close_0)\*0.25)))))) 上面的因子,alpha_005 后面不能有空格,要直接接=号。否则使用上文 “单因子测试” 的 StockRanker 进行回测时,会报错。 \
  • 上面的公式都是有问题的,因为没有考虑复权因素。得出的成交金额和成交量是真实成交,而开收高低都是后复权价格,所以用上面的公式计算出来的因子值,大多有问题 \ 比如'alpha_006 = (-1 \* correlation(open_0, volume_0, 10))', 按照公式的原意,是计算不复权的开盘价和成交量的相关系数;但是由于数据复权的原因,在这里计算的是后复权开盘价和不复权成交量的相关系数,是不正确的
{link}