其实我有两个问题:
第一,如下代码,我用instruments 做训练,用的自动标注表达式是贵司的工作人员替我写的,然后自己写了一些特征因子,结果标注出来绝大多数的得分都是满分,为什么会这样?
请看如下代码和自动标注的结果图:
class conf:
start_date = '2016-01-01'
end_date='2017-07-01'
# split_date 之前的数据用于训练,之后的数据用作效果评估
split_date = '2017-01-01'
# D.instruments: https://bigquant.com/docs/data_instruments.html
instruments = D.instruments(start_date, split_date)
#股票池
stockPool = ['601116.SHA','600760.SHA','600230.SHA','000856.SZA','300308.SZA','002302.SZA','600874.SHA',
'002647.SZA','002307.SZA','600679.SHA','002679.SZA','000877.SZA','300268.SZA','000615.SZA',
'000912.SZA','002346.SZA','000923.SZA','000401.SZA','300304.SZA','300355.SZA','300137.SZA',
'000635.SZA','600698.SHA','002602.SZA','300174.SZA','600516.SHA','002695.SZA','002379.SZA',
'000651.SZA','600321.SHA','002300.SZA','300176.SZA','000885.SZA','000818.SZA','000507.SZA',
'300402.SZA','002122.SZA','603616.SHA','002722.SZA','600069.SHA','002408.SZA','000933.SZA',
'600647.SHA','000959.SZA','002342.SZA','000916.SZA','300344.SZA','603969.SHA','002146.SZA',
'601992.SHA','002265.SZA','600693.SHA','000709.SZA','300334.SZA','002762.SZA','600545.SHA',
'600917.SHA','002457.SZA','000510.SZA','600477.SHA','600758.SHA','002620.SZA','600309.SHA',
'600215.SHA','002389.SZA','600291.SHA','300381.SZA','600148.SHA','300136.SZA','300175.SZA',
'600971.SHA','600008.SHA','600550.SHA','002515.SZA','000672.SZA','000717.SZA','000488.SZA',
'002785.SZA','002113.SZA','300225.SZA','600507.SHA','600068.SHA','601668.SHA','603800.SHA',
'600506.SHA','000022.SZA','002323.SZA','000605.SZA','002651.SZA','002415.SZA','002282.SZA',
'002703.SZA','002158.SZA','000409.SZA','002542.SZA','002755.SZA','002460.SZA','000678.SZA',
'000607.SZA','000921.SZA','600793.SHA','002050.SZA','600295.SHA','000567.SZA','002233.SZA',
'300215.SZA','603300.SHA','600326.SHA','600717.SHA','600592.SHA','300044.SZA','600802.SHA',
'002271.SZA','600620.SHA','603315.SHA','002732.SZA','600801.SHA','600421.SHA','000822.SZA',
'603799.SHA','002730.SZA','300072.SZA','002735.SZA','002780.SZA','000514.SZA','600487.SHA',
'601000.SHA','600816.SHA']
# 机器学习目标标注函数
# 如下标注函数等价于 min(max((持有期间的收益 * 100), -20), 20) + 20 (后面的M.fast_auto_labeler会做取整操作)
# 说明:max/min这里将标注分数限定在区间[-20, 20],+20将分数变为非负数 (StockRanker要求标注分数非负整数)
#label_expr = ['return * 100', 'where(label > {0}, {0}, where(label < -{0}, -{0}, label)) + {0}'.format(20)]
label_expr = ['(high_price-buy_price) / buy_price*100', 'where(label > {0}, {0}, where(label < -{0}, -{0}, label)) + {0}'.format(20)]
# 持有天数,用于计算label_expr中的return值(收益)
hold_days = 60
# 特征 https://bigquant.com/docs/data_features.html,你可以通过表达式构造任何特征
features = [
'mf_net_amount_0', # 净主动买入额
'sh_holder_avg_pct_0', # 户均持股比例,
'turn_0',
'amount_0 / market_cap_0', #成交量占市值比例
'amount_0 / market_cap_float_0', #成交量占流通市值比例
'sh_holder_avg_pct_3m_chng_0'
]
# 给数据做标注:给每一行数据(样本)打分,一般分数越高表示越好
m1 = M.fast_auto_labeler.v8(
instruments=conf.instruments, start_date=conf.start_date, end_date=conf.split_date,
label_expr=conf.label_expr, hold_days=conf.hold_days,
benchmark='000300.SHA', sell_at='open', buy_at='open')
# 计算特征数据
m2 = M.general_feature_extractor.v5(
instruments=conf.instruments, start_date=conf.start_date, end_date=conf.split_date,
features=conf.features)
# 数据预处理:缺失数据处理,数据规范化,T.get_stock_ranker_default_transforms为StockRanker模型做数据预处理
m3 = M.transform.v2(
data=m2.data, transforms=T.get_stock_ranker_default_transforms(),
drop_null=True, astype='int32', except_columns=['date', 'instrument'],
clip_lower=0, clip_upper=200000000)
# 合并标注和特征数据
m4 = M.join.v2(data1=m1.data, data2=m3.data, on=['date', 'instrument'], sort=True)
# StockRanker机器学习训练
m5 = M.stock_ranker_train.v3(training_ds=m4.data, features=conf.features)
第二,我在上面的代码里面定义了一个stockPool的变量,本来我是想拿这个股票池去寻找这些股票的在features 变量这些特征因子里面值的共性,但是不知道怎么去写才好。首先,例子策略里面都是基于所有股票去做训练,这样数据量够大,但是我的股票池只有区区100多只,无法按照这个模版去写;其次,我在有个策略编写的文章教程里看到有个设置训练测试数据集的代码,代码如下:
# 确定训练集
m5_training = M.filter.v1(data=m4.data, expr='date < "2016-01-02"')
# 确定验证数据
m5_training_test = M.filter.v1(data=m4.data,
expr='"2014-01-01" <= date < "2016-01-01"')
# 确定测试数据(样本外)
m5_evaluation = M.filter.v1(data=m4.data, expr='"2016-01-01" <= date')
# 模型的训练
m6 = M.stock_ranker_train.v1(
training_ds=m5_training.data, test_ds=m5_training_test.data,
features=conf.features,
number_of_leaves=30, minimum_docs_per_leaf=1000,
number_of_trees=20, learning_rate=0.1)
# 模型的预测
m7 = M.stock_ranker_predict.v1(model_id=m6.model_id,
data=m5_evaluation.data)
但是上面例子里面训练的数据和测试的数据存在重叠,而且还是基于所有股票的。
所以我想请教下如何实现我的模型才好?
我的模型目标就是找出上面股票池在上涨前夜、上涨中这两个阶段,它们的换手率、成交量占总市值比例、净主动买入额、 户均持股比例、净主动买入额有没有什么共同点。简单的来说,就是我输入一批牛股,经过训练模型应该能够识别出这些牛股启动时的特征,然后拿这个模型去做预测,大概率能够找到正在上涨或将要上涨的牛股。