自动标注不合理如何解决?


(xinyan) #1

其实我有两个问题:
第一,如下代码,我用instruments 做训练,用的自动标注表达式是贵司的工作人员替我写的,然后自己写了一些特征因子,结果标注出来绝大多数的得分都是满分,为什么会这样?

请看如下代码和自动标注的结果图:


class conf:
    start_date = '2016-01-01'
    end_date='2017-07-01'
    # split_date 之前的数据用于训练,之后的数据用作效果评估
    split_date = '2017-01-01'
    # D.instruments: https://bigquant.com/docs/data_instruments.html
    instruments = D.instruments(start_date, split_date)
    #股票池
    stockPool = ['601116.SHA','600760.SHA','600230.SHA','000856.SZA','300308.SZA','002302.SZA','600874.SHA',
                 '002647.SZA','002307.SZA','600679.SHA','002679.SZA','000877.SZA','300268.SZA','000615.SZA',
                 '000912.SZA','002346.SZA','000923.SZA','000401.SZA','300304.SZA','300355.SZA','300137.SZA',
                 '000635.SZA','600698.SHA','002602.SZA','300174.SZA','600516.SHA','002695.SZA','002379.SZA',
                 '000651.SZA','600321.SHA','002300.SZA','300176.SZA','000885.SZA','000818.SZA','000507.SZA',
                 '300402.SZA','002122.SZA','603616.SHA','002722.SZA','600069.SHA','002408.SZA','000933.SZA',
                 '600647.SHA','000959.SZA','002342.SZA','000916.SZA','300344.SZA','603969.SHA','002146.SZA',
                 '601992.SHA','002265.SZA','600693.SHA','000709.SZA','300334.SZA','002762.SZA','600545.SHA',
                 '600917.SHA','002457.SZA','000510.SZA','600477.SHA','600758.SHA','002620.SZA','600309.SHA',
                 '600215.SHA','002389.SZA','600291.SHA','300381.SZA','600148.SHA','300136.SZA','300175.SZA',
                 '600971.SHA','600008.SHA','600550.SHA','002515.SZA','000672.SZA','000717.SZA','000488.SZA',
                 '002785.SZA','002113.SZA','300225.SZA','600507.SHA','600068.SHA','601668.SHA','603800.SHA',
                 '600506.SHA','000022.SZA','002323.SZA','000605.SZA','002651.SZA','002415.SZA','002282.SZA',
                 '002703.SZA','002158.SZA','000409.SZA','002542.SZA','002755.SZA','002460.SZA','000678.SZA',
                 '000607.SZA','000921.SZA','600793.SHA','002050.SZA','600295.SHA','000567.SZA','002233.SZA',
                 '300215.SZA','603300.SHA','600326.SHA','600717.SHA','600592.SHA','300044.SZA','600802.SHA',
                 '002271.SZA','600620.SHA','603315.SHA','002732.SZA','600801.SHA','600421.SHA','000822.SZA',
                 '603799.SHA','002730.SZA','300072.SZA','002735.SZA','002780.SZA','000514.SZA','600487.SHA',
                 '601000.SHA','600816.SHA']

    # 机器学习目标标注函数
    # 如下标注函数等价于 min(max((持有期间的收益 * 100), -20), 20) + 20 (后面的M.fast_auto_labeler会做取整操作)
    # 说明:max/min这里将标注分数限定在区间[-20, 20],+20将分数变为非负数 (StockRanker要求标注分数非负整数)
    #label_expr = ['return * 100', 'where(label > {0}, {0}, where(label < -{0}, -{0}, label)) + {0}'.format(20)]
    label_expr = ['(high_price-buy_price) / buy_price*100', 'where(label > {0}, {0}, where(label < -{0}, -{0}, label)) + {0}'.format(20)]
    # 持有天数,用于计算label_expr中的return值(收益)
    hold_days = 60

    # 特征 https://bigquant.com/docs/data_features.html,你可以通过表达式构造任何特征
    features = [
        'mf_net_amount_0',  # 净主动买入额
        'sh_holder_avg_pct_0',  # 户均持股比例,
        'turn_0',
        'amount_0 / market_cap_0',   #成交量占市值比例
        'amount_0 / market_cap_float_0',    #成交量占流通市值比例
        'sh_holder_avg_pct_3m_chng_0'
        
    ]

# 给数据做标注:给每一行数据(样本)打分,一般分数越高表示越好
m1 = M.fast_auto_labeler.v8(
    instruments=conf.instruments, start_date=conf.start_date, end_date=conf.split_date,
    label_expr=conf.label_expr, hold_days=conf.hold_days,
    benchmark='000300.SHA', sell_at='open', buy_at='open')
# 计算特征数据
m2 = M.general_feature_extractor.v5(
    instruments=conf.instruments, start_date=conf.start_date, end_date=conf.split_date,
    features=conf.features)
# 数据预处理:缺失数据处理,数据规范化,T.get_stock_ranker_default_transforms为StockRanker模型做数据预处理
m3 = M.transform.v2(
    data=m2.data, transforms=T.get_stock_ranker_default_transforms(),
    drop_null=True, astype='int32', except_columns=['date', 'instrument'],
    clip_lower=0, clip_upper=200000000)
# 合并标注和特征数据
m4 = M.join.v2(data1=m1.data, data2=m3.data, on=['date', 'instrument'], sort=True)
# StockRanker机器学习训练
m5 = M.stock_ranker_train.v3(training_ds=m4.data,  features=conf.features)


第二,我在上面的代码里面定义了一个stockPool的变量,本来我是想拿这个股票池去寻找这些股票的在features 变量这些特征因子里面值的共性,但是不知道怎么去写才好。首先,例子策略里面都是基于所有股票去做训练,这样数据量够大,但是我的股票池只有区区100多只,无法按照这个模版去写;其次,我在有个策略编写的文章教程里看到有个设置训练测试数据集的代码,代码如下:

# 确定训练集
m5_training = M.filter.v1(data=m4.data, expr='date < "2016-01-02"')
# 确定验证数据
m5_training_test = M.filter.v1(data=m4.data, 
                   expr='"2014-01-01" <= date < "2016-01-01"')
# 确定测试数据(样本外)
m5_evaluation = M.filter.v1(data=m4.data, expr='"2016-01-01" <= date')
# 模型的训练
m6 = M.stock_ranker_train.v1(
         training_ds=m5_training.data, test_ds=m5_training_test.data,
         features=conf.features,
         number_of_leaves=30, minimum_docs_per_leaf=1000, 
         number_of_trees=20, learning_rate=0.1)
# 模型的预测
m7 = M.stock_ranker_predict.v1(model_id=m6.model_id, 
         data=m5_evaluation.data)

但是上面例子里面训练的数据和测试的数据存在重叠,而且还是基于所有股票的。
所以我想请教下如何实现我的模型才好?
我的模型目标就是找出上面股票池在上涨前夜、上涨中这两个阶段,它们的换手率、成交量占总市值比例、净主动买入额、 户均持股比例、净主动买入额有没有什么共同点。简单的来说,就是我输入一批牛股,经过训练模型应该能够识别出这些牛股启动时的特征,然后拿这个模型去做预测,大概率能够找到正在上涨或将要上涨的牛股。


(小马哥) #2

想法不错。

  • 问题1:


 label_expr = ['(high_price-buy_price) / buy_price*100', 'where(label > {0}, {0}, where(label < -{0}, -{0}, label)) + {0}'.format(20)]

从标注数据的分布可以看出标注为40分的count太多了,具体原因是这样的,因为你的持仓时间特别长,hold_day =60 ,因此60天内收益率大于20%(注:标注函数加了数字20,因此收益率为20%的就是40分)的股票其实非常多。你可以简单地修改这个标注函数。比如将20变成45.
此外,你可以采取更为复杂的标注函数。

label_expr = ['(high_price-buy_price) / buy_price*100', 'where(label > {0}, {0}, where(label < -{0}, -{0}, label)) + {0}'.format(20)]

  • 问题2:

m5_training = M.filter.v1(data=m4.data, expr='date < "2016-01-02"')
# 确定验证数据
m5_training_test = M.filter.v1(data=m4.data, 
                   expr='"2014-01-01" <= date < "2016-01-01"')

从代码里可以看出,并不是训练集和测试集数据重叠。而是将训练集里面的一部分数据作为了验证集。我觉得应该没有问题!


(xinyan) #3

感谢详细的解答,还有写疑问:
我拿stockPool去替换掉instruments,可以吗?会不会因为数据量太小导致模型不可靠?
还有,我这个写的策略有没有问题,能不能实现我的想法?