自定义标注 运算错误"['return_90/return_10'] not in index"

策略分享
标签: #<Tag:0x00007fcf685101a0>

(woody) #1
克隆策略
In [72]:
# 基础参数配置
class conf:
    start_date = '2017-01-01'
    end_date='2017-07-10'
    split_date = '2017-05-01'
    instruments = D.instruments(start_date, end_date)
    hold_days = 3
    features = ['return_90/return_10']

# 手动计算标注
df = D.history_data(conf.instruments, start_date=conf.start_date,end_date=conf.end_date,fields=['close','open','high','low','amount'])
 
# 增加一列数据的函数
def add_column(df, series, name):
    df[name] = series
    return df

# 计算atr函数
def atr(high,low,close,window):
    a=high-low
    b=np.abs(close.shift(1)-high)
    c=np.abs(close.shift(1)-low)
    tr=a.where(a>b,b)
    tr=tr.where(tr>c,c)
    return tr.rolling(window).mean()

# 计算ATR 
df = df.groupby('instrument').apply(lambda x:add_column(x,atr(x.high,x.low,x.close,50),'ATR'))
# 计算标注,标注这个地方可以试一试用 'close' 
df = df.groupby('instrument').apply(lambda x:add_column(x,(x.open.shift(-4)-x.open.shift(-1))/x.ATR,'label'))

# # 对标注数据进行一些转化,上下界限制
df['label']=df.label*5+10
df.label=df.label.where(df.label<20,20)
df.label=df.label.where(df.label>0,0)
 
# 删除一部分数据
df.drop(['low','high','amount','close','ATR','open'],axis=1,inplace=True)
# df['label'] = df['label']*10 # 如果小数太多,这是会影响到转为整数几乎一样,导致模型训练会出错
# 标注要为int格式
df.label = df.label.astype('int')
label_ds = DataSource.write_df(df)


# 计算特征数据
m2 = M.general_feature_extractor.v5(
    instruments=conf.instruments, start_date=conf.start_date, end_date=conf.end_date,
    features=conf.features)
# 数据预处理:缺失数据处理,数据规范化,T.get_stock_ranker_default_transforms为StockRanker模型做数据预处理
m3 = M.transform.v2(
    data=m2.data, transforms=T.get_stock_ranker_default_transforms(),
    drop_null=True, astype='int32', except_columns=['date', 'instrument'],
    clip_lower=0, clip_upper=200000000)
# 合并标注和特征数据
m4 = M.join.v2(data1=label_ds, data2=m3.data, on=['date', 'instrument'], sort=True)
# 训练数据集
m5_training = M.filter.v2(data=m4.data, expr='date < "%s"' % conf.split_date)
[2017-12-29 17:08:35.436123] INFO: bigquant: general_feature_extractor.v5 开始运行..
[2017-12-29 17:08:46.189837] INFO: general_feature_extractor: year 2017, featurerows=367259
[2017-12-29 17:08:46.197806] INFO: general_feature_extractor: total feature rows: 367259
[2017-12-29 17:08:46.206346] INFO: bigquant: general_feature_extractor.v5 运行完成[10.770288s].
[2017-12-29 17:08:46.218764] INFO: bigquant: transform.v2 开始运行..
[2017-12-29 17:08:46.928719] INFO: transform: transformed /y_2017, 344370/367259
[2017-12-29 17:08:46.938596] INFO: transform: transformed rows: 344370/367259
[2017-12-29 17:08:46.957543] INFO: bigquant: transform.v2 运行完成[0.738646s].
[2017-12-29 17:08:46.966465] INFO: bigquant: join.v2 开始运行..
[2017-12-29 17:08:47.691002] INFO: join: /y_2017, rows=344370/344370, timetaken=0.604629s
[2017-12-29 17:08:47.743226] INFO: join: total result rows: 344370
[2017-12-29 17:08:47.745854] INFO: bigquant: join.v2 运行完成[0.77933s].
[2017-12-29 17:08:47.755572] INFO: bigquant: filter.v2 开始运行..
[2017-12-29 17:08:47.761287] INFO: filter: filter with expr date < "2017-05-01"
[2017-12-29 17:08:48.026656] INFO: filter: filter /y_2017, 209857/344370
[2017-12-29 17:08:48.042154] INFO: bigquant: filter.v2 运行完成[0.28659s].
In [73]:
m6 = M.stock_ranker_train.v5(training_ds=m5_training.data,features=conf.features)
[2017-12-29 17:09:02.056764] INFO: bigquant: stock_ranker_train.v5 开始运行..
[2017-12-29 17:09:02.218865] INFO: df2bin: prepare bins ..
[2017-12-29 17:09:02.223193] INFO: df2bin: prepare data: training ..
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-73-2e869703c129> in <module>()
----> 1 m6 = M.stock_ranker_train.v5(training_ds=m5_training.data,features=conf.features)

KeyError: "['return_90/return_10'] not in index"

(iQuant) #2

报错说 没有‘return_90/return_10’这个因子,原来是缺少了抽取衍生因子这一步。

克隆策略
In [1]:
# 基础参数配置
class conf:
    start_date = '2017-01-01'
    end_date='2017-07-10'
    split_date = '2017-05-01'
    instruments = D.instruments(start_date, end_date)
    hold_days = 3
    features = ['return_90/return_10']

# 手动计算标注
df = D.history_data(conf.instruments, start_date=conf.start_date,end_date=conf.end_date,fields=['close','open','high','low','amount'])
 
# 增加一列数据的函数
def add_column(df, series, name):
    df[name] = series
    return df

# 计算atr函数
def atr(high,low,close,window):
    a=high-low
    b=np.abs(close.shift(1)-high)
    c=np.abs(close.shift(1)-low)
    tr=a.where(a>b,b)
    tr=tr.where(tr>c,c)
    return tr.rolling(window).mean()

# 计算ATR 
df = df.groupby('instrument').apply(lambda x:add_column(x,atr(x.high,x.low,x.close,50),'ATR'))
# 计算标注,标注这个地方可以试一试用 'close' 
df = df.groupby('instrument').apply(lambda x:add_column(x,(x.open.shift(-4)-x.open.shift(-1))/x.ATR,'label'))

# # 对标注数据进行一些转化,上下界限制
df['label']=df.label*5+10
df.label=df.label.where(df.label<20,20)
df.label=df.label.where(df.label>0,0)
 
# 删除一部分数据
df.drop(['low','high','amount','close','ATR','open'],axis=1,inplace=True)
# df['label'] = df['label']*10 # 如果小数太多,这是会影响到转为整数几乎一样,导致模型训练会出错
# 标注要为int格式
df.label = df.label.astype('int')
label_ds = DataSource.write_df(df)


# 计算特征数据
m2 = M.general_feature_extractor.v5(
    instruments=conf.instruments, start_date=conf.start_date, end_date=conf.end_date,
    features=conf.features)
# 计算衍生特征(正是缺少这一步 导致报错的)
m2_1 = M.derived_feature_extractor.v2(input_data=m2.data,features=conf.features)
# 数据预处理:缺失数据处理,数据规范化,T.get_stock_ranker_default_transforms为StockRanker模型做数据预处理
m3 = M.transform.v2(
    data=m2_1.data, transforms=None,
    drop_null=True)
# 合并标注和特征数据
m4 = M.join.v2(data1=label_ds, data2=m3.data, on=['date', 'instrument'], sort=True)
# 训练数据集
m5_training = M.filter.v2(data=m4.data, expr='date < "%s"' % conf.split_date)
# 训练模型
m6 = M.stock_ranker_train.v5(training_ds=m5_training.data,features=conf.features)
[2017-12-29 17:43:22.861105] INFO: bigquant: general_feature_extractor.v5 开始运行..
[2017-12-29 17:43:22.867981] INFO: bigquant: 命中缓存
[2017-12-29 17:43:22.872026] INFO: bigquant: general_feature_extractor.v5 运行完成[0.010956s].
[2017-12-29 17:43:22.912974] INFO: bigquant: derived_feature_extractor.v2 开始运行..
[2017-12-29 17:43:22.920413] INFO: bigquant: 命中缓存
[2017-12-29 17:43:22.926662] INFO: bigquant: derived_feature_extractor.v2 运行完成[0.013687s].
[2017-12-29 17:43:22.955324] INFO: bigquant: transform.v2 开始运行..
[2017-12-29 17:43:22.968784] INFO: bigquant: 命中缓存
[2017-12-29 17:43:22.970315] INFO: bigquant: transform.v2 运行完成[0.015012s].
[2017-12-29 17:43:22.997013] INFO: bigquant: join.v2 开始运行..
[2017-12-29 17:43:24.447635] INFO: join: /y_2017, rows=344370/344370, timetaken=1.225579s
[2017-12-29 17:43:24.500044] INFO: join: total result rows: 344370
[2017-12-29 17:43:24.508772] INFO: bigquant: join.v2 运行完成[1.511736s].
[2017-12-29 17:43:24.588813] INFO: bigquant: filter.v2 开始运行..
[2017-12-29 17:43:24.597374] INFO: filter: filter with expr date < "2017-05-01"
[2017-12-29 17:43:25.154565] INFO: filter: filter /y_2017, 209857/344370
[2017-12-29 17:43:25.182152] INFO: bigquant: filter.v2 运行完成[0.593352s].
[2017-12-29 17:43:25.211222] INFO: bigquant: stock_ranker_train.v5 开始运行..
[2017-12-29 17:43:25.512080] INFO: df2bin: prepare bins ..
[2017-12-29 17:43:25.541785] INFO: df2bin: prepare data: training ..
[2017-12-29 17:43:28.723615] INFO: stock_ranker_train: b7394b80 准备训练: 209857 行数
[2017-12-29 17:44:04.317550] INFO: bigquant: stock_ranker_train.v5 运行完成[39.106295s].
In [2]:
m6.feature_gains.read_df()
Out[2]:
feature gain
0 return_90/return_10 5.542992

(woody) #3

好的 谢谢。 更加明确衍生因子这个概念了