分红数据特征抽取案例

策略分享
标签: #<Tag:0x00007fcf728b6ef8>

(达达) #1

平台的新版数据接口中有很多财务数据,这些非结构化财务数据通常是非定期发布,以现金分红比例为例,抽取一年期滚动分红比例之和这个特征。

我们可以通过文档查看 股票分红送配这个数据表的字段,可以看到现金分红比例对应的字段是XHFH,类似的我们可以找到除权除息日字段CQCXR

image

1、我们首先通过数据源模块指定抽取的表名dividend_send_CN_STOCK_A,通过证券代码列表模块指定抽取的股票范围和起止时间,通过输入特征列表模块指定所需字段这里填入CQCXR和XHFH。

2、然后我们计算年度分红比例,步骤如下:

  • 我们过滤出CQCXR非’-'的数据表示该条除权除息信息被执行了,
  • 利用公告所属最近财报期date列(6月或是12月),我们提取所属月,通过股票分组计算本月和上条公告月是否相同来判断是否应该求和作为年度现金分红比例。
  • 我们将除权除息日CQCXR作为日期索引,替换原来的date列

3、我们通过数据源块抽取日线行情数据,并与计算好的年度分红比例合并,这里考虑到其它财报类信息可能是非交易日公布,为了流程统一采用了outer方式,保证财报数据完整。

4、通过数据填充处理模块,以instrument为key分组进行向后向前填充,这样财报数据就扩展到了交易日上。

5、再次将合并后的数据与日线行情数据合并,指定inner方式来去除非交易日数据

经过上述步骤就完成了滚动一年期分红比例的因子计算,可以后续和其它的因子数据做合并了。

克隆策略

    {"Description":"实验创建于2019/3/5","Summary":"","Graph":{"EdgesInternal":[{"DestinationInputPortId":"-3495:input_1","SourceOutputPortId":"-3473:data"},{"DestinationInputPortId":"-3473:instruments","SourceOutputPortId":"-3479:data"},{"DestinationInputPortId":"-3504:instruments","SourceOutputPortId":"-3479:data"},{"DestinationInputPortId":"-3473:features","SourceOutputPortId":"-3487:data"},{"DestinationInputPortId":"-3511:data1","SourceOutputPortId":"-3495:data_1"},{"DestinationInputPortId":"-3518:input_ds","SourceOutputPortId":"-3504:data"},{"DestinationInputPortId":"-69:input_1","SourceOutputPortId":"-3511:data"},{"DestinationInputPortId":"-3511:data2","SourceOutputPortId":"-3518:data"},{"DestinationInputPortId":"-63:data2","SourceOutputPortId":"-3518:data"},{"DestinationInputPortId":"-63:data1","SourceOutputPortId":"-69:data"}],"ModuleNodes":[{"Id":"-3473","ModuleId":"BigQuantSpace.use_datasource.use_datasource-v1","ModuleParameters":[{"Name":"datasource_id","Value":"dividend_send_CN_STOCK_A","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"start_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"end_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"instruments","NodeId":"-3473"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-3473"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-3473","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":2,"Comment":"","CommentCollapsed":true},{"Id":"-3479","ModuleId":"BigQuantSpace.instruments.instruments-v2","ModuleParameters":[{"Name":"start_date","Value":"2010-01-01","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"end_date","Value":"2015-01-01","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"market","Value":"CN_STOCK_A","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"instrument_list","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"max_count","Value":0,"ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"rolling_conf","NodeId":"-3479"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-3479","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":3,"Comment":"","CommentCollapsed":true},{"Id":"-3487","ModuleId":"BigQuantSpace.input_features.input_features-v1","ModuleParameters":[{"Name":"features","Value":"\n# #号开始的表示注释,注释需单独一行\n# 多个特征,每行一个,可以包含基础特征和衍生特征,特征须为本平台特征\nXJFH\nCQCXR","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features_ds","NodeId":"-3487"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-3487","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":4,"Comment":"","CommentCollapsed":true},{"Id":"-3495","ModuleId":"BigQuantSpace.cached.cached-v3","ModuleParameters":[{"Name":"run","Value":"# Python 代码入口函数,input_1/2/3 对应三个输入端,data_1/2/3 对应三个输出端\ndef bigquant_run(input_1, input_2, input_3):\n df = input_1.read_df()\n df = df[df['CQCXR']!='-']\n df['XJFH'] = df['XJFH'].replace('-','0.0')\n df['XJFH'] = df['XJFH'].astype(float)\n df['CQCXR']=df['CQCXR'].apply(lambda x:pd.to_datetime(x))\n df['month'] = df['date'].apply(lambda x:x.month)\n df['last_month'] = df.groupby('instrument')['month'].shift(1)\n df['last_XJFH'] = df.groupby('instrument')['XJFH'].shift(1)\n df['signal'] = (df['last_month']==df['month']).astype(int)\n df['last_XJFH'] = df['signal']*df['last_XJFH']\n df['yearly_XJFH'] = df['last_XJFH'] + df['XJFH']\n df['yearly_XJFH'] = df['yearly_XJFH'].fillna(df['XJFH'])\n df = df.drop(['date'],axis=1).rename(columns={'CQCXR':'date'})\n data_1 = DataSource.write_df(df[['date','instrument','yearly_XJFH']])\n return Outputs(data_1=data_1)\n","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"post_run","Value":"# 后处理函数,可选。输入是主函数的输出,可以在这里对数据做处理,或者返回更友好的outputs数据格式。此函数输出不会被缓存。\ndef bigquant_run(outputs):\n return outputs\n","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"input_ports","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"params","Value":"{}","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"output_ports","Value":"","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_1","NodeId":"-3495"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_2","NodeId":"-3495"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_3","NodeId":"-3495"}],"OutputPortsInternal":[{"Name":"data_1","NodeId":"-3495","OutputType":null},{"Name":"data_2","NodeId":"-3495","OutputType":null},{"Name":"data_3","NodeId":"-3495","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":5,"Comment":"","CommentCollapsed":true},{"Id":"-3504","ModuleId":"BigQuantSpace.use_datasource.use_datasource-v1","ModuleParameters":[{"Name":"datasource_id","Value":"bar1d_CN_STOCK_A","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"start_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"end_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"instruments","NodeId":"-3504"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-3504"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-3504","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":6,"Comment":"","CommentCollapsed":true},{"Id":"-3511","ModuleId":"BigQuantSpace.join.join-v3","ModuleParameters":[{"Name":"on","Value":"date,instrument","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"how","Value":"outer","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"sort","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data1","NodeId":"-3511"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data2","NodeId":"-3511"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-3511","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":7,"Comment":"财报发布日可能非交易日,先outer填充后Inner裁剪","CommentCollapsed":false},{"Id":"-3518","ModuleId":"BigQuantSpace.select_columns.select_columns-v3","ModuleParameters":[{"Name":"columns","Value":"date,instrument","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"reverse_select","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_ds","NodeId":"-3518"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"columns_ds","NodeId":"-3518"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-3518","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":8,"Comment":"","CommentCollapsed":true},{"Id":"-69","ModuleId":"BigQuantSpace.fill_nan.fill_nan-v1","ModuleParameters":[{"Name":"columns_input","Value":"['yearly_XJFH']","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"group_key","Value":"['instrument']","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"method","Value":"向下向上填充","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_1","NodeId":"-69"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_2","NodeId":"-69"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-69","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":1,"Comment":"","CommentCollapsed":true},{"Id":"-63","ModuleId":"BigQuantSpace.join.join-v3","ModuleParameters":[{"Name":"on","Value":"date,instrument","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"how","Value":"inner","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"sort","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data1","NodeId":"-63"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data2","NodeId":"-63"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-63","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":10,"Comment":"时间轴交易日期填充","CommentCollapsed":true}],"SerializedClientData":"<?xml version='1.0' encoding='utf-16'?><DataV1 xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'><Meta /><NodePositions><NodePosition Node='-3473' Position='255,238.783447265625,200,200'/><NodePosition Node='-3479' Position='75,105,200,200'/><NodePosition Node='-3487' Position='414,96,200,200'/><NodePosition Node='-3495' Position='297,352,200,200'/><NodePosition Node='-3504' Position='638,218,200,200'/><NodePosition Node='-3511' Position='469.30535888671875,439.4012145996094,200,200'/><NodePosition Node='-3518' Position='625.7833862304688,330,200,200'/><NodePosition Node='-69' Position='525,551.4840087890625,200,200'/><NodePosition Node='-63' Position='667.6436767578125,637.1536865234375,200,200'/></NodePositions><NodeGroups /></DataV1>"},"IsDraft":true,"ParentExperimentId":null,"WebService":{"IsWebServiceExperiment":false,"Inputs":[],"Outputs":[],"Parameters":[{"Name":"交易日期","Value":"","ParameterDefinition":{"Name":"交易日期","FriendlyName":"交易日期","DefaultValue":"","ParameterType":"String","HasDefaultValue":true,"IsOptional":true,"ParameterRules":[],"HasRules":false,"MarkupType":0,"CredentialDescriptor":null}}],"WebServiceGroupId":null,"SerializedClientData":"<?xml version='1.0' encoding='utf-16'?><DataV1 xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'><Meta /><NodePositions></NodePositions><NodeGroups /></DataV1>"},"DisableNodesUpdate":false,"Category":"user","Tags":[],"IsPartialRun":true}
    In [61]:
    # 本代码由可视化策略环境自动生成 2019年3月6日 16:36
    # 本代码单元只能在可视化模式下编辑。您也可以拷贝代码,粘贴到新建的代码单元或者策略,然后修改。
    
    
    # Python 代码入口函数,input_1/2/3 对应三个输入端,data_1/2/3 对应三个输出端
    def m5_run_bigquant_run(input_1, input_2, input_3):
        df = input_1.read_df()
        df = df[df['CQCXR']!='-']
        df['XJFH'] = df['XJFH'].replace('-','0.0')
        df['XJFH'] = df['XJFH'].astype(float)
        df['CQCXR']=df['CQCXR'].apply(lambda x:pd.to_datetime(x))
        df['month'] = df['date'].apply(lambda x:x.month)
        df['last_month'] = df.groupby('instrument')['month'].shift(1)
        df['last_XJFH'] = df.groupby('instrument')['XJFH'].shift(1)
        df['signal'] = (df['last_month']==df['month']).astype(int)
        df['last_XJFH'] = df['signal']*df['last_XJFH']
        df['yearly_XJFH'] = df['last_XJFH'] + df['XJFH']
        df['yearly_XJFH'] = df['yearly_XJFH'].fillna(df['XJFH'])
        df = df.drop(['date'],axis=1).rename(columns={'CQCXR':'date'})
        data_1 = DataSource.write_df(df[['date','instrument','yearly_XJFH']])
        return Outputs(data_1=data_1)
    
    # 后处理函数,可选。输入是主函数的输出,可以在这里对数据做处理,或者返回更友好的outputs数据格式。此函数输出不会被缓存。
    def m5_post_run_bigquant_run(outputs):
        return outputs
    
    
    m3 = M.instruments.v2(
        start_date='2010-01-01',
        end_date='2015-01-01',
        market='CN_STOCK_A',
        instrument_list='',
        max_count=0
    )
    
    m6 = M.use_datasource.v1(
        instruments=m3.data,
        datasource_id='bar1d_CN_STOCK_A',
        start_date='',
        end_date=''
    )
    
    m8 = M.select_columns.v3(
        input_ds=m6.data,
        columns='date,instrument',
        reverse_select=False
    )
    
    m4 = M.input_features.v1(
        features="""
    # #号开始的表示注释,注释需单独一行
    # 多个特征,每行一个,可以包含基础特征和衍生特征,特征须为本平台特征
    XJFH
    CQCXR"""
    )
    
    m2 = M.use_datasource.v1(
        instruments=m3.data,
        features=m4.data,
        datasource_id='dividend_send_CN_STOCK_A',
        start_date='',
        end_date=''
    )
    
    m5 = M.cached.v3(
        input_1=m2.data,
        run=m5_run_bigquant_run,
        post_run=m5_post_run_bigquant_run,
        input_ports='',
        params='{}',
        output_ports=''
    )
    
    m7 = M.join.v3(
        data1=m5.data_1,
        data2=m8.data,
        on='date,instrument',
        how='outer',
        sort=False
    )
    
    m1 = M.fill_nan.v1(
        input_1=m7.data,
        columns_input=['yearly_XJFH'],
        group_key=['instrument'],
        method='向下向上填充'
    )
    
    m10 = M.join.v3(
        data1=m1.data,
        data2=m8.data,
        on='date,instrument',
        how='inner',
        sort=False
    )
    
    In [62]:
    m10.data.read_df().tail()
    
    Out[62]:
    date instrument yearly_XJFH
    2790757 2014-12-31 603369.SHA 1.0
    2790758 2014-12-31 002562.SZA 1.0
    2790759 2014-12-31 300370.SZA 2.5
    2790760 2014-12-31 000850.SZA 1.0
    2790761 2014-12-31 300213.SZA 1.1

    (ZjFy) #2

    这个表的数据怎么变了,没有 CQCXR 这个特征了,这个怎么弄哦?