在BigQuant上用AI挑战虚拟股票预测竞赛 (附代码)

虚拟股票趋势预测
创新工厂
今日头条
搜狗
challengerai
标签: #<Tag:0x00007f8c5ded2150> #<Tag:0x00007f8c5ded2010> #<Tag:0x00007f8c5ded1ed0> #<Tag:0x00007f8c5ded1d90> #<Tag:0x00007f8c5ded1c50>

(iQuant) #1

【更新2017/10/31】增加自动提交结果到challenger .ai,请查看文档后面的代码

创新工场、搜狗、今日头条联合发起一场世界级AI竞赛,BigQuant作为一家专业的机器学习平台,提供竞赛数据和AI算法,助你轻松参赛,勇夺丰厚现金大奖、斩获主办方提供的工作实习机会。

AI Challenger 全球AI挑战赛

创新工场、搜狗和今日头条联合宣布三方携手发起 AI Challenger全球AI挑战赛。本次竞赛三方联合旨在打造中国最大的科研数据集与世界级AI竞赛平台,推动中国人工智能领域科研创新。

BigQuant

BigQuant 助你用AI做量化投资。我们新上线的 BigStudio 可视化策略开发,能够帮助大家更快速更简单地开发机器学习、深度学习试验,快速实现试验迭代,帮助大家轻松参加全球AI挑战赛!

BigStudio 提供了所见即所得的策略开发环境,集合了众多模块,包括数据输入、输出、数据变换、模型训练、预测和量化交易等。你只需要拖动数据和模块,连连线,配置参数,就可以开发AI策略,从而将更多的创造力放在自己擅长的地方。

开始挑战吧

0. 先创建一个空白可视化策略

新建 > 可视化策略 - 空白策略

image

1. 数据

BigQuant是一个开放的平台,我们可以快速地接入任何公开的和私有的数据。这次竞赛的数据我们已经接入。我们先把数据拖进来。

静态数据集 > 公共数据 > challenger.ai > 虚拟股票趋势预测

2. 格式转换

原数据是CSV格式的,在BigQuant平台上,我们推荐并默认使用的是更高性能的 HDF 格式。我们先做一个格式转换,并运行。

数据处理 > 添加两个 转换CSV为HDF > 连线如下图 > 运行全部

3. 查看代码

BigQuant将简单的可视化模式和灵活的编码模式做了完美融合。切换到代码模式,你可以查看可视化后面生成和执行的代码,你也可以添加更多代码单元。

切换到代码界面,如下所示:

4. 查看数据描述

模块上的m1和m2表示了对应的变量名,我们可以直接引用。

在下方插入两个代码单元(cell)(可以使用快捷键 b) > 添加代码如下 > 运行

当然,你可以并且很可能也需要对数据做更多分析和研究,这都可以在代码单元里完成。

5. 特征选择和抽取

按如下方式添加 “输入特征列表” 和 “衍生特征抽取” (输入要用的基础特征,然后做衍生特征抽取)。配置特征如下:

如果觉得右侧输入特征数据的代码编辑器太小,可以点击右侧按钮,弹出代码编辑器窗口,如下:

6. 数据集划分

为了评估效果,需要划分数据集和验证集。对于时间序列数据,不建议随机划分。竞赛方建议用era列做,这个era列应该表示的时间相关的。

> 对于交叉验证,建议按照训练数据era列随机抽取一个或若干个era进行交叉验证,而不是在全部训练样本上进行随机采样进行交叉验证,因为后者会导致严重的过拟合问题,这也是我们加入了era列的主要目的。

看看 era 数据分布:

image

我们尝试用 era 1-15的做训练,15-20的做验证。

添加两个 “数据过滤模块” > 配置条件分别为  era <= 15 和 era > 15

这里,主要是以教学目的,特征构造特别简单。数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已,因此特征(或因子)是机器学习试验非常重要的一环,值得深入思考和研究。

6. 训练:模型选择和参数

BigQuant提供了很多优化后的机器学习算法。这里选用GBDT来演示。

GBDT的全称为Gradient Boosted Decision Tree,中文可以叫梯度提升决策树,近年来,在数据科学竞赛网站Kaggle上赢得比赛的绝大部分方案都采用了XGBoost包中类似GBDT的算法,因此我们也建议大家多多尝试GBDT算法参与本次竞赛。

点击查看GBDT算法

GBDT是一个应用很广泛的算法,可以用来做分类、回归。在很多的数据上都有不错的效果。GBDT这个算法还有一些其他的名字,比如说MART(Multiple Additive Regression Tree),GBRT(Gradient Boost Regression Tree),Tree Net等。

GBDT的核心思想是,每一次的计算是为了减少上一次的残差(residual),而为了消除残差,我们可以在残差减少的梯度(Gradient)方向上建立一个新的模型。所以说,在GBDT中,每个新的模型是为了使得之前模型的残差往梯度方向减少。

GBDT算法如下:

7. 评估模型效果

当我们在训练集上将模型训练出来以后,可以通过在验证集上使用该模型进行预测,然后通过预测数据与真实数据的差异,进而评估出模型效果。

点击查看评价标准

image

模型评估具体分为两步:

  • 模型在验证集上预测,因此需要m11模块,实现GBDT算法在验证集预测

  • 计算模型评估指标,因此需要m8和m9模块

通过策略运行日志,可以看出模型在验证集上的准确率和logloss.

8. 预测测试集

如果模型评估通过,就可以采用该模型在测试集上进行预测。本例中,需要读取测试集数据,提取衍生特征,然后结合训练出的模型就可以得到该模型在测试集上的预测结果。

9. 下载数据并提交

  • 保存数据到用户目录

通过自定义的模块m13将测试集预测结果保存至用户目录

你可以在左侧目录界面,找到保存的数据集,格式为csv.

  • 下载到本地并提交

      选中该csv文件 > 右键 > 下载
    

image

10. 如何改进

  • 更多的数据分析和研究:通过代码单元引用任何可视化里的模块并获取数据
  • 更多模型和参数调优
  • 特征工程:特征提取、特征选择、特征构造
  • 交叉验证:避免过拟合
  • 其他

相关链接

  1. 虚拟股票趋势预测 竞赛主页
  2. 为什么Kaggle数据分析竞赛者偏爱XGBoost
  3. XGBoost 入门系列第一讲
  4. AI“股神”在哪里?创新工场任性撒钱办竞赛,就要找到你

完整示例代码

克隆策略

    {"Description":"实验创建于2017/9/9","Summary":"","Graph":{"EdgesInternal":[{"DestinationInputPortId":"-95:input_data","SourceOutputPortId":"-6:data"},{"DestinationInputPortId":"-235:training_ds","SourceOutputPortId":"-9:data"},{"DestinationInputPortId":"-132:data2","SourceOutputPortId":"-13:data"},{"DestinationInputPortId":"-221:data","SourceOutputPortId":"-13:data"},{"DestinationInputPortId":"-95:features","SourceOutputPortId":"-90:data"},{"DestinationInputPortId":"-105:features","SourceOutputPortId":"-90:data"},{"DestinationInputPortId":"-235:features","SourceOutputPortId":"-90:data"},{"DestinationInputPortId":"-13:input_data","SourceOutputPortId":"-95:data"},{"DestinationInputPortId":"-9:input_data","SourceOutputPortId":"-95:data"},{"DestinationInputPortId":"-105:input_data","SourceOutputPortId":"-101:data"},{"DestinationInputPortId":"-214:data","SourceOutputPortId":"-105:data"},{"DestinationInputPortId":"-142:input_1","SourceOutputPortId":"-132:data"},{"DestinationInputPortId":"-305:prediction_ds","SourceOutputPortId":"-214:predictions"},{"DestinationInputPortId":"-132:data1","SourceOutputPortId":"-221:predictions"},{"DestinationInputPortId":"-214:model","SourceOutputPortId":"-235:model"},{"DestinationInputPortId":"-221:model","SourceOutputPortId":"-235:model"}],"ModuleNodes":[{"Id":"-6","ModuleId":"BigQuantSpace.convert_csv_to_hdf.convert_csv_to_hdf-v1","ModuleParameters":[],"InputPortsInternal":[{"DataSourceId":"BigQuantSpace.bigquant-challengerai-trendsense-w9-trainingset.v1-default-1","TrainedModelId":null,"TransformModuleId":null,"Name":"input_ds","NodeId":"-6"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-6","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":1,"Comment":"","CommentCollapsed":true},{"Id":"-9","ModuleId":"BigQuantSpace.filter.filter-v3","ModuleParameters":[{"Name":"expr","Value":"era<16","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"output_left_data","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_data","NodeId":"-9"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-9","OutputType":null},{"Name":"left_data","NodeId":"-9","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":2,"Comment":"用于模型训练","CommentCollapsed":false},{"Id":"-13","ModuleId":"BigQuantSpace.filter.filter-v3","ModuleParameters":[{"Name":"expr","Value":"era>=16","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"output_left_data","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_data","NodeId":"-13"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-13","OutputType":null},{"Name":"left_data","NodeId":"-13","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":3,"Comment":"用于做模型验证","CommentCollapsed":false},{"Id":"-90","ModuleId":"BigQuantSpace.input_features.input_features-v1","ModuleParameters":[{"Name":"features","Value":"feature10\nfeature1\nfeature2\nfeature0/feature9\nfeature10+feature11\nfeature20/feature19\n","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features_ds","NodeId":"-90"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-90","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":4,"Comment":"选择的特征","CommentCollapsed":false},{"Id":"-95","ModuleId":"BigQuantSpace.derived_feature_extractor.derived_feature_extractor-v2","ModuleParameters":[{"Name":"date_col","Value":"id","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"instrument_col","Value":"group","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"user_functions","Value":"","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_data","NodeId":"-95"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-95"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-95","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":5,"Comment":"衍生特征计算,比如两个特征相除","CommentCollapsed":false},{"Id":"-101","ModuleId":"BigQuantSpace.convert_csv_to_hdf.convert_csv_to_hdf-v1","ModuleParameters":[],"InputPortsInternal":[{"DataSourceId":"BigQuantSpace.bigquant-challengerai-trendsense-w9-testset.v1-default-1","TrainedModelId":null,"TransformModuleId":null,"Name":"input_ds","NodeId":"-101"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-101","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":6,"Comment":"","CommentCollapsed":true},{"Id":"-105","ModuleId":"BigQuantSpace.derived_feature_extractor.derived_feature_extractor-v2","ModuleParameters":[{"Name":"date_col","Value":"id","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"instrument_col","Value":"group","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"user_functions","Value":"","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_data","NodeId":"-105"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-105"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-105","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":7,"Comment":"测试数据的衍生特征计算","CommentCollapsed":false},{"Id":"-132","ModuleId":"BigQuantSpace.join.join-v3","ModuleParameters":[{"Name":"on","Value":"id","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"how","Value":"inner","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"sort","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data1","NodeId":"-132"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data2","NodeId":"-132"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-132","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":8,"Comment":"预测结果和实际label数据整合在一起","CommentCollapsed":false},{"Id":"-142","ModuleId":"BigQuantSpace.cached.cached-v3","ModuleParameters":[{"Name":"run","Value":"# Python 代码入口函数,input_1/2/3 对应三个输入端,data_1/2/3 对应三个输出端\ndef bigquant_run(input_1, input_2, input_3):\n df = input_1.read_df()\n weight = df.weight.values\n yt = df.label.values\n yp = df.pred_prob.values\n loss = -np.sum(weight * (yt * np.log(yp) + (1 - yt) * np.log(1 - yp))) / np.sum(weight)\n print('准确率是:%s , logloss是:%s'% ((df.label==df.pred_label).sum()/df.shape[0], loss))\n return Outputs()\n","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_1","NodeId":"-142"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_2","NodeId":"-142"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_3","NodeId":"-142"}],"OutputPortsInternal":[{"Name":"data_1","NodeId":"-142","OutputType":null},{"Name":"data_2","NodeId":"-142","OutputType":null},{"Name":"data_3","NodeId":"-142","OutputType":null}],"UsePreviousResults":false,"moduleIdForCode":9,"Comment":"验证集的分数计算","CommentCollapsed":false},{"Id":"-214","ModuleId":"BigQuantSpace.GBDT_predict.GBDT_predict-v1","ModuleParameters":[{"Name":"date_col","Value":"id","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"instrument_col","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"sort","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"model","NodeId":"-214"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data","NodeId":"-214"}],"OutputPortsInternal":[{"Name":"predictions","NodeId":"-214","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":10,"Comment":"测试数据的预测","CommentCollapsed":false},{"Id":"-221","ModuleId":"BigQuantSpace.GBDT_predict.GBDT_predict-v1","ModuleParameters":[{"Name":"date_col","Value":"id","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"instrument_col","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"sort","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"model","NodeId":"-221"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data","NodeId":"-221"}],"OutputPortsInternal":[{"Name":"predictions","NodeId":"-221","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":11,"Comment":"对验证集做预测","CommentCollapsed":false},{"Id":"-235","ModuleId":"BigQuantSpace.GBDT_train.GBDT_train-v1","ModuleParameters":[{"Name":"num_boost_round","Value":120,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"early_stopping_rounds","Value":"20","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"objective","Value":"binary:logistic","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"num_class","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"eval_metric","Value":"error","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"booster","Value":"gbtree","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"eta","Value":0.1,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"gamma","Value":0.0001,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"_lambda","Value":0,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"lambda_bias","Value":0,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"alpha","Value":0,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"max_depth","Value":6,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"max_leaf_nodes","Value":30,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"subsample","Value":0.8,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"xgb_param","Value":"","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"training_ds","NodeId":"-235"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-235"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"test_ds","NodeId":"-235"}],"OutputPortsInternal":[{"Name":"model","NodeId":"-235","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":12,"Comment":"","CommentCollapsed":true},{"Id":"-305","ModuleId":"BigQuantSpace.challenger_submit.challenger_submit-v1","ModuleParameters":[{"Name":"account","Value":"你的用户名","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"password","Value":"你的密码","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"challenge_round","Value":"9","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"challenge_type","Value":"trendsense","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"challenge_cid","Value":"6","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"prediction_ds","NodeId":"-305"}],"OutputPortsInternal":[],"UsePreviousResults":true,"moduleIdForCode":13,"Comment":"","CommentCollapsed":true}],"SerializedClientData":"<?xml version='1.0' encoding='utf-16'?><DataV1 xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'><Meta /><NodePositions><NodePosition Node='-6' Position='91.98088073730469,166.39837646484375,200,200'/><NodePosition Node='BigQuantSpace.bigquant-challengerai-trendsense-w9-trainingset.v1-default-1' Position='93.98088073730469,67.39837646484375,200,200'/><NodePosition Node='-9' Position='-39.23194885253906,415.39837646484375,200,200'/><NodePosition Node='-13' Position='381.76806640625,409.5743713378906,200,200'/><NodePosition Node='-90' Position='465,47,200,200'/><NodePosition Node='-95' Position='91.98088073730469,264.39837646484375,200,200'/><NodePosition Node='-101' Position='803.185546875,222.99044036865234,200,200'/><NodePosition Node='BigQuantSpace.bigquant-challengerai-trendsense-w9-testset.v1-default-1' Position='805.185546875,135.99044036865234,200,200'/><NodePosition Node='-105' Position='806.185546875,325.99044036865234,200,200'/><NodePosition Node='-132' Position='327.74078369140625,690.7135009765625,200,200'/><NodePosition Node='-142' Position='328.74078369140625,805.7135009765625,200,200'/><NodePosition Node='-214' Position='783.1118774414062,669.0750732421875,200,200'/><NodePosition Node='-221' Position='326.74078369140625,596.7135009765625,200,200'/><NodePosition Node='-235' Position='-28.843124389648438,517.0559692382812,200,200'/><NodePosition Node='-305' Position='790.1118774414062,785.0750732421875,200,200'/></NodePositions><NodeGroups /></DataV1>"},"IsDraft":true,"ParentExperimentId":null,"WebService":{"IsWebServiceExperiment":false,"Inputs":[],"Outputs":[],"Parameters":[{"Name":"交易日期","Value":"","ParameterDefinition":{"Name":"交易日期","FriendlyName":"交易日期","DefaultValue":"","ParameterType":"String","HasDefaultValue":true,"IsOptional":true,"ParameterRules":[],"HasRules":false,"MarkupType":0,"CredentialDescriptor":null}}],"WebServiceGroupId":null,"SerializedClientData":"<?xml version='1.0' encoding='utf-16'?><DataV1 xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'><Meta /><NodePositions></NodePositions><NodeGroups /></DataV1>"},"DisableNodesUpdate":false,"Category":"user","Tags":[],"IsPartialRun":true}
    In [18]:
    # 本代码由可视化策略环境自动生成 2017年10月30日 18:02
    # 本代码单元只能在可视化模式下编辑。您也可以拷贝代码,粘贴到新建的代码单元或者策略,然后修改。
    
    
    m1 = M.convert_csv_to_hdf.v1(
        input_ds=DataSource('bigquant-challengerai-trendsense-w9-trainingset')
    )
    
    m4 = M.input_features.v1(
        features="""feature10
    feature1
    feature2
    feature0/feature9
    feature10+feature11
    feature20/feature19
    """
    )
    
    m5 = M.derived_feature_extractor.v2(
        input_data=m1.data,
        features=m4.data,
        date_col='id',
        instrument_col='group'
    )
    
    m3 = M.filter.v3(
        input_data=m5.data,
        expr='era>=16',
        output_left_data=False
    )
    
    m2 = M.filter.v3(
        input_data=m5.data,
        expr='era<16',
        output_left_data=False
    )
    
    m12 = M.GBDT_train.v1(
        training_ds=m2.data,
        features=m4.data,
        num_boost_round=120,
        early_stopping_rounds=20,
        objective='binary:logistic',
        eval_metric='error',
        booster='gbtree',
        eta=0.1,
        gamma=0.0001,
        _lambda=0,
        lambda_bias=0,
        alpha=0,
        max_depth=6,
        max_leaf_nodes=30,
        subsample=0.8
    )
    
    m11 = M.GBDT_predict.v1(
        model=m12.model,
        data=m3.data,
        date_col='id',
        instrument_col='',
        sort=False
    )
    
    m8 = M.join.v3(
        data1=m11.predictions,
        data2=m3.data,
        on='id',
        how='inner',
        sort=False
    )
    
    # Python 代码入口函数,input_1/2/3 对应三个输入端,data_1/2/3 对应三个输出端
    def m9_run_bigquant_run(input_1, input_2, input_3):
        df = input_1.read_df()
        weight = df.weight.values
        yt = df.label.values
        yp = df.pred_prob.values
        loss = -np.sum(weight * (yt * np.log(yp) + (1 - yt) * np.log(1 - yp))) / np.sum(weight)
        print('准确率是:%s , logloss是:%s'% ((df.label==df.pred_label).sum()/df.shape[0], loss))
        return Outputs()
    
    m9 = M.cached.v3(
        input_1=m8.data,
        run=m9_run_bigquant_run,
        m_cached=False
    )
    
    m6 = M.convert_csv_to_hdf.v1(
        input_ds=DataSource('bigquant-challengerai-trendsense-w9-testset')
    )
    
    m7 = M.derived_feature_extractor.v2(
        input_data=m6.data,
        features=m4.data,
        date_col='id',
        instrument_col='group'
    )
    
    m10 = M.GBDT_predict.v1(
        model=m12.model,
        data=m7.data,
        date_col='id',
        instrument_col='',
        sort=False
    )
    
    m13 = M.challenger_submit.v1(
        prediction_ds=m10.predictions,
        account='你的用户名',
        password='你的密码',
        challenge_round='9',
        challenge_type='trendsense',
        challenge_cid='6'
    )
    
    [2017-10-30 17:57:05.409680] INFO: bigquant: convert_csv_to_hdf.v1 开始运行..
    [2017-10-30 17:57:05.415118] INFO: bigquant: 命中缓存
    [2017-10-30 17:57:05.417821] INFO: bigquant: convert_csv_to_hdf.v1 运行完成[0.008199s].
    [2017-10-30 17:57:05.432046] INFO: bigquant: input_features.v1 开始运行..
    [2017-10-30 17:57:05.437528] INFO: bigquant: 命中缓存
    [2017-10-30 17:57:05.439244] INFO: bigquant: input_features.v1 运行完成[0.007191s].
    [2017-10-30 17:57:05.511033] INFO: bigquant: derived_feature_extractor.v2 开始运行..
    [2017-10-30 17:57:05.517053] INFO: bigquant: 命中缓存
    [2017-10-30 17:57:05.519771] INFO: bigquant: derived_feature_extractor.v2 运行完成[0.008762s].
    [2017-10-30 17:57:05.609674] INFO: bigquant: filter.v3 开始运行..
    [2017-10-30 17:57:05.654177] INFO: bigquant: 命中缓存
    [2017-10-30 17:57:05.655547] INFO: bigquant: filter.v3 运行完成[0.045901s].
    [2017-10-30 17:57:05.709954] INFO: bigquant: GBDT_train.v1 开始运行..
    [2017-10-30 17:57:05.715272] INFO: bigquant: 命中缓存
    [2017-10-30 17:57:05.716599] INFO: bigquant: GBDT_train.v1 运行完成[0.00667s].
    [2017-10-30 17:57:05.726564] INFO: bigquant: convert_csv_to_hdf.v1 开始运行..
    [2017-10-30 17:57:05.729682] INFO: bigquant: 命中缓存
    [2017-10-30 17:57:05.731066] INFO: bigquant: convert_csv_to_hdf.v1 运行完成[0.0045s].
    [2017-10-30 17:57:05.745411] INFO: bigquant: derived_feature_extractor.v2 开始运行..
    [2017-10-30 17:57:05.793214] INFO: bigquant: 命中缓存
    [2017-10-30 17:57:05.794987] INFO: bigquant: derived_feature_extractor.v2 运行完成[0.049533s].
    [2017-10-30 17:57:05.806251] INFO: bigquant: GBDT_predict.v1 开始运行..
    [2017-10-30 17:57:05.810289] INFO: bigquant: 命中缓存
    [2017-10-30 17:57:05.811781] INFO: bigquant: GBDT_predict.v1 运行完成[0.005566s].
    [2017-10-30 17:57:05.820234] INFO: bigquant: challenger_submit.v1 开始运行..
    [2017-10-30 17:57:05.823985] INFO: bigquant: 命中缓存
    [2017-10-30 17:57:05.825335] INFO: bigquant: challenger_submit.v1 运行完成[0.005143s].
    

    零基础《AI挑战虚拟股票预测大赛》入门教程
    (bluexxxx) #2

    因为是机器学习相关专业,所以对该比赛比较感兴趣。

    我的实验有一些不同:

    • 算法不一样,由于是分类问题,采取的是logistic regression

    • 挑选特征不一样,采取了主成分分析(PCA)

    • 批量测试,寻找出logloss较小的参数组合

    • 使用全部训练集数据重新拟合模型在测试集上预测

    比赛排名:

    没有花多少时间,开发的这个实验,目前排名是33,希望再接再厉!

    代码如下 :

    克隆策略

    AI挑战赛

    In [140]:
    # 导入相应的包
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.decomposition import PCA
    import numpy as np
    
    # 读取训练集数据
    m1 = M.convert_csv_to_hdf.v1(
        input_ds=DataSource('bigquant-challengerai-trendsense-w2-trainingset'),
        m_cached=False
    )
    raw_df = m1.data.read_df()
    # 特征
    features = [x for x in training_df.columns if 'feature' in x]+['group']
    
    [2017-09-12 20:46:32.658635] INFO: bigquant: convert_csv_to_hdf.v1 开始运行..
    [2017-09-12 20:46:49.182362] INFO: bigquant: convert_csv_to_hdf.v1 运行完成[16.523683s].
    

    计算验证集上logloss的函数

    In [141]:
    def calcu_logloss(p,raw_df,valid_list):  # 计算验证集上的logloss的函数
        
        train_list = [i for i in range(1,21) if i not in valid_list]
        
        training_df = raw_df.set_index('era').ix[train_list].reset_index()
        validation_df =  raw_df.set_index('era').ix[valid_list].reset_index()
     
        train_X= np.array(training_df[features])
        train_y = np.array(training_df['label'])
        
        pca = PCA(n_components=p)
        reduced_train_X = pca.fit_transform(train_X)
        # 确定算法
        classifier = LogisticRegression()
        classifier.fit(reduced_train_X, train_y)
    
        valid_X = np.array(validation_df[features])
        pca = PCA(n_components=p)
        reduced_valid_X = pca.fit_transform(valid_X)
    
        predictions_proba = classifier.predict_proba(reduced_valid_X)
        predictions = classifier.predict(reduced_valid_X)
    
        tmp = pd.DataFrame(predictions_proba,columns=['be_0_prob','be_1_prob'],index = validation_df.index)
        # 计算在验证集上的准确率
        validation_df['pred'] = predictions
        validation_df['be_0_prob'] = tmp['be_0_prob']
        validation_df['be_1_prob'] = tmp['be_1_prob']
    
        weight = validation_df['weight']
        yt = validation_df['label']
        yp = validation_df['be_1_prob']
        loss = - np.sum(weight * (yt * np.log(yp) + (1 - yt) * np.log(1 - yp))) / np.sum(weight)
    # loss = (yt*np.log(yp)+(1-yt)*np.log(1-yp))*-1*weight
        print('主成分个数: ', p ,'验证集',valid_list,'准确率是:%s , log_loss是:%s'% ((validation_df.label==validation_df.pred).sum()/validation_df.shape[0], loss))
    

    确定最优参数

    In [145]:
    time = [   # 验证集
        [1,2,3,4],
        [5,6,7,8],
        [9,10,11,12],
        [13,14,15,16],
        [17,18,19,20],
    ]
    # 批量实验,找出最优的主成分个数和验证集划分
    for i in range(3,20):
        for j in time:
            calcu_logloss(i,raw_df,j)
    
    主成分个数:  3 验证集 [1, 2, 3, 4] 准确率是:0.519196061162 , log_loss是:0.7011856602896425
    主成分个数:  3 验证集 [5, 6, 7, 8] 准确率是:0.544761771104 , log_loss是:0.6844322930285454
    主成分个数:  3 验证集 [9, 10, 11, 12] 准确率是:0.535759116549 , log_loss是:0.6870945247947803
    主成分个数:  3 验证集 [13, 14, 15, 16] 准确率是:0.511108181159 , log_loss是:0.6906682584905697
    主成分个数:  3 验证集 [17, 18, 19, 20] 准确率是:0.560061860328 , log_loss是:0.682055041537043
    主成分个数:  4 验证集 [1, 2, 3, 4] 准确率是:0.492080451638 , log_loss是:0.7141821768168399
    主成分个数:  4 验证集 [5, 6, 7, 8] 准确率是:0.547231773209 , log_loss是:0.6835586788366568
    主成分个数:  4 验证集 [9, 10, 11, 12] 准确率是:0.522555017081 , log_loss是:0.6889722561205334
    主成分个数:  4 验证集 [13, 14, 15, 16] 准确率是:0.511322433911 , log_loss是:0.6902455790049705
    主成分个数:  4 验证集 [17, 18, 19, 20] 准确率是:0.565518080245 , log_loss是:0.6810604489293196
    主成分个数:  5 验证集 [1, 2, 3, 4] 准确率是:0.490983121327 , log_loss是:0.7139404658280754
    主成分个数:  5 验证集 [5, 6, 7, 8] 准确率是:0.548087853484 , log_loss是:0.683274156153444
    主成分个数:  5 验证集 [9, 10, 11, 12] 准确率是:0.540017478351 , log_loss是:0.68788993514858
    主成分个数:  5 验证集 [13, 14, 15, 16] 准确率是:0.509608411893 , log_loss是:0.6912304834736015
    主成分个数:  5 验证集 [17, 18, 19, 20] 准确率是:0.568593720134 , log_loss是:0.6800703124360576
    主成分个数:  6 验证集 [1, 2, 3, 4] 准确率是:0.490189000707 , log_loss是:0.7144278243793236
    主成分个数:  6 验证集 [5, 6, 7, 8] 准确率是:0.54478983931 , log_loss是:0.6864969150412317
    主成分个数:  6 验证集 [9, 10, 11, 12] 准确率是:0.539842694844 , log_loss是:0.6879523146724824
    主成分个数:  6 验证集 [13, 14, 15, 16] 准确率是:0.513316632606 , log_loss是:0.6915922245976434
    主成分个数:  6 验证集 [17, 18, 19, 20] 准确率是:0.5642322195 , log_loss是:0.68184983797627
    主成分个数:  7 验证集 [1, 2, 3, 4] 准确率是:0.490174562151 , log_loss是:0.7153626320612305
    主成分个数:  7 验证集 [5, 6, 7, 8] 准确率是:0.545042453161 , log_loss是:0.6843228063383348
    主成分个数:  7 验证集 [9, 10, 11, 12] 准确率是:0.531405418289 , log_loss是:0.6884388571234378
    主成分个数:  7 验证集 [13, 14, 15, 16] 准确率是:0.51356384732 , log_loss是:0.6907951627710498
    主成分个数:  7 验证集 [17, 18, 19, 20] 准确率是:0.567116717927 , log_loss是:0.6816182454296765
    主成分个数:  8 验证集 [1, 2, 3, 4] 准确率是:0.481006078632 , log_loss是:0.7146384869717017
    主成分个数:  8 验证集 [5, 6, 7, 8] 准确率是:0.546165181391 , log_loss是:0.6843033744185745
    主成分个数:  8 验证集 [9, 10, 11, 12] 准确率是:0.507317073171 , log_loss是:0.691315912343831
    主成分个数:  8 验证集 [13, 14, 15, 16] 准确率是:0.511701496473 , log_loss是:0.6925149030419511
    主成分个数:  8 验证集 [17, 18, 19, 20] 准确率是:0.567742271803 , log_loss是:0.6816113620772606
    主成分个数:  9 验证集 [1, 2, 3, 4] 准确率是:0.486882571218 , log_loss是:0.7117058448059521
    主成分个数:  9 验证集 [5, 6, 7, 8] 准确率是:0.542586485159 , log_loss是:0.6865093768544934
    主成分个数:  9 验证集 [9, 10, 11, 12] 准确率是:0.499912608247 , log_loss是:0.6925101354415926
    主成分个数:  9 验证集 [13, 14, 15, 16] 准确率是:0.515113059529 , log_loss是:0.6907441433909828
    主成分个数:  9 验证集 [17, 18, 19, 20] 准确率是:0.567533753845 , log_loss是:0.6817106781563446
    主成分个数:  10 验证集 [1, 2, 3, 4] 准确率是:0.491979381741 , log_loss是:0.7087270095648637
    主成分个数:  10 验证集 [5, 6, 7, 8] 准确率是:0.538474493018 , log_loss是:0.6870488785370759
    主成分个数:  10 验证集 [9, 10, 11, 12] 准确率是:0.504218638278 , log_loss是:0.6908617601346231
    主成分个数:  10 验证集 [13, 14, 15, 16] 准确率是:0.512245368844 , log_loss是:0.692349934445094
    主成分个数:  10 验证集 [17, 18, 19, 20] 准确率是:0.545326591253 , log_loss是:0.6889196681492644
    主成分个数:  11 验证集 [1, 2, 3, 4] 准确率是:0.494520567724 , log_loss是:0.7083524607063616
    主成分个数:  11 验证集 [5, 6, 7, 8] 准确率是:0.541463756929 , log_loss是:0.6841816857841457
    主成分个数:  11 验证集 [9, 10, 11, 12] 准确率是:0.493350282037 , log_loss是:0.6920299703964798
    主成分个数:  11 验证集 [13, 14, 15, 16] 准确率是:0.51059727075 , log_loss是:0.692919939927919
    主成分个数:  11 验证集 [17, 18, 19, 20] 准确率是:0.541920797929 , log_loss是:0.6890899805369095
    主成分个数:  12 验证集 [1, 2, 3, 4] 准确率是:0.490824297203 , log_loss是:0.7085094649664783
    主成分个数:  12 验证集 [5, 6, 7, 8] 准确率是:0.541604097958 , log_loss是:0.6842461010173783
    主成分个数:  12 验证集 [9, 10, 11, 12] 准确率是:0.49157066815 , log_loss是:0.6924926905231099
    主成分个数:  12 验证集 [13, 14, 15, 16] 准确率是:0.506987935922 , log_loss是:0.6944596116265332
    主成分个数:  12 验证集 [17, 18, 19, 20] 准确率是:0.53773306226 , log_loss是:0.6903908645218868
    主成分个数:  13 验证集 [1, 2, 3, 4] 准确率是:0.487618937611 , log_loss是:0.7088646131759345
    主成分个数:  13 验证集 [5, 6, 7, 8] 准确率是:0.540621710757 , log_loss是:0.684462406302658
    主成分个数:  13 验证集 [9, 10, 11, 12] 准确率是:0.490585524748 , log_loss是:0.6931757019623108
    主成分个数:  13 验证集 [13, 14, 15, 16] 准确率是:0.509344716198 , log_loss是:0.6942936978904388
    主成分个数:  13 验证集 [17, 18, 19, 20] 准确率是:0.542042433405 , log_loss是:0.6892824771585596
    主成分个数:  14 验证集 [1, 2, 3, 4] 准确率是:0.487055833899 , log_loss是:0.7088738829870604
    主成分个数:  14 验证集 [5, 6, 7, 8] 准确率是:0.541084836152 , log_loss是:0.6843596553605866
    主成分个数:  14 验证集 [9, 10, 11, 12] 准确率是:0.491109875268 , log_loss是:0.6931465622070953
    主成分个数:  14 验证集 [13, 14, 15, 16] 准确率是:0.509097501483 , log_loss是:0.6945166589548791
    主成分个数:  14 验证集 [17, 18, 19, 20] 准确率是:0.542250951363 , log_loss是:0.6894132224209096
    主成分个数:  15 验证集 [1, 2, 3, 4] 准确率是:0.484702349153 , log_loss是:0.7092001595826248
    主成分个数:  15 验证集 [5, 6, 7, 8] 准确率是:0.546038874465 , log_loss是:0.6806409829820432
    主成分个数:  15 验证集 [9, 10, 11, 12] 准确率是:0.48829744975 , log_loss是:0.6959239002696228
    主成分个数:  15 验证集 [13, 14, 15, 16] 准确率是:0.509163425407 , log_loss是:0.6956392215220123
    主成分个数:  15 验证集 [17, 18, 19, 20] 准确率是:0.540912961129 , log_loss是:0.6899660625285362
    主成分个数:  16 验证集 [1, 2, 3, 4] 准确率是:0.483633895956 , log_loss是:0.7088118160736112
    主成分个数:  16 验证集 [5, 6, 7, 8] 准确率是:0.545126657778 , log_loss是:0.6808061221644308
    主成分个数:  16 验证集 [9, 10, 11, 12] 准确率是:0.489060141416 , log_loss是:0.6956385474908143
    主成分个数:  16 验证集 [13, 14, 15, 16] 准确率是:0.50852066715 , log_loss是:0.6952506019911886
    主成分个数:  16 验证集 [17, 18, 19, 20] 准确率是:0.538671393074 , log_loss是:0.6922335453215452
    主成分个数:  17 验证集 [1, 2, 3, 4] 准确率是:0.483099669357 , log_loss是:0.7083831703002457
    主成分个数:  17 验证集 [5, 6, 7, 8] 准确率是:0.546221317802 , log_loss是:0.6805937840513746
    主成分个数:  17 验证集 [9, 10, 11, 12] 准确率是:0.490204178915 , log_loss是:0.6949199303891808
    主成分个数:  17 验证集 [13, 14, 15, 16] 准确率是:0.507795503988 , log_loss是:0.6952805385820434
    主成分个数:  17 验证集 [17, 18, 19, 20] 准确率是:0.536551460495 , log_loss是:0.689491227761706
    主成分个数:  18 验证集 [1, 2, 3, 4] 准确率是:0.481381481107 , log_loss是:0.7088544691470082
    主成分个数:  18 验证集 [5, 6, 7, 8] 准确率是:0.544803873412 , log_loss是:0.6819461175672578
    主成分个数:  18 验证集 [9, 10, 11, 12] 准确率是:0.49044252006 , log_loss是:0.6948392509978059
    主成分个数:  18 验证集 [13, 14, 15, 16] 准确率是:0.507515327312 , log_loss是:0.694759523225226
    主成分个数:  18 验证集 [17, 18, 19, 20] 准确率是:0.532328971833 , log_loss是:0.6920144303338489
    主成分个数:  19 验证集 [1, 2, 3, 4] 准确率是:0.492181521535 , log_loss是:0.7052300553686336
    主成分个数:  19 验证集 [5, 6, 7, 8] 准确率是:0.5437653498 , log_loss是:0.682143741471743
    主成分个数:  19 验证集 [9, 10, 11, 12] 准确率是:0.503678398348 , log_loss是:0.6926776104130488
    主成分个数:  19 验证集 [13, 14, 15, 16] 准确率是:0.508734919902 , log_loss是:0.6953286148619514
    主成分个数:  19 验证集 [17, 18, 19, 20] 准确率是:0.532867643226 , log_loss是:0.6910801972592988
    

    测试集预测

    In [143]:
    # 确定验证集
    valid_list =[17,18,19,20]  # 实验发现,这是划分验证集不错的参数
    n_component = 5  # 主成分为5个准确率较高
    classifier = LogisticRegression()
    
    train_list = [i for i in range(1,21) if i not in valid_list]
    training_df = raw_df.set_index('era').ix[train_list].reset_index()
    validation_df =  raw_df.set_index('era').ix[valid_list].reset_index()
    
    # train_X= np.array(training_df[features]) # 这里的思想是使用所有的数据进行模型的训练,不仅仅是训练集数据
    # train_y = np.array(training_df['label'])
    
    train_X = np.array(raw_df[features])
    train_y = np.array(raw_df['label'])
    
    # 重新拟合
    pca = PCA(n_components=n_component)
    reduced_train_X = pca.fit_transform(train_X)
    classifier.fit(reduced_train_X, train_y)
    
    Out[143]:
    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
              penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
              verbose=0, warm_start=False)
    In [144]:
    # 测试集预测
    m2 = M.convert_csv_to_hdf.v1(
        input_ds=DataSource('bigquant-challengerai-trendsense-w2-testset'),
        m_cached=False
    )
    test_df = m2.data.read_df()
    
    test_X = np.array(test_df[features])
    
    # 降低维度
    pca = PCA(n_components=n_component)
    reduced_test_X = pca.fit_transform(test_X)
    
    # 预测并输出 
    ret_df=pd.DataFrame({'id':test_df.id})
    ret_df[['be_0_proba','be_1_prob']] = pd.DataFrame(classifier.predict_proba(reduced_test_X),index=ret_df.index)
    ret_df = ret_df.rename(columns={'be_1_prob':'proba'}).drop('be_0_proba',axis=1)
    ret_df.set_index('id').to_csv('result_2_%s.csv'%n_component)
    
    [2017-09-12 20:48:42.520596] INFO: bigquant: convert_csv_to_hdf.v1 开始运行..
    [2017-09-12 20:48:50.796533] INFO: bigquant: convert_csv_to_hdf.v1 运行完成[8.275907s].
    

    (koukezhiyu) #4

    AI Challenger门槛高吗?


    (小马哥) #5

    怎么说呢?参加比赛门槛不是很高,但是要想得奖,应该还是需要花时间研究。


    (oversky2003) #6

    我从比赛网站下载的训练和测试数据都是几百M的,但bigquant的数据集只有不到2M。而且只有第一和第二期。我想上传新的数据集进去,但提示不能超过15M
    image


    (小Q) #7

    您好 。
    我们目前提供两种方案:

    • 已经补充最新比赛数据
    • 支持用户上传更大的数据

    (oversky2003) #8

    刚刚刷新了几次,没有发现有新的比赛数据。还有你们提供的比赛数据应该是完整的吧,那个显示不到2M,只是显示不正确而已?
    image


    (小Q) #9

    数据是完整的,这个你放心。最新的数据集你稍等下哈!


    (bluexxxx) #10

    我刚刚的截图有多个数据集啊,不止这几个。


    (oversky2003) #11

    直接刷新,新加的数据并不能显示出来(用的是chrome),要强行手动刷新才有新的数据(在调试模式下,见下图),你们可以修改一下,例如在数据加载的API加一个version,就可以直接刷新显示新的数据
    image
    image


    (小Q) #12

    好的,收到反馈!


    (Henry) #13

    小Q。。。竞赛平台的数据已经更新到第六期了,望跟进


    (royburns) #14


    直接克隆楼主的策略,报这个错误,变动很大啊,才一个月就跑不过了


    (小Q) #15

    好的,没有问题就行。


    (Henry) #16

    把’era>=16’的单引号去掉,变成era>=16就可以运行了


    (royburns) #17

    谢谢,我试试


    (lzl) #18

    说实话,小编,我一个玩过kaggle的渣渣愣是没看懂你这份教程。
    可以说是十分不入门了。


    (神龙斗士) #19

    @lzl 你这写的不错,很入门

    https://community.bigquant.com/t/零基础《AI挑战虚拟股票预测大赛》入门教程/3514


    (feixiong) #20

    第8期的数据导进去了吗?@bigquant


    (iQuant) #21

    数据是有的。@feixiong