使用BigQuant平台复现XGBoost算法

xgboost
用户成长系列
新手专区
标签: #<Tag:0x00007fb1150ca350> #<Tag:0x00007fb1150ca210> #<Tag:0x00007fb1150ca0d0>

(lpl22) #1

XGBoost算法的概念

树模型

机器学习模型可以简单分为传统机器学习模型和深度学习模型,传统的机器学习又可以根据模型的表达式分为树模型和线性模型。

Boosting模型

树模型以决策树为基础,在之上衍生出了各种算法,从集成学习的角度考虑,树模型可以分为 Bagging 和 Boosting 模型,Boosting 方法是另一种通过弱学习器提高准确度的方法,和 Bagging 方法不同的是,Boosting 每次根据之前模型的表现,进行新的模型的训练,以改变训练数据的权值和弱分类器的组合方式,得到最后的强学习器。

GBDT

对于简单的损失函数,如指数损失和平方损失,每一次提升都较为简单,但是对于一般的损失函数(如绝对损失),优化难度大大增加。因此 GBDT 利用损失函数负梯度在当前模型的值作为提升树中残差的近似值,拟合一个回归树。

XGBoost

XGBoost则是 GBDT 算法的改进和提高。相比于传统的 GBDT 算法,XGBoost 在损失函数、正则化、切分点查找和并行化设计等方面进行了改进,使得其在计算速度上比常见工具包快 5 倍以上。例如,GBDT 算法在训练第 n 棵树时需要用到第 n-1 棵树的残差,从而导致算法较难实现并行;而 XGBoost 通过对目标函数做二阶泰勒展开,使得最终的目标函数只依赖每个数据点上损失函数的一阶导和二阶导,进而容易实现并行。图表 7 显示了 XGBoost 算法的流程,它与 GBDT 在数学上的不同之处在于训练每个弱学习器时的目标函数。数学方面的知识并非本文重点,故点到为止。

本文将通过BigQuant平台复现XGBoost模型,并讲解其中的参数调整,以下为策略链接:

克隆策略

    {"Description":"实验创建于2017/8/26","Summary":"","Graph":{"EdgesInternal":[{"DestinationInputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-15:instruments","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-8:data"},{"DestinationInputPortId":"-106:instruments","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-8:data"},{"DestinationInputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-53:data1","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-15:data"},{"DestinationInputPortId":"-106:features","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-24:data"},{"DestinationInputPortId":"-113:features","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-24:data"},{"DestinationInputPortId":"-122:features","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-24:data"},{"DestinationInputPortId":"-129:features","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-24:data"},{"DestinationInputPortId":"-165:features","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-24:data"},{"DestinationInputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-84:input_data","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-53:data"},{"DestinationInputPortId":"-122:instruments","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-62:data"},{"DestinationInputPortId":"-141:instruments","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-62:data"},{"DestinationInputPortId":"-165:training_ds","SourceOutputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-84:data"},{"DestinationInputPortId":"-165:predict_ds","SourceOutputPortId":"-86:data"},{"DestinationInputPortId":"-113:input_data","SourceOutputPortId":"-106:data"},{"DestinationInputPortId":"287d2cb0-f53c-4101-bdf8-104b137c8601-53:data2","SourceOutputPortId":"-113:data"},{"DestinationInputPortId":"-129:input_data","SourceOutputPortId":"-122:data"},{"DestinationInputPortId":"-86:input_data","SourceOutputPortId":"-129:data"},{"DestinationInputPortId":"-141:options_data","SourceOutputPortId":"-165:predictions"}],"ModuleNodes":[{"Id":"287d2cb0-f53c-4101-bdf8-104b137c8601-8","ModuleId":"BigQuantSpace.instruments.instruments-v2","ModuleParameters":[{"Name":"start_date","Value":"2010-01-01","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"end_date","Value":"2015-01-01","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"market","Value":"CN_STOCK_A","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"instrument_list","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"max_count","Value":"0","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"rolling_conf","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-8"}],"OutputPortsInternal":[{"Name":"data","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-8","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":1,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"287d2cb0-f53c-4101-bdf8-104b137c8601-15","ModuleId":"BigQuantSpace.advanced_auto_labeler.advanced_auto_labeler-v2","ModuleParameters":[{"Name":"label_expr","Value":"# #号开始的表示注释\n# 0. 每行一个,顺序执行,从第二个开始,可以使用label字段\n# 1. 可用数据字段见 https://bigquant.com/docs/data_history_data.html\n# 添加benchmark_前缀,可使用对应的benchmark数据\n# 2. 可用操作符和函数见 `表达式引擎 <https://bigquant.com/docs/big_expr.html>`_\n\n# 计算收益:5日收盘价(作为卖出价格)除以明日开盘价(作为买入价格)\nshift(close, -5) / shift(open, -1)\n\n# 极值处理:用1%和99%分位的值做clip\nclip(label, all_quantile(label, 0.01), all_quantile(label, 0.99))\n\n# 将分数映射到分类,这里使用20个分类\nall_wbins(label, 20)\n\n# 过滤掉一字涨停的情况 (设置label为NaN,在后续处理和训练中会忽略NaN的label)\nwhere(shift(high, -1) == shift(low, -1), NaN, label)\n","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"start_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"end_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"benchmark","Value":"000300.SHA","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"drop_na_label","Value":"True","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"cast_label_int","Value":"True","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"user_functions","Value":"","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"instruments","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-15"}],"OutputPortsInternal":[{"Name":"data","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-15","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":2,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"287d2cb0-f53c-4101-bdf8-104b137c8601-24","ModuleId":"BigQuantSpace.input_features.input_features-v1","ModuleParameters":[{"Name":"features","Value":"# #号开始的表示注释\n# 多个特征,每行一个,可以包含基础特征和衍生特征\nreturn_5\nreturn_10\nreturn_20\navg_amount_0/avg_amount_5\navg_amount_5/avg_amount_20\nrank_avg_amount_0/rank_avg_amount_5\nrank_avg_amount_5/rank_avg_amount_10\nrank_return_0\nrank_return_5\nrank_return_10\nrank_return_0/rank_return_5\nrank_return_5/rank_return_10\npe_ttm_0\n","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features_ds","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-24"}],"OutputPortsInternal":[{"Name":"data","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-24","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":3,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"287d2cb0-f53c-4101-bdf8-104b137c8601-53","ModuleId":"BigQuantSpace.join.join-v3","ModuleParameters":[{"Name":"on","Value":"date,instrument","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"how","Value":"inner","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"sort","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data1","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-53"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"data2","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-53"}],"OutputPortsInternal":[{"Name":"data","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-53","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":7,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"287d2cb0-f53c-4101-bdf8-104b137c8601-62","ModuleId":"BigQuantSpace.instruments.instruments-v2","ModuleParameters":[{"Name":"start_date","Value":"2015-01-01","ValueType":"Literal","LinkedGlobalParameter":"交易日期"},{"Name":"end_date","Value":"2017-01-01","ValueType":"Literal","LinkedGlobalParameter":"交易日期"},{"Name":"market","Value":"CN_STOCK_A","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"instrument_list","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"max_count","Value":"0","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"rolling_conf","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-62"}],"OutputPortsInternal":[{"Name":"data","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-62","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":9,"IsPartOfPartialRun":null,"Comment":"预测数据,用于回测和模拟","CommentCollapsed":false},{"Id":"287d2cb0-f53c-4101-bdf8-104b137c8601-84","ModuleId":"BigQuantSpace.dropnan.dropnan-v1","ModuleParameters":[],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_data","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-84"}],"OutputPortsInternal":[{"Name":"data","NodeId":"287d2cb0-f53c-4101-bdf8-104b137c8601-84","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":13,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"-86","ModuleId":"BigQuantSpace.dropnan.dropnan-v1","ModuleParameters":[],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_data","NodeId":"-86"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-86","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":14,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"-106","ModuleId":"BigQuantSpace.general_feature_extractor.general_feature_extractor-v7","ModuleParameters":[{"Name":"start_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"end_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"before_start_days","Value":0,"ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"instruments","NodeId":"-106"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-106"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-106","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":15,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"-113","ModuleId":"BigQuantSpace.derived_feature_extractor.derived_feature_extractor-v3","ModuleParameters":[{"Name":"date_col","Value":"date","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"instrument_col","Value":"instrument","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"drop_na","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"remove_extra_columns","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"user_functions","Value":"","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_data","NodeId":"-113"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-113"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-113","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":16,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"-122","ModuleId":"BigQuantSpace.general_feature_extractor.general_feature_extractor-v7","ModuleParameters":[{"Name":"start_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"end_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"before_start_days","Value":0,"ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"instruments","NodeId":"-122"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-122"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-122","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":17,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"-129","ModuleId":"BigQuantSpace.derived_feature_extractor.derived_feature_extractor-v3","ModuleParameters":[{"Name":"date_col","Value":"date","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"instrument_col","Value":"instrument","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"drop_na","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"remove_extra_columns","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"user_functions","Value":"","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"input_data","NodeId":"-129"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-129"}],"OutputPortsInternal":[{"Name":"data","NodeId":"-129","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":18,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"-141","ModuleId":"BigQuantSpace.trade.trade-v4","ModuleParameters":[{"Name":"start_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"end_date","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"handle_data","Value":"# 回测引擎:每日数据处理函数,每天执行一次\ndef bigquant_run(context, data):\n # 按日期过滤得到今日的预测数据\n ranker_prediction = context.ranker_prediction[\n context.ranker_prediction.date == data.current_dt.strftime('%Y-%m-%d')]\n\n # 1. 资金分配\n # 平均持仓时间是hold_days,每日都将买入股票,每日预期使用 1/hold_days 的资金\n # 实际操作中,会存在一定的买入误差,所以在前hold_days天,等量使用资金;之后,尽量使用剩余资金(这里设置最多用等量的1.5倍)\n is_staging = context.trading_day_index < context.options['hold_days'] # 是否在建仓期间(前 hold_days 天)\n cash_avg = context.portfolio.portfolio_value / context.options['hold_days']\n cash_for_buy = min(context.portfolio.cash, (1 if is_staging else 1.5) * cash_avg)\n cash_for_sell = cash_avg - (context.portfolio.cash - cash_for_buy)\n positions = {e.symbol: p.amount * p.last_sale_price\n for e, p in context.perf_tracker.position_tracker.positions.items()}\n\n # 2. 生成卖出订单:hold_days天之后才开始卖出;对持仓的股票,按机器学习算法预测的排序末位淘汰\n if not is_staging and cash_for_sell > 0:\n equities = {e.symbol: e for e, p in context.perf_tracker.position_tracker.positions.items()}\n instruments = list(reversed(list(ranker_prediction.instrument[ranker_prediction.instrument.apply(\n lambda x: x in equities and not context.has_unfinished_sell_order(equities[x]))])))\n # print('rank order for sell %s' % instruments)\n for instrument in instruments:\n context.order_target(context.symbol(instrument), 0)\n cash_for_sell -= positions[instrument]\n if cash_for_sell <= 0:\n break\n\n # 3. 生成买入订单:按机器学习算法预测的排序,买入前面的stock_count只股票\n buy_cash_weights = context.stock_weights\n buy_instruments = list(ranker_prediction.instrument[:len(buy_cash_weights)])\n max_cash_per_instrument = context.portfolio.portfolio_value * context.max_cash_per_instrument\n for i, instrument in enumerate(buy_instruments):\n cash = cash_for_buy * buy_cash_weights[i]\n if cash > max_cash_per_instrument - positions.get(instrument, 0):\n # 确保股票持仓量不会超过每次股票最大的占用资金量\n cash = max_cash_per_instrument - positions.get(instrument, 0)\n if cash > 0:\n context.order_value(context.symbol(instrument), cash)\n","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"prepare","Value":"# 回测引擎:准备数据,只执行一次\ndef bigquant_run(context):\n pass\n","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"initialize","Value":"# 回测引擎:初始化函数,只执行一次\ndef bigquant_run(context):\n # 加载预测数据\n context.ranker_prediction = context.options['data'].read_df()\n\n # 系统已经设置了默认的交易手续费和滑点,要修改手续费可使用如下函数\n context.set_commission(PerOrder(buy_cost=0.0003, sell_cost=0.0013, min_cost=5))\n # 预测数据,通过options传入进来,使用 read_df 函数,加载到内存 (DataFrame)\n # 设置买入的股票数量,这里买入预测股票列表排名靠前的5只\n stock_count = 5\n # 每只的股票的权重,如下的权重分配会使得靠前的股票分配多一点的资金,[0.339160, 0.213986, 0.169580, ..]\n context.stock_weights = T.norm([1 / math.log(i + 2) for i in range(0, stock_count)])\n # 设置每只股票占用的最大资金比例\n context.max_cash_per_instrument = 0.2\n context.options['hold_days'] = 5\n","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"before_trading_start","Value":"","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"volume_limit","Value":0.025,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"order_price_field_buy","Value":"open","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"order_price_field_sell","Value":"close","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"capital_base","Value":1000000,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"auto_cancel_non_tradable_orders","Value":"True","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"data_frequency","Value":"daily","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"price_type","Value":"后复权","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"product_type","Value":"股票","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"plot_charts","Value":"True","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"backtest_only","Value":"False","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"benchmark","Value":"000300.SHA","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"instruments","NodeId":"-141"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"options_data","NodeId":"-141"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"history_ds","NodeId":"-141"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"benchmark_ds","NodeId":"-141"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"trading_calendar","NodeId":"-141"}],"OutputPortsInternal":[{"Name":"raw_perf","NodeId":"-141","OutputType":null}],"UsePreviousResults":false,"moduleIdForCode":19,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true},{"Id":"-165","ModuleId":"BigQuantSpace.xgboost.xgboost-v1","ModuleParameters":[{"Name":"num_boost_round","Value":30,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"objective","Value":"排序(pairwise)","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"booster","Value":"gbtree","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"max_depth","Value":6,"ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"key_cols","Value":"date,instrument","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"group_col","Value":"date","ValueType":"Literal","LinkedGlobalParameter":null},{"Name":"other_train_parameters","Value":"{}","ValueType":"Literal","LinkedGlobalParameter":null}],"InputPortsInternal":[{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"training_ds","NodeId":"-165"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"features","NodeId":"-165"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"model","NodeId":"-165"},{"DataSourceId":null,"TrainedModelId":null,"TransformModuleId":null,"Name":"predict_ds","NodeId":"-165"}],"OutputPortsInternal":[{"Name":"output_model","NodeId":"-165","OutputType":null},{"Name":"predictions","NodeId":"-165","OutputType":null}],"UsePreviousResults":true,"moduleIdForCode":20,"IsPartOfPartialRun":null,"Comment":"","CommentCollapsed":true}],"SerializedClientData":"<?xml version='1.0' encoding='utf-16'?><DataV1 xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'><Meta /><NodePositions><NodePosition Node='287d2cb0-f53c-4101-bdf8-104b137c8601-8' Position='211,64,200,200'/><NodePosition Node='287d2cb0-f53c-4101-bdf8-104b137c8601-15' Position='70,183,200,200'/><NodePosition Node='287d2cb0-f53c-4101-bdf8-104b137c8601-24' Position='706,-22,200,200'/><NodePosition Node='287d2cb0-f53c-4101-bdf8-104b137c8601-53' Position='249,375,200,200'/><NodePosition Node='287d2cb0-f53c-4101-bdf8-104b137c8601-62' Position='1074,127,200,200'/><NodePosition Node='287d2cb0-f53c-4101-bdf8-104b137c8601-84' Position='376,467,200,200'/><NodePosition Node='-86' Position='1078,418,200,200'/><NodePosition Node='-106' Position='381,188,200,200'/><NodePosition Node='-113' Position='385,280,200,200'/><NodePosition Node='-122' Position='1078,236,200,200'/><NodePosition Node='-129' Position='1081,327,200,200'/><NodePosition Node='-141' Position='1037,751,200,200'/><NodePosition Node='-165' Position='653,553,200,200'/></NodePositions><NodeGroups /></DataV1>"},"IsDraft":true,"ParentExperimentId":null,"WebService":{"IsWebServiceExperiment":false,"Inputs":[],"Outputs":[],"Parameters":[{"Name":"交易日期","Value":"","ParameterDefinition":{"Name":"交易日期","FriendlyName":"交易日期","DefaultValue":"","ParameterType":"String","HasDefaultValue":true,"IsOptional":true,"ParameterRules":[],"HasRules":false,"MarkupType":0,"CredentialDescriptor":null}}],"WebServiceGroupId":null,"SerializedClientData":"<?xml version='1.0' encoding='utf-16'?><DataV1 xmlns:xsd='http://www.w3.org/2001/XMLSchema' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'><Meta /><NodePositions></NodePositions><NodeGroups /></DataV1>"},"DisableNodesUpdate":false,"Category":"user","Tags":[],"IsPartialRun":true}
    In [2]:
    # 本代码由可视化策略环境自动生成 2019年1月17日 09:30
    # 本代码单元只能在可视化模式下编辑。您也可以拷贝代码,粘贴到新建的代码单元或者策略,然后修改。
    
    
    # 回测引擎:每日数据处理函数,每天执行一次
    def m19_handle_data_bigquant_run(context, data):
        # 按日期过滤得到今日的预测数据
        ranker_prediction = context.ranker_prediction[
            context.ranker_prediction.date == data.current_dt.strftime('%Y-%m-%d')]
    
        # 1. 资金分配
        # 平均持仓时间是hold_days,每日都将买入股票,每日预期使用 1/hold_days 的资金
        # 实际操作中,会存在一定的买入误差,所以在前hold_days天,等量使用资金;之后,尽量使用剩余资金(这里设置最多用等量的1.5倍)
        is_staging = context.trading_day_index < context.options['hold_days'] # 是否在建仓期间(前 hold_days 天)
        cash_avg = context.portfolio.portfolio_value / context.options['hold_days']
        cash_for_buy = min(context.portfolio.cash, (1 if is_staging else 1.5) * cash_avg)
        cash_for_sell = cash_avg - (context.portfolio.cash - cash_for_buy)
        positions = {e.symbol: p.amount * p.last_sale_price
                     for e, p in context.perf_tracker.position_tracker.positions.items()}
    
        # 2. 生成卖出订单:hold_days天之后才开始卖出;对持仓的股票,按机器学习算法预测的排序末位淘汰
        if not is_staging and cash_for_sell > 0:
            equities = {e.symbol: e for e, p in context.perf_tracker.position_tracker.positions.items()}
            instruments = list(reversed(list(ranker_prediction.instrument[ranker_prediction.instrument.apply(
                    lambda x: x in equities and not context.has_unfinished_sell_order(equities[x]))])))
            # print('rank order for sell %s' % instruments)
            for instrument in instruments:
                context.order_target(context.symbol(instrument), 0)
                cash_for_sell -= positions[instrument]
                if cash_for_sell <= 0:
                    break
    
        # 3. 生成买入订单:按机器学习算法预测的排序,买入前面的stock_count只股票
        buy_cash_weights = context.stock_weights
        buy_instruments = list(ranker_prediction.instrument[:len(buy_cash_weights)])
        max_cash_per_instrument = context.portfolio.portfolio_value * context.max_cash_per_instrument
        for i, instrument in enumerate(buy_instruments):
            cash = cash_for_buy * buy_cash_weights[i]
            if cash > max_cash_per_instrument - positions.get(instrument, 0):
                # 确保股票持仓量不会超过每次股票最大的占用资金量
                cash = max_cash_per_instrument - positions.get(instrument, 0)
            if cash > 0:
                context.order_value(context.symbol(instrument), cash)
    
    # 回测引擎:准备数据,只执行一次
    def m19_prepare_bigquant_run(context):
        pass
    
    # 回测引擎:初始化函数,只执行一次
    def m19_initialize_bigquant_run(context):
        # 加载预测数据
        context.ranker_prediction = context.options['data'].read_df()
    
        # 系统已经设置了默认的交易手续费和滑点,要修改手续费可使用如下函数
        context.set_commission(PerOrder(buy_cost=0.0003, sell_cost=0.0013, min_cost=5))
        # 预测数据,通过options传入进来,使用 read_df 函数,加载到内存 (DataFrame)
        # 设置买入的股票数量,这里买入预测股票列表排名靠前的5只
        stock_count = 5
        # 每只的股票的权重,如下的权重分配会使得靠前的股票分配多一点的资金,[0.339160, 0.213986, 0.169580, ..]
        context.stock_weights = T.norm([1 / math.log(i + 2) for i in range(0, stock_count)])
        # 设置每只股票占用的最大资金比例
        context.max_cash_per_instrument = 0.2
        context.options['hold_days'] = 5
    
    
    m1 = M.instruments.v2(
        start_date='2010-01-01',
        end_date='2015-01-01',
        market='CN_STOCK_A',
        instrument_list='',
        max_count=0
    )
    
    m2 = M.advanced_auto_labeler.v2(
        instruments=m1.data,
        label_expr="""# #号开始的表示注释
    # 0. 每行一个,顺序执行,从第二个开始,可以使用label字段
    # 1. 可用数据字段见 https://bigquant.com/docs/data_history_data.html
    #   添加benchmark_前缀,可使用对应的benchmark数据
    # 2. 可用操作符和函数见 `表达式引擎 <https://bigquant.com/docs/big_expr.html>`_
    
    # 计算收益:5日收盘价(作为卖出价格)除以明日开盘价(作为买入价格)
    shift(close, -5) / shift(open, -1)
    
    # 极值处理:用1%和99%分位的值做clip
    clip(label, all_quantile(label, 0.01), all_quantile(label, 0.99))
    
    # 将分数映射到分类,这里使用20个分类
    all_wbins(label, 20)
    
    # 过滤掉一字涨停的情况 (设置label为NaN,在后续处理和训练中会忽略NaN的label)
    where(shift(high, -1) == shift(low, -1), NaN, label)
    """,
        start_date='',
        end_date='',
        benchmark='000300.SHA',
        drop_na_label=True,
        cast_label_int=True
    )
    
    m3 = M.input_features.v1(
        features="""# #号开始的表示注释
    # 多个特征,每行一个,可以包含基础特征和衍生特征
    return_5
    return_10
    return_20
    avg_amount_0/avg_amount_5
    avg_amount_5/avg_amount_20
    rank_avg_amount_0/rank_avg_amount_5
    rank_avg_amount_5/rank_avg_amount_10
    rank_return_0
    rank_return_5
    rank_return_10
    rank_return_0/rank_return_5
    rank_return_5/rank_return_10
    pe_ttm_0
    """
    )
    
    m15 = M.general_feature_extractor.v7(
        instruments=m1.data,
        features=m3.data,
        start_date='',
        end_date='',
        before_start_days=0
    )
    
    m16 = M.derived_feature_extractor.v3(
        input_data=m15.data,
        features=m3.data,
        date_col='date',
        instrument_col='instrument',
        drop_na=False,
        remove_extra_columns=False
    )
    
    m7 = M.join.v3(
        data1=m2.data,
        data2=m16.data,
        on='date,instrument',
        how='inner',
        sort=False
    )
    
    m13 = M.dropnan.v1(
        input_data=m7.data
    )
    
    m9 = M.instruments.v2(
        start_date=T.live_run_param('trading_date', '2015-01-01'),
        end_date=T.live_run_param('trading_date', '2017-01-01'),
        market='CN_STOCK_A',
        instrument_list='',
        max_count=0
    )
    
    m17 = M.general_feature_extractor.v7(
        instruments=m9.data,
        features=m3.data,
        start_date='',
        end_date='',
        before_start_days=0
    )
    
    m18 = M.derived_feature_extractor.v3(
        input_data=m17.data,
        features=m3.data,
        date_col='date',
        instrument_col='instrument',
        drop_na=False,
        remove_extra_columns=False
    )
    
    m14 = M.dropnan.v1(
        input_data=m18.data
    )
    
    m20 = M.xgboost.v1(
        training_ds=m13.data,
        features=m3.data,
        predict_ds=m14.data,
        num_boost_round=30,
        objective='排序(pairwise)',
        booster='gbtree',
        max_depth=6,
        key_cols='date,instrument',
        group_col='date',
        other_train_parameters={}
    )
    
    m19 = M.trade.v4(
        instruments=m9.data,
        options_data=m20.predictions,
        start_date='',
        end_date='',
        handle_data=m19_handle_data_bigquant_run,
        prepare=m19_prepare_bigquant_run,
        initialize=m19_initialize_bigquant_run,
        volume_limit=0.025,
        order_price_field_buy='open',
        order_price_field_sell='close',
        capital_base=1000000,
        auto_cancel_non_tradable_orders=True,
        data_frequency='daily',
        price_type='后复权',
        product_type='股票',
        plot_charts=True,
        backtest_only=False,
        benchmark='000300.SHA'
    )
    
    [2019-01-17 09:27:20.968538] INFO: bigquant: instruments.v2 开始运行..
    [2019-01-17 09:27:20.973919] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:20.974851] INFO: bigquant: instruments.v2 运行完成[0.00635s].
    [2019-01-17 09:27:20.989079] INFO: bigquant: advanced_auto_labeler.v2 开始运行..
    [2019-01-17 09:27:20.993137] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:20.993985] INFO: bigquant: advanced_auto_labeler.v2 运行完成[0.004922s].
    [2019-01-17 09:27:20.995805] INFO: bigquant: input_features.v1 开始运行..
    [2019-01-17 09:27:20.999204] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:20.999812] INFO: bigquant: input_features.v1 运行完成[0.004007s].
    [2019-01-17 09:27:21.004609] INFO: bigquant: general_feature_extractor.v7 开始运行..
    [2019-01-17 09:27:21.008995] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:21.009972] INFO: bigquant: general_feature_extractor.v7 运行完成[0.005372s].
    [2019-01-17 09:27:21.012100] INFO: bigquant: derived_feature_extractor.v3 开始运行..
    [2019-01-17 09:27:21.016345] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:21.017353] INFO: bigquant: derived_feature_extractor.v3 运行完成[0.00526s].
    [2019-01-17 09:27:21.034377] INFO: bigquant: join.v3 开始运行..
    [2019-01-17 09:27:21.041464] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:21.042490] INFO: bigquant: join.v3 运行完成[0.008131s].
    [2019-01-17 09:27:21.045199] INFO: bigquant: dropnan.v1 开始运行..
    [2019-01-17 09:27:21.050653] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:21.051493] INFO: bigquant: dropnan.v1 运行完成[0.00632s].
    [2019-01-17 09:27:21.053725] INFO: bigquant: instruments.v2 开始运行..
    [2019-01-17 09:27:21.057062] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:21.057773] INFO: bigquant: instruments.v2 运行完成[0.004049s].
    [2019-01-17 09:27:21.062142] INFO: bigquant: general_feature_extractor.v7 开始运行..
    [2019-01-17 09:27:21.065649] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:21.066318] INFO: bigquant: general_feature_extractor.v7 运行完成[0.004178s].
    [2019-01-17 09:27:21.068107] INFO: bigquant: derived_feature_extractor.v3 开始运行..
    [2019-01-17 09:27:21.071114] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:21.071762] INFO: bigquant: derived_feature_extractor.v3 运行完成[0.003653s].
    [2019-01-17 09:27:21.073503] INFO: bigquant: dropnan.v1 开始运行..
    [2019-01-17 09:27:21.076426] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:21.077030] INFO: bigquant: dropnan.v1 运行完成[0.003532s].
    [2019-01-17 09:27:21.419519] INFO: bigquant: xgboost.v1 开始运行..
    [2019-01-17 09:27:21.440029] INFO: bigquant: 命中缓存
    [2019-01-17 09:27:21.441190] INFO: bigquant: xgboost.v1 运行完成[0.02168s].
    [2019-01-17 09:27:21.599274] INFO: bigquant: backtest.v8 开始运行..
    [2019-01-17 09:27:21.603900] INFO: bigquant: 命中缓存
    
    • 收益率419.55%
    • 年化收益率134.18%
    • 基准收益率-6.33%
    • 阿尔法0.92
    • 贝塔0.92
    • 夏普比率2.24
    • 胜率0.64
    • 盈亏比0.95
    • 收益波动率40.49%
    • 信息比率0.21
    • 最大回撤50.32%
    [2019-01-17 09:27:22.987381] INFO: bigquant: backtest.v8 运行完成[1.388072s].
    
    In [8]:
    m20.feature_gains()
    
    Out[8]:
    feature gain
    10 return_20 258
    2 return_5 213
    1 avg_amount_5/avg_amount_20 177
    0 avg_amount_0/avg_amount_5 167
    3 rank_avg_amount_0/rank_avg_amount_5 159
    9 rank_return_5 127
    5 rank_avg_amount_5/rank_avg_amount_10 121
    7 rank_return_5/rank_return_10 115
    8 pe_ttm_0 112
    11 rank_return_10 111
    12 rank_return_0 110
    4 return_10 101
    6 rank_return_0/rank_return_5 97
    In [ ]:
     
    

    XGBoost算法的原理

    GBDT
    对于简单的损失函数,如指数损失和平方损失,每一次提升都较为简单,但是对于一般
    的损失函数(如绝对损失),优化难度大大增加。因此 GBDT 利用损失函数负梯度在当
    前模型的值作为提升树中残差的近似值,拟合一个回归树。GBDT 的具体算法如下:

    1. 初始化模型
      1

    2. 循环训练 K 个模型𝑘 = 1,2, … ,𝐾
      (1) 计算负梯度:对于i = 1,2, … , M

    2
    (2) 以负梯度𝑟𝑘𝑖训练模型,得到第 k 颗树的叶结点区域$𝑅_{𝑘𝑗}$,𝑗 = 1,2, … ,𝐽
    (3) 对𝑗 = 1,2, … ,𝐽,计算:
    3
    (4) 更新模型:
    4
    3. 得到最终模型:
    5
    XGBoost 是一种高效的 Boosting 训练器,可以实现 GBDT 的功能。且不同于一般的
    GBDT,XGBoost 采用损失函数的二阶泰勒展开来近似原损失函数,同时在损失函数后
    加入惩罚项:


    以树模型作为弱分类器为例,得到区别于 GBDT 构造树的过程。

    首先将𝑓𝑡和Ω的表达式带入目标函数中,得到目标函数的如下形式:


    由此可得函数的极小值点:

    8
    每次对已有的叶子加入一个分割时,只需通过以下式子判断是否进行分割:


    其中𝐼𝐿和𝐼𝑅分别属于分裂后的左节点、右节点集合,$𝐼 = 𝐼_𝐿 ∪ 𝐼_𝑅$。
    相比于一般的 GBDT,XGBoost 具有以下优点

    1. 传统 GBDT 在优化时只用到了梯度方向的信息,XGBoost 对损失函数进行二
      阶泰勒展开,不仅用到了二阶导数的信息,还可以根据数据的特点自定义损失
      函数,只要一阶导和二阶导存在。
    2. XGBoost 在损失函数中加入了惩罚项,可以有效降低模型的过拟合。
    3. XGBoost 不仅支持树模型作为弱学习器,也支持线性模型、Logistic 模型等作
      为弱学习器,可以针对数据特点选取更合适的弱学习器。
    4. XGBoost 借鉴了 Bagging 中的思想,可以实现在每次 Boosting 的过程中随机
      抽样样本集和特征集合,有效防止过拟合。
    5. XGBoost 支持在每颗树的迭代过程中子节点的并行计算,且在搜索上通过预排
      序更快的实现算法。

    XGBoost 存在以下缺点

    1. 在选择树的分隔点时,需要遍历所有特征值。
    2. 预排序结果的保存消耗大量内存。
      (微软推出的 LightGBM 优化了 XGBoost 的上述问题,虽然LightGBM也有自身的缺点,在 Kaggle 的比赛中,越来越多的算法融合了 LightGBM。总体来说,LightGBM 和
      XGBoost 在不同问题上表现各有优劣,但是速度上,LightGBM 很多情况下都优于
      XGBoost,所以目前也成为数据科学家建立 GBDT 模型的工具。)

    改变迭代次数

    迭代次数增加时,运行时间变长,模型过拟合程度变高;反之,迭代次数过低,模型欠拟合程度变高。默认值为30次

    迭代三十次:


    迭代十五次:
    迭代一次:

    改变损失函数类型

    在XGBoost算法中,回归法不适用,需要进行模块的修改。排序(pairwise)(默认值)表现最优
    排序(map):


    排序(ndcg):

    树的最大深度

    树的最大深度增加时,运行时间显著变长,模型过拟合程度变高;反之,树的最大深度过低,模型欠拟合程度变高(默认值为6)
    最大深度等于3:

    最大深度等于10

    模型、因子的表现

    通过输入代码

    m20.feature_gains()
    

    可查看在XGBoost算法中各个因子的得分(使用)情况,

    在本策略M2中可根据个人需要更改数据标注条件

    在M3中可根据个人需要添加因子或删去表现力差的因子


    可视化模板实现不同算法下选股策略的流程汇总
    (zhudan) #2

    策略连接打不开啊


    (iQuant) #3

    您好,已修复,现在可以再试一下。