XGboost

image.png image.png

In [1]:
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 加载数据集
data = load_iris()
X = data.data
y = data.target

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 将数据转换为DMatrix格式
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# 设置参数
params = {
    'objective': 'multi:softmax',
    'num_class': 3,
    'eta': 0.1,
    'max_depth': 3
}

# 训练模型
model = xgb.train(params, dtrain, num_boost_round=100)

# 在测试集上预测
y_pred = model.predict(dtest)

# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

逻辑回归

$$P(y>=0)=\frac{1}{1+\exp\{-f(x)\}}$$

交叉熵

交叉熵描述的是两个概率分布之间的距离, 交叉熵是一个非负的实数, 当值越小, 表示两个概率分布越接近.

以样本$(x_i, label_i)$为例, 假如该样本标签$label_i=1$, 则有概率$P(label_i=1)=100\%$, 现在我使用模型$f(.)$对该数据点建模, 使用sigmoid激活后: $$P(pre\_label_i=1)=\frac{1}{1+\exp\{-f(x_i)\}}$$ $pre\_label$为预测标签, 由于样本实际标签为1, 所以我们只需要不断地去更新模型$f(.)$, 使得$P(pre\_label_i=1)$尽可能逼近1(最好等于1)即可.

所以, 交叉熵出现了:

$$L_i=-P(label_i=1)\times\log\{P(pre\_label_i=1)\}-(1-P(label_i=1))\times\log\{1-P(pre\_label_i=1)\}$$

接下来引入一个中间变量$y_i$(通常是连续型标签), 其中$label_i=sgn(y_i)$, $sgn(x)$为符号函数, $x>=0$时取1其余取-1, 把概率和中间变量带入交叉熵, 整理一下一样可以得到华泰研报的形式: $$L_i=\log\{1+\exp\{-sgn(y_i)f(x_i)\}\}$$ 因为$P(label_i=1)$和$1-P(label_i=1)$总有一个是0!

In [2]:
import numpy as np

def sgn(x):
    """
    符号函数
    """
    return np.where(x>=0, 1, -1)

# 设定一组数据, 这组数据包含10个样本, 样本特征数为5
x = np.random.randn(10, 5)
y = np.random.randn(10)

# 设置一个线性模型
def model(x):
    """
    假设我们已有一个线性模型, 参数全部已知
    """
    beta = np.random.randn(x.shape[1], 1)
    alpha = np.random.randn(1)
    return np.reshape(x@beta + alpha, (1, -1))[0]

# 计算交叉熵
def cross_entropy(ypre, ytrue):
    """
    :params ypre: 预测值
    :params ytrue: 真实值
    """
    # 这个是每个样本的交叉熵
    out = np.log(1+np.exp(-sgn(ytrue)*ypre))

    # 最终还要求和
    out = np.sum(out)
    return out

cross_entropy(model(x), y)
Out[2]:
4.8110564559766305

交叉熵的导数

对于模型求得的概率, 我们给出一个通式: $$P(pre\_label_i)=\frac{1}{1+\exp\{-sgn(y_i)f(x_i)\}}=\frac{1}{1+\exp\{-label_i\times f(x_i)\}}$$ 当$pre\_label_i=1$时完全和$P(pre\_label_i)$能对应上, 当$pre\_label_i=-1$也能对上.

一阶导: $$\frac{\partial L_i}{\partial f(x_i)}=\frac{-sgn(y_i)\exp\{-sgn(y_i)f(x_i)\}}{1+\exp\{-sgn(y_i)f(x_i)\}}=-sgn(y_i)(1-P(pre\_label_i))=-label_i\times (1-P(pre\_label_i))$$

二阶导: $$\frac{\partial L_i^2}{\partial^2 f(x_i)}=\frac{\exp\{-sgn(y_i)f(x_i)\}}{(1+\exp\{-sgn(y_i)f(x_i)\})^2}=P(pre\_label_i)(1-P(pre\_label_i))$$ 为什么要求二阶导: 因为后续xgboost要用到二阶导.

In [3]:
import numpy as np

x = np.random.randn(10, 5)
y = np.random.randn(10)

# 设置一个线性模型
def model(x):
    """
    假设我们已有一个线性模型, 参数全部已知
    """
    beta = np.random.randn(x.shape[1], 1)
    alpha = np.random.randn(1)
    return np.reshape(x@beta + alpha, (1, -1))[0]

def grad(x, y):
    prob = 1 / (1+np.exp(-sgn(y)*model(x)))
    out = -sgn(y)*(1-prob)
    return out

print('所有样本一阶导向量', grad(x, y))

xgboost自定义损失函数

我们用交叉熵损失对二分类数据集做个分类.

首先我们明确几点: 1. 二分类数据意味着我们需要将标签转化为1、-1的格式

In [46]:
import numpy as np
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
x = data.data
y = data.target
y = np.where(y==1, 1, -1)
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)

# 自定义损失函数
def logistic_obj(pred, dtrain):
    # 获取标签
    label = dtrain.get_label()
    prob_pre_label = 1 / (np.exp(-label * pred) + 1)

    # 计算一阶导数
    grad = -label * (1 - prob_pre_label)

    # 二阶导
    hessian = prob_pre_label * (1 - prob_pre_label)
    return grad, hessian

dtrain = xgb.DMatrix(xtrain, ytrain)
dtest = xgb.DMatrix(xtest, ytest)

# 自定义评估函数(使用MSE评估分类模型显然不适用了)
def metric(pred, y):
    """
    此时y是一个离散值
    """
    pre_label = 1 / (1+np.exp(-pred))
    y = y.get_label()
    acc = np.sum(np.where(pre_label>0.5, 1, -1)==y) / len(y)
    return 'acc', acc


print("---------focal loss-----------")
params = {'tree_method': 'hist'}
model = xgb.train(params=params, dtrain=dtrain, num_boost_round=5, early_stopping_rounds=50,
            evals=[(dtrain, 'train'), (dtest, 'test')], verbose_eval=1, obj=logistic_obj, feval=metric)
In [47]:
# 我们来统计一下正确率
pred = model.predict(dtest)
p = 1 / (1+np.exp(-pred))
acc = np.sum(np.where(p>0.5, 1, -1)==ytest) / len(ytest)
print('预测正确率为: ', acc)

华泰研报版本的有序回归

$$L_i=\sum_{j=1}^K\log\{1+\exp\{-sgn(y_i-c_j)(f(x_i)-c_j)\}\}$$

其中$c_j$为阈值点. 我们可以从多任务的角度去理解它. 这里有K个任务, 所以会有K个损失函数, 之后将损失函数等全合成使得所有任务公用一个模型参数.

以阈值点$c_j$为例. 这里的$y_i$为连续型标签, 首先我们需要将标签二值化, 大于$c_j$的数据打上标签1, 反之打上标签-1, 很显然, 该任务是一个二分类任务, 其损失函数为: $$loss_i = \log\{1+\exp\{-sgn(y_i-c_j)(f(x_i)-c_j)\}\}$$ 根据之前的结论, 这是一个交叉熵, 所以所有的性质都一目了然. 用这种方式练出来的模型是为了预测样本是否大于阈值$c_j$.

现在我们将所有任务串起来看, 对于所有给定的阈值点, $c_1 < c_2 < ... < c_K$, 最小化损失函数得到的模型的最理想状态便是所有样本都能被正确的分到对应的区间.

对于${\forall}x_i, i=1, ..., n$, ${\exists}f(x_i)$, s.t $$f(x_i)=\mathop{argmin}\limits_{f(x_i)}\log\{1+\exp\{-sgn(y_i)f(x_i)\}\}$$ 若样本$x_i$的标签$y_i$是大于$c_1$小于$c_2$, 则模型$f(x_i)$的值刚好落在$c_1$与$c_2$之间.

image.png

研报上标签是10日收益率, 这里二分类效果太好了, 所以我试试一日收益率

In [3]:
import dai

sql = """
WITH main_table AS (
    SELECT dt AS date, instrument, 
    MAX(down_vol_perc) AS down_vol_perc,
    MAX(volume_perc2) AS volume_perc2, 
    MAX(volume_perc3) AS volume_perc3, 
    MAX(volume_perc4) AS volume_perc4, 
    MAX(volume_perc5) AS volume_perc5, 
    MAX(volume_perc6) AS volume_perc6, 
    MAX(volume_perc7) AS volume_perc7,
    MAX(down_single_amt_perc) AS down_single_amt_perc, 
    MAX(corr_ret_lastret) AS corr_ret_lastret, 
    MAX(corr_close_nextopen) AS corr_close_nextopen
    FROM (
        SELECT date AS _date, instrument, 
        -- 只保留年月日
        DATETRUNC('DAY', _date) AS dt, 
        LOG(close / (LAG(close, 1) OVER (PARTITION BY instrument ORDER BY _date))) AS ret, 

        -- 两分钟前的收益率
        LAG(ret, 2) OVER (PARTITION BY instrument ORDER BY _date) AS ret_before, 

        -- 前一分钟收盘价和后一分钟开盘价
        LAG(close, 1) OVER (PARTITION BY instrument ORDER BY _date) AS pre_close, 

        -- 提取时分, 方便标注第n个半小时
        EXTRACT(HOUR FROM _date) AS _hour, 
        EXTRACT(MINUTE FROM _date) AS _min, 

        -- 计算分母
        SUM(ret*ret) OVER (PARTITION BY instrument, dt) AS _low, 

        -- 计算分子
        IF(ret>0, 0, 1) AS _down,
        ret * _down AS _down_ret,
        SUM(_down_ret*_down_ret) OVER (PARTITION BY instrument, dt) AS _up,

        -- 计算下行波动率占比
        _up / _low AS down_vol_perc, 

        -- 当天成交量总量
        SUM(volume) OVER (PARTITION BY instrument, dt) AS total_volume, 

        -- 第二个半小时成交量(10点到10点半)
        IF((_hour=10 AND _min<=30), 1, 0) AS _second_half, 
        (SUM(volume * _second_half) OVER (PARTITION BY instrument, dt)) / total_volume AS volume_perc2, 

        -- 第三个半小时成交量(10点半到11点)
        IF(((_hour=10 AND _min>=30) OR (_hour=11 AND _min=0)), 1, 0) AS _third_half, 
        (SUM(volume * _third_half) OVER (PARTITION BY instrument, dt)) / total_volume AS volume_perc3, 

        -- 第四个半小时成交量(11点到11点半)
        IF((_hour=11 AND _min<=30), 1, 0) AS _forth_half, 
        (SUM(volume * _forth_half) OVER (PARTITION BY instrument, dt)) / total_volume AS volume_perc4, 

        -- 第五个半小时成交量(1点到1点半)
        IF((_hour=13 AND _min<=30), 1, 0) AS _fifth_half, 
        (SUM(volume * _fifth_half) OVER (PARTITION BY instrument, dt)) / total_volume AS volume_perc5, 

        -- 第六个半小时成交量(1点半到2点)
        IF(((_hour=13 AND _min>=30) OR (_hour=14 AND _min=0)), 1, 0) AS _sixth_half, 
        (SUM(volume * _sixth_half) OVER (PARTITION BY instrument, dt)) / total_volume AS volume_perc6, 

        -- 第七个半小时成交量(2点到2点半)
        IF((_hour=14 AND _min<=30), 1, 0) AS _seventh_half, 
        (SUM(volume * _seventh_half) OVER (PARTITION BY instrument, dt)) / total_volume AS volume_perc7, 

        -- 标注收益率为负的时刻
        IF(ret<0, 1, 0) AS neg_ret, 
        (SUM(amount * neg_ret) OVER (PARTITION BY instrument, dt)) / (SUM(amount) OVER (PARTITION BY instrument, dt)) AS down_single_amt_perc, 

        -- 前后两分钟收益率的相关系数
        CORR(ret, ret_before) OVER (PARTITION BY instrument, dt) AS corr_ret_lastret, 

        -- 前一分钟收盘价和后一分钟开盘价相关性
        CORR(pre_close, open) OVER (PARTITION BY instrument, dt) AS corr_close_nextopen

        FROM cn_stock_bar1m
    )
    GROUP BY date, instrument
),

-- 尾盘收益率偏度
skew_table AS (
    SELECT dt AS date, instrument, MAX(late_skew_yet) AS late_skew_yet FROM (
        SELECT date AS _date, instrument,

        -- 只保留年月日
        DATETRUNC('DAY', date) AS dt, 

        -- 提取时分, 方便提取尾盘数据
        EXTRACT(HOUR FROM _date) AS _hour, 
        EXTRACT(MINUTE FROM _date) AS _min,
        RANK() OVER (PARTITION BY instrument ORDER BY _date) AS _rank, 

        -- 提取收益率
        LOG(close / (LAG(close, 1) OVER (PARTITION BY instrument ORDER BY _date))) AS ret, 

        -- 提取尾盘偏度
        SKEWNESS(ret) OVER (PARTITION BY instrument, dt)  AS late_skew_yet
        FROM cn_stock_bar1m
        WHERE _hour >= 14 AND (_min>=30 OR _min=0)
        QUALIFY _rank > 1
        ORDER BY _date, instrument
    )
    GROUP BY date, instrument
),

-- 早盘成交量与收益率的相关性
v_r_corr AS (
    SELECT dt AS date, instrument, MAX(early_corr_volume_ret) AS early_corr_volume_ret FROM (
        SELECT date AS _date, instrument, 
        -- 只保留年月日
        DATETRUNC('DAY', _date) AS dt, 
        LOG(close / (LAG(close, 1) OVER (PARTITION BY instrument ORDER BY _date))) AS ret, 

        -- 提取时分, 方便标注第一个半小时
        EXTRACT(HOUR FROM _date) AS _hour, 
        EXTRACT(MINUTE FROM _date) AS _min,

        -- 早盘半小时
        CORR(volume, ret) OVER (PARTITION BY instrument, dt) AS early_corr_volume_ret

        FROM cn_stock_bar1m
        WHERE (_hour=9 AND _min>=30) OR (_hour=10 AND _min=0)
    )
    GROUP BY date, instrument
)

SELECT * FROM main_table
INNER JOIN skew_table USING (date, instrument)
INNER JOIN v_r_corr USING (date, instrument)
"""
df = dai.query(sql, filters={'date': ['2019-01-03', '2021-12-31']}).df()
df
Out[3]:
date instrument down_vol_perc volume_perc2 volume_perc3 volume_perc4 volume_perc5 volume_perc6 volume_perc7 down_single_amt_perc corr_ret_lastret corr_close_nextopen late_skew_yet early_corr_volume_ret
0 2019-10-15 000001.SZ 0.539515 0.214917 0.128182 0.118862 0.075754 0.061916 0.108976 0.457985 -0.043431 0.994796 -3.470539 -0.503233
1 2019-03-05 000004.SZ 0.645884 0.103265 0.035770 0.068662 0.327641 0.113724 0.043934 0.387980 -0.088504 0.996456 2.735970 0.067655
2 2021-02-01 000004.SZ 0.792507 0.180732 0.128799 0.063749 0.059569 0.054194 0.038582 0.635766 0.075548 0.995083 -5.556083 0.561315
3 2020-03-27 000005.SZ 0.512168 0.094045 0.058622 0.153418 0.036490 0.196251 0.104884 0.492958 -0.109199 0.980899 -0.473182 0.033952
4 2021-05-13 000005.SZ 0.339658 0.479132 0.161791 0.018791 0.030074 0.017920 0.014165 0.051582 0.098094 0.998050 5.656854 -0.059713
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2881540 2021-08-04 002765.SZ 0.403733 0.148881 0.092451 0.060624 0.096380 0.083139 0.082648 0.274642 0.072857 0.982972 4.777574 0.295311
2881541 2019-08-19 000613.SZ 0.326471 0.070075 0.088555 0.085479 0.086885 0.084754 0.190941 0.204833 -0.082299 0.992689 4.599876 0.183446
2881542 2021-09-10 300053.SZ 0.422111 0.144055 0.194629 0.147497 0.071457 0.056340 0.058383 0.475469 -0.151962 0.996309 4.085515 -0.352655
2881543 2021-12-30 600020.SH 0.476835 0.273167 0.118520 0.046573 0.075835 0.040301 0.067392 0.158887 0.034594 0.863257 0.371447 0.234023
2881544 2021-02-24 603612.SH 0.506851 0.120937 0.140500 0.075234 0.128402 0.061115 0.071510 0.500009 0.055143 0.998926 -4.772271 -0.447285

2881545 rows × 14 columns

In [13]:
# 获取标签(未来一日收益率)
sql = """
SELECT
    -- 计算的是未来1日的收益率。这是通过将1天后的收盘价除以第二天的开盘价得到的。这里使用的是一个叫做m_lead的函数,它可以获取某个字段在未来某天的值。
    -- _future_return 是一个中间变量名,以 _ 开始的别名列不会在最终结果返回
    m_lead(close, 1) / m_lead(open, 1)-1 AS _future_return,

    -- 计算未来5日收益率的1%分位数。all_quantile_cont是一个分位数函数,它能够计算出某个字段值的分位数,这里是计算1%的分位数。
    c_quantile_cont(_future_return, 0.01) AS _future_return_1pct,

    -- 计算未来5日收益率的99%分位数。同样,all_quantile_cont函数用来计算99%的分位数。
    c_quantile_cont(_future_return, 0.99) AS _future_return_99pct,

    -- 对未来5日收益率进行截断处理,值位于1%和99%分位数之间的数据被保留,超过这个范围的值将被设为边界值。
    clip(_future_return, _future_return_1pct, _future_return_99pct) AS _clipped_return,

    -- 将离散化后的数据作为标签使用,这是我们预测的目标。
    _clipped_return AS _label, 

    -- 标准化标签
    normalize(_label) AS label, 

    -- 日期,这是每个股票每天的数据
    date,

    -- 股票代码,代表每一支股票
    instrument
-- 从cn_stock_bar1d这个表中选择数据,这个表存储的是股票的日线数据
FROM cn_stock_bar1d

-- 标签值不为空,且非涨跌停(未来一天的最高价不等于最低价)
QUALIFY label is NOT NULL AND m_lead(high, 1) != m_lead(low, 1)

ORDER BY instrument,date
"""

label = dai.query(sql, filters={'date': ['2019-01-03', '2021-12-31']}).df()
label
Out[13]:
label date instrument
0 0.953949 2019-01-03 000001.SZ
1 -1.322541 2019-01-04 000001.SZ
2 -0.423739 2019-01-07 000001.SZ
3 1.083972 2019-01-08 000001.SZ
4 1.287934 2019-01-09 000001.SZ
... ... ... ...
2871764 -0.639590 2021-12-24 872925.BJ
2871765 0.811220 2021-12-27 872925.BJ
2871766 2.075497 2021-12-28 872925.BJ
2871767 3.847799 2021-12-29 872925.BJ
2871768 -1.925005 2021-12-30 872925.BJ

2871769 rows × 3 columns

我们的阈值取为-0.09、0.09

In [5]:
label['label'].hist()
Out[5]:
<AxesSubplot:>
In [14]:
import pandas as pd

df['date'] = pd.to_datetime(df['date'])
data = pd.merge(df, label, on=['date', 'instrument'], how='inner')
data
Out[14]:
date instrument down_vol_perc volume_perc2 volume_perc3 volume_perc4 volume_perc5 volume_perc6 volume_perc7 down_single_amt_perc corr_ret_lastret corr_close_nextopen late_skew_yet early_corr_volume_ret label
0 2019-10-15 000001.SZ 0.539515 0.214917 0.128182 0.118862 0.075754 0.061916 0.108976 0.457985 -0.043431 0.994796 -3.470539 -0.503233 -1.155403
1 2019-03-05 000004.SZ 0.645884 0.103265 0.035770 0.068662 0.327641 0.113724 0.043934 0.387980 -0.088504 0.996456 2.735970 0.067655 -0.800943
2 2021-02-01 000004.SZ 0.792507 0.180732 0.128799 0.063749 0.059569 0.054194 0.038582 0.635766 0.075548 0.995083 -5.556083 0.561315 -1.918556
3 2020-03-27 000005.SZ 0.512168 0.094045 0.058622 0.153418 0.036490 0.196251 0.104884 0.492958 -0.109199 0.980899 -0.473182 0.033952 -0.203948
4 2021-05-13 000005.SZ 0.339658 0.479132 0.161791 0.018791 0.030074 0.017920 0.014165 0.051582 0.098094 0.998050 5.656854 -0.059713 -1.059168
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2868164 2021-08-04 002765.SZ 0.403733 0.148881 0.092451 0.060624 0.096380 0.083139 0.082648 0.274642 0.072857 0.982972 4.777574 0.295311 0.768907
2868165 2019-08-19 000613.SZ 0.326471 0.070075 0.088555 0.085479 0.086885 0.084754 0.190941 0.204833 -0.082299 0.992689 4.599876 0.183446 0.507129
2868166 2021-09-10 300053.SZ 0.422111 0.144055 0.194629 0.147497 0.071457 0.056340 0.058383 0.475469 -0.151962 0.996309 4.085515 -0.352655 -1.017288
2868167 2021-12-30 600020.SH 0.476835 0.273167 0.118520 0.046573 0.075835 0.040301 0.067392 0.158887 0.034594 0.863257 0.371447 0.234023 -0.058512
2868168 2021-02-24 603612.SH 0.506851 0.120937 0.140500 0.075234 0.128402 0.061115 0.071510 0.500009 0.055143 0.998926 -4.772271 -0.447285 -0.875156

2868169 rows × 15 columns

In [15]:
# 我们取前0.6的数据量作为训练集
import numpy as np

date = data['date'].unique()
date.sort()
index = int(len(date)*0.6)
split = pd.to_datetime(date[index]).strftime('%Y-%m-%d')
train_data = data[data['date']<split]
test_data = data[data['date']>=split]

# 加载数据
xtrain = np.array(train_data.drop(['date', 'instrument', 'label'], axis=1))
ytrain = np.array(train_data['label'])
xtest = np.array(test_data.drop(['date', 'instrument', 'label'], axis=1))
ytest = np.array(test_data['label'])

梯度运算法则 $$\frac{\partial L_i}{\partial f(x_i)}=\frac{\partial}{\partial f(x_i)}\sum_{j=1}^K\log\{1+\exp\{-sgn(y_i-c_j)(f(x_i)-c_j)\}\}=\sum_{j=1}^K\frac{\partial}{\partial f(x_i)}\log\{1+\exp\{-sgn(y_i-c_j)(f(x_i)-c_j)\}$$ 二阶导同理

In [17]:
import numpy as np
import xgboost as xgb

def sgn(x):
    return np.where(x>=0, 1, -1)

# 自定义损失函数
def logistic_obj(pred, dtrain):
    # 获取标签
    label = dtrain.get_label()

    # 阈值为-0.09时
    prob_pre_label = 1 / (np.exp(-sgn(label + 0.09) * (pred + 0.09)) + 1)
    grad_1 = -sgn(label + 0.09) * (1 - prob_pre_label)
    hessian_1 = prob_pre_label * (1 - prob_pre_label)

    # 阈值为0.09时
    prob_pre_label = 1 / (np.exp(-sgn(label - 0.09) * (pred - 0.09)) + 1)
    grad_2 = -sgn(label - 0.09) * (1 - prob_pre_label)
    hessian_2 = prob_pre_label * (1 - prob_pre_label)

    # 求和取梯度等于梯度再求和
    grad = grad_1 + grad_2
    hessian = hessian_1 + hessian_2
    return grad, hessian

dtrain = xgb.DMatrix(xtrain, ytrain)
dtest = xgb.DMatrix(xtest, ytest)


# 自定义评估函数(使用MSE评估分类模型显然不适用了, 我们直接查看交叉熵损失函数的值)
def metric(pred, y):
    """
    此时y是一个连续值
    """
    y = np.reshape(y.get_label(), (-1, 1))
    pred = np.reshape(pred, (-1, 1))
    cut_point = np.array([[-0.09, 0.09]])
    temp = sgn(y-cut_point) * (pred - cut_point)
    temp = np.log(1 + np.exp(temp))
    out = np.sum(temp)
    return 'loss', out


print("---------focal loss-----------")
params = {'tree_method': 'hist'}
model = xgb.train(params=params, dtrain=dtrain, num_boost_round=20, early_stopping_rounds=50,
            evals=[(dtrain, 'train'), (dtest, 'test')], verbose_eval=1, obj=logistic_obj, feval=metric)
In [18]:
pre_data = test_data[['date', 'instrument']]
pre_data['pre_label'] = model.predict(dtest)
In [21]:
pre_data.sort_values('date')
Out[21]:
date instrument pre_label
571374 2020-10-23 300048.SZ -0.392531
2198098 2020-10-23 603528.SH -0.161591
801071 2020-10-23 000980.SZ -0.267982
2198363 2020-10-23 300715.SZ -0.101751
1608128 2020-10-23 600232.SH -0.362535
... ... ... ...
2104209 2021-12-30 688609.SH -0.343264
1909501 2021-12-30 603655.SH -0.173466
2459684 2021-12-30 301025.SZ -0.408208
2460752 2021-12-30 601279.SH -0.283394
618921 2021-12-30 000534.SZ -0.236235

1254135 rows × 3 columns

In [22]:
from bigdatasource.api import DataSource
from bigdata.api.datareader import D
from biglearning.api import M
from biglearning.api import tools as T
from biglearning.module2.common.data import Outputs
 
import pandas as pd
import numpy as np
import math
import dai
import warnings
import datetime
from datetime import timedelta
 
from zipline.finance.commission import PerOrder
from zipline.api import get_open_orders
from zipline.api import symbol
 
from bigtrader.sdk import *
from bigtrader.utils.my_collections import NumPyDeque
from bigtrader.constant import OrderType
from bigtrader.constant import Direction

# 开始回测(回测准备工作)
instruments = {'market': 'CN_STOCK_A', 'instruments': list(pre_data.instrument.unique()), 'start_date': '2020-10-23', 'end_date': '2021-12-30	'}
instruments = DataSource.write_pickle(instruments)

df = DataSource.write_df(pre_data)
In [24]:
# 交易引擎:初始化函数,只执行一次
def m4_initialize_bigquant_run(context):
    # 加载预测数据
    context.df = context.options['data'].read_df()

# 交易引擎:每个单位时间开盘前调用一次。
def m4_before_trading_start_bigquant_run(context, data):
    # 盘前处理,订阅行情等
    pass

# 交易引擎:tick数据处理函数,每个tick执行一次
def m4_handle_tick_bigquant_run(context, tick):
    pass

# 交易引擎:bar数据处理函数,每个时间单位执行一次
def m4_handle_data_bigquant_run(context, data):
    dt = data.current_dt.strftime('%Y-%m-%d')

    # 获取数据
    df = context.df[context.df['date']==dt].sort_values('pre_label', ascending=False)
    instruments = list(df[df['pre_label']>0].instrument)[:10]
    
    # 获取持仓信息
    holding = context.get_account_positions()
    holding_list = list(holding.keys())

    # 卖出不在买入池中的股票
    for ins in holding_list:
        if ins not in instruments and data.can_trade(ins):
            context.order_target(ins, 0)
            holding_list.remove(ins)
    
    # 买入持仓中没有的票
    for ins in instruments:
        if ins not in holding_list and data.can_trade(ins) and len(holding_list)<10:
            context.order_target_percent(ins, 1/10)
            holding_list.append(ins)


# 交易引擎:成交回报处理函数,每个成交发生时执行一次
def m4_handle_trade_bigquant_run(context, trade):
    pass

# 交易引擎:委托回报处理函数,每个委托变化时执行一次
def m4_handle_order_bigquant_run(context, order):
    pass

# 交易引擎:盘后处理函数,每日盘后执行一次
def m4_after_trading_bigquant_run(context, data):
    pass


m4 = M.hftrade.v2(
    instruments=instruments,
    options_data=df,
    start_date='',
    end_date='',
    initialize=m4_initialize_bigquant_run,
    before_trading_start=m4_before_trading_start_bigquant_run,
    handle_tick=m4_handle_tick_bigquant_run,
    handle_data=m4_handle_data_bigquant_run,
    handle_trade=m4_handle_trade_bigquant_run,
    handle_order=m4_handle_order_bigquant_run,
    after_trading=m4_after_trading_bigquant_run,
    capital_base=1000000,
    frequency='daily',
    price_type='真实价格',
    product_type='股票',
    before_start_days='0',
    volume_limit=1,
    order_price_field_buy='open',
    order_price_field_sell='open',
    benchmark='000300.SH',
    plot_charts=True,
    disable_cache=False,
    replay_bdb=False,
    show_debug_info=False,
    backtest_only=False
)
  • 收益率37.72%
  • 年化收益率30.38%
  • 基准收益率4.3%
  • 阿尔法0.31
  • 贝塔0.27
  • 夏普比率1.11
  • 胜率0.51
  • 盈亏比1.22
  • 收益波动率25.05%
  • 信息比率0.06
  • 最大回撤17.07%
日期 时间 股票代码 股票名称 买/卖 数量 成交价 总成本 交易佣金
Loading... (need help?)
日期 标的代码 标的名称 持仓均价 收盘价 数量 持仓价值 收益
Loading... (need help?)
时间 级别 内容
Loading... (need help?)

我们来魔改一下损失函数, 将问题转化为一个二分类问题, 直接预测上涨或者下跌

In [25]:
import numpy as np
import xgboost as xgb

# 自定义损失函数
def logistic_obj(pred, dtrain):
    # 获取标签
    label = dtrain.get_label()
    label = np.where(label>0, 1, -1)
    prob_pre_label = 1 / (np.exp(-label * pred) + 1)

    # 计算一阶导数
    grad = -label * (1 - prob_pre_label)

    # 二阶导
    hessian = prob_pre_label * (1 - prob_pre_label)
    return grad, hessian

dtrain = xgb.DMatrix(xtrain, ytrain)
dtest = xgb.DMatrix(xtest, ytest)

# 自定义评估函数(使用MSE评估分类模型显然不适用了)
def metric(pred, y):
    pre_label = 1 / (1+np.exp(-pred))
    y_ = y.get_label()
    y_ = np.where(y_>0, 1, -1)
    acc = np.sum(np.where(pre_label>0.5, 1, -1)==y_) / len(y_)
    return 'acc', acc


print("---------focal loss-----------")
params = {'tree_method': 'hist'}
model = xgb.train(params=params, dtrain=dtrain, num_boost_round=5, early_stopping_rounds=50,
            evals=[(dtrain, 'train'), (dtest, 'test')], verbose_eval=1, obj=logistic_obj, feval=metric)
In [26]:
prob = 1 / (1 + np.exp(-model.predict(dtest)))
np.sum(np.where(prob>0.5, 1, -1) == np.where(ytest>0, 1, -1)) / (len(ytest))
Out[26]:
0.5549322840045131
In [27]:
pre_data_ = test_data[['date', 'instrument']]
pre_data_['pre_label'] = prob
In [29]:
pre_data_.sort_values('date')
Out[29]:
date instrument pre_label
571374 2020-10-23 300048.SZ 0.451978
2198098 2020-10-23 603528.SH 0.480584
801071 2020-10-23 000980.SZ 0.458659
2198363 2020-10-23 300715.SZ 0.495811
1608128 2020-10-23 600232.SH 0.446237
... ... ... ...
2104209 2021-12-30 688609.SH 0.456730
1909501 2021-12-30 603655.SH 0.482958
2459684 2021-12-30 301025.SZ 0.448454
2460752 2021-12-30 601279.SH 0.471264
618921 2021-12-30 000534.SZ 0.466919

1254135 rows × 3 columns

In [30]:
from bigdatasource.api import DataSource
from bigdata.api.datareader import D
from biglearning.api import M
from biglearning.api import tools as T
from biglearning.module2.common.data import Outputs
 
import pandas as pd
import numpy as np
import math
import dai
import warnings
import datetime
from datetime import timedelta
 
from zipline.finance.commission import PerOrder
from zipline.api import get_open_orders
from zipline.api import symbol
 
from bigtrader.sdk import *
from bigtrader.utils.my_collections import NumPyDeque
from bigtrader.constant import OrderType
from bigtrader.constant import Direction

# 开始回测(回测准备工作)
instruments = {'market': 'CN_STOCK_A', 'instruments': list(pre_data_.instrument.unique()), 'start_date': '2020-10-23', 'end_date': '2021-12-30'}
instruments = DataSource.write_pickle(instruments)

df = DataSource.write_df(pre_data_)
In [31]:
# 交易引擎:初始化函数,只执行一次
def m4_initialize_bigquant_run(context):
    # 加载预测数据
    context.df = context.options['data'].read_df()

# 交易引擎:每个单位时间开盘前调用一次。
def m4_before_trading_start_bigquant_run(context, data):
    # 盘前处理,订阅行情等
    pass

# 交易引擎:tick数据处理函数,每个tick执行一次
def m4_handle_tick_bigquant_run(context, tick):
    pass

# 交易引擎:bar数据处理函数,每个时间单位执行一次
def m4_handle_data_bigquant_run(context, data):
    dt = data.current_dt.strftime('%Y-%m-%d')

    # 获取数据
    df = context.df[context.df['date']==dt].sort_values('pre_label', ascending=False)
    instruments = list(df[df['pre_label']>0].instrument)[:10]
    
    # 获取持仓信息
    holding = context.get_account_positions()
    holding_list = list(holding.keys())

    # 卖出不在买入池中的股票
    for ins in holding_list:
        if ins not in instruments and data.can_trade(context.symbol(ins)):
            context.order_target(ins, 0)
            holding_list.remove(ins)
    
    # 买入持仓中没有的票
    for ins in instruments:
        if ins not in holding_list and data.can_trade(context.symbol(ins)) and len(holding_list)<10:
            context.order_target_percent(ins, 1/10)
            holding_list.append(ins)


# 交易引擎:成交回报处理函数,每个成交发生时执行一次
def m4_handle_trade_bigquant_run(context, trade):
    pass

# 交易引擎:委托回报处理函数,每个委托变化时执行一次
def m4_handle_order_bigquant_run(context, order):
    pass

# 交易引擎:盘后处理函数,每日盘后执行一次
def m4_after_trading_bigquant_run(context, data):
    pass


m4 = M.hftrade.v2(
    instruments=instruments,
    options_data=df,
    start_date='',
    end_date='',
    initialize=m4_initialize_bigquant_run,
    before_trading_start=m4_before_trading_start_bigquant_run,
    handle_tick=m4_handle_tick_bigquant_run,
    handle_data=m4_handle_data_bigquant_run,
    handle_trade=m4_handle_trade_bigquant_run,
    handle_order=m4_handle_order_bigquant_run,
    after_trading=m4_after_trading_bigquant_run,
    capital_base=1000000,
    frequency='daily',
    price_type='真实价格',
    product_type='股票',
    before_start_days='0',
    volume_limit=1,
    order_price_field_buy='open',
    order_price_field_sell='open',
    benchmark='000300.SH',
    plot_charts=True,
    disable_cache=False,
    replay_bdb=False,
    show_debug_info=False,
    backtest_only=False
)
  • 收益率67.84%
  • 年化收益率53.6%
  • 基准收益率4.3%
  • 阿尔法0.56
  • 贝塔0.14
  • 夏普比率1.8
  • 胜率0.49
  • 盈亏比1.34
  • 收益波动率24.85%
  • 信息比率0.09
  • 最大回撤10.95%
日期 时间 股票代码 股票名称 买/卖 数量 成交价 总成本 交易佣金
Loading... (need help?)
日期 标的代码 标的名称 持仓均价 收盘价 数量 持仓价值 收益
Loading... (need help?)
时间 级别 内容
Loading... (need help?)
In [ ]: