DNN滚动训练5日选股
由bq93t66l创建,最终由bq93t66l 被浏览 4 用户
1. 策略概览
本策略基于DNN模型,在2018年至2025年期间对每年进行滚动训练。训练集时段为过去5年,测试集时段为未来一年,如2018年训练集采用2013-01-01至2017-12-31,测试集时段为2018-01-01至2018-12-31。数据使用当期全市场数据,聚焦于量价数据及其衍生,如5日均值比例,量价5日相关性以及量价横截面分位数排名等。标签设置为未来5日收益率的分位数排名。回测时选取预测分数top200只股票,每5日调仓。
2. 数据处理
数据采用cn_stock_bar1d表内数据。包括原始量价数据和构建的因子共53个模型输入特征。只对于训练集特征进行4倍标准差winsorize,训练集标签1和99分位数winsorize,测试集特征不做处理避免回测时引入未来信息泄露。数据准备阶段不做标准化,由模型内部BatchNorm1d实现标准化。具体实现代码如下:
def get_data(start_date, end_date, is_train=True):
sql1 = """
WITH feature_table AS (
/*基础特征*/
SELECT date, instrument, close close_0, open open_0, high high_0, low low_0, amount amount_0, turn * 100 turn_0, change_ratio + 1 return_0,
/*均线*/
m_AVG(close,5)/close ma_close_5,
m_AVG(turn * 100,5)/turn ma_turn_5,
m_AVG(amount,5)/amount ma_amount_5,
m_AVG(change_ratio + 1, 5)/(change_ratio + 1) ma_cr_5,
/*标准差*/
m_STDDEV(close, 5) std_close_5,
m_STDDEV(turn * 100,5) std_turn_5,
m_STDDEV(amount,5) std_amount_5,
m_STDDEV(change_ratio + 1,5) std_cr_5,
/*排名百分比*/
m_rolling_rank(close, 5)/5 rank_close_5,
m_rolling_rank(low, 5)/5 rank_low_5,
m_rolling_rank(open, 5)/5 rank_open_5,
m_rolling_rank(high, 5)/5 rank_high_5,
m_rolling_rank(turn * 100, 5)/5 rank_turn_5,
m_rolling_rank(amount, 5)/5 rank_amount_5,
m_rolling_rank(change_ratio+1, 5)/5 rank_cr_5,
/*相关系数*/
m_CORR(volume, change_ratio+1, 5) corr_vcr,
m_CORR(volume, close, 5) corr_vc,
m_CORR(volume, turn * 100, 5) corr_vt,
m_CORR(change_ratio+1, close, 5) corr_crc,
m_CORR(change_ratio+1, turn, 5) corr_crt,
m_CORR(high, low, 5) corr_hl,
m_CORR(high, close, 5) corr_hc,
m_CORR(high, open, 5) corr_ho,
m_CORR(low, close, 5) corr_lc,
m_CORR(low, open, 5) corr_lo,
m_CORR(close, open, 5) corr_co,
m_CORR(close, turn * 100, 5) corr_ct,
/*截面特征*/
c_pct_rank(turn) cross_turn,
c_pct_rank(change_ratio + 1) cross_change_ratio,
c_pct_rank(ma_close_5) cross_ma_close_5,
c_pct_rank(ma_turn_5) cross_ma_turn_5,
c_pct_rank(ma_amount_5) cross_ma_amount_5,
c_pct_rank(ma_cr_5) cross_ma_cr_5,
c_pct_rank(std_close_5) cross_std_close_5,
c_pct_rank(std_turn_5) cross_std_turn_5,
c_pct_rank(std_amount_5) cross_std_amount_5,
c_pct_rank(std_cr_5) cross_max_cr_r,
c_pct_rank(rank_close_5) cross_rank_close_5,
c_pct_rank(rank_turn_5) cross_rank_turn_5,
c_pct_rank(rank_amount_5) cross_rank_amount_5,
c_pct_rank(rank_cr_5) cross_rank_cr_5,
c_pct_rank(corr_vcr) cross_corr_vcr,
c_pct_rank(corr_vc) cross_corr_vc,
c_pct_rank(corr_vt) cross_corr_vt,
c_pct_rank(corr_crc) cross_corr_crc,
c_pct_rank(corr_crt) cross_corr_crt,
FROM cn_stock_bar1d
QUALIFY COLUMNS(*) IS NOT NULL
)
"""
if is_train:
print('抽取训练集数据')
sql2 = """
/*标签*/
,
label_table AS (
SELECT date, instrument,
m_lead(close, 5) / m_lead(open, 1) - 1 AS _future_return,
all_quantile_cont(_future_return, 0.01) AS _future_return_1pct,
all_quantile_cont(_future_return, 0.99) AS _future_return_99pct,
clip(_future_return, _future_return_1pct, _future_return_99pct) AS _label,
c_pct_rank(_label) as label,
FROM cn_stock_bar1d
QUALIFY COLUMNS(*) IS NOT NULL AND m_lead(high, 1) != m_lead(low, 1)
)
-- 移除特征标准化
SELECT date, instrument, label, COLUMNS(feature_table.* EXCLUDE (date, instrument)) FROM feature_table
INNER JOIN label_table USING (date, instrument)
ORDER BY date, instrument;
"""
else:
print('抽取测试集数据')
sql2 = """
/*数据提取*/
SELECT feature_table.* FROM feature_table
ORDER BY date, instrument
"""
sql = sql1+sql2
df = dai.query(sql, filters={'date': [start_date, end_date]}).df()
df = pl.from_pandas(df)
df = df.fill_nan(None)
df = df.select(pl.all().forward_fill().over('instrument'))
df = df.fill_null(0)
if is_train:
df = df.with_columns(pl.exclude('date','instrument').clip(
pl.exclude('date','instrument').mean()-4*pl.exclude('date','instrument').std(),
pl.exclude('date','instrument').mean()+4*pl.exclude('date','instrument').std()
))
# df = df.with_columns((pl.col('label')-pl.col('label').mean())/(pl.col('label').std()+1e-6))
return df
def get_train_test(start_year:str='2023'):
'''
默认5年训练,一年测试
start_year: 测试集开始年份,训练集自动后选5年
默认从1月1到12月31
'''
train_start_date = str(int(start_year)-5)+'-01-01'
train_end_date = str(int(start_year)-1)+'-12-31'
test_start_date = start_year+'-01-01'
test_end_date = start_year+'-12-31'
train_df = get_data(train_start_date, train_end_date, is_train=True)
test_df = get_data(test_start_date, test_end_date, is_train=False)
return train_df, test_df
模型
模型采用DNN(多层感知机),4层全连接层。代码如下:
class DNN(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.pipe = nn.Sequential(
nn.BatchNorm1d(input_dim),
nn.Linear(input_dim, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256,128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128,64),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(64,1)
)
def forward(self, x):
y = self.pipe(x)
return y
训练
训练时在训练集内部按4:1再次划分训练集和验证集。损失函数为mse,优化器为Adam。参数设置如下:
| batch size | 512 |
|---|---|
| learning rate | 0.001 |
| max_epochs | 50 |
结果
回测采取5日调仓,每次选取预测分数最高的200只股票。基准设置为沪深300.
各年回测表现
| 年份 | 基准(沪深300)收益率 | 策略累计收益率 | 超额收益率 |
|---|---|---|---|
| 2018 | -27.63% | -3.01% | +34.02% |
| 2019 | +34.42% | +44.41% | +7.43% |
| 2020 | +26.72% | +11.26% | -12.2% |
| 2021 | -10.1% | +23.83% | +37.73% |
| 2022 | -20.07% | -2.1% | +22.48% |
| 2023 | -14.5% | +1.45% | +18.65% |
| 2024 | +19.75% | +49.46% | +24.81% |
| 2025 | +22.9% | +33.34% | +8.49% |
2018-2025年回测曲线