深度学习对话系统实战篇--老版本tf.contrib.legacy_seq2seq API介绍和源码解析

由ypyu创建，最终由ypyu更新于2023-06-14 03:02 被浏览 7 用户

上一篇文章中我们已经分析了各种seq2seq模型，从理论的角度上对他们有了一定的了解和认识，那么接下来我们就结合tensorflow代码来看一下这些模型在tf中是如何实现的，相信有了对代码的深层次理解，会在我们之后构建对话系统模型的时候有很大的帮助。

tensorflow版本升级之后把之前的tf.nn.seq2seq的代码迁移到了tf.contrib.legacy_seq2seq下面，其实这部分API估计以后也会被遗弃，因为已经开发出了新的API放在tf.contrib.seq2seq下面，更加灵活，但是目前在网上找到的代码和仿真实现基本上用的还是legacy_seq2seq下面的代码，所以我们先来分析一下这部分的函数功能及源码实现。本次我们会介绍下面几个函数，这部分代码的定义都可以在[python/ops/seq2seq.py](https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py)文件中找到。

首先看一下这个文件的组成，主要包含下面几个函数：

可以看到按照调用关系和功能不同可以分成下面的结构：

model_with_buckets
1，seq2seq函数
basic_rnn_seq2seq
rnn_decoder
tied_rnn_seq2seq
embedding_tied_rnn_seq2seq
embedding_rnn_seq2seq
embedding_rnn_decoder
embedding_attention_seq2seq
embedding_attention_decoder
attention_decoder
attention
one2many_rnn_seq2seq
2，loss函数
sequence_loss_by_example
sequence_loss

在这里，我会主要介绍一下功能最完备的几个函数，足以让我们实现一个基于seq_to_seq模型的对话系统。就让我们按照函数的调用关系来进行一一介绍吧：

model_with_buckets()函数

首先来说最高层的函数model_with_buckets()，定义如下所示：

   def model_with_buckets(encoder_inputs,
                          decoder_inputs,
                          targets,
                          weights,
                          buckets,
                          seq2seq,
                          softmax_loss_function=None,
                          per_example_loss=False,
                          name=None):

首先来说一下这个函数，目的是为了减少计算量和加快模型计算速度，然后由于这部分代码比较古老，你会发现有些地方还在使用static_rnn()这种函数，其实新版的tf中引入dynamic_rnn之后就不需要这么做了。但是呢，我们还是来分析一下，其实思路很简单，就是将输入长度分成不同的间隔，这样数据的在填充时只需要填充到相应的bucket长度即可，不需要都填充到最大长度。比如buckets取[(5，10), (10，20),(20，30)...]（每个bucket的第一个数字表示source填充的长度，第二个数字表示target填充的长度，eg：‘我爱你’-->‘I love you’，应该会被分配到第一个bucket中，然后‘我爱你’会被pad成长度为5的序列，‘I love you’会被pad成长度为10的序列。其实就是每个bucket表示一个模型的参数配置），这样对每个bucket都构造一个模型，然后训练时取相应长度的序列进行，而这些模型将会共享参数。其实这一部分可以参考现在的dynamic_rnn来进行理解，dynamic_rnn是对每个batch的数据将其pad至本batch中长度最大的样本，而bucket则是在数据预处理环节先对数据长度进行聚类操作。明白了其原理之后我们再看一下该函数的参数和内部实现：

   encoder_inputs: encoder的输入，一个tensor的列表。列表中每一项都是encoder时的一个词（batch）。
   decoder_inputs: decoder的输入，同上
   targets:        目标值，与decoder_input只相差一个<EOS>符号，int32型
   weights:        目标序列长度值的mask标志，如果是padding则weight=0，否则weight=1
   buckets:        就是定义的bucket值，是一个列表：[(5，10), (10，20),(20，30)...]
   seq2seq:        定义好的seq2seq模型，可以使用后面介绍的embedding_attention_seq2seq，embedding_rnn_seq2seq，basic_rnn_seq2seq等
   softmax_loss_function: 计算误差的函数，(labels, logits)，默认为sparse_softmax_cross_entropy_with_logits
   per_example_loss: 如果为真，则调用sequence_loss_by_example，返回一个列表，其每个元素就是一个样本的loss值。如果为假，则调用sequence_loss函数，对一个batch的样本只返回一个求和的loss值，具体见后面的分析
   name: Optional name for this operation, defaults to "model_with_buckets".

内部代码这里不会全部贴上来，捡关键的说一下：

   #保存每个bucket对应的loss和output    
   losses = []
   outputs = []
   with ops.name_scope(name, "model_with_buckets", all_inputs):
   #对每个bucket都要选择数据进行构建模型
   for j, bucket in enumerate(buckets):
     #buckets之间的参数要进行复用
     with variable_scope.variable_scope(variable_scope.get_variable_scope(), reuse=True if j > 0 else None):

       #调用seq2seq进行解码得到输出，这里需要注意的是，encoder_inputs和decoder_inputs是定义好的placeholder，
       #都是长度为序列最大长度的列表（也就是最大的那个buckets的长度），按上面的例子，这两个placeholder分别是长度为20和30的列表。
       #在构建模型时，对于每个bucket，只取其对应的长度个placeholder即可，如对于（5,10）这个bucket，就取前5/10个placeholder进行构建模型
       bucket_outputs, _ = seq2seq(encoder_inputs[:bucket[0]], decoder_inputs[:bucket[1]])
       outputs.append(bucket_outputs)
       #如果指定per_example_loss则调用sequence_loss_by_example，losses添加的是一个batch_size大小的列表
       if per_example_loss:
         losses.append(
             sequence_loss_by_example(
                 outputs[-1],
                 targets[:bucket[1]],
                 weights[:bucket[1]],
                 softmax_loss_function=softmax_loss_function))
       #否则调用sequence_loss，对上面的结果进行求和，losses添加的是一个值
       else:
         losses.append(
             sequence_loss(
                 outputs[-1],
                 targets[:bucket[1]],
                 weights[:bucket[1]],
                 softmax_loss_function=softmax_loss_function))

函数的输出为outputs和losses，其tensor的shape见上面解释。

embedding_attention_seq2seq()函数

上面函数中会调用seq2seq函数进行解码操作，我们这里就那一个实现的最完备的例子进行介绍一个seq2seq模型是如何实现的，如下所示，从名字我们就可以看出其实现了embedding和attention两个功能，而attention则是使用了“Neural Machine Translation by Jointly Learning to Align and Translate”这篇论文里的定义方法：

   def embedding_attention_seq2seq(encoder_inputs,
                                   decoder_inputs,
                                   cell,
                                   num_encoder_symbols,
                                   num_decoder_symbols,
                                   embedding_size,
                                   num_heads=1,
                                   output_projection=None,
                                   feed_previous=False,
                                   dtype=None,
                                   scope=None,
                                   initial_state_attention=False):

在接下来的代码介绍中，之前函数里说过的参数，如果在本函数中定义上没有任何差别，不就不会再重复介绍，比如这里的encoder_inputs和decoder_inputs，下面我们看一下其各个参数的含义：

   cell:                RNNCell常见的一些RNNCell定义都可以用.
   num_encoder_symbols: source的vocab_size大小，用于embedding矩阵定义
   num_decoder_symbols: target的vocab_size大小，用于embedding矩阵定义
   embedding_size:      embedding向量的维度
   num_heads:           Attention头的个数，就是使用多少种attention的加权方式，用更多的参数来求出几种attention向量
   output_projection:   输出的映射层，因为decoder输出的维度是output_size，所以想要得到num_decoder_symbols对应的词还需要增加一个映射层，参数是W和B，W:[output_size, num_decoder_symbols],b:[num_decoder_symbols]
   feed_previous:       是否将上一时刻输出作为下一时刻输入，一般测试的时候置为True，此时decoder_inputs除了第一个元素之外其他元素都不会使用。
   initial_state_attention: 默认为False, 初始的attention是零；若为True，将从initial state和attention states开始。

下面来看一下几个关键的代码片：

   # Encoder.先将cell进行deepcopy，因为seq2seq模型是两个相同的模型，但是模型参数不共享，所以encoder和decoder要使用两个不同的RnnCell
   encoder_cell = copy.deepcopy(cell)
   #先将encoder输入进行embedding操作，直接在RNNCell的基础上添加一个EmbeddingWrapper即可
   encoder_cell = core_rnn_cell.EmbeddingWrapper(encoder_cell,
       embedding_classes=num_encoder_symbols,
       embedding_size=embedding_size)
   #这里仍然使用static_rnn函数来构造RNN模型
   encoder_outputs, encoder_state = rnn.static_rnn(encoder_cell, encoder_inputs, dtype=dtype)

   # First calculate a concatenation of encoder outputs to put attention on.
   #将encoder的输出由列表转换成Tensor，shape为[batch_size，encoder_input_length，output_size]。转换之后
   #的tensor就可以作为Attention的输入了
   top_states = [array_ops.reshape(e, [-1, 1, cell.output_size]) for e in encoder_outputs]
   attention_states = array_ops.concat(top_states, 1)

上面的代码进行了embedding的encoder阶段，最终得到每个时间步的隐藏层向量表示encoder_outputs，然后将各个时间步的输出进行reshape并concat变成一个[batch_size，encoder_input_length，output_size]的tensor。方便计算每个decode时刻的编码向量Ci。接下来看一下decoder阶段的代码：

   # Decoder.
   output_size = None
   #将decoder的输出进行映射到output_vocab_size维度，直接将RNNCell添加上一个OutputProjectionWrapper包装即可
   if output_projection is None:
     cell = core_rnn_cell.OutputProjectionWrapper(cell, num_decoder_symbols)
     output_size = num_decoder_symbols

   #如果feed_previous是bool型的值，则直接调用embedding_attention_decoder函数进行解码操作
   if isinstance(feed_previous, bool):
     return embedding_attention_decoder(
         decoder_inputs,
         encoder_state,
         attention_states,
         cell,
         num_decoder_symbols,
         embedding_size,
         num_heads=num_heads,
         output_size=output_size,
         output_projection=output_projection,
         feed_previous=feed_previous,
         initial_state_attention=initial_state_attention)

先是对RNNCell封装了一个OutputProjectionWrapper用于输出层的映射，然后直接调用embedding_attention_decoder函数解码。但是当feed_previous不是bool型的变量，而是一个tensor的时候，会执行下面的逻辑：

   # 如果feed_previous是一个tensor，则使用tf.cond构建两个graph
   def decoder(feed_previous_bool):
     #本函数会被调用两次，第一次不适用reuse，第二次使用reuse。所以decoder(True),decoder(false)
     reuse = None if feed_previous_bool else True
     with variable_scope.variable_scope(variable_scope.get_variable_scope(), reuse=reuse):
       outputs, state = embedding_attention_decoder(
           decoder_inputs,
           encoder_state,
           attention_states,
           cell,
           num_decoder_symbols,
           embedding_size,
           num_heads=num_heads,
           output_size=output_size,
           output_projection=output_projection,
           feed_previous=feed_previous_bool,
           update_embedding_for_previous=False,
           initial_state_attention=initial_state_attention)
       state_list = [state]
       if nest.is_sequence(state):
         state_list = nest.flatten(state)
       return outputs + state_list

   #????这里不是很懂
   outputs_and_state = control_flow_ops.cond(feed_previous,
                                             lambda: decoder(True),
                                             lambda: decoder(False))
   outputs_len = len(decoder_inputs)  # Outputs length same as decoder inputs.
   state_list = outputs_and_state[outputs_len:]
   state = state_list[0]
   if nest.is_sequence(encoder_state):
     state = nest.pack_sequence_as(
         structure=encoder_state, flat_sequence=state_list)
   return outputs_and_state[:outputs_len], state

首先说一下自己对上面这段代码的理解，希望大神可以指出这段代码的含义。tf.cond这个函数其实就是一个if else条件控制语句，也就是说如果feed_previous为真则执行decode(True), 否则执行decode(False)。然后再看decode函数，直接调用embedding_attention_decoder函数进行解码，然后将结果拼接在一起，最后执行完在将结果分开返回。感觉整体实现的功能跟上面那段代码是一样的，所以不太清楚目的是什么==

1.1，embedding_attention_decoder函数

前面的embedding_attention_seq2seq在解码时会直接调用本函数，那么我们就来看一下这个函数的定义：

   def embedding_attention_decoder(decoder_inputs,
                                   initial_state,
                                   attention_states,
                                   cell,
                                   num_symbols,
                                   embedding_size,
                                   num_heads=1,
                                   output_size=None,
                                   output_projection=None,
                                   feed_previous=False,
                                   update_embedding_for_previous=True,
                                   dtype=None,
                                   scope=None,
                                   initial_state_attention=False):

因为大多数都是之前函数中的参数或者变量直接传进来的，想必会比较容易理解各个变量的含义，捡重要的参数简单说一下：

   initial_state:    2D Tensor [batch_size x cell.state_size]，RNN的初始状态
   attention_states: 3D Tensor [batch_size x attn_length x attn_size]，就是上面计算出来的encoder阶段的隐层向量
   num_symbols:      decoder阶段的vocab_size
   update_embedding_for_previous: Boolean; 只有在feed_previous为真的时候才会起作用。就是只更新‘GO’的embedding向量，其他元素保持不变。

这个函数首先对定义encoder阶段的embedding矩阵，该矩阵用于将decoder的输出转化为下一时刻输入向量或者将decoder_inputs转化为响应的词向量；然后直接调用attention_decoder函数进入attention的解码阶段。

     with variable_scope.variable_scope(scope or "embedding_attention_decoder", dtype=dtype) as scope:
       #decoder阶段的embedding，
       embedding = variable_scope.get_variable("embedding", [num_symbols, embedding_size])
       #将上一个cell输出进行output_projection然后embedding得到当前cell的输入,仅在feed_previous情况下使用
       loop_function = _extract_argmax_and_embed(embedding, output_projection, update_embedding_for_previous) if feed_previous else None
       #如果不是feed_previous的话，将decoder_inputs进行embedding得到词向量
       emb_inp = [embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs]
       return attention_decoder(
           emb_inp,
           initial_state,
           attention_states,
           cell,
           output_size=output_size,
           num_heads=num_heads,
           loop_function=loop_function,
           initial_state_attention=initial_state_attention)

1.1.1，attention_decoder()函数

这个函数可以说是Attention based seq2seq的核心函数了，最重要的attention部分和decoder部分都是在这里实现的，也就是论文中的公式会在这部分代码中体现出来：

   def attention_decoder(decoder_inputs,
                         initial_state,
                         attention_states,
                         cell,
                         output_size=None,
                         num_heads=1,
                         loop_function=None,
                         dtype=None,
                         scope=None,
                         initial_state_attention=False):

   loop_function: If not None, this function will be applied to i-th output
     in order to generate i+1-th input, and decoder_inputs will be ignored,
     except for the first element ("GO" symbol).loop_function(prev, i) = next
       * prev is a 2D Tensor of shape [batch_size x output_size],
       * i is an integer, the step number (when advanced control is needed),
       * next is a 2D Tensor of shape [batch_size x input_size].

下面我们看一下具体的代码实现：

   # To calculate W1 * h_t we use a 1-by-1 convolution, need to reshape before.
   #为了方便进行1*1卷积，将attention_states转化为[batch_size, num_steps, 1， attention_size]的四维tensor
   #第四个维度是attention_size，表示的是input_channle，
   hidden = array_ops.reshape(attention_states, [-1, attn_length, 1, attn_size])

   #用来保存num_heads个读取头的相关信息，hidden_states保存的是w*hj，v保存的是v，每个读取头的参数是不一样的
   hidden_features = []
   v = []
   #-----------------------------------接下来计算v*tanh(w*hj+u*zi)来表示二者的相关性--------------------------------------------------------
   attention_vec_size = attn_size  # Size of query vectors for attention.
   #对隐藏层的每个元素计算w*hj
   for a in xrange(num_heads):
     #卷积核的size是1*1，输入channle为attn_size，共有attention_vec_size个filter
     k = variable_scope.get_variable("AttnW_%d" % a, [1, 1, attn_size, attention_vec_size])
     #卷积之后的结果就是[batch_size, num_steps, 1，attention_vec_size]
     hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))
     v.append(variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))
   state = initial_state

上面的代码对所有的hidden向量进行了计算，接下来定义一个函数来实现上面的公式，因为每个decode时刻需要输入相应的query向量，就是解码RNN的隐层状态，所以定义一个函数是比较好的选择。

   def attention(query):
     """Put attention masks on hidden using hidden_features and query."""
     ds = []  # Results of attention reads will be stored here.

     #如果query是tuple，则将其flatten，并连接成二维的tensor
     if nest.is_sequence(query):  # If the query is a tuple, flatten it.
       query_list = nest.flatten(query)
       for q in query_list:  # Check that ndims == 2 if specified.
         ndims = q.get_shape().ndims
         if ndims:
           assert ndims == 2
       query = array_ops.concat(query_list, 1)

     for a in xrange(num_heads):
       with variable_scope.variable_scope("Attention_%d" % a):
         #计算u*zi，并将其reshape成[batch_size, 1, 1, attention_vec_size]
         y = Linear(query, attention_vec_size, True)(query)
         y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
         # Attention mask is a softmax of v^T * tanh(...).
         #计算v * tanh(w * hj + u * zi)
         #hidden_features[a] + y的shape为[batch_size, num_steps, 1，attention_vec_size],在于v向量(【attention_vec_size】)相乘仍保持不变
         #在2， 3两个维度上进行reduce_sum操作,最终变成[batch_size，num_steps]的tensor，也就是各个hidden向量所对应的分数
         s = math_ops.reduce_sum(v[a] * math_ops.tanh(hidden_features[a] + y), [2, 3])
         #使用softmax函数进行归一化操作
         a = nn_ops.softmax(s)
         # Now calculate the attention-weighted vector d.
         #对所有向量进行加权求和
         d = math_ops.reduce_sum(array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden, [1, 2])
         ds.append(array_ops.reshape(d, [-1, attn_size]))
     return ds

定义好了attention的计算函数，接下来就是对输入进行循环，一次计算每个decode阶段的输出。这里需要注意的是，attention函数返回的是一个列表，其每个元素是一个读取头对应的结果，然后将该列表与每一时刻的decode_input连接在一起输入到RNNCell中进行解码。代码如下所示：

   #如果使用全零初始化状态，则直接调用attention并使用全另状态。
   if initial_state_attention:
     attns = attention(initial_state)
   #如果不用全另初始化状态，则对所有decoder_inputs进行遍历，并逐个解码
   for i, inp in enumerate(decoder_inputs):
     if i > 0:
       #如果i>0，则复用解码RNN模型的参数
       variable_scope.get_variable_scope().reuse_variables()
     # If loop_function is set, we use it instead of decoder_inputs.
     #如果要使用前一时刻输出作为本时刻输入，则调用loop_function覆盖inp的值
     if loop_function is not None and prev is not None:
       with variable_scope.variable_scope("loop_function", reuse=True):
         inp = loop_function(prev, i)
     # Merge input and previous attentions into one vector of the right size.
     input_size = inp.get_shape().with_rank(2)[1]
     if input_size.value is None:
       raise ValueError("Could not infer input size from input: %s" % inp.name)

     #输入是将inp与attns进行concat，喂给RNNcell
     inputs = [inp] + attns
     x = Linear(inputs, input_size, True)(inputs)
     # Run the RNN.
     cell_output, state = cell(x, state)
     # Run the attention mechanism.
     #计算下一时刻的atten向量
     if i == 0 and initial_state_attention:
       with variable_scope.variable_scope(variable_scope.get_variable_scope(), reuse=True):
         attns = attention(state)
     else:
       attns = attention(state)

到这为止我们就介绍完了所有关于attention seq2seq模型的代码。至于剩下几个seq2seq模型都是本模型的子集，就不过多进行赘述，然后接下来我们再来看一看关于loss计算的代码：

2 loss计算函数

我们先来看第一个函数sequence_loss_by_example的定义，代码比较简单，就是计算decode结果与targets之间的差别，注意本函数的返回结果是一个shape为batch_size的1-D tensor，其中每个值都是一个样本的loss：

   def sequence_loss_by_example(logits,
                                targets,
                                weights,
                                average_across_timesteps=True,
                                softmax_loss_function=None,
                                name=None):
       log_perp_list = []
       #对每个时间步的数据进行计算loss，并添加到log_perp_list列表当中
       for logit, target, weight in zip(logits, targets, weights):
         #如果没有指定softmax_loss_function，则默认调用sparse_softmax_cross_entropy_with_logits函数计算loss
         if softmax_loss_function is None:
           target = array_ops.reshape(target, [-1])
           crossent = nn_ops.sparse_softmax_cross_entropy_with_logits(
               labels=target, logits=logit)
         else:
           crossent = softmax_loss_function(labels=target, logits=logit)
         #weight是0或者1，用于标明该词是否为填充，如果是为0，则loss也为0，不计算
         log_perp_list.append(crossent * weight)
       #对所有时间步的loss进行求和，add_n就是对一个列表元素进行求和
       log_perps = math_ops.add_n(log_perp_list)
       #如果的话，求平均，注意除以的是weights的和，为不是n_step的和
       if average_across_timesteps:
         total_size = math_ops.add_n(weights)
         total_size += 1e-12  # Just to avoid division by 0 for all-0 weights.
         log_perps /= total_size
     return log_perps

接下来再看一下sequence_loss这个函数的定义，很简单，就是调用上面的函数，然后对batch个样本的loss进行求和或者求平均。返回的结果是一个标量值。

   def sequence_loss(logits,
                     targets,
                     weights,
                     average_across_timesteps=True,
                     average_across_batch=True,
                     softmax_loss_function=None,
                     name=None):
     with ops.name_scope(name, "sequence_loss", logits + targets + weights):
       #对batch个样本的loss进行求和
       cost = math_ops.reduce_sum(
           sequence_loss_by_example(
               logits,
               targets,
               weights,
               average_across_timesteps=average_across_timesteps,
               softmax_loss_function=softmax_loss_function))
       #如果要对batch进行求平均，则除以batch_size
       if average_across_batch:
         batch_size = array_ops.shape(targets[0])[0]
         return cost / math_ops.cast(batch_size, cost.dtype)
       else:
         return cost

以上，我们分析了tf中seq2seq的代码，相比看完之后大家应该有了一个大致的了解，如何调用这些函数应该也很清楚明白了，下一篇博客中会结合实际的对话系统的代码进行分析。后续计划也会去研究tf最新的seq2seq的API接口tf.contrib.seq2seq，用更规范的代码来构造seq2seq模型~~