Two seperate embedding layers, one for tokens, one for token index (positions).
class TokenAndPositionEmbedding(layers.Layer):def__init__(self, maxlen, vocab_size, embed_dim):super().__init__()self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)def call(self, x): maxlen = tf.shape(x)[-1] positions = tf.range(start=0, limit=maxlen, delta=1) positions =self.pos_emb(positions) x =self.token_emb(x)return x + positions
3 Download and prepare dataset
vocab_size =20000# Only consider the top 20k wordsmaxlen =200# Only consider the first 200 words of each movie review(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words = vocab_size)print(len(x_train), "Training sequences")
Transformer layer outputs one vector for each time step of our input sequence. Here, we take the mean across all time steps and use a feed forward network on top of it to classify text.