IMDB Sentiment Analysis (Warm-Start From Pretrained Embedding)

Econ 425T / Biostat 203B

Author

Dr. Hua Zhou @ UCLA

Published

February 26, 2023

Display system information for reproducibility.

import IPython
print(IPython.sys_info())
{'commit_hash': 'add5877a4',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Users/huazhou/opt/anaconda3/lib/python3.9/site-packages/IPython',
 'ipython_version': '8.8.0',
 'os_name': 'posix',
 'platform': 'macOS-10.16-x86_64-i386-64bit',
 'sys_executable': '/Users/huazhou/opt/anaconda3/bin/python3',
 'sys_platform': 'darwin',
 'sys_version': '3.9.12 (main, Apr  5 2022, 01:56:13) \n[Clang 12.0.0 ]'}
sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        here_1.0.1        lattice_0.20-45   png_0.1-8        
 [5] withr_2.5.0       rprojroot_2.0.3   digest_0.6.29     grid_4.2.2       
 [9] jsonlite_1.8.0    magrittr_2.0.3    evaluate_0.15     rlang_1.0.6      
[13] stringi_1.7.8     cli_3.4.1         rstudioapi_0.13   Matrix_1.5-1     
[17] reticulate_1.27   rmarkdown_2.14    tools_4.2.2       stringr_1.4.0    
[21] htmlwidgets_1.6.1 xfun_0.31         yaml_2.3.5        fastmap_1.1.0    
[25] compiler_4.2.2    htmltools_0.5.4   knitr_1.39       

Load libraries.

# Plotting tool
import matplotlib.pyplot as plt
# Load Tensorflow and Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_hub as hub
library(keras)
library(tfhub)

Source: https://tensorflow.rstudio.com/tutorials/keras/text_classification_with_hub

1 Prepare data

Different from earlier experiment of fitting LSTM on IMDB data, we will start from the original raw text of IMDB reviews.

We download the IMDB dataset from a static url (if it’s not already in the cache):

if (dir.exists("aclImdb/"))
  unlink("aclImdb/", recursive = TRUE)
url <- "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset <- get_file(
  "aclImdb_v1",
  url,
  untar = TRUE,
  cache_dir = '.',
  cache_subdir = ''
)
unlink("aclImdb/train/unsup/", recursive = TRUE)

We can then create a TensorFlow dataset from the directory structure using the text_dataset_from_directory function:

batch_size = 512
seed = 425

train_data = keras.utils.text_dataset_from_directory(
  'aclImdb/train',
  batch_size = batch_size,
  validation_split = 0.2,
  subset = 'training',
  seed = seed
)
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
validation_data = keras.utils.text_dataset_from_directory(
  'aclImdb/train',
  batch_size = batch_size,
  validation_split = 0.2,
  subset = 'validation',
  seed = seed
)
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
test_data = keras.utils.text_dataset_from_directory(
  'aclImdb/test',
  batch_size = batch_size
)
Found 25000 files belonging to 2 classes.
batch_size <- 512
seed <- 425

train_data <- text_dataset_from_directory(
  'aclImdb/train',
  batch_size = batch_size,
  validation_split = 0.2,
  subset = 'training',
  seed = seed
)
validation_data <- text_dataset_from_directory(
  'aclImdb/train',
  batch_size = batch_size,
  validation_split = 0.2,
  subset = 'validation',
  seed = seed
)
test_data <- text_dataset_from_directory(
  'aclImdb/test',
  batch_size = batch_size
)

Let’s take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

Let’s print first 10 examples.

batch = list(train_data.as_numpy_iterator())[0]
batch[0][0]
b"Let me start out by saying I'm a big Carrey fan. Although I'll admit I haven't seen all of his movies *cough*the magestic*cough*. Bruce Almighty was enjoyable. None of the other reviews have really gone into how cheesy it gets towards the end, I dont know what the writers were thinking. Somehow I couldn't help but feel like this movie was a poor attempt at re-creating Liar Liar.<br /><br />On a positive note, The Daily Show's Steve Correl is HILARIOUS and so is the rest of the cast. See Bruce Almighty if you're a big Jim Carrey fan, or if you just want to see a light-hearted (que soft piano music) somewhat funny comedy."

First 10 labels:

batch[1][0:9]
array([1, 1, 1, 0, 0, 1, 0, 0, 0], dtype=int32)
batch <- train_data %>%
  reticulate::as_iterator() %>%
  reticulate::iter_next()

batch[[1]][1]
tf.Tensor(b"Let me start out by saying I'm a big Carrey fan. Although I'll admit I haven't seen all of his movies *cough*the magestic*cough*. Bruce Almighty was enjoyable. None of the other reviews have really gone into how cheesy it gets towards the end, I dont know what the writers were thinking. Somehow I couldn't help but feel like this movie was a poor attempt at re-creating Liar Liar.<br /><br />On a positive note, The Daily Show's Steve Correl is HILARIOUS and so is the rest of the cast. See Bruce Almighty if you're a big Jim Carrey fan, or if you just want to see a light-hearted (que soft piano music) somewhat funny comedy.", shape=(), dtype=string)

Let’s also print the first 10 labels.

batch[[2]][1:10]
tf.Tensor([1 1 1 0 0 1 0 0 0 0], shape=(10), dtype=int32)

2 Build model

Let’s first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: (num_examples, embedding_dimension).

embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(
  handle = embedding, 
  # Enable fine-tuning (takes longer)
  trainable = True
  )
# Embed the first training texts
hub_layer(batch[0][0:1])
<tf.Tensor: shape=(1, 50), dtype=float32, numpy=
array([[ 0.76370066,  0.183997  , -0.12718768,  0.70335877,  0.07327911,
        -0.03899736,  0.10226654, -0.30235237, -0.50765055,  0.62317777,
         0.22796501, -0.09297381, -0.1462293 ,  0.20359744, -0.50289273,
         0.12905747, -0.46563283,  0.46837053,  0.3183409 , -0.53685   ,
         0.02151133, -0.40384126,  0.18346405,  0.21639028, -0.3739372 ,
         0.17969884, -1.0825881 , -0.08053909,  0.5606583 , -0.32753116,
        -0.7381755 ,  0.07553624,  0.28268006, -0.14106293, -0.40518084,
         0.27209735,  0.4923586 , -0.09804886,  0.2137844 , -0.45998612,
         0.289826  ,  0.12571187, -0.26875192,  0.02086616, -0.43353546,
        -0.11499331, -0.6014056 , -0.2741146 ,  0.04519763, -0.06563535]],
      dtype=float32)>
embedding <- "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer <- tfhub::layer_hub(
  handle = embedding, 
  # Enable fine-tuning (takes longer)
  trainable = TRUE
  )
# Embed the first training texts
hub_layer(batch[[1]][1:2])
tf.Tensor(
[[ 0.76370066  0.183997   -0.12718768  0.70335877  0.07327911 -0.03899736
   0.10226654 -0.30235237 -0.50765055  0.62317777  0.22796501 -0.09297381
  -0.1462293   0.20359744 -0.50289273  0.12905747 -0.46563283  0.46837053
   0.3183409  -0.53685     0.02151133 -0.40384126  0.18346405  0.21639028
  -0.3739372   0.17969884 -1.0825881  -0.08053909  0.5606583  -0.32753116
  -0.7381755   0.07553624  0.28268006 -0.14106293 -0.40518084  0.27209735
   0.4923586  -0.09804886  0.2137844  -0.45998612  0.289826    0.12571187
  -0.26875192  0.02086616 -0.43353546 -0.11499331 -0.6014056  -0.2741146
   0.04519763 -0.06563535]
 [ 0.9949969   0.5141704   0.16449574  0.8084587  -0.20159443 -0.50888604
   0.38217273 -0.12473521 -1.0824962   0.7963115  -0.5053254   0.13061531
   0.02384013 -0.05730984 -0.53146046 -0.40029854 -0.78000605 -0.22182897
   0.48937032 -1.3286033   0.06303684 -0.16473001  1.3556808  -0.23764718
  -0.49831617  0.63716036 -1.8083394   0.09699536  0.24990597 -1.0124916
  -0.5400294   0.5142796   1.0795236  -0.64328635 -0.76760125  0.46185046
   0.34145948 -0.41720492  0.38070628 -1.0945162  -0.12662072  0.37232855
  -0.5240889   0.7304352  -0.21560599 -0.26014465 -0.54552066 -1.1023974
   0.16730602 -0.00775222]], shape=(2, 50), dtype=float32)

Let’s now build and compile the full model:

model = keras.Sequential([
  hub_layer,
  layers.Dense(units = 16, activation = 'relu'),
  layers.Dense(units = 1, activation = 'sigmoid')
]
)
model.compile(
  optimizer = 'adam',
  loss = "binary_crossentropy",
  metrics = 'accuracy'  
)
model <- keras_model_sequential() %>%
  hub_layer() %>%
  layer_dense(16, activation = 'relu') %>%
  layer_dense(1)

Compile model:

model %>% compile(
  optimizer = 'adam',
  loss = loss_binary_crossentropy(from_logits = TRUE),
  metrics = 'accuracy'
)

3 Training

history = model.fit(
  train_data,
  epochs = 10,
  validation_data = validation_data,
  verbose = 2
)
Epoch 1/10
40/40 - 26s - loss: 0.6573 - accuracy: 0.6324 - val_loss: 0.5966 - val_accuracy: 0.7352 - 26s/epoch - 652ms/step
Epoch 2/10
40/40 - 25s - loss: 0.5138 - accuracy: 0.7992 - val_loss: 0.4478 - val_accuracy: 0.8186 - 25s/epoch - 631ms/step
Epoch 3/10
40/40 - 25s - loss: 0.3577 - accuracy: 0.8723 - val_loss: 0.3529 - val_accuracy: 0.8574 - 25s/epoch - 634ms/step
Epoch 4/10
40/40 - 25s - loss: 0.2550 - accuracy: 0.9133 - val_loss: 0.3138 - val_accuracy: 0.8680 - 25s/epoch - 636ms/step
Epoch 5/10
40/40 - 26s - loss: 0.1888 - accuracy: 0.9384 - val_loss: 0.2967 - val_accuracy: 0.8728 - 26s/epoch - 638ms/step
Epoch 6/10
40/40 - 25s - loss: 0.1403 - accuracy: 0.9597 - val_loss: 0.2927 - val_accuracy: 0.8802 - 25s/epoch - 624ms/step
Epoch 7/10
40/40 - 27s - loss: 0.1034 - accuracy: 0.9743 - val_loss: 0.2998 - val_accuracy: 0.8798 - 27s/epoch - 664ms/step
Epoch 8/10
40/40 - 26s - loss: 0.0763 - accuracy: 0.9836 - val_loss: 0.3120 - val_accuracy: 0.8768 - 26s/epoch - 640ms/step
Epoch 9/10
40/40 - 26s - loss: 0.0558 - accuracy: 0.9909 - val_loss: 0.3195 - val_accuracy: 0.8802 - 26s/epoch - 661ms/step
Epoch 10/10
40/40 - 27s - loss: 0.0409 - accuracy: 0.9948 - val_loss: 0.3334 - val_accuracy: 0.8784 - 27s/epoch - 673ms/step

WARNING:tensorflow:From /Users/huazhou/opt/anaconda3/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
WARNING:tensorflow:From /Users/huazhou/opt/anaconda3/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 keras_layer (KerasLayer)    (None, 50)                48190600  
                                                                 
 dense (Dense)               (None, 16)                816       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 48,191,433
Trainable params: 48,191,433
Non-trainable params: 0
_________________________________________________________________

Visualize training process:

plt.figure()
plt.ylabel("Loss (training and validation)")
plt.xlabel("Training Epoches")
plt.ylim([0, 1])
(0.0, 1.0)
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.show()

plt.figure()
plt.ylabel("Accuracy (training and validation)")
plt.xlabel("Training Epoches")
plt.ylim([0, 1])
(0.0, 1.0)
plt.plot(history.history["accuracy"])
plt.plot(history.history["val_accuracy"])
plt.show()

system.time({
history <- model %>% fit(
  train_data,
  epochs = 10,
  validation_data = validation_data,
  verbose = 2
)
})
    user   system  elapsed 
1367.804  403.264  265.980 
summary(model)
Model: "sequential_1"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 keras_layer_1 (KerasLayer)         (None, 50)                      48190600    
 dense_3 (Dense)                    (None, 16)                      816         
 dense_2 (Dense)                    (None, 1)                       17          
================================================================================
Total params: 48,191,433
Trainable params: 48,191,433
Non-trainable params: 0
________________________________________________________________________________

Visualize training process:

plot(history)

4 Testing

results = model.evaluate(test_data, verbose = 2)
49/49 - 5s - loss: 0.3634 - accuracy: 0.8601 - 5s/epoch - 110ms/step
results
[0.36342179775238037, 0.8601199984550476]
results <- model %>% evaluate(test_data, verbose = 2)
results
     loss  accuracy 
0.3853427 0.8546000