LSTM and GRU

Dhiraj Patra
7 min readOct 11, 2024

--

Long Short-Term Memory (LSTM) Networks

LSTMs are a type of Recurrent Neural Network (RNN) designed to handle sequential data with long-term dependencies.

Key Features:

Cell State: Preserves information over long periods.

Gates: Control information flow (input, output, and forget gates).

Hidden State: Temporary memory for short-term information.

Related Technologies:

Recurrent Neural Networks (RNNs): Basic architecture for sequential data.

Gated Recurrent Units (GRUs): Simplified version of LSTMs.

Bidirectional RNNs/LSTMs: Process input sequences in both directions.

Encoder-Decoder Architecture: Used for sequence-to-sequence tasks.

Real-World Applications:

Language Translation

Speech Recognition

Text Generation

Time Series Forecasting

GRUs are an alternative to LSTMs, designed to be faster and more efficient while still capturing long-term dependencies.

Key Differences from LSTMs:

Simplified Architecture: Fewer gates (update and reset) and fewer state vectors.

Faster Computation: Reduced number of parameters.

Technical Details for LSTMs and GRUs:

LSTM Mathematical Formulation:

Let x_t be the input at time t, h_t be the hidden state, and c_t be the cell state.

Input Gate: i_t = sigmoid(W_i * x_t + U_i * h_(t-1) + b_i)

Forget Gate: f_t = sigmoid(W_f * x_t + U_f * h_(t-1) + b_f)

Cell State Update: c_t = f_t * c_(t-1) + i_t * tanh(W_c * x_t + U_c * h_(t-1) + b_c)

Output Gate: o_t = sigmoid(W_o * x_t + U_o * h_(t-1) + b_o)

Hidden State Update: h_t = o_t * tanh(c_t)

Parameters:

W_i, W_f, W_c, W_o: Weight matrices for input, forget, cell, and output gates.

U_i, U_f, U_c, U_o: Weight matrices for hidden state.

b_i, b_f, b_c, b_o: Bias vectors.

GRU Mathematical Formulation:

Let x_t be the input at time t, h_t be the hidden state.

Update Gate: z_t = sigmoid(W_z * x_t + U_z * h_(t-1) + b_z)

Reset Gate: r_t = sigmoid(W_r * x_t + U_r * h_(t-1) + b_r)

Hidden State Update: h_t = (1 — z_t) * h_(t-1) + z_t * tanh(W_h * x_t + U_h * (r_t * h_(t-1)) + b_h)

Parameters:

W_z, W_r, W_h: Weight matrices for update, reset, and hidden state.

U_z, U_r, U_h: Weight matrices for hidden state.

b_z, b_r, b_h: Bias vectors.

Here’s a small mathematical example for an LSTM network:

Example:

Suppose we have an LSTM network with:

Input dimension: 1

Hidden dimension: 2

Output dimension: 1

Input at time t (x_t)

x_t = 0.5

Previous Hidden State (h_(t-1)) and Cell State (c_(t-1))

h_(t-1) = [0.2, 0.3]

c_(t-1) = [0.4, 0.5]

Weight Matrices and Bias Vectors

W_i = [[0.1, 0.2], [0.3, 0.4]]

W_f = [[0.5, 0.6], [0.7, 0.8]]

W_c = [[0.9, 1.0], [1.1, 1.2]]

W_o = [[1.3, 1.4], [1.5, 1.6]]

U_i = [[1.7, 1.8], [1.9, 2.0]]

U_f = [[2.1, 2.2], [2.3, 2.4]]

U_c = [[2.5, 2.6], [2.7, 2.8]]

U_o = [[2.9, 3.0], [3.1, 3.2]]

b_i = [0.1, 0.2]

b_f = [0.3, 0.4]

b_c = [0.5, 0.6]

b_o = [0.7, 0.8]

Calculations

Input Gate

i_t = sigmoid(W_i * x_t + U_i * h_(t-1) + b_i)

= sigmoid([[0.1, 0.2], [0.3, 0.4]] * 0.5 + [[1.7, 1.8], [1.9, 2.0]] * [0.2, 0.3] + [0.1, 0.2])

= sigmoid([0.05 + 0.55, 0.1 + 0.65])

= sigmoid([0.6, 0.75])

= [0.55, 0.68]

Forget Gate

f_t = sigmoid(W_f * x_t + U_f * h_(t-1) + b_f)

= sigmoid([[0.5, 0.6], [0.7, 0.8]] * 0.5 + [[2.1, 2.2], [2.3, 2.4]] * [0.2, 0.3] + [0.3, 0.4])

= sigmoid([0.25 + 0.75, 0.35 + 0.85])

= sigmoid([1.0, 1.2])

= [0.73, 0.78]

Cell State Update

c_t = f_t * c_(t-1) + i_t * tanh(W_c * x_t + U_c * h_(t-1) + b_c)

= [0.73, 0.78] * [0.4, 0.5] + [0.55, 0.68] * tanh([[0.9, 1.0], [1.1, 1.2]] * 0.5 + [[2.5, 2.6], [2.7, 2.8]] * [0.2, 0.3] + [0.5, 0.6])

= [0.292, 0.39] + [0.55, 0.68] * tanh([0.45 + 0.7, 0.55 + 0.8])

= [0.292, 0.39] + [0.55, 0.68] * [0.58, 0.66]

= [0.479, 0.63]

Output Gate

o_t = sigmoid(W_o * x_t + U_o * h_(t-1) + b_o)

= sigmoid([[1.3, 1.4], [1.5, 1.6]] * 0.5 + [[2.9, 3.0], [3.1, 3.2]] * [0.2, 0.3] + [0.7, 0.8])

= sigmoid([0.65 + 0.95, 0.75 + 1.05])

= sigmoid([1.6, 1.8])

= [0.82, 0.87]

Hidden State Update

h_t = o_t * tanh(c_t)

= [0.82, 0.87] * tanh([0.479, 0.63])

= [0.82, 0.87] * [0.44, 0.53]

= [0.36, 0.46]

Output

y_t = h_t

= [0.36, 0.46]

This completes the LSTM calculation for one time step.

Here’s a small mathematical example for a GRU (Gated Recurrent Unit) network:

Example:

Suppose we have a GRU network with:

Input dimension: 1

Hidden dimension: 2

Input at time t (x_t)

x_t = 0.5

Previous Hidden State (h_(t-1))

h_(t-1) = [0.2, 0.3]

Weight Matrices and Bias Vectors

W_z = [[0.1, 0.2], [0.3, 0.4]]

W_r = [[0.5, 0.6], [0.7, 0.8]]

W_h = [[0.9, 1.0], [1.1, 1.2]]

U_z = [[1.3, 1.4], [1.5, 1.6]]

U_r = [[1.7, 1.8], [1.9, 2.0]]

U_h = [[2.1, 2.2], [2.3, 2.4]]

b_z = [0.1, 0.2]

b_r = [0.3, 0.4]

b_h = [0.5, 0.6]

Calculations

Update Gate

z_t = sigmoid(W_z * x_t + U_z * h_(t-1) + b_z)

= sigmoid([[0.1, 0.2], [0.3, 0.4]] * 0.5 + [[1.3, 1.4], [1.5, 1.6]] * [0.2, 0.3] + [0.1, 0.2])

= sigmoid([0.05 + 0.45, 0.1 + 0.55])

= sigmoid([0.5, 0.65])

= [0.62, 0.66]

Reset Gate

r_t = sigmoid(W_r * x_t + U_r * h_(t-1) + b_r)

= sigmoid([[0.5, 0.6], [0.7, 0.8]] * 0.5 + [[1.7, 1.8], [1.9, 2.0]] * [0.2, 0.3] + [0.3, 0.4])

= sigmoid([0.25 + 0.65, 0.35 + 0.75])

= sigmoid([0.9, 1.1])

= [0.71, 0.75]

Hidden State Update

h~t = tanh(W_h * x_t + U_h * (r_t * h(t-1)) + b_h)

= tanh([[0.9, 1.0], [1.1, 1.2]] * 0.5 + [[2.1, 2.2], [2.3, 2.4]] * ([0.71, 0.75] * [0.2, 0.3]) + [0.5, 0.6])

= tanh([0.45 + 0.55, 0.55 + 0.65])

= tanh([1.0, 1.2])

= [0.58, 0.62]

Hidden State

h_t = (1 — z_t) * h_(t-1) + z_t * h~_t

= (1 — [0.62, 0.66]) * [0.2, 0.3] + [0.62, 0.66] * [0.58, 0.62]

= [0.38, 0.42] + [0.36, 0.41]

= [0.74, 0.83]

This completes the GRU calculation for one time step.

Here are examples of Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks:

LSTM Example

Python

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt

# Generate sample dataset (time series data)
np.random.seed(0)
time_steps = 100
future_pred = 30
data = np.sin(np.linspace(0, 10 * np.pi, time_steps)) + 0.2 * np.random.normal(0, 1, time_steps)

# Plot original data
plt.figure(figsize=(10, 6))
plt.plot(data)
plt.title('Original Data')
plt.show()

# Scale data
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data.reshape(-1, 1))

# Split data into training and testing sets
train_size = int(0.8 * len(data_scaled))
train_data, test_data = data_scaled[0:train_size], data_scaled[train_size:]

# Split data into X (input) and y (output)
def split_data(data, future_pred):
X, y = [], []
for i in range(len(data) - future_pred):
X.append(data[i:i + future_pred])
y.append(data[i + future_pred])
return np.array(X), np.array(y)

X_train, y_train = split_data(train_data, future_pred)
X_test, y_test = split_data(test_data, future_pred)

# Reshape data for LSTM input
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

# Build LSTM model
model = Sequential()
model.add(LSTM(50, activation='relu', return_sequences=True, input_shape=(future_pred, 1)))
model.add(LSTM(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1))

# Compile model
model.compile(optimizer='adam', loss='mean_squared_error')

# Early stopping callback
early_stopping = EarlyStopping(patience=5, min_delta=0.001)

# Train model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])

# Make predictions
predictions = model.predict(X_test)

# Plot predictions
plt.figure(figsize=(10, 6))
plt.plot(y_test, label='Actual')
plt.plot(predictions, label='Predicted')
plt.legend()
plt.title('Predictions')
plt.show()

GRU Example

Python

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt

# Generate sample dataset (time series data)
np.random.seed(0)
time_steps = 100
future_pred = 30
data = np.sin(np.linspace(0, 10 * np.pi, time_steps)) + 0.2 * np.random.normal(0, 1, time_steps)

# Plot original data
plt.figure(figsize=(10, 6))
plt.plot(data)
plt.title('Original Data')
plt.show()

# Scale data
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data.reshape(-1, 1))

# Split data into training and testing sets
train_size = int(0.8 * len(data_scaled))
train_data, test_data = data_scaled[0:train_size], data_scaled[train_size:]

# Split data into X (input) and y (output)
def split_data(data, future_pred):
X, y = [], []
for i in range(len(data) - future_pred):
X.append(data[i:i + future_pred])
y.append(data[i + future_pred])
return np.array(X), np.array(y)

X_train, y_train = split_data(train_data, future_pred)
X_test, y_test = split_data(test_data, future_pred)

# Reshape data for GRU input
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

# Build GRU model
model = Sequential()
model.add(GRU(50, activation='relu', return_sequences=True, input_shape=(future_pred, 1)))
model.add(GRU(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1))

# Compile model
model.compile(optimizer='adam', loss='mean_squared_error')

# Early stopping callback
early_stopping = EarlyStopping(patience=5, min_delta=0.001)

# Train model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])

# Make predictions
predictions = model.predict(X_test)

# Plot predictions
plt.figure(figsize=(10, 6))
plt.plot(y_test, label='Actual')
plt.plot(predictions, label='Predicted')
plt.legend()
plt.title('Predictions')
plt.show()

Key Differences:

Architecture:

LSTM has three gates (input, output, and forget) and three state vectors (cell state and two hidden states).

GRU has two gates (update and reset) and two state vectors (hidden state).

Computational Complexity:

LSTM is computationally more expensive due to the additional gate and state.

GRU is faster and more efficient.

Performance:

LSTM generally performs better on tasks requiring longer-term dependencies.

GRU performs better on tasks with shorter-term dependencies.

Use Cases:

LSTM:

Language modeling

Text generation

Speech recognition

GRU:

Time series forecasting

Speech recognition

Machine translation

These examples demonstrate basic LSTM and GRU architectures. Depending on your specific task, you may need to adjust parameters, add layers, or experiment with different optimizers and loss functions.

--

--

Dhiraj Patra
Dhiraj Patra

Written by Dhiraj Patra

AI Strategy, Generative AI, AI & ML Consulting, Product Development, Startup Advisory, Data Architecture, Data Analytics, Executive Mentorship, Value Creation