Activation Function in Machine Learning

5 min readDec 25, 2023

In machine learning, activation functions are crucial components of artificial neural networks. They introduce non-linearity into the network, enabling it to learn and represent complex patterns in data. Here’s a breakdown of the concept and examples of common activation functions:

1. What is an Activation Function?

Purpose: Introduces non-linearity into a neural network, allowing it to model complex relationships and make better predictions.
Position: Located within each neuron of a neural network, applied to the weighted sum of inputs before passing the output to the next layer.

2. Common Activation Functions and Examples:

a. Sigmoid:

Output: S-shaped curve between 0 and 1.
Use Cases: Binary classification, historical use in early neural networks.
Example: Predicting if an image contains a cat (output close to 1) or not (output close to 0).

b. Tanh (Hyperbolic Tangent):

Output: S-shaped curve between -1 and 1.
Use Cases: Similar to sigmoid, often preferred for its centred output.
Example: Sentiment analysis, classifying text as positive (close to 1), neutral (around 0), or negative (close to -1).

c. ReLU (Rectified Linear Unit):

Output: 0 for negative inputs, x for positive inputs (x = input value).
Use Cases: Very popular in deep learning, helps mitigate the vanishing gradient problem.
Example: Image recognition, detecting edges and features in images.

d. Leaky ReLU:

Output: Small, non-zero slope for negative inputs, x for positive inputs.
Use Cases: Variation of ReLU, addresses potential “dying ReLU” issue.
Example: Natural language processing, capturing subtle relationships in text.

e. Softmax:

Output: Probability distribution over multiple classes (sums to 1).
Use Cases: Multi-class classification, is often the final layer in multi-class neural networks.
Example: Image classification, assigning probabilities to each possible object in an image.

f. PReLU (Parametric ReLU):

Concept: Similar to ReLU, sets negative inputs to 0 but introduces a learnable parameter (α) that allows some negative values to have a small positive slope.
Benefits: Addresses the “dying ReLU” issue where neurons become inactive due to always outputting 0 for negative inputs.
Drawbacks: Increases model complexity due to the additional parameter to learn.
Example: Speech recognition tasks, where capturing subtle variations in audio tones might be crucial.

g. SELU (Scaled Exponential Linear Unit):

Concept: Combines Leaky ReLU with an automatic scaling factor that self-normalizes the activations, reducing the need for manual normalization techniques.
Benefits: Improves gradient flow and convergence speed, prevents vanishing gradients, and helps with weight initialization.
Drawbacks: Slightly more computationally expensive than Leaky ReLU due to the exponential calculation.
Example: Computer vision tasks where consistent and stable activations are important, like image classification or object detection.

h. SoftPlus:

Concept: Smoothly transforms negative inputs to 0 using a log function, avoiding the harsh cutoff of ReLU.
Benefits: More continuous and differentiable than ReLU, can be good for preventing vanishing gradients and offers smoother outputs for regression tasks.
Drawbacks: Can saturate for large positive inputs, limiting expressiveness in some situations.
Example: Regression tasks where predicting smooth outputs with continuous changes is important, like stock price prediction or demand forecasting.

The formula for the above-mentioned activation functions

1. Sigmoid:

Formula: f(x) = 1 / (1 + exp(-x))
Output: S-shaped curve between 0 and 1, with a steep transition around 0.
Use Cases: Early neural networks, binary classification, logistic regression.
Pros: Smooth and differentiable, provides probabilities in binary classification.
Cons: Suffers from vanishing gradients in deeper networks, computationally expensive.

2. Tanh (Hyperbolic Tangent):

Formula: f(x) = (exp(x) — exp(-x)) / (exp(x) + exp(-x))
Output: S-shaped curve between -1 and 1, centered around 0.
Use Cases: Similar to sigmoid, often preferred for its centred output.
Pros: More balanced activation range than sigmoid, avoids saturation at extremes.
Cons: Still susceptible to vanishing gradients in deep networks, slightly computationally expensive.

3. ReLU (Rectified Linear Unit):

Formula: f(x) = max(0, x)
Output: Clips negative inputs to 0, outputs directly positive values.
Use Cases: Popular choice in deep learning, image recognition, and natural language processing.
Pros: Solves the vanishing gradient problem, is computationally efficient, and promotes sparsity.
Cons: “Dying ReLU” issue if negative inputs dominate, insensitive to small changes in input values.

4. Leaky ReLU:

Formula: f(x) = max(α * x, x) for some small α > 0
Output: Similar to ReLU, but allows a small positive slope for negative inputs.
Use Cases: Addresses ReLU’s “dying” issue, natural language processing, and audio synthesis.
Pros: Combines benefits of ReLU with slight negative activation, helps prevent dying neurons.
Cons: Introduces another hyperparameter to tune (α), slightly less computationally efficient than ReLU.

5. Softmax:

Formula: f_i(x) = exp(x_i) / sum(exp(x_j)) for all i and j
Output: Probability distribution over multiple classes (sums to 1).
Use Cases: Multi-class classification, final layer in multi-class neural networks.
Pros: Provides normalized probabilities for each class, and allows for confidence estimation.
Cons: Sensitive to scale changes in inputs, computationally expensive compared to other options.

6. PReLU (Parametric ReLU):

Formula: f(x) = max(αx, x)
Explanation:
For x ≥ 0, the output is simply x (same as ReLU).
For x < 0, the output is αx, where α is a learnable parameter that adjusts the slope of negative values.
The parameter α is typically initialized around 0.01 and learned during training, allowing the model to determine the optimal slope for negative inputs.

7. SELU (Scaled Exponential Linear Unit):

Formula: f(x) = lambda * x if x >= 0 else lambda * alpha * (exp(x) — 1)
Explanation:
For x ≥ 0, the output is lambda * x, where lambda is a scaling factor (usually around 1.0507).
For x < 0, the output is lambda * alpha * (exp(x) — 1), where alpha is a fixed parameter (usually 1.67326).
The scaling and exponential terms help normalize the activations and improve gradient flow, often leading to faster and more stable training.

8. SoftPlus:

Formula: f(x) = ln(1 + exp(x))
Explanation:
Transforms negative inputs towards 0 using a logarithmic function, resulting in a smooth, continuous curve.
Provides a smooth transition between 0 and positive values, avoiding the sharp cutoff of ReLU.
Can be more sensitive to small changes in input values, making it suitable for tasks where continuous variations are important.

Key points to remember:

The choice of activation function significantly impacts a neural network’s performance and training dynamics.
Experimenting with different activation functions and evaluating their performance on your specific task is crucial for finding the best fit.
Consider the problem type, network architecture, desired properties (e.g., smoothness, non-linearity, normalization), and computational cost when selecting an activation function.

Choosing the right activation function among these options depends on your specific needs. Consider factors like:

Problem type: Is it classification, regression, or something else?
Network architecture: How deep is the network, and what other activation functions are used?
Performance considerations: Do you prioritize faster training or better accuracy?

Experimenting with different options and evaluating their performance on your specific dataset is crucial for making an informed decision.

Activation Function in Machine Learning

Written by Think Different - Dhiraj Patra