Automatic differentiation 📈

Derivatives, specifically gradients (derivatives in more than one dimension spaces) and hesians (second derivatives), have become in a fundamental of machine learning. Gradient is a vector which indicates the maximum slope direction of a function at the evaluated point. This is important to move around the function space to find relative minimums or maximums (to minimize or maximize the function). Hesian gives information about the concavity and convexity of the function. Some algorithms use it to improve the exploratory movement over the function space and to find minimums and maximums faster. Automatic Differentiation (AD) is a efficiently and accurately procedure of derivatives calculation of numerical functions, represented as computer programs. As already mentioned, the use of derivatives is very important for model optimization. From a mathematical point of view a model is nothing more than a function that you want to minimize or maximize and derivatives are the tool to do it.

There are several methods of derivative calculation with a computer:

Numerical differentiation. This method uses derivative definition to approximate using samples of the original function. In this way we can approximate the gradient ∇f as:

where ei is ith unitary vector and h>0 is the step size for the approximation.
Symbolic differentiation. It consists on automatic manipulation of the mathematical expressions to obtain the derivatives (similar to what we did at school). It requires to implement derivative rules. The problem of this kind of derivation is that it can produce long symbolic expressions which are difficult to evaluate.
Automatic Differentiation. It is based on the fact that all functions can be decomposed into a finite number of operations which derivative is known. Combining these derivatives the derivative of the original function can be computed. Applying the chain rule to each elementary operation of the function we obtain the trace for the calculation of the real function derivative.

Next image shows the differences between the three methods.

Methodologies for calculating derivatives.

There are two types of Automatic Differentiation (AD): forward mode and reverse mode. On the one hand forward mode evaluates the different parts of the function forward and then it does the same for each part of the derivative till real function derivative is obtained. On the other hand reverse mode evaluates the different parts of the function forward but after, from the derivative of the function, it obtains the partial derivatives. This is how the backpropagation method works in neural networks, which needs the partial derivatives to update the weights of each of the neural network layers. This method avoids to reuse calculus already computed and to calculate derivatives in a very efficient way.

AD forward mode schema.

AD reverse mode schema.

Derivatives tools

In this post we will be focus on function parameters optimization using Automatic Differentiation. Here are listed some software packages to calculate derivatives and gradients:

Tensorflow: It uses AD reverse mode.
Theano: It uses symbolic differentiation.
Mathematica: It uses symbolic differentiation.
Autograd: It uses AD reverse mode.

Tensorflow

It is an open source library developed by Google for numerical computation using flow graphs. Before to execute a program, Tensorflow makes a flow graph where nodes represent mathematical operations and edges represent multidimensional data vectors also called tensors. The construction of this graph avoids to obtain the most profit of system CPUs and GPUs where the program is executed. Then, completely transparent to the programmer, Tensorflow parallels everything it can among the resources it dispose.

This library was originally designed for deep learning, the machine learning branch that studies neural networks. Tensorflow avoids, in a easy way, to implement Deep Neural Networks (DNN), Convulational Neural Networks (CNN) and Recurrent Neural Networks (RNN). However last versions have focused on satisfying the rest of machine learning community by trying to convert the library in a standard for programming models of all branches. Specifically they have developed a module called TFLearn which has a set of models ready to use and also have updated its syntax with the intention to be more close to Scikit-learn syntax which is one of the most popular and important machine learning libraries.

One of the most interesting aspects of this library is that it implements AD reverse model in a very elegant way. The coder defines a model indicating its parameters as variables and practically automatically, after specify the inference algorithm, Tensorflow is in charge of calculating gradients and apply them in optimization procedures.

Usage examples

Below is the code to optimize the parameters of a lineal regression model with Tensorflow and with Autograd (both use AD reverse model to get gradients). A lineal regression model is defined by the equation:

Where w represents the weight and b the bias. AD will find values for these parameters and these values will minimize Mean Squared Error.

The model is defined as code in the following way. In Tensorflow parameters to be optimized of a function are defined as variables. Later a cost function is defined based on these parameters, Mean Squared Error. Then optimization algorithm is specified, in this case Gradient Descent. And finally we write the code to train the model (last lines). This loop, in each iteration, get a sample from the dataset and derive the cost function to obtain the direction (gradient vector) of the local minimum, in other words, the direction that reduces Mean Squared Error. With this gradient vector weight and bias parameters will be updated (transparently to the programmer). In this way, when a sufficient number of iterations have been made, values for the parameters that minimize the cost function will have been obtained (a local minimum will have been found).

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

rng = np.random

# Parameters
learning_rate = 0.01
training_epochs = 100

# Training data
train_X = np.asarray([3.3, 4.4, 5.5, 6.71, 6.93, 4.168, 9.779, 6.182, 7.59,
	  2.167, 7.042, 10.791, 5.313, 7.997, 5.654, 9.27, 3.1])
train_Y = np.asarray([1.7, 2.76, 2.09, 3.19, 1.694, 1.573, 3.366, 2.596, 2.53,
	  1.221, 2.827, 3.465, 1.65, 2.904, 2.42, 2.94, 1.3])
n_samples = train_X.shape[0]

# Graph input data
X = tf.placeholder('float')
Y = tf.placeholder('float')

# Optimizable parameters with random initialization
weight = tf.Variable(rng.randn(), name='weight')
bias = tf.Variable(rng.randn(), name='bias')

# Linear model
predictions = (X * weight) + bias

# Loss function: Mean Squared Error
loss = tf.reduce_sum(tf.pow(predictions-Y, 2))/(2*n_samples)

# Gradient descent optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
	for (x, y) in zip(train_X, train_Y):
		sess.run(optimizer, feed_dict={X: x, Y: y})
train_error = sess.run(loss, feed_dict={X: train_X, Y: train_Y})
print('Train error={}'.format(train_error))

# Test error
test_X = np.asarray([6.83, 4.668, 8.9, 7.91, 5.7, 8.7, 3.1, 2.1])
test_Y = np.asarray([1.84, 2.273, 3.2, 2.831, 2.92, 3.24, 1.35, 1.03])
test_error = sess.run(
	tf.reduce_sum(tf.pow(predictions - Y, 2)) / (2 * test_X.shape[0]),
	feed_dict={X: test_X, Y: test_Y})
print('Test error={}'.format(test_error))

print('Weight={} Bias={}'.format(sess.run(weight), sess.run(bias)))

# Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.plot(train_X, sess.run(weight) * train_X
		 + sess.run(bias), label='Fitted line')
plt.legend()
plt.show()

Learning weight and bias parameters of a linear regression model with Tensorflow.

Optimization results of model parameters of a lineal regression using Tensorflow.

Using Autograd all is more visible than in Tensorflow. A cost function is defined with the model parameters and then get gradients in each iteration to update weight and bias parameters.

import autograd.numpy as np
import matplotlib.pyplot as plt
from autograd import elementwise_grad

rng = np.random

# Parameters
learning_rate = 0.01
training_epochs = 100

# Training data
train_X = np.array([3.3, 4.4, 5.5, 6.71, 6.93, 4.168, 9.779, 6.182, 7.59,
	2.167, 7.042, 10.791, 5.313, 7.997, 5.654, 9.27, 3.1])
train_Y = np.array([1.7, 2.76, 2.09, 3.19, 1.694, 1.573, 3.366, 2.596, 2.53,
	1.221, 2.827, 3.465, 1.65, 2.904, 2.42, 2.94, 1.3])
n_samples = train_X.shape[0]


def loss((weight, bias)):
""" Loss function: Mean Squared Error """
predictions = (train_X * weight) + bias
return np.sum(np.power(predictions - train_Y, 2) / (2 * n_samples))

# Function that returns gradients of loss function
gradient_fun = elementwise_grad(loss)

# Optimizable parameters with random initialization
weight = rng.randn()
bias = rng.randn()

for epoch in range(training_epochs):
gradients = gradient_fun((weight, bias))
weight -= gradients[0] * learning_rate
bias -= gradients[1] * learning_rate
print('Train error={}'.format(loss((weight, bias))))

# Test error
test_X = np.array([6.83, 4.668, 8.9, 7.91, 5.7, 8.7, 3.1, 2.1])
test_Y = np.array([1.84, 2.273, 3.2, 2.831, 2.92, 3.24, 1.35, 1.03])
predictions = (test_X * weight) + bias
print('Test error={}'.format(
np.sum(np.power(predictions - test_Y, 2) / (2 * n_samples))))

print('Weight={} Bias={}'.format(weight, bias))

# Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.plot(train_X, weight * train_X + bias, label='Fitted line')
plt.legend()
plt.show()

Learning weight and bias parameters of a linear regression model with Autograd.

Optimization results of model parameters of a lineal regression using Autograd.

The main objective of this post was to uncover a bit the black box that involves the optimization of models using tools as Tensorflow, Theano, Pytorch, ...

References

Automatic differentiation in machine learning: a survey
Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind

Automatic differentiation 📈

Derivatives tools

Tensorflow

Usage examples

References

Write a comment! 😀