Blog-posts

Deep Learning with Python: Book Notes

June 19, 2020

First thing first, This book is highly recommended for deep learning enthusiasts!

Ch 1. Why deep learning ?

Deep Learning methods have multiple representation layers whereas shallow network and ML algo's have a single representation layer [ a big limitation ]. This leads to superior performance than ML algos when more data is available.
It doesn't require feature engineering. Deep learning methods learn features by itself, which makes it easy to apply.
It supports techniques such as Transfer learning and fine Tuning which makes it useful for problems where data is less.

Deep learning is the future! One popular way to perform deep learning is through neural networks.

Ch 3. Neural Networks

This chapter provides definations of basic deep learning (DL) terms :

Neural Network Layers: Building blocks of DL. Each layer computes a new way to represent the data using different mathematical functions. We use different layers for different use case. Convolution layers are efficient representation for images, Dense layer work better for numeric data.
Loss Functions: Functions to calculate how wrong are our predictions. eg functions :binary cross entropy , categorical corss entropy , sparse categorical cross entropy, MAE. We used one depending on the use case.
Optimization Algorithms: Once we know the error we need a general mechnaism to effieciently fix the error. eg algos : Adam, Rmsprop ..
Activation Functions: These provide new set of impressive representation layers, which can be used to efficiently solve problems/ Why activation functions ? Without activations functions there each layer will calcualte linear relations thus we will not be able to capture complex relationships.
eg : relu, tanh

Side note : Mini-batch or batch — A small set of samples (typically between 8 and 128) that are processed simultaneously by the model. The number of samples is often a power of 2, to facilitate memory allocation on GPU.

Ch 4: ML General Guidelines

This books talks about a lot of good practices and general guidelines to process ml problems I found it informative and knowledgeble even after having developed numerous of models

First verify that given data is sufficient and informative enough to solve the given problem

firstly develop a model better than a stastical model
have a better than naive baseline

Ideal capacity of network : First start simple and small then keep on adding layers and finally when overfitting regularize the modal. Basically first undetfit then overfit.

Dropout : Introducing noise, Not trusitng a singel neuron too much , keep rotating
The importance of having sufficiently large intermediate layers.

You should never use in your workflow any quantity computed on the test data, even for something as simple as data normalization.

Importance of test set and validation set

You may ask, why not have two sets: a training set and a test set? 
You’d train on the training data and evaluate on the test data. 
Much simpler!

The reason is that developing a model always involves tuning its configuration:
for example, choosing the number of layers or the size of the layers 
(called the hyperparameters of the model, to distinguish them from the parameters, 
which are the network’s weights). You do this tuning by using as a feedback signal 
the performance of the model on the validation data. In essence, this tuning is a 
form of learning: a search for a good configuration in some parameter space. 
As a result, tuning the configuration of the model based on its performance on the 
validation set can quickly result in overfitting to the validation set, even though
your model is never directly trained on it.

Discusses topic Like :

Correct CV scheme - Iterative k fold for small dataset , k-fold,
Correct splitting of data - Stratified split , Temporal spit
Checking for no redundancy in data
Noramlization and scaling of data is important for neural network
Regularization Techniques - Dropout, L1 regularization, L2 regularization

The most common ways to prevent overfitting in neural networks:

Get more training data.
Reduce the capacity of the network.
Add weight regularization.
Add dropout.

Developing a model that does better than a baseline
Your goal at this stage is to achieve statistical power: 
that is, to develop a small model that is capable of beating a dumb baseline.
Once you’ve developed a satisfactory model configuration, you can train your final
production model on all the available data (training and validation) and evaluate it
one last time on the test set. If it turns out that performance on the 
test set is significantly worse than the performance measured on 
the validation data, this may mean either that your validation procedure
wasn’t reliable after all,  or that you began overfitting to the validation 
data while tuning  the parameters of the model. In this case, you may want to switch to
a more reliable evaluation protocol (such as iterated K-fold validation).

Chapter 5. Convnet

Convet is much better than normal neural network because of rotation invariance nature of convolution operation and since spatial data is learned once can be applied anywhere [?]

Convolution Operation - A set of filters to detect pattern in a image. So basically representing image in different ways and finding best representation that can solve the problem at hand easily.
Why max pooling - its the best to reduce number of parameters to train. If we dont use this we will have to use 100's of conv layer and number of params will be humungous [other techinques like stride and averaginf dont work well in practice]
Low level features are leaned first. later layes store Then high abstract features related to the problem you are solving

Data augmentation techniques :

Feature Extraction
Finetuning is a great application - two types : train only dense layer, train last conv layers
Pretrained weights

Visualizing convnets :

Visualize intermediate layer activations
Visualize filters in each convnet layer
Visualize Class activation heatmap

Chapter 6. Sequence Networks

First part discusses two methods to process text data namely one hot encoding & embeddings. [hashing trick is another one referenced].One hot encoding is memory intensive.

This chapter talks about word embeddings in bit depth: Why they are useful :

Are not naive representations
vector operations on embedings are meaningful
They have a spatial meaning as well. eg: A vector v = king - queen is very similar to v' = man - woman eg embeddings : Glove embeddings, Word2Vec ..

RNN : A type of neural layer which is useful in processing sequential data.

Bidirectional RNN : Another flavour , process data ahead and before to generate output.

Simple RNN , Bidirectional versions - has problem of vanishing gradients [the learnings from further back layers vanishes due to operations by multiple layers in between]. Basically they don't work effectively in case of long sequences.

1D conv : More efficeinet than rnn in tasks where order of input doesnt matter [?]

LSTM - A flavour of RNN.

Has some kind of additonal carry line to carry past state info
Is better when long context of input eg : Its helpful in tasks like translation ans isn't usefuk when recent past data is the best indicator, like in case some forecasting problems.

A good blog on lstm : http://blog.echen.me/2017/05/30/exploring-lstms/

GRU - Another flavour of RNN.

Advanced uses of rnn

To be continued..

Great Insight from the book :

Because RNNs are extremely expensive for processing very long sequences, but 1D convnets are cheap, it can be a good idea to use a 1D convnet as a preprocessing step before an RNN, shortening the sequence and extracting useful representations for the RNN to process.

One strategy to combine the speed and lightness of convnets with the order-sensitivity
of RNNs is to use a 1D convnet as a preprocessing step before an RNN (see figure 6.30).
This is especially beneficial when you’re dealing with sequences that are so long they can’t
realistically be processed with RNNs, such as sequences with thousands of steps.
The convnet will turn the long input sequence into much shorter (downsampled) sequences of
higher-level features. This sequence of extracted features then becomes the input to
the RNN part of the network. This technique isn’t seen often in research papers and 
practical applications, possibly because it isn’t well known

Ch 7 : Advanced deep learning techniques

Introduction to the functional API

In the functional API, you directly manipulate tensors, and you use layers as functions that take tensors and return tensors (hence, the name functional API):

To be continued..

Ch 8: Future of Deep Learning

As a machine-learning practitioner, always be mindful of this,
and never fall into the trap of believing that neural networks
understand the task they perform—they don’t, at least not in a 
way that would make sense to us. They were trained on a different, 
far narrower task than the one we wanted to teach them: that of 
mapping training inputs to training targets, point by point. 
Show them anything that deviates from their training data, and 
they will break in absurd ways.

To be continued..

Books ML

All Articles