Training Image Classification/Recognition models based on Deep Learning & Transfer Learning with ML.NET

Avatar

Cesar

               

 

Context and background for ‘Image Classification’, ‘training vs. scoring’ and ML.NET

Image Classification, Object Detection and Text Analysis are probably the most common tasks in Deep Learning which is a subset of Machine Learning.
However, for this blog post I am only focusing on Image Classification/Recognition and the multiple approaches you can take with ML.NET in order to train a Deep Learning model for image Classification/Recognition.

Run/score a pre-trained model vs. train a custom model

Before getting into the specific subject of this blog post focusing on “training a model”, I also want to highlight that in ML.NET you can also do the simplest thing which is to run/score an already pre-trained deep learning model to only run predictions. Those pre-trained models (also called ‘architectures’) are the culmination of many deep neural networks (DNN) architecture ideas developed by multiple researchers over the years and usually trained on very large datasets with many millions of images (such as the ImageNet dataset). That kind of large scale training would require too much specialized resources for most developers or even most organizations.

You can see a list of the most common pre-trained models (such as Inception v3, Resnet v2101, Yolo, etc.) at the http://modelzoo.co and in particular if focusing on computer vision (Image Classification and Object Detection) here: https://modelzoo.co/category/computer-vision

Here’s a summary of existing architectures (pre-trained models)

Those pre-trained models are implemented and trained on a particular deep learning framework/library such as TensorFlow, PyTorch, Caffe, etc. and might also be exported to the ONNX format (standard model format across frameworks).

As of today, ML.NET supports TensorFlow and ONNX, while Pytorch is in our long-term roadmap, though.

Therefore, the simplest approach you can take with any of those pre-trained models is to simply use them to make predictions, in this case, to classify or identify images, such as in the following illustration:

You can see some ML.NET sample apps scoring/running pre-trained TensorFlow or ONNX models here:

However and as mentioned, that scenario (simply scoring/running a pre-trained DNN model) and those samples are NOT the goal for this blog post.

The goal for this blog post is to explain how you can train your own custom Deep Learning model with ML.NET for the Image Classification task in particular.

Why would you want to train your own custom model?

Making predictions with the previous mentioned pre-trained models can be enough if your scenario is very generic. For instance, if you want to recognize/classify a photo as a ‘person’, a ‘cat’, a ‘dog’ or a ‘flower’, then some of those pre-trained models will be enough. But what if you have your own business domain with your own image classes which are more particular? (for instance being able to differentiate between different types of flowers or different types of dogs) and even further what if you want to be able to recognize your own entities or objects? (such as very specific industrial objects which are not generic objects)? For that you will need to train a custom model with your own images and classify across your own image classes.

For instance, you might want to create your own custom image classifier model with your own images so instead of identifying a photo as “a flower” it’d be able to classify across multiple flower types.

Image classifier scenario – Train your own custom deep learning model with ML.NET

 

Possible ways of training an Image Classifier model in ML.NET

Currently (mid-2019), there are three possible ways in ML.NET for training an Image Classifier model:

  1. Native Deep Learning model training (TensorFlow) for Image Classification (Easy to use high-level API – In Preview)
  2. Model composition of: A pretrained TensorFlow model working as image featurizer plus a ML.NET trainer as the model’s algorithm
  3. Model composition of: A pretrained ONNX model working as image featurizer plus a ML.NET trainer as the model’s algorithm

As highlighted above, the first approach is the one which is easier to use and the one we’re currently investing more, although it is as of today in Preview state. The other two approaches are also possible and I will also explain them in this blog post however I want to highlight that the first approach is the one which is not only simpler to use also the one which is more flexible and powerful because of the explanations to be explained below.

A. Native Deep Learning model training (TensorFlow) for Image Classification in ML.NET

Even when this is the newest implementation we’re currently doing in ML.NET (as of Sept. 2019 in Preview) I’m focusing on it first because if you don’t read the whole blog post, at least you should read the important part of the blog post which is this one because this approach is the most flexible and powerful of the three in the list above and it’ll be our long-term approach, therefore it is also the most recommended path for anyone using ML.NET.

The internal architecture stack

In order to use TensorFlow, ML.NET is internally taking dependency on the Tensorflow.NET library.

The Tensorflow.NET library is an open source and low level API library that provides the .NET Standard bindings for TensorFlow. That library is part of the SciSharp stack libraries.

Microsoft (the ML.NET team) is closely working with the TensorFlow.NET library team not just for providing higher level APIs for the users in ML.NET (such as our new ImageClassification API) but also helping to improve and evolve the Tensorflow.NET library as an open source project.

The stack diagram below shows how ML.NET implements these new DNN training features:

What’s highlighted in yellow is precisely this feature on ‘Image Classification’ that we first released at ML.NET 1.4-Preview and we’ll keep evolving by adding additional DNN architectures in addition to Inception v3 and Resnet v2101. Going forward and in upcoming releases coming soon, we’ll also add Object Detection model training support, meaning native training (transfer learning) with TensorFlow.

Easy to use high level API

We aim to provide an easy to use high level API that is also ‘task oriented’, meaning that each API will be targeting a different task such as Image Classification or Object Detection instead of having a more complex API that could train any kind of deep learning model.

As a comparison, code example for transfer learning by TensorFlow.NET needs hundreds of lines of code versus our high level API in ML.NET only needs a couple of lines and still we’ll simplify it further in regards the hyper-parameters and architecture selection:

Note, however, that ML.NET uses TensorFlow.NET under the covers as the low level .NET bindings for TensorFlow.

 

Based on Transfer Learning

Deriving from pre-trained models (DNN architectures) when doing Transfer Learning

As previously mentioned, full training from scratch of deep learning models is hard and expensive.

Specifically for predictive image classification with images as input, there are publicly available base pre-trained models (also called DNN architectures), under a permissive license for reuse, such as Google Inception v3, NASNet, Microsoft Resnet v2101, etc. which took a lot of effort from the organizations when implementing each DNN architecture.

These models can be downloaded and incorporated directly into new models that expect image data as input. That can be done based on the technique named ‘Transfer Learning‘ which allows you to take a pre-trained model on comparable images to the custom images you want to use and reuse that pre-trained model’s “knowledge” for your new custom deep learning model that you train on your new images, as illustrated in the following image:

The definition of ‘transfer learning’ is the following:

Transfer learning at Wikipedia:

Transfer learning is a machine learning method where a model developed for an original task is reused as the starting point for a model on a second different but related task. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

 

Benefits of Native DNN Transfer Learning in ML.NET

The main benefit provided by the ‘Transfer Learning’ approach is:

Full optimization power within the DNN framework: Transfer Learning happens within TensorFlow DNN models, the ML.NET team will be able to optimize the retrain process with many improvements such as re-train one or more layers within the DNN graph plus any other tuning within the TensorFlow graph.

Here’s a simplified diagram on how transfer learning happens under the covers when using the ML.NET ImageClassification estimator. Those graph diagrams are simplified diagrams taken from screenshots while using the Netron tool after opening the serialized TensorFlow .pb models:

Benefit 1: Simple API encapsulating DNN transfer learning

The first main benefit of this new ImageClassification API in ML.NET is simplicity.

It is not just a scenario oriented API for image classification/recognition. We are pushing the limits and we basically shrank hundreds of lines of code using the TensorFlow.NET bindings for C# and surface a very simple and easy to use API for Image Classification meaning that in a couple of lines you can implement your model training which is internally doing a native TensorFlow training, as illustrated in the following diagram:

Benefit 2: Trains natively on TensorFlow producing a TensorFlow frozen graph/model (.pb) in addition to a ML.NET model

Flexibility and performace: Since ML.NET is internally retraining natively on Tensorflow layers, the ML.NET team will be able to optimize further and take multiple approaches like training on the last layer or training on multiple layers across the TensorFlow model and achieve better quality levels.

A second benefit of this approach which is natively training in TensorFlow is that you not only get a ML.NET model that you can consume from .NET in order to predict image classifications but you also get a native TensorFlow model (frozen graph as a .pb file) that if you want you could also consume from any other platform/language that supports TensorFlow (i.e. Python, Java/Android, Go, etc.). The following screenshot shows you an example of the generated Tensorflow .pb model after you train with your image-set:

In the screenshot below you can see how you can see that retrained TensorFlow model (custom_retrained_model_based_on_InceptionV3.meta.pb) in Netron, since it is a native TensorFlow model.

By the way, note that you don’t need to understand/learn or even open the DNN model/graph with Netron in order to use it with the ML.NET API, by the way, I’m just showing it for folks familiar with TensorFlow to prove that the generated model is a native TensorFlow model:

Then, the generated ML.NET model .zip file model you use in C# is just like a wrapper around the new native retrained TensorFlow model. See the ML.NET model .ZIP file in Visual Studio:

It must be highlighted though that the ML.NET model file (.zip file) is self-sufficient, meaning that it also includes the serialization of the TensorFlow .pb model inside the .zip file, so when deploying into a .NET application you only need the ML.NET model .zip file.

Implementing ‘ML.NET model training C# code’ using the ImageClassification API

Let’s stop talking about how this new feature was designed and internally implemented but show you how easy it is to use it.

The sample training app I’m showing below is publicly available at the ML.NET GitHub repo here:

Image Classification Model Training sample with ML.NET

The dataset (Imageset)

First things first. In order to train your own deep learning model you need to provide the images you want to train on. For this example, you need to have the images distributed in multiple folders and each folder’s name will be a different label (also called class).

In this example, I’m using an imageset with 200 image files you can download from here. Although that is a simplified imageset from the original which is a 3,600 files available from TensorFlow here.

The important point is that you must have a balanced dataset, meaning that you have to have the same (or very similar) number of images per image class. In my simplified dataset of 200 images I have 5 image classes and 40 images per class, as shown below:

The name of each sub-folder is important because in this example that’ll be the name of each class/label the model is going to use to classify the images.

 

The data class

You need to have a data class with the schema so you can use it when loading the data, such as the following simple class:

The boilerplate code: Code for downloading the image set and load into an IDataView and splitting in train/test datasets

The following code is using custom code for downloading the dataset files, unzip and finally load it into the IDataView while using each folder’s name as the image class name for each image.

You can research those custom methods (boiler code) in the sample.

In the last line of that code I’m shuffling the rows so the datasets will be better balanced (even distribution of rows per image class) when splitting in two datasets later (train/test datasets).

Now, the dataset is split in two datasets, one for training and the second for testing/validating the quality of the mode.

 

THE IMPORTANT CODE: Simple pipeline defining the model with the new ImageClassification API

As the most important step, you define the model’s training pipeline where you can see how easily you can train a new TensorFlow model which under the covers is based on transfer learning from a selected architecture (pre-trained model) such as Inception v3 or Resnet v2101.

The first line with the ‘MapValueToKey()‘ is needed so the labels (such as roses, tulips, etc.) are converted to numeric keys. Everything in Machine Learning is at the end of the day based on math and statistics so all the values (text or categorical values like in this case) need to be converted to numeric values.

The second line is precisely the new ImageClassification API where you simply need to provide the column containing the image, the column containing the related label (image class) plus the following configuration values that depend on the nature of your imageset (type of images and number of images most of all):

  • The DNN architecture (pre-trained model) such as Inception v3, or Resnet v2101: You can simply try any available DNN architectures (pre-trained models) in our API and use the one that gets better accuracy for your dataset. That will depend on the type of your images compared to the images used when training the original pre-trained model. For instance, if the base pre-trained models was trained with photos of objects, animals, plants and people versus a based model trained with black & white images or even digits/numbers (such as MINST). In the curren version (ML.NET 1.4-Preview) we only have Inception v3 and Resnet v2101, but we’ll add more in the future.
  • Epoch: An epoch is one learning cycle where the learner sees the whole training data set. The more learning cycles, the more accuracy you’ll get up to a point where you don’t need any more cycles. But the more learning cycles you do, the more time you’ll need to train.
  • BatchSize: It sets the number of images to feed the model at a time. It needs to divide the training set evenly or the remaining part won’t be used for training. If this value is very small, it will be over-fitting (it’ll models the training data too well) and therefore when predicting on new data some images might be recognized wrong because it is trying to match the learnt data too close and the it’ll negatively impact the models ability to generalize. On the other hand if the batchsize is too large, it might underfit (model that can neither model the training data nor generalize to new data). This parameter also depends on how many images you have for training (tens vs. hundreds vs. thousands vs. millions).

Cons and pros and areas of improvements for the ImageClassification API

Pros:

  • Simplicity: Even when you might not know what the mentioned parameters are (DNN architecture, epoch, batchSize), compared to low level TensorFlow API it is very much simplified. The fact that you simply need to select a DNN architecture from our ‘catalog’ means that internally it’ll make the needed image transformations (such as image resizing, normalization, etc.) for you depending on that DNN architecture. Other than that, if you where providing the DNN architecture file (pre-trained model), you’d need to know how each DNN architecture expects the size of the images plus additional configuration needed (As that is the case for the other methods explained later).

Cons:

  • Limited collection of DNN architectures in the catalog: Precisely, derived from the simplicity goal and the fact that we’re doing the ‘hard work’ for you depending on the selected DNN architecture, that also means that you can only select/use an architecture provided by our catalog, at least with this high level API. We’ll have other APIs more flexible where you’ll be able to provide your selected DNN architecture, however in that case you’ll need to know many more details about it and provide those parameters.

Current areas of improvement (Post ML.NET 1.4-Preview release):

  • AutoML: Meaning “Find those hyper-parameters for me!“… Right? 😉  In the upcoming releases and based on AutoML approaches, you won’t need to manually try different DNN architectures and hyper-parameters. With AutoML approaches we will intelligently generate many combinations of DNN architectures and hyper-parameters and will find high quality models for you.
  • In-memory images as both, input for training and scoring/consuming the ML.NET model. This is currently ‘work in progress’ for the next release.
  • Hyper-parameters simplification: Refactor and simplify hyper-parameters such as Epoch, BachSize, and others. Also ‘work in progress’ for the next release.

The rest of the steps for training, evaluating and consuming your model

The rest of the steps such as training by calling trainedModel.Fit(), evaluating the model with the quality metrics and trying/consuming the model with some predictions are pretty similar to the way you do it for any other ML.NET model so I’m not going to explain it here. You can learn about it in the training sample app itself, here:

Image Classification Model Training sample with ML.NET

See it working!

When running the sample above, the console app will automatically download the image-set, unpack it, train the model with those images, validate the quality of the model my making many predictions using the test dataset (split set of images not used for training) and showing the metrics/accuracy:

And finally it’ll show you all the test predictions used for calculating the accuracy/metrics a even a further single try/prediction with another image not used for training:

At this point, I have told you the main approach we’re currently recommending to use for Image Classification model training in ML.NET and where we’ll keep investing to improve it, so you can stop reading the Blog Post if you want unless you also want to know about the other possible ways of training a model for image classification based on a different type of transfer learning which is NOT TensorFlow DNN native (it doesn’t create a new TensorFlow model) because it uses an ML.NET trainer “on top” of the base DNN model that only works as a featurizer. Keep reading if you want to know more about it… 😉

B. Model composition of a pretrained TensorFlow model working as image featurizer plus a ML.NET trainer as the model’s algorithm

This method or approach is available in ML.NET since v1.0. It is also described in detail in this Tutorial/ Walkthrough :

Tutorial: Retrain a TensorFlow image classifier with ‘Model Composition’ transfer learning and ML.NET

Plus we also have this sample in the ML.NET GitHub repo:

Sample training app: Image Classification Training (Model composition using TensorFlow Featurizer Estimator)

Since you have detailed ste-by-step in those resources above, what I’m going to do for this blog is to highlight what this approach is doing under the covers, what are the main issues and complex details about the TensorFlow pre-trained model (DNN architecture) the user needs to know about, which is why we’re working on the previous approach trying to simplify the ‘Computer Vision’ scenarios in ML.NET while providing native DNN power and flexibility.

The problem to solve

The problem is ‘Image Classification’. Same problem than the one targeted by the previous approach.

Nothing new here.

The approach

This approach mixes a pre-trained Deep Learning model (DNN architecture) simply used used to generate features from all images with traditional ML.NET algorithms (using a multi-class classification ML Task trainer such as the LbfgsMaximumEntropy).

In more detail, you use the Inception model as a featurizer. This means that the model will process input images through the neural network, and then it will use the output of the tensor which precedes the classification. This tensor contains the image features, which allows to identify an image.

Finally, these image features will feed into an LbfgsMaximumEntropy algorithm/trainer which will learn how to classify different sets of image features.

You can see that approach in a visual illustration below:

You can see that the process and produced assets are different compared to the approach #1 where we are training a new TensorFlow model.

In this case, we are not training in TensorFlow but simply using a TensorFlow pre-trained model as featurizer to feed a regular ML.NET algorithm and therefore the only thing that is produced is a ML.NET model but not a new retrained TensorFlow model.

The code

Let’s see the code of this mentioned Sample training app: Image Classification Training (Model composition using TensorFlow Featurizer Estimator) per sections.

The boilerplate code: Code for downloading the image set and load into an IDataView and splitting in train/test datasets

That code is almost exactly the same than in the native DNN Transfer Learning explained at the begining of the blog post, so nothing new here. You can see it here:

The complex code: Code with ‘magic’ names and settings related to the TensorFlow model being used

The first thin you’ll notice when reviewing this code is that there are quite a few configuration settings that might sound pretty much like “How would I find out these parameters?“, such as in this code:

In fact, those values usually depend on the pre-trained TensorFlow model you are using. For instance, the values shown in the struct are the right ones when using the Inception v3 pretrained model, and the values commented on the right are the ones needed if using the InceptionV1 pretrained model. Basically, the image size needs to be different, the re-scale value, etc.

The way you can find out those configuration values is not straightforward since you need to research what are the requirements of the pre-trained TensorFlow model probably by investigating some other sample using the same model in Python or through any documentation available for that DNN architecture. Definitively not straight forward! 😉

Then it comes the ‘fun part’ which is the pipeline definition for the images transformation, as shown in this code:

The actions (method names) transforming the images look logical although too verbose:

  • Load Images
  • ResizeImages
  • ExtractPixels
  • LoadTensorFlowModel
  • ScoreTensorFlowModel

But couldn’t a higher level API do all those steps for me (That’s what we’re currently doing in the previous approach 😉 ).

And most of all, how can you find out those additional “magic strings” such as the following?:

Values for Inception V3:

  • outputColumnNames: new[] { “InceptionV3/Predictions/Reshape” }
  • inputColumnNames: new[] { “input” }
  • addBatchDimensionInput: false

Values for Inception V1:

  • outputColumnNames: new[] { “softmax2_pre_activation” }
  • inputColumnNames: new[] { “input” }
  • addBatchDimensionInput: true

Well, it turns out that those “magic strings” are precisely the names of the input tensor and the output tensor name of the penultimate layer (the layer that generates the image features) as named within the specific pre-trained TensorFlow model you are using (InceptionV3, InceptionV1, ResNet, etc.). If you open the TensorFlow frozen graph file (.pb file) with Netron, you can see it as shown in the following illustration (Note that in the illustration it is using the values needed for InceptionV1):

Then, the rest of the code is about adding the regular ML.NET multi-class classification trainer (in this case LbfgsMaximumEntropy), train the model by running Fit(), evaluating the model and finding out the metrics such as accuracy, etc. the same way you’d do with other ML.NET models when creating/training it, as in the following code:

Notice how you need to specify what’s the output tensor name (InceptionV3/Predictions/Reshape if using InceptionV3) providing the image features as the name of the input column name for the ML.NET trainer/algorithm.

Finally, I also want to highlight that in this approach the only output produced by the training is the ML.NET model (.zip file) since we are not retraining a new TensorFlow model as we were doing in the approach number 1 in this Blog Post, but simply using the image features produced by the TensorFlow model to train a ML.NET model, as shown in the following illustration:

So, yeah, this approach is pretty flexible. You can use any pretrained TensorFlow model (DNN architecture) you’d like, but from an usage and simplicity point of view is far from ideal, right?

That’s why we are improving the API experience with simpler approaches as the approach #1 currently in ML.NET 1.4-Preview.

C. Model composition of a pretrained ONNX model working as image featurizer plus a ML.NET trainer as the model’s algorithm

 

TBD – Will continue working on this section pretty soon! 🙂

 

Takeaways

TBD

Avatar
Cesar De la Torre

Principal Program Manager, .NET

Follow Cesar   

0 comments

    Leave a comment