Comparing Image-Classification Systems: Custom Vision Service vs. Inception

Clemens Wolff

We recently worked with Reverb, an online marketplace for music gear. Musicians and music retailers use Reverb’s platform to buy and sell items such as guitars, keyboards, effects pedals, etc. A key element that aids discovery of items listed on the Reverb site is accurate metadata. For example, the manufacturer’s official name for the finish of a guitar (e.g., “honey burst” or “fiesta red”) is an important criterion for surfacing relevant guitars in user searches. However, the finish name is sometimes omitted or not accurately entered when sellers list a guitar for sale. To solve the problem of missing or mislabeled manufacturer finish metadata, we developed a set of image classification models with Reverb that help resolve manufacturer finish names based on product images.

In this code story, we will cover how to build image classification models in Python with Custom Vision Service and compare the results with popular Tensorflow-based models in terms of accuracy, prediction speed, training speed and setup complexity.

Getting started with the Custom Vision Service

Custom Vision Service is a managed service that enables developers to easily customize state-of-the-art deep learning image classification models using transfer learning. You can read more about the service in this code story. We decided to use this service for our solution to classify the manufacturer finish names because it provides a very quick getting-started experience and enables us to build image classification models without having to manage and scale deep learning infrastructure.

A simple way to train models via the Custom Vision Service is to use its web portal and create classifiers via a drag-and-drop user interface to upload and tag training images. However, this approach is inefficient when dealing with large amounts of training data and when iterating on new models and ideas since the process cannot be automated. As a result, we created a Python SDK for the Custom Vision service. Using this SDK, it is straightforward to quickly train new models with large amounts of training data.

To get started, install the Custom Vision Python SDK and its dependencies:

pip install custom_vision_client

We will assume your image data is laid out on disk using directories as labels like so:

├── label1
│   ├── image1.jpg
│   └── image2.jpg
└── label2
    ├── image3.jpg
    └── image4.jpg

Now we’re ready to build a model:

from glob import glob
from os.path import join

from custom_vision_client import TrainingClient, TrainingConfig

azure_region = "southcentralus"
training_key = "the-training-key"  # from settings pane on
project_name = "my-classifier"
data_directory = "/path/to/training-data"

training_client = TrainingClient(TrainingConfig(azure_region, training_key))

# create a new project, if the name already exists, a suffix will
# be added to the name in order to make it unique
project_id = training_client.create_project(project_name).Id

for label in glob(data_directory):
    images = glob(join(data_directory, label, '*.jpg'))
    # register a new label if it doesn't already exist in the project
    # also register training images, duplicates get ignored automatically
    training_client.create_tag(project_id, label)
    training_client.add_training_images(project_id, images, label)

model_id = training_client.trigger_training(project_id).Id

print('{},{}'.format(project_id, model_id))  # record these for prediction

With Reverb, our objective was to classify an image into several categories of manufacturer finish names. These finish names roughly corresponded to colors and as such can be grouped naturally into distinct super-families like “the reds,” “the greens,” “the multi-colors/sunburst,” etc. We decided to utilize this natural structure of our data and build a hierarchical image classification model. First, we built a model to distinguish between the color super-families. Next, within each family, we built a second model to differentiate between the more detailed nuances of the finishes in that color family.

The figure below shows a flow-chart diagram of the model:

Image reverb custom vision hierarchical model
Hierarchical model for classifying color finish names for guitars

First, we predicted high-level color families such as red, green or sunburst. Within each family we then trained an additional model to differentiate nuances; for example, to tell the difference between “bordeaux-metallic” and “fiesta-red” within the “red” family.

The intention behind this model was to maximize the amount of training data available to the first classifier and enable the second level of classifiers to learn more discriminatory features within each color family instead of having a single model learning features that span the entire label space. For the Reverb data set, this approach increased precision to 87% and recall to 84% compared to a standard multi-label model with precision at 79% and recall at 75%. The downside of the tiered approach is that we now require two classifications per image at prediction time which doubles latency. A single prediction takes on the order of 200ms-400ms so the latency for the hierarchical model is on the order of 400ms-800ms.

Image reverb custom vision performance training set size
Chart showing the precision and recall of the Custom Vision Service as the amount of training data available to the model increases. We see that the service does well even with relatively limited amounts of training data.

In synthesis, getting started building image classification models with the Custom Vision Service is very simple. Uploading images and building models can be done in a matter of minutes. No deep learning hardware needs to be set up or managed to train models. The Custom Vision Service leverages many data sets under the hood for transfer learning, so the performance of the trained models is strong, even with limited training data (see chart above). At prediction time, the latency for a request to the Custom Vision Service is on the order of 200ms to 400ms which is reasonable for many non-real-time applications (750ms to 1400ms if not in the same Azure data center). Note that the Custom Vision Service recently added support for exporting models to enable prediction to be run locally on edge devices, so this latency can be further reduced.

Getting started with Inception V3

In order to get an idea how Custom Vision Service compares in both prediction accuracy and speed to other popular convolutional networks, we built a pair of transfer learning implementations based on Inception V3 and MobileNet. For simplicity, the examples below are based on the command line tools included in TensorFlow’s examples.

Both Inception V3 and MobileNet networks were retrained using the tensorflow/tensorflow:1.3.0-devel-py3 Docker image. All experiments were done in a CPU-centric environment to mirror a production service deployment environment (that is, without Nvidia-Docker or GPU support due to the cost of running these in production at scale). From a working directory which includes the training images, you can start by running the following Docker container:

docker run --rm -it -v `pwd`:/c13n -w /c13n tensorflow/tensorflow:1.3.0-devel-py3 bash

Assuming the training images are laid out as described above, training can be done using the script like so:

python3 /tensorflow/tensorflow/examples/image_retraining/ \
    --architecture inception_v3 \
    --image_dir /c13n/training-data \
    --output_graph /c13n/index.pb \
    --output_labels /c13n/index.txt

After a few minutes, the script should complete with output similar to the following:

INFO:tensorflow: Step 3999: Train accuracy = 79.0%
INFO:tensorflow: Step 3999: Cross entropy = 1.548623
INFO:tensorflow: Step 3999: Validation accuracy = 61.0% (N=100)
INFO:tensorflow:Final test accuracy = 65.6% (N=570)

As you can see from the example, at 61.0% accuracy, the performance of Inception V3 trained on 5000 images is comparable to that of the Custom Vision Service trained on 200 images. Retraining of the Inception V3 neural network can take somewhere between 6-15 minutes per model whereas the Custom Vision Service only takes between 10 seconds to 5 minutes to train depending on the dataset size (timed on a 2.9 GHz Intel Core i7 machine with 16GB of RAM).

In order to run a prediction using this newly-trained model we can run the following command from TensorFlow’s examples:

python3 /tensorflow/tensorflow/examples/label_image/ \
    --image /path/to/image.jpg \
    --input_width 128 \
    --input_height 128 \
    --graph /c13n/red.pb \
    --labels /c13n/red.txt \
    --input_layer DecodeJpeg \
    --output_layer final_result

Running a prediction using a local Inception V3 and a local image takes significantly longer (6000ms to 7000ms) than running the equivalent request on Custom Vision’s REST endpoint (750ms to 1400ms). We can shave about 1000ms off the prediction time by wrapping the Inception prediction code in a REST service (e.g. using hug, shown in this sample code) and only loading the Inception model once at service start instead of for every prediction.

Getting started with MobileNet

The steps for retraining and using a MobileNet model are almost identical as the those for Inception V3. The only difference is the value passed through the –architecture  command line parameter:

python3 /tensorflow/tensorflow/examples/image_retraining/ \
    --architecture mobilenet_0.25_128_quantized \
    --image_dir /c13n/training-data \
    --output_graph /c13n/index.pb \
    --output_labels /c13n/index.txt

Interestingly, running the training on the same set of 5,500 training images of size 128×128, the accuracy of the MobileNet model increases to 72.5%, higher than the one for Inception V3 (likely with some overfitting in MobileNet as hinted-at by the very high training accuracy).

INFO:tensorflow: Step 3999: Train accuracy = 99.0%
INFO:tensorflow: Step 3999: Cross entropy = 0.112992
INFO:tensorflow: Step 3999: Validation accuracy = 75.0% (N=100)
INFO:tensorflow:Final test accuracy = 72.5% (N=570)

However, based on the community at large’s empirical experience, it’s likely that the performance of a MobileNet model on other data sets will be lower relative to Inception V3 since MobileNet is optimized for speed whereas Inception aims for correctness.

Another noteworthy difference between Inception and MobileNet is the big savings in model size at 900KB for MobileNet vs 84MB for Inception V3. You can experiment further by switching between variants of MobileNet. For instance, using mobilenet_1.0_128  as the base model increases the model size to 17MB but also increases accuracy to 80.9%.

INFO:tensorflow: Step 3999: Train accuracy = 100.0%
INFO:tensorflow: Step 3999: Cross entropy = 0.034551
INFO:tensorflow: Step 3999: Validation accuracy = 87.0% (N=100)
INFO:tensorflow:Final test accuracy = 80.9% (N=570)

It’s also worth mentioning that this smaller model size is accompanied with a shorter prediction runtime of 950ms versus one of roughly 5000ms with Inception V3. If we wrap the MobileNet model in a REST service and load model only once during startup, one can achieve prediction speeds of around 400ms.

For running prediction, the script can be invoked for a MobileNet model as shown below. (Note that the input_layer and output_layer args are different than those used for Inception V3.)

time python3 /tensorflow/tensorflow/examples/label_image/ \
    --image /path/to/image.jpg \
    --input_width 128 \
    --input_height 128 \
    --graph /c13n/red.mobilenet.pb \
    --labels /c13n/red.mobilenet.txt \
    --input_layer input \
    --output_layer final_result

bordeaux metallic 0.999991
burgundy mist metallic 8.92877e-06
burgundy mist 1.87918e-07
dakota red 4.31156e-09

real 0m0.955s
user 0m0.840s
sys 0m0.190s

Framework comparison

The figures below show a comparison of the predictions of the Custom Vision Service, Inception and MobileNet models on a variety of sample guitar images with diverse rotations, backgrounds, lighting and cropping factors. Overall, MobileNet is the fastest model at prediction time and the Custom Vision Service is the most accurate while having an acceptable runtime.

Image reverb custom vision performance comparison
Sample guitar images alongside their color finish name as predicted by the Tensorflow-based MobileNet and Inception models as well as the predictions made by the Custom Vision Service. We see that the Custom Vision Service does well across a wide range of guitar images, rotations, partial views, lighting changes, etc.

Image model comparison chart
Charts comparing the Tensorflow-based models MobileNet and Inception with the Custom Vision Service in terms of accuracy, model size, training time and prediction time. We see that the Custom Vision Service has the best aggregate performance across the categories.


In this code story, we’ve shown how to tackle a custom image classification task via transfer learning on the Custom Vision Service and Tensorflow. Overall, we found the Custom Vision Service to have the best performance on our task. Due to the high-end models backing the service, we expect the great performance can be generalized to other image classification tasks such as identifying sign language utterances in a video stream or classifying skin disease based on smartphone camera pictures.

The Custom Vision Service offers a simple getting-started scenario enabling us to build models in minutes via a simple Python SDK without having to set up a complicated Tensorflow stack. If you’re interested in trying out the Custom Vision Service to get started with your image classification project, take a look at the Python SDK and get in touch with us on GitHub! We’d love to hear your feedback.



1 comment

Discussion is closed. Login to edit/delete existing comments.

  • Clemens WolffMicrosoft employee 0

    The performance metrics can be looked up using the get_iteration_performance method on the CustomVisionTrainingClient class in the Custom Vision Python SDK.

Feedback usabilla icon