Building Smart Apps with Microsoft Cognitive Services

Machine learning is a hot topic for developers these days. Many developers, though, have been deterred from integrating machine learning into their applications due to daunting requirements in advanced understanding of math and theory. Earlier this year, Microsoft announced the availability of a set of 21 new APIs called Microsoft Cognitive Services. Rather than having to deal with the complexities that come with machine learning, Cognitive Services provides simple APIs that handle common use cases, such as recognizing speech or performing facial recognition on an image. These APIs can be broken down into five main categories: vision, speech, language, knowledge, and search.

In this blog post, I’m going to explore the Computer Vision API, which returns rich information about visual content found in an image, to build a language learning app similar to Rosetta Stone where users take photos of objects based on the word presented to them.

Exploring the API

The Computer Vision API has many features that help developers building apps with an image component. These features include the ability to analyze a picture to understand its content, create smart thumbnails (to ensure you never crop the region of interest), as well as OCR and adult content detection.

Adult content detection is one feature of the Vision APIs that I believe will change the way we think about storing our images. I’m unable to think of many situations where, as a developer, I’d be comfortable with end-users storing adult content on my servers, especially within the scope of sharing those images with the public. With one API call, I can have a probability score for inappropriate content within the picture and decide the appropriate action to take (most likely ensuring the image is never uploaded to blob storage), and stopping inappropriate content from reaching my users before it’s even stored in the cloud.

Creating a Spelling App with Computer Vision APIs

When I first heard that the Vision APIs were available for general consumption, I had an idea for a language learning game. The idea would be to present the gamer with a word (in a foreign language), and ask them to go find the item described and take a photo. I’d then use Cognitive Services to analyze the image and confirm or reject the image as having the correct content. If we take Dutch as an example, the screen would show Kat and the user would need to take a photo of a cat in order to unlock an achievement.

Registering for a Cognitive Service API Key

To get started using all of the Cognitive Services APIs, all you have to do is head over to the Cognitive Services website and sign up for an account. This free account allows you to create an API key for consuming computer vision APIs as well as any of the other 20 APIs available as part of the service.

Once an API key was registered, I jumped over to Azure to set up my backend infrastructure. In this case, I created an App Service using EasyTables. This way I could dynamically push new vocabulary and corrections to all devices without requiring an App Store update.

To populate my App Service with data, I created an Excel spreadsheet of both the English and Dutch words to be exported as a CSV file. Setting up the backend infrastructure took no more than 5 minutes, which I think is a testament to the power of PAAS.

Taking a Photo

Naturally, for any vision based app to work, we’ll need something to look at. In this case, I want to take a photo and upload. The Media Plugin for Xamarin and Windows is a great way to accomplish this if you’re doing cross-platform development, but each platform (in this case iOS) also exposes a rich API for taking photos:

var imagePicker = new UIImagePickerController();
imagePicker.SourceType = UIImagePickerControllerSourceType.Camera;
PresentViewController(imagePicker, true, null);
imagePicker.Canceled += async delegate {
    await imagePicker.DismissViewControllerAsync(true);
};

imagePicker.FinishedPickingMedia += async (object s, UIImagePickerMediaPickedEventArgs e) {
    //Insert code here for upload to Cognitive Services
};

The images that my iPhone takes are huge, normally over 2500px x 3000px which is a little large for uploading to Cognitive Services (in fact we’ll get an InvalidImageSize exception if we upload the image without first scaling it). To scale the image, we can use the following iOS snippet:

UIImage ScaledImage(UIImage image, nfloat maxWidth, nfloat maxHeight)
{
    var maxResizeFactor = Math.Min(maxWidth / image.Size.Width, maxHeight / image.Size.Height);
    var width = maxResizeFactor * image.Size.Width;
    var height = maxResizeFactor * image.Size.Height;
    return image.Scale(new CoreGraphics.CGSize(width, height));
}

Recognizing Image Content with the Computer Vision APIs

All Cognitive Services APIs are consumable via beautiful cross-platform libraries distributed via NuGet, although REST APIs are also available if you wish to use those. To add the vision APIs to the iOS app, add the Microsoft.ProjectOxford.Vision NuGet package and create a CognitiveService service class that interacts with the Cognitive Services APIs.

In just two lines of code, it’s possible to analyze the image and get detailed information about its content, without requiring any knowledge about machine learning:

public async Task GetImageDescription(Stream imageStream)
{
    VisualFeature[] features = { VisualFeature.Tags, VisualFeature.Categories, VisualFeature.Description};
    return await visionClient.AnalyzeImageAsync(imageStream, features.ToList(), null);
}

To get the most out of Cognitive Services Vision APIs, provide a list of features you wish to enable in the analysis process. You can then pass this into the AnalyzeImageAsync method as an array. The options available to you are:

ImageType
Color
Faces
Adult
Categories
Tags
Description

Validating the Photo Against the Word

The AnalysisResult object that we get back from the VisionServiceClient has a number of properties depending on the VisualFeatures requested. In this case, all I really care about are description tags. The tags list all of the items that Cognitive Services believes to be in the image or related to items it thinks are in the image. When taking a photo of a cat in a basket, it will commonly add tags for things like the cat, the floor, the rug it sits on, etc.

All that’s required is to loop through the tags of the description and see if they match the original word provided:

var selectedWord = Words.FirstOrDefault();
foreach (var tag in result.Description.Tags)
{
    if (tag == selectedWord.English.ToLower())
    {
        Acr.UserDialogs.UserDialogs.Instance.ShowSuccess("Correct!");
        Words.Remove(selectedWord);
        if (Words.Count > 0)
           lblWord.Text = Words.FirstOrDefault().Translation;
        else
        {
            lblWord.Text = "Game Over";
            btnSkip.Alpha = 0;
            btnSnapPhoto.Alpha = 0;
        }
        return;
    }
 }

Wrapping Up

Microsoft Cognitive Services allows developers to build incredibly complex features into apps using only a couple of lines of code. Features that historically would have required mathematicians, statisticians, and machine learning experts, as well as a huge amount of annotated data, can now be implemented by any developer, as we saw above.

There is a wealth of information regarding how to get started with Cognitive Services both in the language learning app featured in this post as well as in the official Cognitive Services documentation portal. Be sure to try out our code challenge based on the Emotion APIs for a hands-on look at the power of Microsoft Cognitive Services.