Launchpad.ai
HomeServices
Training
FellowshipUpskill
SolutionsBlogTeamLevi's UpskillContact
Contact us
Posted on 
December 17, 2020

A case for custom deep learning solutions

By Arshak N., Vaishak K.,

A case for custom deep learning solutions

AWS  Rekognition  and  Azure  Computer  Vision  module  are  two  very  popular  and  widely used deep learning platforms for Computer Vision. Both of these platforms can identify a wide  range  of  objects  like  natural  objects,  human  body  and  face,  household  things, animals,  birds,  vehicles,  gadgets,  food  items  and  more,  in  an  image  with  very  good accuracy.  Their  ability  to  identify  a  variety  of  natural  and  man-made  objects  at  pretty decent  accuracy,  with  models  that  work  right  out  of  the  box  without  the  need  for  any training, make them a good choice for many deep learning computer vision applications. But are they good enough for all kinds of real-world applications or does specialized deep learning solutions have a case?

Figure (i) - Few examples of images in the dataset.

In  this  article  we  try  to  answer  this  very  question  in  the  context  of  identifying  raw  food items  placed  inside  an  oven.  Identifying  raw  food  items  is  one  thing,  trying  to  identify them placed inside an oven can be a whole different ball game - presence of artefacts like oven  grill,  peculiar  angle  of  view  and  specific  lighting  conditions  can  all  add  to  the challenges for the classifier.

The  dataset  used  for  the  evaluation  had  665  images  of  food  items  placed  in  an  oven spread  over  46  different  classes.  All  the  images  were  collected  from  the  open  internet, either  through  google  image  search  or  from  YouTube  videos.  The  dataset  was  very carefully curated to make it as representative as possible. Each image was selected such that it contain a food item placed in an oven with oven lighting and at a constant viewing angle. A few examples of images in the dataset is shown in figure (i).

Microsoft Azure Computer Vision

Azure Computer Vision module has an image classifier that can identify multiple objects that may contain in the input image. The model, apparently, is pre-trained on a huge dataset and can successfully identify many number of real and man-made objects in images. The model is capable of identifying many food items as well, but it is not clear which all food items it is capable of recognizing. From our experiments it was evident that Azure’s model couldn’t identify some of the classes in our dataset. To add to its vows, some our classes were too specific, for example our dataset has chicken breast, chicken thighs, chicken wings and whole chicken as different classes while Azure has just a generic ‘chicken’ class.

Besides, the model is a multi-label classifier that spits out more than label for each image, ideally one label each for each of the object in the image. But in reality the results are far from ideal, as demonstrated in figure (ii).

Figure (ii) - Examples of multi-label outputs of Azure CV module.

For the first image the model returns labels that are quite accurate and more importantly it successfully identifies the presence of chicken in the image, which can be considered as a positive outcome. But for the second image the model totally fails to recognize the chicken, which is a negative outcome. After feeding all the 665 images in the dataset to the model, it returned an accuracy of just 3.9%. Even if we consider the fact that Azure doesn’t have 27 out of the 46 labels in its armor, 3.9% is a very poor accuracy. The precision and recall scores of the model, shown in table (i), further establish its weakness. Even though the precision is very high for most classes the recall is quite poor for all the classes. In a practical sense this has a major impact. For example when the model says that an image contains chicken, it is very often true, but whenever there is a chicken in an image, the model very rarely says so.


AWS Rekognition

The AWS Rekognition service is similar to Azure CV module in many aspects - it is a multi-label image classifier pre-trained on a huge dataset enabling it to identify, with decent accuracy, many number of real-world objects right out of the box without the need for any model training. In contrast to the Azure model, the rekognition model has 31 out of our 46 classes in its list of objects, which apparently enabled it to perform much better own our dataset and returned an accuracy of 23%, which is still quiet low for practical applications.

Table (ii) shows the precision and recall scores of the model for the 31 classes it could identify. Similar to the Azure model, precision is quite high for most of the classes, but unlike Azure, rekognition has high recall for many of the classes, further establishing itself as a better model for this specific use case.

Conclusion

Popular deep learning solutions like Microsoft Azure Computer Vision module and AWS Rekognition service can be a good solution for rapid deployment of many real-world computer vision applications. But certain specialized use-cases like identifying raw-food items placed in an oven can make them sweat. In our experiments Azure CV module and AWS Rekognition service scored, respectively, 3.9% and 23% accuracy on our raw-food dataset. On the other hand our own training of CNN models using food images collected from the internet yielded up to 65% accuracy on the same dataset we used to test AzureCV and AWS Rekognition. Going for a custom solution provides the opportunity to better curate the dataset, apply task specific data augmentations and control other aspects of model training which could substantially improve performance.

Tagged:
Deep Learning
Fellowship Team
view All Posts
Featured Posts
Reimagining Online Shopping Using Multimodal Search Engines
How to Make Data-Driven Decisions with Contextual Bandits: The Case for Bayesian Inference
Will AI Change the Role of Cybersecurity?
Deep Learning Approach to Fraud
Tags
CycleGAN
CNN
Cyber-Security
Deep Learning
Fraud
Generative Adversarial Networks
Image Translation
LSTM
Machine Learning
Multi-modal
Productivity
Reinforcement-learning
More Posts

You Might Also Like

PSOL for Weakly Supervised Object Localization
Dec 21, 2020
 by 
Fellowship Team
The Usage of CycleGAN for Image Translation to Increase the Size of Fridge Food Types Dataset
Dec 19, 2020
 by 
Fellowship Team
Reimagining Online Shopping Using Multimodal Search Engines
Nov 2, 2020
 by 
Divyam Malay Shah
Understanding Evolved Policy Gradients
Nov 2, 2018
 by 
Michael Klear
How to Make Data-Driven Decisions with Contextual Bandits: The Case for Bayesian Inference
Oct 11, 2018
 by 
Michael Klear
Adversarial Image Detection and Defense
Aug 27, 2018
 by 
Andrew Eaton
Launchpad.AI

We create transformative solutions that deliver valuable business results

HomeTeamServicesBlog
SolutionsFellowshipUpskill
© 2014-2020 Launchpad.AI, Inc.
Privacy
Terms of Use
Credits