Reimagining Online Shopping Using Multimodal Search Engines
The ultimate search engine would basically understand everything in the world, and it would always give you the right thing. And we're a long, long ways from that. - Larry Page
Most search engines today enable a user to look for results based on a "Text" query or an "Image" query. The problem with these standalone approaches is that they tend to limit user expression. A text query or an image query might not always help capture a user's intuition. Let's take the following image as an example:
Traditional search engines can either help me find shirts that look very similar to the above image or use product descriptions such as "full-sleeved blue shirt with stripes" in a text query to find matches. But I cannot use a combination of both "Text" and "Image" to convey what I am looking for. For example, I can't say 'remove stripes' or 'add pocket' while looking for products similar to the shirt in the above image.
As search engines evolve it is important that they are able to combine information from multiple sources(i.e. images, text, audio, video) to help any given user capture their intuition. This would bring us a step closer to the "Ultimate Search Engine" that Larry Page talks about.
Our team at fellowship.ai decided to prototype a multimodal search solution that would enable a user to simultaneously make use of both visual and textual information to navigate through a catalog of clothes. This project was completed over a period of about 5-6 weeks and makes use of a custom dataset that we curated by scraping a variety of online apparel stores. Our solution is captured by the following demo loop:
We have restricted the scope of our search engine to an e-commerce use case of online apparel shopping. Apparel products have strong visual/stylistic components that are difficult to capture using textual descriptions. This makes them a good use-case for a multimodal search engine solution as textual descriptions alone cannot capture the information in the product.
The upcoming section explains the shortcomings of existing e-commerce search engines and how our multimodal solution uses 'Joint Embeddings' to solve these shortcomings
Existing approaches and need for Joint Embeddings
Product search on most e-commerce platforms is driven by two major approaches:
- Text based: This approach uses tries to match user query to product title/descriptions.
- Product similarity: This approach looks for similar products based on image similarity and/or finding similar "product descriptions".
The problem with the above approaches is that the user query is either "Text" or "Image". There is no way for us to transfer the information contained in any given wild image into "product descriptions". Similarly, finding similar products based on "product descriptions" depends entirely on the sanity of these descriptions. Mistakes in product descriptions can lead to incorrect search results. This calls for a solution that enables one to dynamically search for products by combining information from both "Image" and "Text".
We therefore propose a solution based on an Amazon paper titled "Joint Visual-Textual Embedding for Multimodal Style Search"(https://arxiv.org/pdf/1906.06620.pdf). This solution involves the development of a neural embedding that can be used to simultaneously represent information from an "Image" and its associated "Product Description".
Lets take an example to make things more clear.
We want to build a joint embedding that can deduce product description(i.e. shirt, blue, stripes, full-sleeved) from this image. This joint embedding can also accept product description and find an appropriate product that matches this description. All this can be done on a dynamic basis as opposed to the earlier methods where one had to always have reliable product descriptions to find appropriate matches.
Having a joint embedding also enables query arithmetic to add/subtract custom features.
Understand the Architecture of Joint Embedding
The above architecture comprises of two stages:
The Joint embedding can accept both 'Product Image' and 'Product Description' as an input to generate a vector fI or fT that uniquely represents any given product.
- We use a ResNet-18 CNN to extract image encodings from a product's display image.
- These encodings are then sent to a fully connected layer to project them into the vector space of Textual Embeddings.
- This final projection of visual features is denoted by fI.
- Text from product descriptions is treated as a bag of words.
- This bag of words is preprocessed to remove stop-words, deduplicated(by stemming) and filtered to remove words that may not contain any information relevant to the product.
- We now transform these words from product descriptions into word embeddings using a language model such as BERT and add them to get a final vector fT.
The objective of our training would be to minimize the distance between fI and fT. This represented by the Mini-Batch Match Retrieval Objective or LMBMR.
Attribute Extraction comprises of a fully connected layer that projects visual embeddings from the Joint Embedder into a vector of fixed dimension |V| followed by a sigmoid. Here, |V| is the size of vocabulary that is predetermined from the textual product descriptions metadata in the training set.
The outputs of this sigmoid branch, Pw(I), are approximations of the probabilities for each word "w" in the vocabulary to belong to image "I". These predictions are then used to generate the product descriptions of the product image.
The current multimodal search solution works better on a fixed vocabulary as compared to wild text queries. This fixed vocabulary is derived from the dataset that our model is trained on. We aim to incorporate an auto-suggest mechanism that can capture a user's intuition and suggest appropriate attributes from our vocabulary. This remains an active area of research in our fellowship.
Our multimodal search solution is currently limited to apparel. We have also made efforts to replicate this on other categories such as shoes, watches and handbags. Do reach out to us at firstname.lastname@example.org for a demo or if you feel that out solution can benefit you in any manner.
About the Author and Fellowship.AI:
My name is Divyam Shah and I worked on this project alongside my teammate Shashank Pathak as a part of my externship with Launchpad.AI(commercial arm of fellowship.ai). Prior to joining fellowship.ai, I worked at a telematics startup named Zendrive where I worked towards scaling backend systems. Fellowship.ai enabled me to gain valuable experience in applying cutting edge machine learning research to a variety of practical problems. This helped me transition my career to data science. I am also deeply interested in subjects such as History, Public Policy and Politics. These help me understand problems from a broader perspective. I hope to someday use my technical expertise to solve a significant societal problem.
Fellowship.ai brings together people from diverse backgrounds and gives them an opportunity to apply cutting edge machine learning research to a variety of practical problems. This helps fellows kick-start their career as a data scientist. Do apply to get involved in such projects!