Product Recommendation by Image Similarity

Hello everyone.

It’s BASIC but useful.

Take a look at this post to see how comfortable and practical was to make some product recommendation with image similarity using Python 3.6 and some libraries.

It was used Fashion MNIST. It’s a kind of Digits MNIST based Fashion dataset.

I like the results of the post and used it on production in my work to do the same as this post objective: Product Recommendation by Distance Image Similarity.

See my GitHub repository for sources.

Contents

Intro

For understanding, here goes the process step of this application:

Create / Training Model / Create Recommendations

Feature extraction
- Using ResNet50
Training
- Using all extracted features
- Create a train set of features
Model predict
- Using all the test set
Save the predicts
- Create a recommendation set

Inference

Feature extraction
- Using ResNet50
Model predict
- Load model and weights
Predicts
- Using the image of the inference
Recommendation
- Get the nearest features from recommendation set

It was tested over 3/12/50/100 thousand CGI product images to train the categorical model to obtain the result and had a good result and so assertive.

What you’ll see in this post:

Fashion MNIST dataset;
Model creation and feature extraction using ResNet50 with ImageNet;
One Hot Encoding for products’ classes;
Prevent imbalanced dataset;
Model creation, classification and feature extraction using Sequential Model;
Hyper Parameters and Train the data;
Inference and distance similarity of features.

Among the items above, you’ll see some practices of normalizations, learning and accuracy curves, image plots of inference/results, etc.

This application was created and validated using Nvidia GTX 1050 Ti GPU with 4GB memory and Nvidia Tesla K80 with 24gb memory.

Imports

Here go all the imports used in this example.

from time import time
import matplotlib.pyplot as plt
from matplotlib import cm
import json

import numpy as np
from PIL import Image
from tqdm import tqdm

import tensorflow as tf
from keras.callbacks import TensorBoard

from sklearn.preprocessing import LabelEncoder, StandardScaler
from scipy.spatial import distance

from keras import datasets
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout
from keras.preprocessing import image
from keras.applications.resnet50 import ResNet50, preprocess_input
from keras.activations import relu, softmax
from keras.optimizers import Adam
from keras.losses import categorical_crossentropy
from keras import utils as kutils
from keras import backend as K

I used the following version:

sklearn: 0.19.2
scipy: 1.1.0
keras: 2.2.2
tensorflow-gpu: 1.10.1
CUDA: 9.0

Dataset

Below you can see how to get the data using Keras datasets.

I’m using Keras in this case, but you can download or create your dataset for this post.

This Keras dataset was separated in Train and Test samples. There are 60k Train images and 10k Test images.

(x_train, y_train), (x_test, y_test) = datasets.fashion_mnist.load_data()

It’s an MNIST based dataset, therefore, you can use Digits MNIST in this same example, just change the code above for the one below.

(x_train, y_train), (x_test, y_test) = datasets.mnist.load_data()

And all the example will run with digits images, instead of fashion images.

Image Feature Extractor Model

Now, extract the features from all dataset images to categorize them in their classes.
But first, let’s create our ResNet50 feature extractor model.

config = tf.ConfigProto()
tf.Session(config=config)

model_fe = ResNet50(
	weights='imagenet',
	pooling='avg',
	include_top=False
)

def feature_extract(img):
	img = Image.fromarray(cm.gray(img, bytes=True))
	img = img.resize((224, 224))
	img = img.convert('RGB')
	
	x = image.img_to_array(img)
	x = np.expand_dims(x, axis=0)
	x = preprocess_input(x)
	
	feature = model_fe.predict(x)[0]
	return feature

First, prepare the TensorFlow configuration.

Inside this config, you can set some parameters for TensorFlow GPU Configuration.

After started the configuration, I created a ResNet50 model using pre-trained weights of ImageNet. I see this as my best and fast option to extract the product image features and create a function which receives the product image (from our training set, test set or inference) to extract features and return the same.

This function returns the last layer of ResNet50 which was a tensor with the 2048 extracted features.

I’ll make some parenthesis here because there are many kinds of tutorial who stopped at this point and made recommendations with only these extracted features. Sometimes they’re using other types of features extractors like VGG16 – VGG19.

It works, but, sometimes, it was necessarily using some categorical model to get more approximate, reducing the features and to be more assertive. I tried to use VGG16 on my first test, but, when I’m doing the inference with some square light pendants the result was square sofas (funny no?!), it truly makes sense because the sofa was really like the light pendant, so I change to ResNet50.

After this, I tried to improve in creating a model to categorize my features with my classes to prevent this kind of error (result from the other classes in my model).

After creating the feature extractor model I put the features inside my new array of feature and labels (classes) for, in the future, train my classification model.

PS: You can save it as file/files using Pickle (not recommended) or HDF5 file/files (recommended), see more about HDF5 here.

In this case, I’m keeping the extracted features in memory (as in the example below).

x_data = []
y_data = []

for x, y in tqdm(list(zip(x_train, y_train))[:15000]):
	x_data.append(feature_extract(x))
	y_data.append(y)

PS: I’m only extracted the first 15k images to, in the future, train the categorical model.

Normalization

I had some problems with the results even when using categorical data, they showed terrible results on inference. After that, I decided to apply normalization (standardization).

scaler = StandardScaler()
x_data = scaler.fit_transform(x_data)

It worked much better (as if the model started wearing glasses to see).

One Hot Encoding

On Fashion MNIST Dataset or Digits MNIST Dataset all the classes are numerically typed.

In this example, I’m creating binary classes from my textual classes which were converted from numeric classes, to force the use of text classes to test.

int_y_data = LabelEncoder().fit_transform(y_data)
unique_int_y_data = np.unique(int_y_data)
num_classes = len(unique_int_y_data)
y_data = kutils.to_categorical(int_y_data, num_classes=num_classes)

The process is explained below:
Supposed we have three text categories:

category_1
category_2
category_3

I’m using LabelEncoder to transform them, this three categories will map to:

As a final step, we create categorical binary classes with Categorical Keras Utils, and then, the results were:

[1, 0, 0] – For class 0 how is “category_1”
[0, 1, 0] – For class 1 how is “category_2”
[0, 0, 1] – For class 2 how is “category_3”

If you have integer classes you can use it as is.

Imbalanced Datasets

On my datasets, I had a problem with imbalance. I found a good solution by adding some class weights calculated with the proportion of the dataset.

The source below does the work well, it calculates the class proportions on your dataset.

class_weight = compute_class_weight('balanced', unique_int_y_data, int_y_data)

Here you can see some outputs from my test using these calculated proportions to see how it works:

[(0, 1.0940919037199124),
 (1, 0.8992805755395683),
 (2, 0.9920634920634921),
 (3, 0.998003992015968),
 (4, 1.0245901639344261),
 (5, 1.0141987829614605),
 (6, 1.0141987829614605),
 (7, 0.9765625),
 (8, 1.0204081632653061),
 (9, 0.9881422924901185)]

Feature Categorize Model

After all the extraction I created a simple classifier to adapt my dataset.

In this case, I’m using Keras Sequential.

mid_layer_name = 'dense_mid'

model_c = Sequential()
model_c.add(Dense(1024, input_dim=2048, activation=relu))
model_c.add(Dropout(.7))
model_c.add(Dense(512, activation=relu, name=mid_layer_name))
model_c.add(Dropout(.7))
model_c.add(Dense(num_classes, activation=softmax))

adam_optimizer = Adam(lr=.0004)
model_c.compile(
	loss=categorical_crossentropy,
	optimizer=adam_optimizer,
	metrics=['accuracy']
)

As the source above, it was created a Sequential model with:

Dense input layer, activated by Relu, which receives extracted features from ResNet50 with input size 2048 and output size 1024.
Droupout layer to prevent overfitting
Dense middle layer, activated by Relu, which receives input size 1024 and output size 512, and name it (see why after).
Droupout layer to prevent overfitting
Dense layer, activated by Softmax, passing the count of all the classes as output size and receive 512 as input size.

The model was created to classify (and thus adapt the model) the extracted features from ResNet50 do a dimensionality reduction (in this case I kept them in memory to be practical).

After creating the model, a compilation was required. To compile the model, you can use some different losses and optimizers, in this case, I used the Categorical Crossentropy as the loss function and the Adam for optimizing our loss. I monitored the model with Accuracy to judge the performance of the model (since the dataset is imbalanced, you might also want to measure other metrics).

If you are using integer classes, you don’t need to convert to binary classes, use the label as integer and Sparse Categorical Crossentropy for loss.

Why Adam has a learning rate set with 0.0004 value? Because I did a non-exhaustive hyperparameter optimization and found that particular value.

By default, Adam uses 0.001 as learning rate value. On the next topic, you’ll see more about the learning rate and training.

Training

Now comes the magic step. After creating the model and input some magic numbers on loss settings, it’s time to train the model with the images’ extracted features, as described earlier.

model_history = model_c.fit(
	x_data,
	y_data,
	epochs=32,
	batch_size=64,
	validation_split=.2,
	class_weight=class_weight,
	verbose=0
)

Fitting is a process of learning and measuring how well a machine learning model generalizes to similar data to that on which it was trained.

This function receives the data (extracted features from ResNet50 in our case) and category labels for each sample of the dataset.

It was set to 32 epochs and a batch size of 64 samples. To understand this, suppose that there is a dataset of 10 samples and the fit was set with a batch size of 2 and you’ve specified that you want the algorithm to run for 3 epochs, so in each epoch, you’ll have 5 batches (10/2 = 5), each batch gets passed through the algorithm, therefore you have 5 iterations per epoch. Since you’ve specified 3 epochs, you have a total of 15 iterations (5*3 = 15), or steps, for training.

I choose to train the model using the split set of the trains (only 15k samples was used) of Fashion MNIST dataset (20% of this set was used to evaluate the model) and the split set of the tests (5k samples was used) as the products for recommendations. PS: Take a look at the lifecycle sequence at the beginning of the post.

To handle the imbalanced dataset I set the class weights (as seen before) passing as parameters to this function.

The result of this fitted function is an object with records of the training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable) called accuracy in this case.

PS: For this case, I’m not saving the model or/and weights, but you can read more about the save and load functions for model/weights here.

I used the history result object to plot the loss/accuracy curves as seen below:

plt.plot(model_history.history['acc'])
plt.plot(model_history.history['val_acc'])
plt.title('Model Accuracy Curve')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='center right')
plt.show()

plt.plot(model_history.history['loss'])
plt.plot(model_history.history['val_loss'])
plt.title('Model Loss Curve')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='center right')
plt.show()

You can read more about learning rate and performance improvement here. I like the explanation, it’s easy to understand.

Predict Data

To predict some data, after training the model, I used the middle layer output to create the predicted dataset for, in the future, use with a distance metric and rank them according to the semantic similarity.

I created a new model setting the inputs and outputs based on the categorical model inputs and outputs above. It used the middle layer because the last layer of the model is composed of classified weights, activated by Softmax, and just return the categorized class, and, create a function to help to use this middle layer model output predict.

model_c_mid_layer = Model(
	inputs=model_c.inputs,
	outputs=model_c.get_layer(mid_layer_name).output
)

def mid_l_predict(feature):
	predict = model_c_mid_layer.predict(np.asarray([feature]))[0]
	return predict

Now, on the code below, I created three new vectors to keep the output of the middle layer predictions, categorical predictions and the images used to create this prediction.

After creating these auxiliary vectors, I iterate the test dataset (data and labels) to create the predicted dataset. Inside this iteration, we have the feature extractor from ResNet50 model, middle layer features for this extracted features and added it to the predicted dataset.

reco_predicts = []
reco_label = []
reco_img = []

for x_t, y_t in tqdm(list(zip(x_test, y_test))[:5000]):
	fe = feature_extract(x_t)
	reco_predicts.append(mid_l_predict(fe))
	reco_label.append(y_t)
	reco_img.append(x_t)

PS: I’m using the first 5k images from test split to create the dataset to use as recommendations in the inference step.

With predicted dataset already ready, I normalize all the data, as seen in the normalization topic.

scaler = StandardScaler()
reco_predicts = scaler.fit_transform(reco_predicts)

Inference

Now let’s make some inference to see how useful this source is. Take a look here to understanding the difference between Training and Inference.

First, extract the ResNet50 features, predict the class for these features, extract the middle layer features (from the trained model) to get the result to compare to the distances and normalize it.

With this data in hands, I measured the distances from the middle layer features of this inferred image with all the predict dataset and get the six nearest predictions from the dataset to get the indexes.

Why I get the indexes? For showing it on plots after the catch. In other cases, it may be used some data identification like products indexes or some database index.

test_idx = 5033

fe = feature_extract(x_test[test_idx])
predict_class = model_c.predict_classes(np.asarray([fe]))
predict_mid = mid_l_predict(fe)
predict_mid = scaler.transform([predict_mid])[0]

dists = [(i, distance.euclidean(reco_predicts[i], predict_mid)) for i in range(len(reco_predicts))]
dists.sort(key=lambda x: x[1])
dists = dists[:6]
dists_idxs = [d[0] for d in dists]

Now I show the plots with the inferred image and the results (recommended) the test dataset indexes.

print('Test Item: #{}'.format(test_idx))
print('Label Class: {}\nPredict Class: {}'.format(
	y_test[test_idx],
	predict_class[0]
))

plt.imshow(x_test[test_idx], cmap=plt.get_cmap('gray'))
plt.show()

print('Recommended results: #{}'.format(dists_idxs))
print('Classes: #{}'.format([reco_label[i] for i in dists_idxs]))

for index, item_index in enumerate(dists_idxs):
	plt.subplot(230+(index+1))
	plt.imshow(reco_img[dists_idxs[index]], cmap=plt.get_cmap('gray'))
plt.show()

The inferred index, class, and image was:

Test Item: #5033
Label Class: 9
Predict Class: 9

The results indexes, classes, and images were:

Recommended results: #[3317, 1017, 940, 2405, 181, 3184]
Classes: #[9, 9, 9, 9, 9, 9]

Conclusion

I’m trying to help the community with this post. There isn’t any complete post where all the subjects were grouped and, because of this, I had the idea to create it.

Like I said in the in the beginning, it’s basic and easy, and works well. I’m so proud of obtained results.

All the source was researched, developed and tested in about a month and a half.

There are many other subjects that I’m not inserted in this post, like the use of Cross-Validation technique to validate the accuracy of the categorical model using K-Fold which is the best way to get how fitness the model is.

I’m so grateful for Christian Perone, professional that I admire and brother, who helped me to improve my knowledge giving me some tips to achieve these excellent results and encouraged me to make this blog.

See you in the next post.

Regards, Perone.