Hello!
I’m using the efficientnet model and the 1280 vector output for image similarity. It’s working great and from testing is the best model I’ve found with the amount of data that is used.
Sometimes it just does weird things though.
Here are the image in question: Imgur: The magic of the Internet
The first image is the input, the 2nd which should be found and the third one which is actually found as closest.
I’m using a compare script and these are the results of said images:
input against image that should be found (img1 vs img2)
I’m not understanding how img3 can be closer to the input than the other. It’s working most of the time and this is a weird outlier.
Any ideas how this can be solved?
No I’m using pre-trained weights. I would train it but I’m not sure what’s the correct course and the benefit. The image database is only filled with one class, namely stamps, so any classification goes out the window.
Thank you very much for your suggestions!
I’ve stumbled upon metric learning before but it went over my head how to implement it in my case. I’ve figured it out now, wasn’t too difficult once it clicked with the link you provided and the tests I’ve done show good results. The distance is much much closer now. Sometimes nearly 0 which is a big win!
I’m still struggling to understand what’s really going on and how this works. From my understanding every layer has weights that can be tuned, in fine-tuning you freeze most of the pre-trained weights. I’m not freezing any right now and I wonder if that’s the best thing to do.
From my intuition I’d freeze every weight but the last one I use, the avg_pool but this one and those before don’t have much weights. I fear I skew the weights too much with my limited data set.
Any suggestions on this or do you think it’s alright?
The goal is to take a camera shot of a real stamp and find a correct lookup in a database that consists of 350k+ unique stamp images.
Most of the results are pretty accurate and good, it could be better though, that’s why I’m looking to further train the model.
For a lot of stamps I have real camera photos, 10-20 or more which I could compile but:
With the metric learning I’m running into the problem that a label is expected and as I technically have over 350k labels I don’t know how to deal with that. In practice, I think metric learning is the right solution but I’m a little stuck right now.
If you need to retrieve the image from a real camera picture and you don’t have to much real camera images in the wild in your training dateset you will probably need to care about augmentations.
How should I handle 350k+ classifications? Overall I would need at least 1 million because that’s roughly the count of unique stamps that exist.
If I try with 1 million labels I run into OOM errors. This conceptual problem keeps me from doing any meaningful training.
For the augmentation i was only talking about this:
For a lot of stamps I have real camera photos, 10-20
If you are really going to have a large scale classification problem it is going to be quite similar to the large scale face recognition proposed solutions.
E.g. Glint360K dataset has 360k identities so quite similar to your 350k+ but with other tricks you can scale also to 10M 20M 30M 100M. See:
I see that you are using ImageNet Pre-Trained weights of EfficientNet for Feature Extraction.
From what I see, either or both of the following issues may be contributing to the weird result:
The Keras official implementation of EfficinetNet Expects Un-Normalized Inputs in the range of 0-255. So, in case you are normalizing the input images before feeding them into the network, it may lead to issues.
Quote from Documentation:
EfficientNet models expect their inputs to be float tensors of pixels with values in the [0-255] range.
Source: Keras EfficientNet Documentation
The alternative issue (and most likely one) is since the network was pre-trained on ImageNet, which does not contain examples similar to your query and target image, it is possible that the feature map for both these images is the same and/or similar, which is leading to the error in distance calculations. A solution, in this case, would be to train/fine-tune your model on your relevant dataset to get a more relevant feature vector.