From mass surveillance to fashion advice - can consumer AI benefit from surveillance research?
We've recently published a preprint of our paper with answer on that question.
In the last 3 years, there's been a flood of research papers in AI and Machine Learning published by Chinese universities and companies. Strong incentives from government and industry, combined with the challenges of scale (almost 1.4Bn citizens) push the frontier of research in China forward at an impressive pace.
Although the topics are quite diverse, there is a category of papers which are overrepresented when compared to western research institutions' output. The papers cover various topics, but their possible applications are common: mass surveillance. [10],[9],[2],[6],[1],[5],[8]
One of the most popular applications of AI, is that of Person Re-Identification - which is linking the photos/videos from CCTV cameras to identities of citizens. As you can imagine, the task is quite challenging, due to various factors:
- cameras have different quality
- angles, lighting & visibility conditions vary
- people can be partially occluded
- people change clothes, put on hats, can wear sunglasses etc.
- people travel between different areas, so location information is of limited usability
- human body is capable of many different poses and movements
- much more variety is present in the real-world compared to synthetic datasets.
While tracking the progress made by China is scary, yet fascinating, can their state-sponsored research be re-purposed to directly benefit the end-user?
In our paper [7] we propose an approach adapted from mass surveillance, which with some modifications, outperforms all prior research in fashion visual search.
The problem of fashion retrieval / visual search sounds simple - given a user-made photo of a clothing item, automatically pick most similar clothes from a store's assortment. The user may take a photo of his/her friend, a photo of an item in a store, or upload a photo found in the Internet.
We decided to build a visual search product back in 2019, after having updated our visually-similar recommendation model (the problem of visually-similar recommendations is simpler, because it only considers stock-photos with little variation). When analyzing the state-of-the-art methods & models, we've noticed that the problem of fashion retrieval is very similar to Person Re-Identification for mass surveillance.
The fundamental problem being solved in both areas is that of representation learning - we want to encode images with vectors (called embeddings), such that:
- identical or very similar objects have very similar vectors
- very different objects have very different vectors
The 2 conditions above must hold despite changes in angle, lighting, object deformation, occlusion, crop & other confounding factors. There is more to representation learning in general than just images, but our problem is in the visual domain.
Intuitively, representation learning should "distill the essence of visual identity and similarity" of objects, and disregard all modifications & transformations of input, which do not change similarity or identity.
Some examples of transformations which do not change visual identity/similarity:
- facial expressions - while they can deform the face, they do not change a person's identity
- clothes deformability - a crumpled sweater is still the same sweater
- lighting, brightness, contrast, etc. - objects remain the same, while they look different
- view angles, rotations, focal length, image resolution - they change the photo, but have no effect on the objects themselves
Source: BU-3DFE dataset facial expressions
There are a lot more real-world transformations which can confuse ML models, but are naturally disregarded by people when evaluating "identity" or "similarity", e.g. weather conditions, mechanical transformations etc. Visual representation learning aims to be resistant to these transformations.
When it comes to fashion, we're interested in vector representations of clothes, where the same "fashion item" gets the same or similar vectors, regardless of where, when and how the photo was taken. In contrast, when it comes to mass surveillance, we'd be interested in vector representations of people, where the same person get the same or similar vectors, regardless of where, when and how the photo was taken, what the person was wearing and what pose they were photographed in.
The two use-cases sound so much alike, that it's quite surprising that very little intellectual cross-pollination has happened between these areas until now.
In our paper [7], we identify the similarities and differences between fashion and mass surveillance in depth. Then we successfully transfer latest research from Person Re-Identification to fashion retrieval for visual search.
While the Person ReID models require some adjustments to work well on fashion datasets, the final results are quite extraordinary. Our best approach outperforms all prior published research in fashion retrieval and establishes new state-of-the art results on two commonly used datasets - DeepFashion and Street2Shop. The best model described in the paper is a foundation of our Visual Search product at Synerise, trained on our massive proprietary datasets.
What's especially worth noting, is that our strong baseline model is much simpler than some of the recent fashion-specific approaches. The simplicity is apparent with regard to architecture, training procedure and computational resources required.This should serve as a reminder that good foundations, proper abstractions and picking the right problem to solve are often key to unlocking significant progress in research. As unlikely as it sounds, fashion and surveillance have a lot in common when thinking in the framework of representation learning.
Here are some example results of our best model:
For more details & nice pictures check out our paper with the appendix: [7].
Jacek Dąbrowski / Jarek Krolewski
[1] Dong, C. et al. 2019. DeepMEF: A Deep Model Ensemble Framework for Video Based Multi-modal Person Identification. Proceedings of the 27th ACM International Conference on Multimedia (Nice, France, Oct. 2019), 2531–2534.
[2] Guo, Y. et al. 2019. Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection. 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) (Sep. 2019), 1–5.
[3] Kuang, Z. et al. 2019. Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid. arXiv:1908.11754 [cs]. (Aug. 2019).
[4] Kucer, M. and Murray, N. 2019. A Detect-Then-Retrieve Model for Multi-Domain Fashion Item Retrieval. CVPR Workshops. 10.
[5] Nie, J. et al. 2019. Understanding personality of portrait by social embedding visual features. Multimedia Tools and Applications. 78, 1 (Jan. 2019), 727–746.
[6] Song, W. et al. 2019. Partial Attribute-Driven Video Person Re-Identification. 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI) (Nov. 2019), 539–546.
[7] Wieczorek, M. et al. 2020. A Strong Baseline for Fashion Retrieval with Person Re-Identification Models. arXiv:2003.04094 [cs]. (Mar. 2020).
[8] Wu, L. et al. 2019. A Neural Influence Diffusion Model for Social Recommendation. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France, Jul. 2019), 235–244.
[9] Zhang, X. et al. 2019. TVV: Real-Time Visual Identity and Tracking with Edge Computing. Proceedings of the 2019 International Conference on Embedded Wireless Systems and Networks (Beijing, China, Mar. 2019), 419–424.
[10] Zhang, Z. et al. 2018. Billion-Scale Network Embedding with Iterative Random Projection. 2018 IEEE International Conference on Data Mining (ICDM) (Nov. 2018), 787–796.
President at MOST Foundation. Co-Founder at exeq.eu and clipatize.com ** I help high-tech companies grow and get funded.
4yVery interesting insights, thank you. What is the purpose of AI understanding fashion?