This post aims to describe how a set of images can be represented as points onto a multi-dimensional manilfold. Such a process is extremely useful when someone wants to cluster together visually similar images, or when someone wants to create a dynamic indexing structure for large volumes of visual data.

### The idea

The idea, for representing images as points, is to relate the visual similarities between images, as Euclidean distances between points onto a multi-dimensional manifold over which these images are projected. So, assuming that we have in our disposal a dataset of \(N\) images, we can start, firstly, by estimating a visual similarity metric between every pair of images and then find a way to relate this metric with the Euclidean distances between points that lie on a manifold.

### Estimation of visual similarity metric

A common way for checking how visually similar two different images are, is to estimate the number of their correspondent points. For this reason, the estimation of the visual similarity metric is based on the exploitation of local visual descriptors. In this case, I chose to use the ORB descriptor[1], but SURF[2] and SIFT[3] descriptors can be used too. ORB describes the visual content of an image by detecting \(K\) keypoints and characterizing each one of them by a binary feature. Let’s denote as \(k_{i,(A)}\) the \(i\)-th keypoint of the image \(A\) extracted via the ORB algorithm. Then, its corresponding point \(k_{j_i,(B)}\) belonging to image \(B\) can be estimated by performing a nearest-neighbor keypoints matching algorithm.