# Interlude: Distances between point sets

Before I keep writing about the visualization of trajectory data, I want to show something cool about inner products that I came across quite by accident today.

Imagine you have two sets of points in $R^d$, $a$ and $b$, and you want to measure the distance between the two sets of points globally. This comes up all the time when doing clustering. One way of doing this is to consider the distance between the two point sets to be the minimum distance between any point in $a$ and any other point in $b$: $d_M(a, b) = min_{i,j} d(a_i, b_j)$. This is fine, but it is not a metric: two point sets might be different and have distance zero. You can also look at the max-min distance: $d_H(a,b) = max_j min_i d(a_i, b_j)$. This is not symmetric, but we can then use the symmetric version $d_S(a, b) = max(d_H(a, b), d_H(b, a))$. This is fine, but it’s not what I want to write about today.

Yet another idea of computing the distance between clusters is to somehow take the average distance between points in $a$ and $b$. We’ll look at the sum for now, but all of this applies to the average up to a constant factor, so $d(a, b) = \sum_i \sum_j d(a_i, b_j)$. This seems like a reasonable way of computing the distance between clusters, but it takes quadratic time to compute. Can we fix this? Pretty much, and this is what we will do now.

Let’s start thinking about a slightly different distance function, namely, $d(a,b)^2 = \sum_i \sum_j d(a_i, b_j)^2$. It turns out that this sum can be evaluated exactly in linear time, and it’s very simple to show that.

First, recall that $||x||^2 = \langle x, x \rangle$, and so $||a - b||^2 = \langle a - b, a - b \rangle = \langle a, a \rangle - 2\langle a, b \rangle + \langle b, b \rangle$. Then,

$d(a,b)^2 = \sum_i \sum_j \langle a_i - b_j, a_i - b_j \rangle$

$d(a,b)^2 = \sum_i \sum_j \langle a_i, a_i \rangle + \sum_i \sum_j \langle b_j, b_j \rangle - 2 \sum_i \sum_j \langle a_i, b_j \rangle$

The first two terms of the right hand side can be evaluated in linear time, and what’s interesting about the last sum is that it turns out to be exactly zero. To see that, we first write $a_i$ and $b_j$ in terms of cluster mean and their displacement: $a_i = \mu_a + \delta_{ai}$. The trick we will use is that $\sum_i \delta_{ai} = 0$, and, in particular, $\sum_i \langle k, \delta_{ai} \rangle = \langle k, \sum_i \delta_{ai} \rangle = \langle k, 0 \rangle = 0$. With this in mind everything else is trivial:

$\sum_i \sum_j \langle a_i, b_j \rangle = \sum_i \sum_j \langle \mu_a + \delta_{ai}, \mu_b + \delta_{bj} \rangle$
$\sum_i \sum_j \langle a_i, b_j \rangle = \sum_i \sum_j \langle \mu_a , \mu_b \rangle + \langle \mu_a, \delta_{bj} \rangle + \langle \mu_b, \delta_{ai} \rangle + \langle \delta{ai}, \delta_{bj} \rangle$

The last three terms are zero, and so

$d(a,b)^2 = |b| \sum_i \langle a_i, a_i \rangle + |a| \sum_j \langle b_j, b_j \rangle - |a| |b| \langle \mu_a, \mu_b \rangle$

With a little bit of rearranging, you can also write this as

$d(a,b)^2 = |b| \sum_i ||\delta_{ai}||^2 + |a| \sum_j ||\delta_{bj}||^2 + |a| |b| d^2(\mu_a, \mu_b)$

This is kind of neat: even though the original sum had a quadratic number of terms, most of them cancel out nicely. However, we can go further. This function we have defined is not really a metric, because $d(a, a) \neq 0$. We can, in fact, fix this, and it turns out that the natural fix to make it a metric will end up looking very familiar. Looking at that function above, it seems like what we want to do is to somehow get rid of the first couple of terms: they do not actually measure any distance from $a$ to $b$ at all.

We were trying to define the distance between two point sets by summing over the distance of the points. We know, however, how to compute distances from inner products. So let’s instead define an inner product between two point sets, and see what kind of distance function falls out of it. In particular, let’s just use an inner product in perfect analogy with the way we were trying to define distances: $\langle a, b \rangle = \sum_i \sum_j \langle a_i, b_j \rangle$. Now, we look at what happens with $d(a, b)^2 = \langle a, a \rangle + \langle b, b \rangle - 2\langle a, b \rangle$:

$d(a, b)^2 = \sum_i \sum_j \langle a_i, a_j \rangle + \langle b_i, b_j \rangle - 2 \langle a_i, b_j \rangle$

We do the same trick of writing the elements in terms of their means and displacements:

$d(a, b)^2 = \sum_i \sum_j \langle \mu_a + \delta_{ai}, \mu_a + \delta_{aj} \rangle + \langle \mu_b + \delta_{bi}, \mu_b + \delta_{bj} \rangle - 2 \langle \mu_a + \delta_{ai}, \mu_b + \delta_{bj} \rangle$

Now the biggest difference between this expression and the one before it is that all of the inner products have $\delta$ terms in them, which all cancel out! So the whole thing, in the end, is:

$d(a, b)^2 = \sum_i \sum_j \langle \mu_a, \mu_a \rangle + \langle \mu_b, \mu_b \rangle - 2 \langle \mu_a, \mu_b \rangle = \sum_i \sum_j d(\mu_a, \mu_b)^2$

So this fancy-looking inner product between clusters we were trying to define turns out to define a distance that is exactly the distance between the two cluster centroids! This was a surprise to me. In a sense, this is a long-winded justification for using the distance between cluster centroids as a distance measure between the entire clusters.

Wait a minute, you say! This means that two clusters will have zero distance even though they are not the same! Yes, this is because that inner product defined about defines, in fact, a semi-Hilbert space, instead of a Hilbert space. This is an inner product space where some nonzero vectors are allowed to have length zero. This happens, essentially, because if you squint hard enough, you can think of the inner product that we wrote as $a^T M b$, where $M$ is a matrix entirely composed of ones. This means we can also write that inner product as $a^T M_1 M_2^T b = (M_1^T a)^T (M_2^T b)$, where $M_1$ and $M_2$ are vectors entirely composed of ones. Notice that the whole inner product becomes an inner product of the sums of the points, which is exactly where the centroids come in. If this matrix $M$ were full rank, then our point-set inner product would, in fact, be a true Hilbert space. More about that in future posts.