Back to Writing
November 5, 2022

K-Means From Scratch

Generate Dataset

We need to create a dataset. Since the same dataset can be used, I used his and removed some unneeded print statements. See the plot for what the data looks like.

Loading...
Loading...
Loading...
Loading...
Output

K Means

K-means is a clustering algorithm. There's 4 main steps to the process:

Loading...

Once you have those steps, you can repeat the last 3 until your centroids no longer move.

Calculate Distance

In order to initialize our centroids we need to be able to calculate distances, so let's do that first.

Given a tensor of centroid coordinates and a tensor of data coordinates we calculate distance by:

  • Subtract centroids coordinates from data points coordinates
  • Take absolute value of distances
  • Pythagorean Calculation
    • Square coordinates
    • Add them together
    • Take the Square Root

That gives us the euclidean distance between each data point and each centroid.

Loading...

Initialize Centroids

Where we initialize our centroids is really important. If we don't have good initialization we are very likely to get stuck in a local optimum. Especially with 6 centroids. One option is to run the algorithm many times and pick the best solution, but it's a much better idea to try to have good initializations.

We pick centroid locations in the following way:

  • Pick a random data point and use those coordinates as the first centroid
  • Loop to create remaining centroids
    • Calculate the distance between existing centroids and data points.
    • Get the distance from each data point to it's closest centroid
    • Place the next centroid at the point with the max distance from previous step

This ensures we get initialization that are nice and far away from each other and spread out amonth the data, minimizing the risk of hitting local optimums.

Loading...

Classify Data Points

Once we have centroids (or updated centroids), we need to assign a centroid to each data point. We do this by calculating the distance between each data point and each centroid, and assigning each datapoint to it's closes centroid.

Loading...

Update Centroids

To update the centroid locations, we take the mean of all the data point assigned to that centroid. We make the new centroid that point.

Loading...

Training

Here we put it all together and train our K-means model. As you can see it fits this dataset very quickly (it's a simple dataset).

Loading...
OutputOutputOutput

Stay up to date

I send useful notes when I have something worth sending: lessons from building with clients, new public posts, talks, tools, mistakes, and questions about retrieval, evals, agents, and AI product workflow.

5,000+ readers