K-Means From Scratch
Generate Dataset
We need to create a dataset. Since the same dataset can be used, I used his and removed some unneeded print statements. See the plot for what the data looks like.
Loading...
Loading...
Loading...
Loading...
K Means
K-means is a clustering algorithm. There's 4 main steps to the process:
Loading...
Once you have those steps, you can repeat the last 3 until your centroids no longer move.
Calculate Distance
In order to initialize our centroids we need to be able to calculate distances, so let's do that first.
Given a tensor of centroid coordinates and a tensor of data coordinates we calculate distance by:
- Subtract centroids coordinates from data points coordinates
- Take absolute value of distances
- Pythagorean Calculation
- Square coordinates
- Add them together
- Take the Square Root
That gives us the euclidean distance between each data point and each centroid.
Loading...
Initialize Centroids
Where we initialize our centroids is really important. If we don't have good initialization we are very likely to get stuck in a local optimum. Especially with 6 centroids. One option is to run the algorithm many times and pick the best solution, but it's a much better idea to try to have good initializations.
We pick centroid locations in the following way:
- Pick a random data point and use those coordinates as the first centroid
- Loop to create remaining centroids
- Calculate the distance between existing centroids and data points.
- Get the distance from each data point to it's closest centroid
- Place the next centroid at the point with the max distance from previous step
This ensures we get initialization that are nice and far away from each other and spread out amonth the data, minimizing the risk of hitting local optimums.
Loading...
Classify Data Points
Once we have centroids (or updated centroids), we need to assign a centroid to each data point. We do this by calculating the distance between each data point and each centroid, and assigning each datapoint to it's closes centroid.
Loading...
Update Centroids
To update the centroid locations, we take the mean of all the data point assigned to that centroid. We make the new centroid that point.
Loading...
Training
Here we put it all together and train our K-means model. As you can see it fits this dataset very quickly (it's a simple dataset).
Loading...