Overview
Roughly 30% of harvested fruit gets wasted because sorting is subjective. Two workers looking at the same apple disagree on ripeness, timing is off by a day or two, and the fruit either ships overripe or rots before distribution. Agriculture has been waiting for automation on this, but most solutions aimed at the wrong layer: they picked deep learning first and interpretability never.
This was my thesis. The goal was an apple ripeness classifier that could split fruit into five ripeness stages (20%, 40%, 60%, 80%, 100%) with accuracy high enough for production sorting, but with a decision process a farmer could understand. That last constraint is what ruled out a CNN from day one.
Why classical ML, not deep learning
CNNs would have been the easy paper to write. They would also have been a black box at harvest time, when the person checking the sorter needs to know why a specific apple landed in the 80% bin instead of 100%.
K-Nearest Neighbors is boring and old, but it has three properties CNNs don't: it's interpretable (you can literally point to the three neighbors driving a decision), it's distribution-agnostic (no assumption about the shape of feature space), and it's small enough to run on a campus server without a GPU.
The real question became: which features do you feed KNN so it can match a CNN's accuracy? That answer is Haralick texture features.
Haralick features explained briefly
Ripening changes apple skin texture before it finishes changing color. Haralick features, derived from the Gray Level Co-occurrence Matrix (GLCM), capture statistical patterns in how pixel intensities relate to their neighbors: contrast, homogeneity, energy, entropy, correlation. Each feature is parameterized by two knobs:
For each (d, θ) pair the pipeline computes five Haralick statistics, flattens them into a feature vector, and feeds that vector to KNN. The combinations that lost information got dropped during feature selection.
The parameter tuning journey
Three knobs mattered: the Haralick distance d, the Haralick angle θ, and the KNN k value. I ran a grid search across the full parameter space (d × θ × k for k across odd values from 3 to 25) and let cross-validation pick the winners.
The optimum converged quickly: d=1, θ=45°, k=3. Shorter pixel distances preserved fine-grained texture. Diagonal orientation captured anisotropy in apple skin that horizontal/vertical scans missed. Small k mattered because ripeness class boundaries are sharp (texture shifts faster than color near class boundaries), so averaging over many neighbors blurred the decision.
The dashboard: making KNN trustable
The model is useless if the supervisor at the sorting line can't explain a misclassification to the owner. That meant the dashboard was as important as the classifier.
Three Streamlit tabs, each targeted at a different audience:
- Summary Report. Single-screen overview: best parameters, headline accuracy, per-class F1. For the five-minute look.
- Detailed Results. Confusion matrix, per-class precision/recall, and the signature PCA visualization: high-dimensional Haralick space projected to 2D with the test sample highlighted next to its three neighbors. This is where "why" questions get answered.
- Cross Validation. Radar charts + parallel-coordinates plots over the 5-fold CV splits, so model robustness is visually checkable without reading tables.
The first version was unusable because every parameter change recomputed everything. Strategic caching (cache Haralick feature matrices, recompute only KNN predictions when k changed) cut interaction latency from ~15 seconds to near-instant.
Numbers
Measured across the 500-image dataset (100 images per ripeness class), 5-fold cross-validated:
The confusion matrix was informative in a useful way. The 20% and 60% classes were near-perfect. The most common misclassifications were between 80% and 100%, which makes sense: those two stages are visually the most similar even to trained human sorters. The model was honest about where it struggled, and that honesty was the reason stakeholders trusted the overall number.
What I learned
- Simple algorithms + good features beat complexity. Haralick + KNN at 96% isn't a compromise against a CNN. It's a better fit for the problem because it's explainable.
- Interpretability is a deployment feature, not a research concession. People deciding whether to trust a model want to see its work.
- Dashboard design is model development. The PCA + nearest-neighbor view turned out to be the most important thing I built, even though it's technically "just visualization."
- Parameter grids are worth running properly. The
d=1, θ=45°result wasn't obvious up front. A sloppier sweep would have picked a suboptimal pair and I'd have shipped a 91% model. - Cache aggressively in interactive dashboards. The difference between 15-second interactions and instant interactions is the difference between "toy" and "tool."
Full write-up with the GLCM derivation and per-class confusion analysis is on Medium. Source on GitHub.
