An interactive version of this articlie is availabe at the juniper secintel blog:
Above demo is just an example of the mistakes one can make when training machine learning classifiers. There are many more and one can make them in any type of machine learning – not just classifiers.
When implementing machine learning algorithms or training and evaluating statistical models, it is easy to make mistakes that – unlike typical software engineering bugs – are less clear and more challenging to identify. This is because with implicit directions come implicit assumptions that may not be immediately obvious. These may not surface until much later (such as Linus’s email fiasco and issues with Microsoft’s Tay chat bot). So how do we avoid these mistakes and test for them?
All machine learning/statistics pitfalls can fall into two categories:
1. Not understanding the domain and
2. Not understanding the algorithms.
Not understanding the domain:
- Garbage in, garbage out phenomenon.
This is the mistake we saw in the demo above. The data scientist may not be familiar with different types of oranges, so they trained their model on a huge number of one type of orange.
This type of mistake can lead to potentially excellent results during evaluation stage and disastrous results in the wild and overall unreliable performance. Another example is figuring out the size of shipping container boxes by the color combinations/saturation/intensity of the boxes. This might work if you only receive large UPS boxes and small FedEx boxes. If all of a sudden the company completely switches to UPS or changes the camera used to take pictures of boxes, the performance would deteriorate. One can extract millions of such color features, none of which would be helpful. If one has domain knowledge into the shipping industry, all these color features could probably more helpfully expressed by 1-3 features saving a lot of engineering effort in pushing the useless color-related data around.
- Data scientists should know exactly how data was acquired. Noisy data with errors may require a different approach. Not having such knowledge can lead to unexpectedly poor performance in real life datasets but good performance during evaluation. Additionally, when there are a lot of variables and few samples, mistakes in training data have an even bigger effect on performance on test data. This is because relevant features would not be selected when there are irrelevant features that separate the small dataset better.
- Unexpected behavior of the variables that the algorithm relies upon. This would result in unexpectedly poor performance in the wild. A famous example is an algorithm that attempts to separate vehicles into cars and trucks in images. If all the images of trucks are taken at night and images of cars are taken during the day, the algorithm would determine that any image of a vehicle taken at night must be a truck.
- Lack of interpretability when something goes wrong. If algorithm performance drops dramatically when running in the wild and the system inputs are opaque to the statistician/data scientist, it would be difficult to make modifications to the algorithm/training data.
- Failure to have the data sample used for training being representative of the test population (real life situations that the dataset pertains to). An example is training on all oranges when separating apples from oranges, but the business requirement is separating one type of orange from apples.
- Functional Duplicates in dataset. Example: when separating cars and trucks in images you can have millions of examples of one model of a car or truck and no examples of other models. Any other model may be classified incorrectly.
- Failure to account for intelligent attackers. Recent example is Microsoft’s Tay.
Lack of algorithm understanding:
- People sometimes use their existing “Big Data” store to extract all kinds of features also known as independent variables. Usually dynamic extraction leads to a large number of these features. This leads to the so-called curse of dimensionality. Unless care is taken to select pertinent features (e.g.: aggressive L1/L2 regularization), the more features you have, the more samples you need to have a good predictive power on unseen data.
- Lack of understanding of algorithm assumptions. The algorithm might have an assumption for features/IVs to be normally distributed, be non-independent or have a specific variance. Many variables can be measuring the same signal.
- Lack of understanding of how to modify the machine learning algorithms to target business objectives such as low false positive rates/low false negative rates can result in much poorer performance on these metrics than possible.
- Inability to understand the inner workings of ML algorithms can lead to picking the wrong algorithm for a problem. For example, if data is expected to be quickly changing, one should consider an “online” algorithm. On the other hand, such algorithms are more easily influenced by intelligent attackers, so care must be taken to avoid training on maliciously fed data. Trying just one or two algorithms randomly usually leads to suboptimal performance.
- When feeding data to the algorithm, the simplest approach is to split data into two sets – training and testing sets. This approach is almost always wrong except for toy problems. Usually one needs to do a lot more data munging and manipulation to correctly do cross-validation and optimize algorithm hyperparameters. Fast and loose training schedule can lead to overfitting and unexpected performance in the wild.
- Sometimes algorithm hyperparameters have a meaning for the specific domain and it may be important to adjust them based on the Bayesian Prior knowledge of the data. For example, it may be possible to avoid overfitting when training linear SVM on large feature sets if the value of the C hyperparameter is lowered. However, this suggestion is highly dataset specific.
- Picking an algorithm that has an asymptotic performance incompatible with dataset size and computing resources. Some algorithms take a lot of computing effort to evaluate, with recent commonly used deep learning algorithms with many hidden layers now being the biggest offenders. Mistakes here can be potentially very costly in both computing time and engineering time to set up such long-running or massively parallel jobs. Sometimes the data and downstream processing choices allows to skip some steps in many algorithms entirely e.g. computing all eigenvectors instead of top k when you are interested only in the top k, thus making PCA feasible on some large datasets when it wouldn’t be otherwise. Another example is using small input/output images instead of large ones when evaluating generative deep learning algorithms.
So what are the solutions?
General strategy is to visualize the data in various ways, have some domain knowledge and understand the algorithm being used and the assumptions made.
Once ready, build a prototype classifier on some sample data.
If the prototype is relatively successful, efforts can be put into data engineering and additional feature extraction/algorithm optimizations.