Pete Warden writes convincingly about computer scientists’ focus on improving machine learning algorithms, to the exclusion of improving the training data that the algorithms interpret, and how that focus has slowed the progress of machine learning.
The problem is as old as data-processing itself: garbage in, garbage out. Assembling the large, well-labeled datasets needed to train machine learning systems is a tedious job (indeed, the whole point and promise of machine learning is to teach computers to do this work, which humans are generally not good at and do not enjoy). The shortcuts we take to produce datasets come with steep costs that are not well-understood by the industry.
For example, in order to teach a model to recognize attractive travel photos, Jetpac paid low-waged Southeast Asian workers to label pictures. These workers had a very different idea of a nice holiday than the wealthy people who would use the service they were helping to create: for them, conference reception photos of people in suits drinking wine in air-conditioned international hotels were an aspirational ideal — imagine that for some of these people, the beach and sea connoted grueling work fishing or clearing brush, rather than relaxing on a sun-lounger.
Warden says that people who are trying to improve vision systems for drones and other robots run into problems using the industry standard Imagenet dataset, because those images were taken by humans, not drones, and humans take pictures in ways that are significanty different from the way that machines do — different lenses, framing, subjects, vantage-points, etc.
Warden’s advice is for machine learning researchers to sit with their training data: sift through it, hand-code it, review it and review it again. Do the hard, boring work of making sure that PNGs aren’t labeled as JPGs, retrieve the audio samples that were classified as “other” and listen to them to see why the classifier barfed on them.
It’s an important lesson for product design, but even more important when considering machine learning’s increasing role in adversarial uses like predictive policing, sentencing recommendations, parole decisions, lending decisions, hiring decisions, etc. These datasets are just as noisy and faulty and unfit for purpose as the datasets Warden cites, but their garbage out problem ruins peoples’ lives or gets them killed.
Here’s an example that stuck with me, from a conversation with Patrick Ball, whose NGO did a study of predictive policing. The police are more likely to discover and arrest perpetrators of domestic violence who live in row-houses, semi-detached homes and apartment buildings, because the most common way for domestic violence to come to police attention is when a neighbor phones in a complaint. Abusers who live in detached homes get away with it more than their counterparts in homes with a party wall.
Train a machine learning system with police data, and it will overpolice people in homes with shared walls (who tend to be poorer), and underpolice people in detached homes (who tend to be richer). No one benefits from that situation.
There are almost always model errors that have bigger impacts on your application’s users than the loss function captures. You should think about the worst possible outcomes ahead of time and try to engineer a backstop to the model to avoid them. This might just be a blacklist of categories you never want to predict, because the cost of a false positive is so high, or you might have a simple algorithmic set of rules to ensure that the actions taken don’t exceed some boundary parameters you’ve decided. For example, you might keep a list of swear words that you never want a text generator to output, even if they’re in the training set, because it wouldn’t be appropriate in your product.
It’s not always so obvious ahead of time what the bad outcomes might be though, so it’s essential to learn from your mistakes in the real world. One of the simplest ways to do this, once you have a half-decent product/market fit, is to use bug reports. When people use your application, and they get a result they don’t like from the model, make it easy for them to tell you. If possible get the full input to the model but if it’s sensitive data, just knowing what the bad output was can be helpful to guide your investigation. These categories can be used to choose where you gather more data, and which classes you explore to understand their current label quality. Once you have a new revision of your model, have a set of inputs that previously produced bad results and run a separate evaluation on those, in addition to the normal test set. This rogues gallery works a bit like a regression test, and gives you a way to track how well you’re improving the user experience, since a single model accuracy metric will never fully capture everything that people care about. By looking at a small number of examples that prompted a strong reaction in the past, you’ve got some independent evidence that you’re actually making things better for your users. If you can’t capture the input data to your model in these cases because it’s too sensitive, use dogfooding or internal experimentation to figure out what inputs you do have access to produce these mistakes, and substitute those in your regression set instead.