Can you see the blind spots in your machine learning?

Looking back at the pace of innovation over the last 10 years it becomes increasingly difficult to imagine a world without artificial intelligence (AI). From product recommendation and self-driving cars to automatically understanding your receipts, AI – and specifically machine learning – has made huge strides into mainstream technology, and one technique that has made the biggest impact is the use of convolutional neural networks, or CNNs.

By forming networks of layers designed to extract general describing features of an item, CNNs have achieved significant improvements over previous techniques. Since the breakthroughs in neural networks in the ImageNet challenge back in 2012, we have seen the application of machine learning used to solve generic classification problems from identifying cats and dogs to spotting cancer, with better than human accuracy.

AI has been an important area of research and investment. The next challenge for AI, though, is to go from general applications to more niche, deep models, seeking to give one company the ultimate competitive advantage over its peers.

While the rewards have the potential to be significant, the task is not easy. Developing machine learning models relies heavily on the work of talented individuals. There is a significant skills shortage when it comes to appropriate experts that have hands-on experience with machine learning. The challenge for the industry is to lower the barrier to entry, automate as much of the process as possible, and create tools that are a force multiplier.

Second, machine learning models require huge amounts of data in order to train them. Training is the process of showing the model an example of what you would like it to learn, letting it guess what it might be, and then updating its internal structure (called weights) based on whether it got the answer right or not. This process is then repeated many times, giving it many different examples until it reaches a point where it can no longer learn anything new and the process is stopped. This can often mean feeding the model thousands to millions of examples during training to achieve a suitable level of accuracy. This is where the problem occurs, as finding and labelling that much data is a mammoth task - and not just any data, as it needs to be clean, well representative of the problem and labelled correctly.

The importance of good quality data is critical to how well the resulting machine learning model makes decisions when it sees new information. A simple example of the impact of poor quality and mislabelled data is from a group of researchers that wanted to build a system to identify the differences between dogs and wolves. As you would expect, the training data consisted of lots of pictures of dogs and wolves in their natural habitat: dogs on grass and wolves in the snow. However what they ended up building was a snow classifier – a machine learning model that was able to tell if there was snow in the image or not. In a way what the model learnt was very smart, as by picking up that snow was the common denominator it found the shortest path to identifying wolves in images. However as we all know just because a dog is in the snow that does not necessarily make it a wolf.

That’s where data visualisation can step in to help. By being able to better understand and visualise training data, machine learning specialists can produce better machine learning models with less effort. As a tool that readily combines data and images from multiple sources, both public and private, Zegami can make a huge difference in the quality of life for a data scientist.

By presenting vast quantities of structured and unstructured data within a single field of view, Zegami allows users to quickly see correlations, outliers, patterns and relationships. This makes it the perfect tool for data preparation and building training data sets as it becomes simple to see the overall shape of a data set, and to identify potential biases that may cause underperformance or misclassification in a machine learning model. In the case of wolves in the snow, by displaying all the images side by side in Zegami the pattern of white amongst wolves would have been immediately obvious.

As they say when it comes to working with any data: garbage in, garbage out. This is especially important when it comes to machine learning. Spending the time to build and curate high quality, well labelled data sets is the only way to build a model of any real value.


Article by Roger Noble, Co-founder & CTO at Zegami. Zegami will be exhibiting on stand 142 at Big Data LDN 2019.

Recent Posts


Big future for big data at Big Data LDN


Data projects delivered at cost deficit – how DataOps changes this


Accelerating People Analytics