Machine Learning is an umbrella term for an array of statistics-based approaches for analyzing data where mathematical models are developed to describe what things are or how they work. That’s pretty broad, but these techniques find use in all sorts of fields!
Classification is one important application of machine learning. In classification problems, an object or process may 1) belong to one of a limited number of groups or classes, and 2) have several measurable features that help describe what group it belongs to. One of the most widely used examples for presenting the concept of classification with machine learning is the Fisher’s Iris dataset. This involves data related to three species of iris plants (classes):
- Iris setosa
- Iris virginica
- Iris versicolor
Four particular measurements for each iris sample are taken (features):
- sepal length (green structures that support the petals when in bloom and help protect the flower)
- sepal width
- petal length
- petal width
Based on this information, which in machine learning language is called a feature vector, a wide range of algorithms can be employed to classify which species (class) it belongs to – the response vector. Some important algorithms that can be used for classification problems include Linear or Quadratic Discriminant Analysis (LDA, QDA), Support Vector Machines, Naïve Bayes, Neighbourhood Component Analysis (NCA), K-Means Classification, Decision Trees, or Neural Networks.
One of the key things about machine learning models which make them incredibly useful is that fundamental equations like Newton’s Laws of Motion are often not required or involved at all. Indeed, for cases like the classification of Iris species, no such simple “equation” exists. The same could be said for identifying the make and model of a wind turbine based on windspeed data and gearbox acoustic signature, or in determining credit ratings based on outstanding debt, industry, or working capital.
Text analysis of movie reviews categorizes individual words as “positive”, “negative”, or “neutral”. The words are converted to a machine-friendly format (numbers), and the review is run through a model. The model classifies the review as “good” or “bad”, but it is understood that such a rating is a statistical probability and not necessarily definitive. Furthermore, it isn’t based on any kind of closed-form mathematical equation.
Machine learning models are mathematical structures with parameters that can be tuned via a process called training. Training samples, such as individual iris flowers, are presented to the algorithm; the features of each flower are the sepal and petal measurements. Each feature set is accompanied with a tag telling the model what the corresponding species is (class). This is the ground truth, and is usually prepared by a person. As more examples are presented to the model, the tuneable parameters are updated, gradually yielding better predictions.
If the model misclassifies a large number of the test data points, more training data are taken, and the model is updated again. Indeed, the model is “learning”, just like people do! When we are children, our parents show us pictures of dogs and cats (data points – a feature vector in the form of an image) and tell us what they are (tags – “dog” or “cat”). Our neurons build new connections (update parameters) and after repeating the process many times, eventually we can tell the difference on our own.