A Brief Introduction to Machine Learning
- Posted by Yomna Anwar
- On September 5, 2021
“Machine Learning is the study of computer algorithms that improve automatically through experience.”
— Machine Learning, Tom Mitchell, McGraw Hill, 1997.
Machine learning is a branch of artificial intelligence that focuses on using algorithms to find patterns in data and better understand them. It is based on creating models against certain data to make future data predictions without being explicitly programmed to do so.
Today it is used in a wide variety of applications, ranging from Google Maps to email spam filters.
So how do we create this “model” that will do all the magic? Well, it is not that simple, from experienced Data Scientists and Machine Learning Engineers the most efficient workflow is:
1. Data collection: you must first collect the data that the algorithm will learn.
2. Data preparation: this could be done by selecting the features that you want the algorithm to use (feature selection), fixing missing data (data cleansing), randomizing the data, and splitting the data into training, testing, and validation.
3. Choosing a model: simply choose the algorithm you want; this will be discussed in a bit more detail later.
4. Train the model: run the model on your training data and let it try to find a pattern in the data.
5. Evaluate the model: use a preferred metric and evaluate the performance of the model, does it achieve the accuracy you would like? Does it correctly identify what you want?
6. Hyperparameter tuning: tune your model’s hyperparameters, as this usually improves the model’s performance.
7. Make predictions: everyone’s favourite step, the model is done, now is the time to use it. The model is now run on the test data to ensure that it works correctly on data it has not been trained on.
Throughout this blog post, each step will be explained using a real-life example that we worked on, EEHC (Egypt’s Electricity Holding Companies). EEHC was a project that aimed to create one billing system that unifies all payments related to electricity bills throughout Egypt. We worked on two business problems, predicting the monthly electricity usage, and predicting the monthly amount of new electricity meters that will be installed.
I. Data Collection
As easy as it sounds, you must keep in mind that the quality of the collected data will greatly affect the model. Quality here means that missing values and outliers should be as little as possible, and the data should be reliable, consistent, accurate, and complete.
Working on EEHC meant we had access to millions of records from which we were then able to extract the relevant data that we would then use in the next step. All the data was saved in comma-separated values (CSV) format and were then imported, to perform the next step in the pipeline. Below is an example of the data.
Figure 1: Example of the data used for the EEHC project, for the meter counts use case.
II. Data Preparation
Finding the perfect dataset is not an easy thing to do, therefore, this step is extremely important, and contrary to popular belief, this step usually takes the longest effort and time. In this step, you should work on removing duplicates, deal with missing data, randomize the data, and normalize and scale the data.
Another important part of this step is choosing the features that will be used as an input to your model, however, it is not straightforward. Usually feature selection is considered an art form, since there is no set of steps to follow that are guaranteed the best features, however, at the same time there are some helpful indications that could be used. The following could be taken into consideration to help with feature selection:
• The percentage of missing values
• Amount of variation
• The correlation with the target variable
• Pairwise correlation
In our project, the missing data were in the ‘Closed_Meters’ column and were the NULL values, fortunately for us in this case we know that the null values represented a 0, therefore, this was easy to fix. We then randomized the data, meaning we shuffled the rows to remove any effect the order of the data might have on the algorithm. Finally, normalizing and scaling the data or in other words changing the range, mean, and standard deviation of the data, or simply changing the range of the data to avoid the scenario where a value in the 1000’s has more weight than a value in the 0.1’s just because it’s larger.
One of the ways to help with data preparation is visualizing the data, plotting some graphs really help, for example, a box plot could help with outliers. Figure 2 shows a boxplot representing the ages of students in a university, as you can see most students lie between the 17-18 to 23-24 ages, however, there are students that are aged 15 and 27, this doesn’t mean that those values are necessarily wrong, it just shows the ranges. However, there are one or more students aged 50+, and the boxplot shows this in the form of an outlier (the circle). Depending on the business problem, this could be removed, for example, in this case, it is highly unlikely that there is a student in the university that is 50 years old, and therefore, this record is most likely handled.
Figure 2: A boxplot of students ages in a university.
The final step in data preparation is splitting the data into training, test, and validation. The training set as its name suggests is used to train the model. The validation set is used to tune the model, meaning after training the model we must make sure that the model works on new data, so we run it on the validation set and tune its parameters to improve its performance. Finally, we run the model on the test set, or fresh data it has never seen before (neither in training nor when tuning it), this is to ensure that the model is able to generalize to new data that it has never seen before.
In our project, we removed any duplicate records and outliers, since there was a huge amount of data, and the corrupted records were very small in comparison, this did not make an issue. We used the data collected over 5 years (from 2014 to 2019) which was around 790k+ meters (for the meter count use case), and around 75% of the data was used for training. After that, we started extracting more of the data that will be used as an input to the model and applying some transformations to the data.
III. Choosing a model
There are 3 main categories of machine learning:
a. Supervised learning:
In supervised learning, we have two main sets of variables, the features (what will be used to make the prediction), and the label (the prediction itself). An example of this is giving the machine pictures of cats and dogs while letting it know which is which, the machine then learns from the given data some patterns that could help it differentiate between cats and dogs.
There are two types of models for supervised learning classification, and regression. Classification models are when you try to classify the input, or like the cats and dogs example, whatever the input is, the output is always one of a set of discrete values. Regression, however, is when your prediction is a discrete value, and an example of this is predicting the price of a house given some features (like the number of rooms, and the size of the land), the price is not a set of discrete values, in fact, it could be any value at all.”
b. Unsupervised learning:
In unsupervised learning, there is no label, and the goal is to find some sort of pattern in the data. An example of this is customer segmentation, you have some data about 100’000 customers, and you want to find which customers behave similarly, this would be done using an unsupervised learning algorithm.
Again, there are two main types of unsupervised learning clustering and association. Clustering models is simply grouping similar data points together, like the customer segmentation example. Association is like recommender algorithms, they are patterns that associate information between data points, for example, if customer 1 watches movies A, B, and C and liked all of them, then if there is another customer, customer 2, that watched movie A, and B then there is a good chance that customer 2 will like movie C.
c. Reinforcement learning:
In reinforcement learning the machine is taught a series of actions where each action has a reward, the machine is then tasked to reach a certain goal while making each action given the reward from the previous step as feedback. An example of reinforcement learning is self-driving cars.
A simple diagram to help differentiate between the main learning categories.
There are different algorithms for each category, however, this will not be discussed under the scope of this blog.
In EEHC we had our data, and it was labelled, so which category do you think this business problem falls under? If you said Supervised Learning you are absolutely correct! At this stage our data was ready, and since we were predicting either the number of meters or the amount of electricity used, and we had those values for previous data then our problem was a Supervised Learning problem.
IV. Training the model:
Given the features and what you want to predict, the algorithm assigns each feature a weight, the larger the weight is means this feature is more important. The algorithm keeps running a set number of times, while each run it adjusts the weights until it reaches the best set of weights in the given number of iterations. This then results in an equation consisting of the features and the weights (of each feature) that will be used to predict future values.
In EEHC we formatted the data in such a way that enabled us to use a time series analysis model, and we chose SARIMA. SARIMA is an algorithm used for time series forecasting it stands for Seasonal Autoregressive Integrated Moving Average from the pmdarima library. Since we already know the value we are trying to predict for the historical data, meaning our data is labelled, this is considered a Supervised Learning model.
V. Evaluate the model:
At this stage, you start running the model on the validation set and start considering a good business metric to ensure your model satisfies the business needs. Checking whether your model is good varies based on your business problem. For example, if you are building a model to identify cancer cells then you might be a lot more careful in marking a malignant tumor as benign, since you do not want to miss the opportunity in providing early treatment to a patient.
Before we dive deeper into the most common ways of evaluating a model, we need to define a couple of terms that will be used later.
True positives (TP): Values that are predicted to be true and are true. For example, predicting that an image is of a cat and is indeed a cat.
False positives (FP): Values that are predicted to be true, however, are false. For example, predicting that an image is of a cat, and it turns out it is a dog.
True negatives (TN): Values that are predicted to be false and are false. For example, predicting that an image is not a cat, and it is indeed not a cat.
False negatives (TN): Values that are predicted to be false, however, are true. For example, predicting that an image is not a cat, and it turns out it is a cat.
To better represent the values above, we use something called a confusion matrix:
Now let us start explaining what the most common ways of evaluating a model are:
Accuracy: The most used, and its equation is: (TP+TN)/(TP+FP+TN+FN)
Accuracy, however, is not always favoured if we care about false negatives. For example, if we are trying to detect whether a patient has cancer or not, and out of 100 samples, 90 do not have cancer, if (accidentally) an algorithm is built that always predicts false (always votes that a person does not have cancer), using accuracy that would result in 90%, however, in this case, it should be extremely important to account for false negatives, people who the algorithm said do not have cancer, while they actually do!
Precision: Simply is the percentage of positives out of the total the algorithm predicted to be positive. Its equation is TP/(TP+FP)
Recall: A bit close to precision, however, this time we are trying to calculate the percentage of positives out of the actual positives. Its equation is:
TP/(TP+FN)
F1 score: An equation that considers both precision and recall, taking their harmonic mean. Its equation is:
(2*precision*recall)/(precision+recall)
From this stage onwards in our machine learning pipeline, we went back to step 5 and experimented with different algorithms until we found one that best suited the requirements and helped us achieved satisfactory evaluation scores.
VI. Hyperparameter tuning:
Each algorithm has hyperparameters that slightly alters the way an algorithm behaves, and this step usually refers to tuning those. The hyperparameters vary depending on the algorithm you chose; however, it usually consists of the batch size and the learning rate.
There are multiple ways to tune the hyperparameters of a model, the most used methods are:
Grid search: creating a list of values that you want to try for each different hyperparameter, and simply creating a model for each possible combination of values from the different lists.
Random search: like grid search, however, instead of defining lists for each hyperparameter, you define ranges which the algorithm then chooses random values and tries them.
VII. Making predictions:
At this point in our pipeline, we start using unforeseen data, or the data we withheld from the model, the testing data. Now is the time we show our model entirely new data and let it make its predictions, this step is extremely important as it acts as a good indicator to how the model will behave on production data.
VIII. Conclusion:
In conclusion, Machine Learning is not simply choosing an algorithm and passing some data through it, it is a much larger pipeline, fortunately, each step is not that difficult.
Make sure to always have a large enough dataset so that your model has seen enough data and is able to make much more accurate predictions. You then have to make sure that your data is clean and your final dataset has the features you wish to pass to the model. After that you must make sure to choose an algorithm that best fits your business problem, please note that you do not need to choose the most complex algorithm, and there is no single ‘algorithm-fits-all’. Then you start training your model on your training data, and ensure it suits your evaluation metrics, and has the best parameters. Finally, your Machine Learning model is ready for production! By now you hopefully have the basic understanding of what each step in the pipeline is and are now able to dive deeper into a step that interests you.