Machine Learning

General Description

Machine Learning (ML) is a subject area that builds algorithms based on data to improve and/or automatize the decision making process. It's seen as part of Artificial Intelligence (AI), but ML applications are used in many situations that can not be seen as AI but as more simple prediction tasks. However, currently ML is indeed very related to AI, and it gives devices the ability to improve from experience without being explicitly programmed (called Reinforcement Learning).

For example, you are driving, and using a navigation app to find the fastest route. Or when you use a software or app to convert your voice into a text file. Machine Learning is present and part of our daily lives.

In fact Machine Learning is all about prediction. The "Machine" is only able to predict results based on incoming data. Let's use the example above. The navigation app is able to provide the best route, because it's collecting data from other users that also use navigation/maps apps. The data (average speed, route chosen, location) combined with algorithms makes prediction possible.

Machine Learning along with computer vision has improved many fields, be it helping with disease diagnosis, or analyzing thousands of financial transactions per second in search of fraud. Overall, Machine Learning can be useful in a very specific field of research, or in tasks that are impossible for humans to perform, given their volume or complexity.

It can be used with several approaches:

Supervised learning

It's the execution of ML models where both input and output are knowed and used for prediction and comparision purposes. The Datascientist prepares and configures the model in a dynamic and iterative process until obtain the best and accurate prediction, considering the expected output.

This approach is very useful for prediction (Regressions and classifications)

Unsupervised learning

In this case the output is unknown, so, the model is trained with no expected value to compare, but instead it tries to get insights and correlations between the data. It is useful in cases where the human expert doesn’t know what to look for in the data so, it is used in pattern detection and descriptive modeling, but in the typical ML process it is used for clustering and feature reduction.

Semi-supervised learning**

It exploits the idea that even though the group memberships of the unlabeled data are unknown, this data carries important information about the group parameters.

Tasks

Classification

It is utilized when it needs a limited set of outcomes and generally provides predicted output values. It performs as Binomial with and 2 categories output (Like True or False) or as Multi-Class where can predict several categories (Like car types).
One example would be finding whether a received email is spam or not.

Regression

This task can help us predict the value of the label from a set of related features. For example, it can predict house prices based on house attributes such as number of bedrooms, location, or size.

Clustering

Clusters can organize a bunch of data based on their characteristics. This unsupervised learning technique doesn’t have any output information for the training process. Understanding segments of hotel guests based on habits and characteristics of hotel choices is an example.

Tools and Techniques

Neuronal networks

Artificial neural network learning algorithm, or neural network, or just neural. Many synonyms, but one meaning: it uses a network of functions to understand and translate a data input of one form into a desired output, usually in another form. In general, it does not need to be programmed with specific rules that define what to expect from the input. Instead, the algorithm learns with experience. The more examples and variety of inputs available, more accurate the results typically become. Due to its characteristic, there is no limit to the areas that this technique can be applied in. A few of them are self-driving vehicle trajectory prediction, cancer research, object detection, etc.

Decision Trees

A decision tree can be used to visually and explicitly represent decisions and decision-making, or creates a model that predicts the value of a target variable. As the name goes, it uses a tree-like model of decisions, in which the leaf node corresponds to a class label and attributes are represented on the internal node of the tree. Belonging to the supervised learning family, this algorithm is generally referred to as CART or Classification and Regression Trees. Simple to understand, interpret, visualize are just a few advantages; they can also implicitly perform variable screening or feature selection and handle both numerical and categorical data. On the other hand, decision-tree learners can create over-complex trees that do not generalize the data well. Besides, it can be unstable because small variations in the data might result in a completely different tree being generated, which it is called variance.

Preparation and Tuning

Introduction

Even though it leaves more time to test, tune, and optimize models to create greater value, good data preparation takes more time than any other part of the process. The step before data preparation involves defining the problem. Once it is decided, there are six critical steps to follow: Data collection, Exploration and Profiling, Formatting, Improving data quality, Feature engineering and Splitting data into training and evaluation sets. This includes ensuring that data is formatted in a way that best fits the Machine Learning model; to define a strategy for dealing with erroneous data, missing values and outliers; transforming input data into new features that better represent the business or reduce complexity; and finally splitting the data input in 3 datasets: one for training the algorithm, and another for evaluation purposes and the third one for testing purposes.

Normalization

Standardization

Hyperparameters

Tuning consists in maximizing a model’s performance without overfitting or creating too high of a variance by selecting appropriate “hyperparameters”. They must be set manually, unlike other model parameters that can learn through training methods. Many type of hyperparameters are available and some depend on the technique used. For now we will just retain the concept that and hyperparameter is a configuration value that control the training and evaluation process.

Note: Later we will update our Library with several hyperparameters and techniques.

Model Evaluation and Quality

Cross-validation

Easy to understand, easy to implement. A technique to evaluate and test a Machine Learning model. Cross-validation consists in comparing the model results with the real results to evaluate model accuracy and quality. It is uses in supervised Learning approch for categorizal predictions.

There are many CV models, like K-Fold, Leave-P-Out, but the simplest and most common one is Hold-out.

Note: Later we will update our Library with explanations of these different types.

Mean Square error

Speaking about lacking of precision, mean squared error is able to identify an error soon enough. Being a specific type of loss function, they are calculated by the average, specifically the mean, of errors that have been squared from data. So, its utility comes from the fact that squared numbers are positive, and that errors are squared before they are averaged. Besides, the mean squared error is often used as a measure to get a better picture of the variance and bias in data sets.

Overfitting

Either way, CV is a very useful for assessing the effectiveness of a model, particularly in cases where it needs to mitigate overfitting. Overfitting is actually a concept in data science, which occurs when a statistical model fits exactly against its training data - there is a low error rate and the test data has a high error rate. One of the ways to avoid it would be through data augmentation. While it is better to inject clean, relevant data, sometimes noisy data is added to make a model more stable. This method should be done sparingly, though.