Here is a summary of the new section ‘Machine Learning’ in reading number 8 of CFA exam, second level.
Section No. 7
In real time, huge amount of data (commonly called Big Data) is being created by institutions businesses, governments, financial markets, individuals, and sensors (e.g. satellite imaging). Investors generally use big data information to find better investment opportunities.
Big data covers data from traditional and non-traditional sources. Analysis of big data is challenging because:
1. non-traditional data sources are often unstructured
2. theoretical methods do not perform well to establish relationships among the data at such a massive scale.
Machine learning (advanced computer techniques), computer algorithms and adaptable models are used to study relationships among the data. The information obtained from big data using such techniques is called data analysis (a.k.a ‘data analytics’).
Major Focuses of Data Analytics
Six focuses of data analytics include:
Determining synchronous relationship between variables i.e. how variables tend to covary.
Identifying variables that can help predict the value of variable of interest.
Making casual inferences
Casual inference focuses on determining whether an independent variable cause changes to the dependent variable. Casual inference is a stronger relationship between variables than that of correlation and prediction.
However, in real-world situation, estimating casual effect in the presence of confounding variables (variables that influence both dependent and independent variables) is challenging.
Classification focuses on classifying variables into various categories. Variables can be continuous variables (such as time, weight) or categorical variables (countable distinct groups). In case of categorical variables, the econometric model is called a classifier.
Many classification models are binary classifiers (two possible values 0 or 1 ), others are multi-category classifiers (such as ordinal or nominal). Ordinal variables follow some natural order (small, medium, large or low to high ratings etc.). Nominal variables do not follow any natural order (e.g. equity, fixed income, alternate).
Sorting data into clusters
Clustering focuses on sorting observations into various groups based on similar attributes or set of criteria that may or may not be pre-specified.
Reducing the dimension of data
Dimension reduction is a process of reducing number of independent variables while retaining variation across observations.
Dimension reduction when applied to data with large number of attributes makes easier to visualize the data on computer screens.
For out of sample forecasting, simple models perform better than complex models.
Dimension reduction improves performance by focusing on major factors that cause asset price movements in quantitative investment and risk management.
All these problems (prediction, clustering, dimension reduction, classification etc.) are often solved by machine learning methods.
What is Machine Learning?
Machine learning (ML) is a subset of artificial intelligence (AI).
Machine Learning (ML) uses statistical techniques that give computer systems the ability to act by learning from data without being explicitly programmed.
The ML program uses inputs from historical database, trends and relationships to discover hidden insights and pattern in data.
Types of Machine Learning
Two broad categories of ML techniques are:
1. Supervised learning
Supervised learning uses labeled training data (set of inputs supplied to the program), and process that information to find the output. Supervised learning follows the logic of ‘X leads to Y’.
For example, consider a ML program that predicts whether credit card transactions are fraudulent or not.
This is a binary classifier where the transaction is either fraudulent (value = 1) or non-fraudulent (value = 0).
The ML program collects input from the growing database of credit card transactions labeled ‘fraudulent’ or ‘non-fraudulent’ and learns the relationship from experience.
The performance is measured by the percentage of transactions accurately predicted.
2. Unsupervised learning
Unsupervised learning does not make use of labelled training data and does not follow the logic of ‘X leads to Y’. There are no outcomes to match to, however, the input data is analyzed, and the program discovers structures within the data itself.
One application of unsupervised learning is “clustering’ where program identifies similarity among data points and automatically splits data into groups based on their similar attributes.
Some additional ML categories are ‘deep learning’ (ML program using neural network with many hidden layers) and ‘reinforcement learning’ (ML program that learns from interacting with itself).
Machine Learning Vocabulary
General ML terminologies are different from the terms used in statistical modeling.
Y variable (dependent variable in regression analysis) is called target variable (or tag variable) in ML.
X variable (independent variable in regression analysis) is known as feature in ML.
In ML terminology, organizing features for ML processing is called feature engineering.
Machine Learning Algorithms
The following sections provide description of some important models and procedures categorized under supervised and unsupervised learning.
Neural networks are commonly included under supervised learning but are also important in reinforcement learning, which is a part of unsupervised learning.
Supervised leaning is divided into two classes based on the nature of the Y variable. These classes are regression and classification. Both classes use different ML techniques.
when the Y variable is continuous – supervised ML techniques include linear and non-linear models often used for prediction problems.
when the Y variable is categorical or ordinal – classification techniques include CART (classification and regression trees), random forests and neural networks.
Penalized regression is a computationally-efficient method used to solve prediction problems. Penalized regression (imposing penalty on the size of regression coefficients) improves prediction in large datasets by shrinking the number of independent variables and handling model complexity.
Penalized regression and other forms of linear regression, like multiple regression, are classified as special case of the generalized linear regression (GLM).
GLM is linear regression in which specification can be changed based on two choices:
1. Maximum number of independent variables the researcher wants to use;
2. How good model fit is needed?
In large datasets, algorithm starts modelling unnecessary complex relationships among many variables, and estimates an output that does not perform well on new data.
This problem is called overfitting. Penalized regression solves overfitting through regularization (by penalizing the statistical variability and magnitude of high dimensional data features). In prediction, parsimonious models (having less parameters) are less subject to overfitting.
Penalized regression is similar to ordinary linear regression with an added penalty which increases as the number of variables increase.
The purpose is to regularize the model such that only variables that explain Y should remain in the model. Penalized regressions are usually subject to a trade-off between contribution to model fit versus penalty.
Classification and Regression Trees
CART is a common supervised ML method that can be used for predicting classification or regression related modeling issues.
CART model is:
- computationally efficient
- adaptable to complex datasets
- usually applied where the target is binary
- useful for interpreting how observations are classified
CART model is represented by a binary tree (two-way branches). CART works on a pre-classified training data. Each node signifies a single input variable (X).
A classification tree is formed by splitting each node into two distinct subsets, and the process of splitting the derived subsets is repeated in a recursive manner.
Process ends when further splitting is not possible (observations cannot be divided into two district groups). The last node is called terminal node that holds a category based on attributes shared by observations at that node.
The chosen cut-off value (splitting value into two groups) is the one that decreases classification error, therefore, observations in each subsequent division have lower error within that group.
Some parts of the tree may turn out to be denser (a greater number of splits) while others simpler (a smaller number of splits).
Classification tree vs Regression Tree
Classification tree is used when the value of the target variable is categorical.
Regression tree is used when the value of the target variable is continues or numeric.
Classification Tree Example
Random forest, a ML technique that ensembles multiple decision trees together based on random selection of features that contribute more to the process in order to produce accurate and stable prediction.
Splitting a node in random forests is based on the best features from a random subsets of n features. Therefore, each tree marginally varies from other tress.
The power of this model is based on the idea of ‘wisdom of crowd’ and ensemble learning (using numerous algorithms to improve prediction).
All the classifier trees go for classification by majority vote for any new observation. The involvement of random subsets in the pool of classification trees prevents overfitting problem and also reduces the ratio of noise to signal.
CART and random forest techniques are useful to resolve classification problems in investment and risk management (such as predicting IPO performance, classifying info concerning positive and negative sentiments etc.).
Neural networks are also known as artificial neural networks, or ANNs. Neural networks are appropriate for nonlinear statistical data and for data with complex connections among variables. Neural networks contain nodes that are linked to the arrows.
ANNs have three types of interconnected layers:
- an input layer
- hidden layers
- an output layer
Input layer consists of nodes, and the number of nodes in the input layer represents the number of features used for prediction.
For example, the neural network shown below has an input layer with three nodes representing three features used for prediction, two hidden layers with four and three hidden nodes respectively, and an output layer.
For a sample network given below, the four numbers – 3,4,3, and 1 – are hyperparameters (variables set by humans that determine the network structure).
Sample: A Neural Network with Two Hidden Layers
Links (arrows) are used to transmit values from one node to the other. Nodes of the hidden layer(s) are called neurons because they process information.
Nodes assign weights to each connection depending on the strength and the value of information received, and the weights typically varies as the process advances.
A formula (activation function) is applied to inputs, which is generally nonlinear. This allows modeling of complex non-linear functions. Learning (improvement) happens through better weights being applied by neurons.
Better weights are identified by improvement in some performance measure (e.g. lower errors). Hidden layer feeds the output-layer.
Deep learning nets (DLNs) are neural networks with many hidden layers (often > 20 ). Advanced DLNs are used for speech recognition and image or pattern detection.
In unsupervised ML, we only have input variables and there is no target (corresponding output variables) to which we match the feature set. Unsupervised ML algorithms are typically used for dimension reduction and data clustering.
Clustering algorithms discover the inherent groupings in the data without any predefined class labels. Clustering is different from classification. Classification uses predefined class labels assigned by the researcher.
Two common clustering approaches are:
i) Bottom-up clustering:
Each observation starts in its own cluster, and then assemble with other clusters progressively based on some criteria in a non-overlapping manner.
ii) Top-down clustering:
All observations begin as one cluster, and then split into smaller and smaller clusters gradually.
The selection of clustering approach depends on the nature of the data or the purpose of the analysis. These approaches are evaluated by various metrics.
K-means Algorithm: An example of a Clustering Algorithm
K-means is a type of bottom-up clustering algorithm where data is partitioned into k-clusters based on the concept of two geometric ideas ‘Centroid’ (average position of points in the cluster) and ‘Euclidian’ (straight line distance between two points). The number of required clusters (k-clusters) must have been mentioned beforehand.
Suppose an analyst wants to divide a group of 100 firms into 5 clusters based on two numerical metrics of corporate governance quality.
Algorithm will work iteratively to assign suitable group (centroid) for each data point based on similarity of the provided features (in this case two-dimensional corporate governance qualities). There will be five centroid positions (initially located randomly).
Step 1. First step involves assigning each data point its nearest Centroid based on the squared Euclidian distance.
Step 2. The centroids are then recomputed based on the mean location of all assigned data points in each cluster.
The algorithm repeats step 1 and 2 many times until no movement of centroids is possible and sum of squared distances between points is minimum. The five clusters for 100 firms are considered to be optimal when average of squared-line distance between data points from centroid is at minimum.
However, final results may depend on the initial position selected for the centroids. This problem can be addressed by running algorithm many times for different initial positions of the centroids, and then selecting the best fit clustering.
Clustering is a valuable ML technique used for many portfolio management and diversification functions.
Dimension reduction is another unsupervised ML technique that reduces the number of random variables for complex datasets while keeping as much of the variation in the dataset as possible.
Principal component analysis (PCA) is an established method for dimension reduction. PCA reduces highly correlated data variables into fewer, necessary, uncorrelated composite variables. Composite variables are variables that assemble two or more highly correlated variables.
The first principal component accounts for major variations in the data, after which each succeeding principal component obtains the remaining volatility, subject to constraint that it is uncorrelated with the preceding principal component.
Each subsequent component has lower information to noise ratio. PCA technique has been applied to process stock market returns and yield curve dynamics.
Dimension reduction techniques are applicable to numerical, textual or visual data.
Supervised Machine Learning: Training
The process to train ML models includes the following simple steps.
- Define the ML algorithm.
- Specify the hyperparameters used in the ML technique. This may involve several training cycles.
- Divide datasets in two major groups:
Training sample (the actual dataset used to train the model. Model actually learns from this dataset).
Validation sample (validates the performance of model and evaluates the model fit for out-of-sample data.)
- Evaluate model-fit through validation sample and tune the model’s hyperparameters.
- Repeat the training cycles for some given number of times or until the required performance level is achieved.
The output of the training process is the ‘ML model’.
The model may overfit or underfit depending on the number of training cycles e.g. model overfitting (excessive training cycles) results in bad out-of-sample predictive performance.
In step 3, the process randomly and repeatedly partition data into training and validation samples.
As a result, a data may be labeled as training sample in one split and validation sample in another split. ‘Cross validation’ is a process that controls biases in training data and improves model’s prediction.
Note: Smaller datasets entail more cross validation whereas bigger datasets require less cross-validation.
Now you should practice End of chapter( EOC ) Questions