A simple introduction to multivariate analysis

Data is everywhere now, with ‘big data’ methods becoming increasingly popular in business. In this article, we are going to...

Introduction

Data is everywhere now, with ‘big data’ methods becoming increasingly popular in business. In this article, we are going to look at a simplified example of multivariate analysis, which forms the basis for many machine learning methods. Multivariate analysis can be applied to a multitude of different problems, from identifying patterns in data to increase understanding of a topic, making comparisons or discarding unwanted information quickly, to identify the most relevant information needed from a large volume of (often messy) data. 

Types of multivariate analysis

Multivariate analysis (i.e. lots of different variables) is a simple extension of univariate analysis (i.e. one variable), and there are a range of different types of multivariate analyses:

  • Principal component analysis - a method for reducing the dimensions of large datasets
  • Multivariate regression - used to predict the behaviour of the response variable based on its corresponding predictor variables.
  • Cluster analysis - a method of grouping similar observations in such a way that members in the same group are more similar 
  • ANOVA - a statistical method that separates observed variance data into different components for additional analyses

The focus of this article is going to be multivariate regression, one of the most powerful tools in statistics (when used correctly). 

Multivariate regression

Multivariate regression is used to determine the relationship between  a dependent variable and one (univariate) or more (multivariate) independent variable. To best understand this, we first consider the univariate case. Consider an ice cream shop, who wants to know how much ice cream they will need on a given day. For the last two weeks, they have recorded the temperature that day, and the number of ice creams they have sold that day. If we hypothesise that the number of ice creams sold are dependent on the temperature that day, the dependent variable is the number of ice creams sold, and the independent variable is the temperature.

 Univariate statistics is essentially just putting these two variables on a graph, and drawing a line that best fits to the points, by minimising the distance from each point to the line. Multivariate statistics is a generalisation of this, for example, the amount of rainfall may also have been considered, or the wind speed, which also may impact ice cream sales. Categorical or indicator variables could also be considered, such as if it is a holiday. In multivariate regression, the distance between the variable and the line is minimised, for all variables.

When considering three variables (in this case the length, weight and height of objects), it becomes harder to visualise. Below is a three-dimensional plot, from which it is much harder to identify the relationship between variables.

Assumptions

Ensuring that data satisfies assumptions made by statistical models is very important, and so should always be checked. For example, for multivariate regression, there are three main assumptions which are made about variables, which must be satisfied for methods to be deemed appropriate, which are outlined below.

  1. The relationship between the dependent variable and independent variables are linear (CHECK: look at the data, does the relationship look linear?).
  2. The errors between observed and predicted values i.e. the distance on the graph between observations and the fitted line are normally distributed (CHECK: calculate residuals, do they have a symmetrical histogram? Carry out a goodness-of-fit test).
  3. Independent variables are not highly correlated with each other (CHECK: look at scatterplots between independent variables, are there any strong correlations?).