We often encounter data where two variables (observable measurements) change in predictable ways with one another. We say that these variables covary. The simplest mathematical relation between two variables is a linear relation, where the change in one variable is proportional to the change in the other variable. If \(x\) and \(y\) are the variables, then the mathematical expression that captures this relation is \[y-y_0 = m(x-x_0)\] where \((x_0,y_0)\) is the original state of the variables, \((x,y)\) is the new state of the variables, and \(m\) is the slope or rate of change for the relation.
In practice, collected data rarely satisfy a linear relation exactly. That would require that all data points are exactly collinear. Errors in measurement and random variations that the two variables are both likely. Furthermore, the true relation between the variables may not be truly linear even though a linear relation makes a very good approximation of the relation.
Consequently, it is often desirable to find the equation of a linear relation between two variables when the data are not actually satisfying a linear relation. In order to find the equation for the relation, we need to establish a criteria for what it means to be a good approximation.
The most common way to identify the best linear approximation for data is by minimizing a measure of distance between the points of data and the line defined by the relation. In this discussion, we will focus on two types of measures of distance.
The first distance is the absolute error, which measures the distance from a data point vertically to the line. If we think of \(y\) as a function of \(x\), then the linear relation would mean \[y = y_0 + m(x-x_0)\] should be a perfect prediction of \(y\) knowing the value of \(x\). In reality, this would not be observed, and a given point \((x_i,y_i)\) would require a correction, or residual, \(\epsilon_i\) to reach the observed point, \[y_i = y_0 + m(x_i-x_0) + \epsilon_i.\] The magnitude of this residual is the absolute error.
A second distance is a geometric distance, measuring the distance from a data point to the line. Distance is measured using a right angle to the line and will be less than or equal to the magnitude of the residual, depending on the slope of the line. Using a geometric argument, it can be shown that the distance is computed as \[d_i = \frac{|\epsilon_i|}{\sqrt{m^2+1}}.\]
The total measure for a set of data is found by adding all of the distances between each data point and the line, depending on whether it is the absolute error or the geometric distance. Mathematically, using a sum of squared distances smooths the profile and leads to results that have a cleaner mathematical solution. Linear regression uses the sum of the squared errors to determine the best line. For the purpose of this discussion, we will refer to the sum of distances or sum of squared distances as a measure of discrepancy.
In this section, you can explore how the four different types of measures of discrepancy depend on the slope and height of a linear relation. A collection of data points are automatically provided, although you can enter your own data if you wish.
The panel on the left plots the points (plotted 'x') and shows the current line. Line segments are drawn from the points to the line, either vertical in the plane or perpendicular to the line, depending on which measure of discrepancy is used. The \(x\)-position of the line's reference point is always based on the average of the \(x\)-values of the data points.
The two panels on the right have sliders allowing you to select the slope (top-right) and the height of the reference point (bottom-right). Additionally, these panels graph the measure of discrepancy as it depends on the control parameter, holding the other parameter fixed. By sliding the controls to reach the lowest measure of discrepancy as possible allows you to find the best linear approximation.
Enter the data here. List x- and y-values separated by a comma.
Separate data points with a semi-colon.
Choose what measure of dicrepancy you wish to minimize:
Of course, it is impractical to adjust the parameters (slope and height) manually to find the optimal line. Instead, computer software can find the optimal line given the data, usually using linear regression. Linear regression refers to minimizing the sum of the squared errors and has explicit formulas that can be rapidly evaluated to find the model parameters. In some software (e.g., Microsoft Excel), this is called finding a trend line for the data.