Skip to main content
\( \newcommand{\lt}{ < } \newcommand{\gt}{ > } \newcommand{\amp}{ & } \)

Section2.2Variables and Relationships


In this section, we introduce the ideas of different variables having relationships. We start by introducing the idea that variables represent measurements and that a complete description of all measurements represents the state of a system. A scatter plot provides a graphical way to identify possible patterns between variables by plotting points representing pairs of variables from the state.

We then transition from an experimental point of view of related variables to the mathematical way that such relationships are expressed, namely through equations. Equations provide a formal and exact way to state relationships between variables. A theoretical state of a system consists only of values that make the equation true. Observations generally do not work out so exactly, so we look for equations that serve as models to approximate the pattern observed in measured values.

Subsection2.2.2Systems, States and Variables

In the course of an experiment, or even just in observation, many different quantities typically are covarying, or changing with one another. For example, an object in motion has changing position, changing velocity, and changing forces. In the course of a chemical reaction, there are changing concentrations of the different reactants and products as well as possibly changing temperature, pH, and volume, for example. While observing a changing population, there could be changing population numbers, total biomass, birth and death rates, consumption of resources and production of products and waste. Mathematically, the system consists of all possible observable quantities associated with the experiment or observed physical system. The state of the system refers to the collection of instantaneous values of all such quantities at a particular instant or configuration of the system. A variable represents a single quantity that is or could be observed in the system.


Consider the following data about the population, births and deaths in the United States. To conserve space, the data are given using scientific notation expressed in the standard machine form where the power of 10 follows the letter E, so that \(2.521 \times 10^8\) would be written 2.521E8.

Year Population Births Deaths Year Population Births Deaths
1991 2.521E8 4.111E6 2.170E6 2001 2.850E8 4.026E6 2.416E6
1992 2.550E8 4.065E6 2.176E6 2002 2.876E8 4.022E6 2.443E6
1993 2.577E8 4.000E6 2.269E6 2003 2.901E8 4.090E6 2.448E6
1994 2.602E8 3.953E6 2.279E6 2004 2.928E8 4.112E6 2.397E6
1995 2.628E8 3.900E6 2.312E6 2005 2.955E8 4.138E6 2.448E6
1996 2.652E8 3.891E6 2.315E6 2006 2.984E8 4.266E6 2.426E6
1997 2.677E8 3.881E6 2.314E6 2007 3.012E8 4.316E6 2.424E6
1998 2.703E8 3.942E6 2.337E6 2008 3.041E8 4.248E6 2.472E6
1999 2.727E8 3.959E6 2.391E6 2009 3.068E8 4.131E6 2.437E6
2000 2.822E8 4.059E6 2.403E6 2010 3.094E8 3.999E6 2.468E6

Each row (corresponding to the population in a given year) represents a distinct state of the system. The system is characterized by the observed values, which consist of the year itself, the total population, the total number of births in the year, and the total number of deaths in the year. These three measurements are the variables of the system. The year should also be considered one of the variables (an independent variable) because we think of the other variables changing with respect to time.

A symbol is often used to represent, or name, a variable. The symbol is often a letter, but could also be a Greek letter, an abbreviation, or a word. The choice of symbol should generally be related to the meaning of the variable. In the previous example, the population variable might be represented by the symbol \(P\) while the births and deaths might be represented by the symbols \(B\) and \(D\). The year might be represented by the symbol \(Y\). Uppercase and lowercase letters are different symbols and should not be interchanged with one another. An important part of communication is in stating clearly the variables of a system and identifying the symbols that are chosen to represent them.


In biology, scientists run electrophoresis gels to determine the size of polymers, such as proteins or DNA strands. The gel provides a porous structure for the polymers to travel through while an electric potential (voltage) creates a force that pulls the polymers through the gel. Different size polymers travel at different speeds. The experiment is setup with all polymers starting at one end of the gel, the voltage is turned on for a certain amount of time and then disconnected. Clusters of similarly sized polymers are identified visually as bands on the gel, with smaller polymers traveling a greater distance. The following paragraph illustrates how variables might be introduced with their chosen symbols.

The image below represents an electrophoresis gel run on a standardized collection of DNA of fixed sizes. Because the image does not show a length scale, the distances traveled by the different lengths are measured in image pixels and recorded in the table below. The variables for the experiment are the length of DNA segments and the distance traveled through the gel. Let \(L\) represent the length of the segment (in nucleotides) and let \(D\) represent the distance traveled (in pixels), measured from the center of the starting well to the center of the corresponding band in the image. Each row represents a single state \((L,D)\) of the system.

\(L\) (nts) \(D\) (px)
100 342
200 327
300 312
400 299
500 288
600 278
700 270
800 263
900 256
1000 249

Without the earlier descriptive paragraph introducing the variables, this table would have no context and would not be helpful to a reader.

Subsection2.2.3Relationships Between Variables

The primary motivation for collecting data regarding different variables in the state of a system is to determine relationships between those variables. One of the ways that we look for relationships is using a scatter plot. A scatter plot is a graph showing the relationship between two variables. Suppose the two variables use symbols \(x\) and \(y\). For each state of the system, there will have been observed values for both \(x\) and \(y\). The graph will include points for each pair \((x,y)\).

Spreadsheets (like Microsoft Excel, Apple Numbers or Google Sheets) are a common tool to generate scatter plots. The data are first put in a table. The first column of data will correspond to the variable used for the horizontal axis (\(x\)) and the second column of data will correspond to the variable for the vertical axis (\(y\)). Select the two columns at the same time and add a chart to your spreadsheet, choosing the scatter plot style of graph. You should become familiar with how to create a scatter plot. Always be sure that you label your axes.

The following figure shows two different scatter plots for the electrophoresis gel data above. One plot is based on the pairs \((L,D)\) whereas the other is based on the pairs \((D,L)\). These graphs contains the same information but viewed from a reverse perspective.

<<SVG image is unavailable, or your browser cannot render it>>

When a system has a state defined by more than two variables, scatter plots can be defined for each pair of state variables. For example, the population data has four state variables, \((Y,P,B,D)\). Three scatter plots can be formed by plotting the population, the total births and the total deaths versus the year, giving graphs of points \((Y,P)\), \((Y,B)\), \((Y,D)\). Because the births and deaths are on the same scale, we can combine the plots as one. We could also plot the inverse relationships \((P,Y)\), \((B,Y)\) and \((D,Y)\), but these really contain the same information from a different view.

<<SVG image is unavailable, or your browser cannot render it>>

We can also look at relationships between other pairs of variables. For example, we can look at how the number of births or deaths relate to the population, plotting \((P,B)\) and \((P,D)\), or how the number of births relate to the number of deaths with \((B,D)\). The graph showing the relation between births and deaths to time (above) is very similar to the graph showing the relation between births and deaths to population (below). However, the relation between the births and deaths illustrates that sometimes variables do not show a clear relation.

<<SVG image is unavailable, or your browser cannot render it>>

Subsection2.2.4Equations as Relations

An equation gives an abstract representation of a relationship between variables. An expression is any formula involving numbers and variables. An equation is a statement that two expressions are equal. For some values of the variables, the equation may be false; for other values, the equation will be true. A solution to the equation is a set of values for the variables in the equation that makes the statement true. The solution set of an equation is the set of all possible solutions.

Just as the state of an experimental system is defined by the value of the variables defining the state, an equation can be considered as a mathematical way to define relationships between variables of an abstract system. A scatter plot of data is generalized for an equation as a graph of all solutions. If we choose an ordering for the variables (e.g., alphabetical), the values for the variables can be conveniently listed as an ordered list. When two variables are involved in an equation, the ordered list is called an ordered pair or point, like \((x,y)\), and the graph of the equation is typically a curve in the plane.


The equation \begin{equation*}2x+3y = 12\end{equation*} involves two variables, \(x\) and \(y\). The expressions in the equation are \(2x+3y\) and \(12\). The values \(x=3\) and \(y=2\), corresponding to the ordered pair \((x,y)=(3,2)\), provide one solution because for those values, \(2x+3y = 2(3)+3(2)=12\), so that the equation is true. On the other hand, \((x,y)=(4,1)\) is not a solution because for that state, \(2x+3y = 2(4)+3(1)=11\) and \(11 \ne 12\). Some other solutions include the points \((6,0)\) and \((0,4)\).

You should have recognized that this equation is an equation of a line. (See Appendix A.2.1.) That is, the solutions we identified above of \((3,2)\), \((6,0)\) and \((0,4)\), along with all other solutions, will lie on the same line in the plane.

<<SVG image is unavailable, or your browser cannot render it>>


The equation \begin{equation*}u^2+v^2=16+6u\end{equation*} also involves two variables, \(u\) and \(v\). The expressions in the equation are \(u^2+v^2\) and \(16+6u\). Using ordered pairs \((u,v)\), the points \((3,5)\) and \((3,-5)\) are solutions. That is, if \((u,v)=(3,5)\), the expressions have the same value: \begin{align*} u^2+v^2 &= 3^2+5^2=9+25 = 34,\\ 16+6u &= 16+6(3) = 16+18 = 34. \end{align*} It is possible to show that the graph of solutions for this equation is a circle centered at \((3,0)\) with radius 5. Other points on this circle include such points as \((-2,0)\), \((-1,3)\), and \((6,-4)\). You should verify that these are also solutions, at least for one or two points to reinforce the idea that a solution makes the statement of the equation true.

<<SVG image is unavailable, or your browser cannot render it>>

Mathematical equations are exact. Real data exhibit uncertainty and randomness. Although they do not capture the uncertainty of data, equations can be used to model the trend or average presented by the data. A trend line or trend curve (if not linear) is a model that captures the general behavior of the data and stays close to the scatter points. Most spreadsheets and graphing calculators have an option to show a line of best fit for a scatter plot, which is an example of a trend line. The process these programs use is called regression. They usually report the equation of the regression curve using the generic variable symbols \(x\) and \(y\), so it is the researcher's responsibility to interpret the equation in terms of the true variables.

A trend line or a trend curve allow us to predict values where there are not observed data. When the prediction occurs between observed data, such prediction is called interpolation. If the prediction is occurring beyond the extremes of the data, such prediction is called extrapolation. Often, a formula may not describe all of the data but provides a good approximation for certain values. Interpolation is usually safer than extrapolation, which might attempt to use the formula in a region where the approximation is not good.


Consider the population example with the scatter plot of the number of deaths plotted with respect to the total population size, and predict the number of deaths in a year if the population were 300 million.

A spreadsheet reported the trend line of this data set with an equation \begin{equation*}y=0.0049x+992711.\end{equation*} Because the scatter plot had \(P\) on the horizontal axis and \(D\) on the vertical axis, the more appropriate equation would be \begin{equation*}D=0.0049P+992711.\end{equation*} A spreadsheet may not give enough precision in the model equation. Note that the equation reported above only has two significant digits in the slope value but an apparent 6 significant digits in the intercept. A more advanced reporting of the trend line would be \begin{equation*} D = 4.86666 \times 10^{-3} P + 9.92711 \times 10^5. \end{equation*} Depending on the values of the data, the greater accuracy might make a significant difference.

<<SVG image is unavailable, or your browser cannot render it>>

Let us compare the two models with \(P=300\times 10^6\). The first model which only has two significant digits in the first coefficient gives \begin{equation*}D = 0.0049(3\times 10^8) + 992711 = 2462711.\end{equation*} Because one coefficient only had two significant digits, we can only expect the first two values are accurate, with \(D=2.5\) million deaths in the year. Using the second model with six digits of accuracy in both coefficients, we find \begin{equation*}D = 4.86666 \times 10^{-3} \cdot 3 \times 10^{8} + 992711 = 2452709.\end{equation*} With six digits of accuracy, this predicts \(D=2,452,710\) deaths in the year. As approximations, neither model really claims to truly predict the number of deaths, nor does either claim to specify how far away from the prediction is considered reasonable (which is a statistics question).


Consider the electrophoresis gel scatter plot with the length of the DNA \(L\) graphed with respect to the distance traveled in the gel \(D\). The data appear to follow a nice curve without a lot of uncertainty. Using a polynomial trend, a spreadsheet reports the following equation for the data: \begin{equation*} y = 0.0573x^2-43.381x+8241.6.\end{equation*} Using the appropriate variables for the problem, this equation should be rewritten as \begin{equation*} L = 0.0573D^2-43.381D+8241.6.\end{equation*} The graph of the data with the trend curve is shown below.

<<SVG image is unavailable, or your browser cannot render it>>

Knowing a model equation that conveniently characterizes a data set allows us to use that equation to predict values that do not occur within the data set itself. For example, suppose we had another DNA sample of unknown length that traveled a distance of \(D=282\) pixels. Using our value for \(D\), we can find the value of \(L\) using the model, \begin{equation*}L = 0.0573(282)^2 - 43.381(282)+8241.6 = 564.8832.\end{equation*} Since our original data had 3 significant digits, we would estimate the length of the DNA in question as \(L \approx 565\) nucleotides.

Again, you should note that the number of significant digits reported is not the same as the uncertainty in the prediction. The degree to which the original data vary around the trend curve leads to uncertainty in the coefficients of the regression model and subsequent uncertainty to the trend curve itself. However, analysis of this uncertainty is a topic for statistics and is outside the scope of this text.