Articles

11.4: Plotting Data


Here's some preliminary commands to run if they haven't been yet.

And for this section, we will need to let julia know to use the Plots package:

We will want to produce plots of datasets, we'll start with just some random data. The following produces the values 1 to 10 for x and random integers between 1 and 10 for y:

The following will produce a scatter plot of the data, where each point is plotted as a point.

If we wanted the data to be plotted with lines connecting the points, we use the sameplotcommand as we did above.

and we can plot both points and lines with:


Chapter 11.4: Vapor Pressure

  • Contributed by Anonymous
  • LibreTexts

Learning Objectives

Nearly all of us have heated a pan of water with the lid in place and shortly thereafter heard the sounds of the lid rattling and hot water spilling onto the stovetop. When a liquid is heated, its molecules obtain sufficient kinetic energy to overcome the forces holding them in the liquid and they escape into the gaseous phase. By doing so, they generate a population of molecules in the vapor phase above the liquid that produces a pressure&mdashthe vapor pressure The pressure created over a liquid by the molecules of a liquid substance that have enough kinetic energy to escape to the vapor phase. of the liquid. In the situation we described, enough pressure was generated to move the lid, which allowed the vapor to escape. If the vapor is contained in a sealed vessel, however, such as an unvented flask, and the vapor pressure becomes too high, the flask will explode (as many students have unfortunately discovered). In this section, we describe vapor pressure in more detail and explain how to quantitatively determine the vapor pressure of a liquid.


11.2 Fitting a Linear Model in R

It turns out that a very way to choose the best (alpha) and (eta) is to minimize the sum of square distance between the data points and the model predictions. Suppose, we have a model with (N) data points ((x_1,y_1), (x_2,y_2), . (x_N,y_N)) , then we can measure the Cost of the model for one data point (y_j) by finding the distance (squared) between this data point and the predicted value (hat(x_j)=alpha+eta x_j) . Summing up all these errors or residuals gives us a measure of how well the model describes the data. [ egin ext(alpha, eta)=sum_^N r_j^2=sum_^N [y_j-(alpha+eta x_j)]^2 end ]

The below plot shows the residuals as green arrows for a guess of (alpha=10) , (eta=1.5) for the flight model. The total cost is also printed below for this parameter choice. Note, I reduced the number of data points (circles) in this plot just for the purpose of being able to see the green arrows (residual) values more clearly.

Now I want to show you how to use R to fit a linear model and view the results. Here is the command to build a linear model for our flying data and view a summary of the results.

We will learn what all this output (stats poop) means later. Let’s see what our residual plot looks like for these optimal values:

Notice that the cost (sum of all the residuals squared has decreased by quite a bit from our initial guess). This is the best (optimal) values of (alpha) and (eta) we could possibly choose. Any other choice of (alpha, eta) would give a larger cost value. Now we can look at the the estimates for the (alpha) and (eta) parameters that R finds:

The (eta) slope parameter is what is most important for our flying model. The best point estimate for (eta) is (approx 0.126) . In the context of the model this means that for every 1 mile increase in distance we should expect the flying time to increase by about (0.12) minutes. We can see how well the value of (eta) is determined by the data by finding the confidence interval for (eta) :

We can also make a plot of the line that R fit to our flying data. We can see that the line captures some of the big picture trends in the data.

The (alpha) term (y-intercept) here tells us that flights which go no distance at all (0 miles) should be expected to take somewhere between 17-19 minutes. This is a bit more difficult to interpret as presumably nobody is booking flights which take off and go nowhere. However, we could regard this value as a measurement of the inevitable inefficiency of airports where planes must take turns to take-off and land and can only approach from particular directions. This effect generally adds something like twenty minutes to flights out of NYC.

11.2.0.1 House Sales Price vs Square Footage

Lets consider a more interesting problem. In this section we will use linear regression to understand the relationship between the sales price of a house and the square footage of that house. Intuitively, we expect these two variables to be related, as bigger houses typically sell for more money. The data set comes from Ames, Iowa house sales from 2006-2010. First, lets read this data in and make a scatter plot of the sales price versus the square footage.

We can see this has the log10 of the selling price, the square footage and the number of bathrooms in the house.

As expected we can see from the plot that square footage is somewhat important in determining the sales price of the house, but we can see that their is significant variation in the sales price for any given sqft size. Let’s try and build a linear model for the relationship between the sqft of the houses and the sales price.

Lets look to see if the slope we found is significant (relative to a slope of zero):

We can say that the slope is significantly greater than zero with a significance level (alpha=0.01) since this 99% confidence interval doesn’t include zero. Finally, lets plot our regression line on the scatter plot:

Note, since we are dealing with the logarithms of the price and square footage here we these results tell us to expect a 1% increase in the square footage of the house to increase the Sales price by about 1% as well. In terms of the non logarithm transformed variables our model looks like [Price=alpha_0(Sqft)^<eta>.] By taking the logarithm of both sides of this we get a linear equation [log(Price)=log(alpha_0)+eta log(Sqft)]


11.2 Correlation

The correlation ( ho) of random variables (X) and (Y) is a number between -1 and 1 that measures the strength of the linear relationship between (X) and (Y) . It is positive when (X) and (Y) tend to be large together and small together, and it is negative when large (X) values tend to accompany small (Y) values and vice versa. A correlation of 0 indicates no linear relationship, and a correlation of (pm 1) is achieved only when (X) and (Y) have an exact linear relationship.

If (X) and (Y) have means and standard deviations (mu_X) , (sigma_X) and (mu_Y) , (sigma_Y) respectively, then [ ho_ = frac = frac<< ext>(X, Y)>.]

Many times, we are not able to calculate the exact correlation between two random variables (X) and (Y) , and we will want to estimate it from a random sample. Given a sample ((x_1, y_1),ldots, (x_n, y_n)) , we define the correlation coefficient (r) as follows:

The sample correlation coefficient is [ r = frac<1> sum ^n _ left( frac> ight) left( frac> ight) ]

The (i^< ext>) term in the sum for (r) will be positive whenever:

  • Both (x_i) and (y_i) are larger than their means (ar) and (ar) .
  • Both (x_i) and (y_i) are smaller than their means (ar) and (ar) .

It will be negative whenever

  • (x_i) is larger than (ar) while (y_i) is smaller than (ar) .
  • (x_i) is smaller than (ar) while (y_i) is larger than (ar) .

Since (r) is a sum of these terms, (r) will tend to be positive when (x_i) and (y_i) are large and small together, and (r) will tend to be negative when large values of (x_i) accompany small values of (y_i) and vice versa.

For the rest of this chapter, when we refer to the correlation or the sample correlation between two random variables, we will mean the sample correlation coefficient.

The sample correlation coefficient is symmetric in (x) and (y) , and is not dependent on the assignment of explanatory and response to the variables.

The correlation between carb and optden in the Formaldehyde data set is (r = 0.9995232) , which is quite close to 1. The plot showed these data points were almost perfectly on a line.

Figure 11.6 shows the relationship between flipper length and body mass for all three penguin species.

Figure 11.6: Body mass and flipper length for three penguin species.

We compute the sample correlation coefficient (r) for each species of penguin.

Gentoo penguins have the strongest linear relationship between flipper length and body mass, with (r = 0.703) . The adelie penguins have the weakest, with (r = 0.468) . The difference is visible in the plots, where the points for adelie penguins have a looser clustering. All three scatterplots do exhibit a clear linear pattern.

In the child tasks study, the sample correlation between child’s age and time on the STT trail B is (r = -0.593) .

The negative correlation indicates that older children post faster times on the test. This is visible in the scatterplot as a downward trend as you read the plot from left to right.

Note that correlation is a unitless quantity. The term ((x_i - ar)/sigma_x) has the same units (for (x) ) in the numerator and denominator, so they cancel, and the (y) term is similar. This means that a linear change of units will not affect the correlation coefficient:

It it clear (from experience, not from a statistical point of view) that there is a causal relationship between a child’s age and their ability to connect dots quickly. As children age, they get better at most things. However, correlation is not causation. There are many reasons why two variables might be correlated, and (x) causes (y) is only one of them.

As a simple example, the size of children’s shoes are correlated with their reading ability. However, you cannot buy a child bigger shoes and expect that to make them a better reader. The correlation between shoe size and reading ability is due to a common cause, age. In this example, age is a lurking variable, important to our understanding of both shoe size and reading ability, but not included in the correlation.


11.4: Plotting Data

After you import data into the MATLAB ® workspace, it is a good idea to plot the data so that you can explore its features. An exploratory plot of your data enables you to identify discontinuities and potential outliers, as well as the regions of interest.

The MATLAB figure window displays plots. See Types of MATLAB Plots for a full description of the figure window. It also discusses the various interactive tools available for editing and customizing MATLAB graphics.

Load and Plot Data from Text File

This example uses sample data in count.dat , a space-delimited text file. The file consists of three sets of hourly traffic counts, recorded at three different town intersections over a 24-hour period. Each data column in the file represents data for one intersection.

Load the count.dat Data

Import data into the workspace using the load function.

Loading this data creates a 24-by-3 matrix called count in the MATLAB workspace.


Plotting data

This page has been made largely redundant by the plotting options in JASP and JAMOVI - my statistical packages of choice. But if you do want to use SPSS for your stats, you shouldn't feel tethered to the (very poor) options that package provides for plotting your data.

Controversially, I like to plot my data in Excel. I like to do things this way for several reasons. First, data plotted in an Excel sheet can easily be viewed and edited by anyone, from the crustiest old scientist to the greenest undergraduate student (remember programming whizkids - the vast majority of STEM scientists can do very little programming - if you aren't trained early in your career, you'll likely never get the hang of it). Second, number crunching in Excel is very easy and visual - good luck trying to wade through someone's R code if you want to make sure they haven't made a mistake. Finally, I think having all your data laid out in Excel is a good way to have an automatic double check for outliers and data entry errors. Another less-known feature of Excel is that you can File-->Export the figure in vector format (.pdf or .xps), which will save you having to worry about the quality/DPI issues when submitting figures to journals. You'll typically need to trim the excess white space off of these files, which can be easily accomplished using a small program called Briss .

In terms of what sort of thing I like to plot, here's an example of a typical bar graph from one of my papers.


Plot a Bar Chart using Pandas

Bar charts are used to display categorical data. Let’s now see how to plot a bar chart using Pandas.

Step 1: Prepare your data

As before, you’ll need to prepare your data. Here, the following dataset will be used to create the bar chart:

Step 2: Create the DataFrame

Create the DataFrame as follows:

You’ll then get this DataFrame:

Step 3: Plot the DataFrame using Pandas

Finally, add the following syntax to the Python code:

In this case, set the kind = ‘bar’ to plot the bar chart.

And the complete Python code is:

Run the code and you’ll get this bar chart:


Go Math Grade 7 Answer Key Chapter 11 Analyzing and Comparing Data

Every student has a chance to know how to analyze and compare the data. Get the solutions with step by step explanation from our Go Math Answer Key for Grade 7 Chapter 11 Analyzing and Comparing Data. So, before you start your preparation go through the topics given below.

Chapter 11 – Lesson: 1

Chapter 11 – Lesson: 2

Chapter 11 – Lesson: 3

Chapter 11 – Comparing Data Displayed in Dot Plots

Guided Practice – Page No. 338

The dot plots show the number of miles run per week for two different classes. For 1–5, use the dot plots shown.

Question 1.
Compare the shapes of the dot plots.

Answer: In Class A the dot plot is clustered around two areas and in Class B the dot plot is clustered in the middle.

Question 2.
Compare the centers of the dot plots.

Answer: In Class A the data is centered around 4 miles and 13 miles and in Class B the data is centered around 7 miles.

Question 3.
Compare the spreads of the dot plots.

Answer: In class A the spread of the dot plot is 4 miles to 14 miles and in Class B the spread is 3 miles to 9 miles.

Question 4.
Calculate the medians of the dot plots.

Answer: The median or the dot plots for Class A and Class B is 6.

Explanation: For Class A median is 4,4,4,4,4,5,5,5,6,6,12,13,13,13,13,14,14
= 6.
For Class B median is 3,4,4,4,5,5,5,5,6,6,7,7,7,7,7,8,8,9
= (6+6)/2
= 12/2
= 6.

Question 5.
Calculate the ranges of the dot plots.

Answer: The range of the dot plot For Class A is 10 and Class B is 6.

Explanation: For Class A the range is 14-4= 10.
For Class B the range is 9-3= 6.

Essential Question Check-In

Question 6.
What do the medians and ranges of two dot plots tell you about the data?

Answer: The median of dot plots tells that the values of each dot plot are centered and we can get to know which dot plot has greater values. The range of the dot plot tells about the spread of each value in each plot. The smaller the range, the closer will be the values.

Independent Practice – Page No. 339

The dot plot shows the number of letters in the spellings of the 12 months. Use the dot plot for 7–10.

Question 7.
Describe the shape of the dot plot.

Answer: There is a slight increase in the number 8.

Question 8.
Describe the center of the dot plot.

Answer: The center of the dot plot is 6.

Question 9.
Describe the spread of the dot plot.

Answer: The spread of the dot plot is from 3 to 9

Question 10.
Calculate the mean, median, and range of the data in the dot plot.

Answer:
The mean of the dot plot is 6.17.
The median of the dot plot is 6.5.
The range of the dot plot is 6.

Explanation: 3,4,4,5,5,6,7,7,8,8,8,9
The mean of the dot plot is (frac<3+4+4+5+5+6+7+7+8+8+8+9><12>
= frac<74><12>)
= 6.17.
The medain of the dot plot is (frac<6+7><2>
= frac<13><2>)
= 6.5.
The range of the dot plot is 9-3= 6.

The dot plots show the mean number of days with rain per month for two cities.

Question 11.
Compare the shapes of the dot plots.

Answer: The most number of days with rain for Montgomery is greater than 8 days and in Lynchburg, the number of days of rain is 12 days or less.

Question 12.
Compare the centers of the dot plots.

Answer: In Montgomery, the center of the dot plot is around 9 days. And in Lynchburg, the center of the dot plot is around 10 days.

Question 13.
Compare the spreads of the dot plots.

Answer: In Montgomery, the spread of the dot plot is from 1 to 12 days and the outlier is 1. And in Lynchburg, the spread of the data plot is from 8 to 12 days.

Question 14.
What do the dot plots tell you about the two cities with respect to their average monthly rainfall?

Answer: As the center of Lynchburg is greater than the center of Montgomery, so average monthly rainfall for Lynchburg is greater than the average monthly rainfall of Montgomery.

Page No. 340

The dot plots show the shoe sizes of two different groups of people.

Question 15.
Compare the shapes of the dot plots.

Answer: In Group A the shoe sizes are mostly less than 9. And in group B all the shoe sizes are 11.5 or less.

Question 16.
Compare the medians of the dot plots.

Answer:
The median of Group A is 8.
The median of Group A is 9.5.

Explanation: 6.5,7,7,7.5,7.5,7.5,8,8,8,8,8,8.5,8.5,9,13
The median of Group A is 8.
8.5,9,9,9,9,9.5,9.5,9.5,9.5,10,10,10.5,10.5,10.5,11.5
The median of Group B is 9.5.

Question 17.
Compare the ranges of the dot plots (with and without the outliers).

Answer:
The range with the outlier is 13-6.5= 6.5.
The range without the outlier is 9-6.5= 2.5.
The range is 11.5-8.5= 3.

Explanation: The outlier in Group A is 13
The range with the outlier is 13-6.5= 6.5.
The range without the outlier is 9-6.5= 2.5.
There is no outlier in Group B, so the range is 11.5-8.5= 3.

Question 18.
Make A Conjecture
Provide a possible explanation for the results of the dot plots.

Answer: Group A is Girls and Group B is boys. Because boys have large feet than girls.

Focus on Higher Order Thinking

Question 19.
Analyze Relationships
Can two dot plots have the same median and range but have completely different shapes? Justify your answer using examples.

Answer: Yes, it is possible to have the same median and range with different shapes.

Explanation: Yes, it is possible to have the same median and range with different shapes. The median and the range of the below image is
image 1 data – 1,2,2,3,3,3,4,4,5.
The median of image 1 is 3.
image 2 data is – 2,2,2,2,3,3,4,4,5,5,6.
The median of image 2 is 3.
The range of image 1 is 5-1= 4.
The range of image 2 is 6-2= 4.

Question 20.
Draw Conclusions
What value is most affected by an outlier, the median or the range? Explain. Can you see these effects in a dot plot?

Answer: The most affected by an outlier is range. The outlier increases the range as median values are in the middle, so the outlier will not mostly affect the median. Yes, in a dot plot we can see both range and median.

Guided Practice – Page No. 344

For 1–3, use the box plot Terrence created for his math test scores. Find each value.

Question 1.
Minimum = _____ Maximum = _____

Answer:
Minimum = 72.
Maximum = 88.

Explanation: The minimum value is the smallest value in the box plot, so the minimum value is 72, and the maximum value is the largest value in the box plot which is 88

Explanation:
The data is 72,75,79,85,88
The Median is 79.

Question 3.
Range = _____ IQR = _____

Answer:
The range is 16.
The IQR is 10.

Explanation:
The range is 88-72= 16
IQR is the difference between upper quartiles and lower quartiles, so 85-75= 10.

For 4–7, use the box plots showing the distribution of the heights of hockey and volleyball players.

Question 4.
Which group has a greater median height?
_____

Answer:
The greater median height is Volleyball players with 74 in.

Explanation:
Hockey players data is 64,66,70,76,78.
The median height of hockey players is 70 in.
Volleyball players data is 67,68,74,78,85
The median height of the Volleyball player is 74 in.

Question 5.
Which group has the shortest player?
_____

Answer:
Hockey players have the shortest player with 64 in.

Explanation:
The minimum height of the hockey players is 64 in.
The minimum height of the Volleyball players is 67 in.

Question 6.
Which group has an interquartile range of about 10?
_____

Answer: The IQR for Hockey players and Volleyball players is 10.

Explanation:
The IQR for Hockey players is 76-66= 10.
The IQR for Volleyball players is 78-68= 10.

Essential Question Check-In

Question 7.
What information can you use to compare two box plots?

Answer: To compare two box plots we can use minimum, maximum values, ////////the median, the range, and the IQR.

Independent Practice – Page No. 345

For 8–11, use the box plots of the distances traveled by two toy cars that were jumped from a ramp.

Question 8.
Compare the minimum, maximum, and median of the box plots.

Answer:
The data of Car A is 165,170,180,195,210.
The data of Car B is 160,175,185,200,205.
The minimum value of Car A is 165.
The minimum value of Car B is 165.
The maximum value of Car A is 210.
The maximum value of Car B is 205.
The median of Car A is 180.
The median of Car B is 185.

Explanation:
The data of Car A is 165,170,180,195,210.
The data of Car B is 160,175,185,200,205.
The minimum value of Car A is 165.
The minimum value of Car B is 165.
The maximum value of Car A is 210.
The maximum value of Car B is 205.
The median of Car A is 180.
The median of Car B is 185.

Question 9.
Compare the ranges and interquartile ranges of the data in box plots.

Answer:
The range of Car A is 45.
The range of Car B is 45.
The IQR of Car A is 25.
The IQR of Car B is 25.

Explanation:
The range of Car A is 210-165= 45.
The range of Car B is 205-160= 45.
The IQR of Car A is 195-170= 25.
The IQR of Car B is 200-175= 25.

Question 10.
What do the box plots tell you about the jump distances of two cars?

Answer: The box plot tells about the minimum and the maximum jump distance, the median jump distance, and the spread of the jump distance.

Question 11.
Critical Thinking
What do the whiskers tell you about the two data sets?

Answer: The whiskers tells about the spread of maximum and minimum values of the bottom and top 25% of data.

For 12–14, use the box plots to compare the costs of leasing cars in two different cities.

Question 12.
In which city could you spend the least amount of money to lease a car? The greatest?
______

Answer: The least and the greatest amount is spent by City B.

Explanation:
The data set of City A is $425,$450,$475,$550,$600.
The data set of City B is $400,$425,$450,$475,$625.
The minimum cost of City A is $425 and the maximum is $600.
The minimum cost of City B is $400 and the maximum is $625.
The least and the greatest amount is spent by City B.

Question 13.
Which city has a higher median price? How much higher is it?
______

Answer: The higher median price is City A with $475 and $50 higher.

Explanation:
The median of City A is $475 and the median of City B is $450.
So the difference is $475-$425= $50.

Question 14.
Make a Conjecture
In which city is it more likely to choose a car at random that leases for less than $450? Why?
______

Answer: 450 corresponds to the first quartile of City A, which means 25% of the cars cost less than $450. 450 corresponds to the median for City B which means 50% of the cars cost less than $450. So City B is more likely to have a car chosen randomly that costs less than $450.

Page No. 346

Question 15.
Summarize
Look back at the box plots for 12–14 on the previous page. What do the box plots tell you about the costs of leasing cars in those two cities?

Answer: City A has a smaller range than City B, but it has greater IQR. And City B has 4 key values of City A which means leasing a car is cheaper in City B.

Focus on Higher Order Thinking

Question 16.
Draw Conclusions
Two box plots have the same median and equally long whiskers. If one box plot has a longer box than the other box plot, what does this tell you about the difference between the data sets?

Answer: If two box plots have the same median and equally long whiskers and one box is longer than the other, that means the box plot with the larger box has a greater range and IQR.

Question 17.
Communicate Mathematical Ideas
What you can learn about a data set from a box plot? How is this information different from a dot plot?

Answer: We can learn about the minimum and the maximum values, the median, the range, the IQR, and the range of 25% of the data.
and a data plot contains all data values. which a box plot doesn’t have.

Question 18.
Analyze Relationships
In mathematics, central tendency is the tendency of data values to cluster around some central value. What does a measure of variability tell you about the central tendency of a set of data? Explain.

Answer: If the range and IQR are small, the values are clustering around some central values.

Guided Practice – Page No. 350

The tables show the numbers of miles run by the students in two classes. Use the tables in 1–2.

Question 1.
For each class, what is the mean? What is the mean absolute deviation?
Class 1 mean: __________
Class 2 mean: __________
Class 1 MAD: __________
Class 2 MAD: __________

Answer:
Class 1 mean: 6
Class 2 mean: 11
Class 1 MAD: 3.067
Class 2 MAD: 3.067

Explanation:
The mean of Class 1 is (frac<12+6+1+10+1+2+3+10+3+8+3+9+8+6+8><6>
= frac<90><15>)
= 6
The mean of Class 2 is (frac<11+14+11+13+6+7+8+6+8+13+8+15+13+17+15><15>
= frac<165><15>)
= 11
The mean absolute deviation of Class 1 is
|12-6| = 6
|6-6| = 0
|1-6| = 5
|10-6| = 4
|1-6| = 5
|2-6| = 4
|3-6| = 3
|10-6| = 4
|3-6| = 3
|8-6| = 2
|3-6| = 3
|9-6| = 3
|8-6| = 2
|6-6| = 0
|8-6| = 2
The mean absolute deviation of Class 1 is (frac<6+0+5+4+5+4+3+4+3+2+3+3+2+0+2><15>
= frac<46><15>)
= 3.067

The mean absolute deviation of Class 2 is
|11-11| = 0
|14-11| = 3
|11-11| = 0
|13-11| = 2
|6-11| = 5
|7-11| = 4
|8-11| = 3
|6-11| = 5
|8-11| = 3
|13-11| = 2
|8-11| = 3
|15-11| = 4
|13-11| = 2
|17-11| = 6
|15-11| = 4
The mean absolute deviation of Class 2 is (frac<0+3+0+2+5+4+3+5+3+2+3+4+2+6+4><15>
= frac<46><15>)
= 3.067

Question 2.
The difference of the means is about _____ times the mean absolute deviations.
_____

Explanation: The difference of the mean is 11-6=5, and the difference of the means is about 3 times the mean absolute deviations, so
5/3= 1.67.

Question 3.
Mark took 10 random samples of 10 students from two schools. He asked how many minutes they spend per day going to and from school. The tables show the medians and the means of the samples. Compare the travel times using distributions of the medians and means.

Essential Question Check-In

Question 4.
Why is it a good idea to use multiple random samples when making comparative inferences about two populations?

Answer: It’s important to use multiple random samples, so you can draw more interferences about the populations. The more samples we use the more convincing arguments you can make about the distributions.

Independent Practice – Page No. 351

Josie recorded the average monthly temperatures for two cities in the state where she lives. Use the data for 5–7.

Question 5.
For City 1, what is the mean of the average monthly temperatures? What is the mean absolute deviation of the average monthly temperatures?
Mean: __________
MAD: __________

Answer:
Mean: 50 °F.
MAD: 13 °F.

Question 6.
What is the difference between each average monthly temperature for City 1 and the corresponding temperature for City 2?
_______ °F

Answer: The difference between each average monthly temperature for City 1 and the corresponding temperature for City 2 is 15 °F

Explanation:
|23-8|= 15
|38-23|= 15
|39-24|= 15
|48-33|= 15
|55-40|= 15
|56-41|= 15
|71-56|= 15
|86-71|= 15
|57-42|= 15
|53-38|= 15
|43-28|= 15
|31-16|= 15
The difference between each average monthly temperature for City 1 and the corresponding temperature for City 2 is 15 °F

Question 7.
Draw Conclusions
Based on your answers to Exercises 5 and 6, what do you think the mean of the average monthly temperatures for City 2 is? What do you think the mean absolute deviation of the average monthly temperatures for City 2 is? Give your answers without actually calculating the mean and the mean absolute deviation. Explain your reasoning.
Mean = __________ °F
MAD __________ °F

Answer:
Mean =35 °F
MAD = 13°F

Explanation: As all the values of City 2 are 15 below the values of City 1, so the mean of the City 2 will be 50 less than the mean of City 1. Which means 50-15= 35. All of City 2’s values deviate from the mean the same way City 1’s values do which means that the mean absolute deviation is 13

Question 8.
What is the difference in the means as a multiple of the mean absolute deviations?
_______ (MAD)

Explanation:
(50-35)/13
= 15/13
= 1.15.
The difference in the means as a multiple of the mean absolute deviations 1.15.

Question 9.
Make a Conjecture
The box plots show the distributions of mean weights of 10 samples of 10 football players from each of two leagues, A and B. What can you say about any comparison of the weights of the two populations? Explain.

Answer: As both leagues have a lot of variability since the ranges and IQR’s are both very large. The middle halves overlap entirely. The variation and overlap in the distributions make it hard to make any convincing comparison.

Page No. 352

Question 10.
Justify Reasoning
Statistical measures are shown for the ages of middle school and high school teachers in two states.
State A: Mean age of middle school teachers = 38, mean age of high school teachers = 48, mean absolute deviation for both = 6
State B: Mean age of middle school teachers = 42, mean age of high school teachers = 50, mean absolute deviation for both = 4
In which state is the difference in ages between members of the two groups more significant? Support your answer.
_____________

Answer: State B has a difference in ages between members of the two groups more significant.

Explanation:
For State A the difference in the mean as a multiple of the mean absolute deviation is (48-38)/6
= 10/6
= 1.67.
So for State B, (50-42)/4
= 8/4
= 2.
As State B has a larger multiple, the differences in ages between members of the two groups are more significant.

Question 11.
Analyze Relationships
The tables show the heights in inches of all the adult grandchildren of two sets of grandparents, the Smiths and the Thompsons. What is the difference in the medians as a multiple of the ranges?

______ x range

Answer: The difference in the median is 1.75.

Explanation:
Smith: 64,65,65,66,66,67,68,68,69,70.
The Median is (66+67)/2
= 133/2
= 66.5.
The range is 70-64= 6.
Thompsons: 74,75,75,76,77,77,78,79,79,80.
The Median is (77+77)/2
= (154)/2
= 77.
The range is 80-74= 6.
The difference in the median is (77-66.5)/6
= 10.5/6
= 1.75.

Focus on Higher Order Thinking

Question 12.
Critical Thinking
Jill took many samples of 10 tosses of a standard number cube. What might she reasonably expect the median of the medians of the samples to be? Why?
Median of the medians: ______

Answer:
Median of the medians: 3.5.

Explanation: The possible outcome of a number cube is 1,2,3,4,5,6. So median is
= (3+4)/2
= 7/2
= 3.5
The median of the medians should be close to the median of the populations, so it will also be about 3.5.

Question 13.
Analyze Relationships
Elly and Ramon are both conducting surveys to compare the average numbers of hours per month that men and women spend shopping. Elly plans to take many samples of size 10 from both populations and compare the distributions of both the medians and the means. Ramon will do the same, but will use a sample size of 100. Whose results will probably produce more reliable inferences? Explain.
_____________

Answer: The larger the sample size, the less variability there should be in the distributions of the medians and means. And Ramon will most likely produce more reliable inferences since he will be using a much larger sample size.

Question 14.
Counterexamples
Seth believes that it is always possible to compare two populations of numerical values by finding the difference in the means of the populations as a multiple of the mean absolute deviations. Describe a situation that explains why Seth is incorrect.

Answer: In order to compare two populations by finding the difference in the means of the populations as a multiple of the mean absolute deviations, so the mean absolute deviations of both populations need to be about the same. So if the mean absolute deviations are significantly different, like 5 and 10 and we cannot compare the populations this way.

11.1 Comparing Data Displayed in Dot Plots – Page No. 353

The two dot plots show the number of miles run by 14 students at the start and at the end of the school year. Compare each measure for the two dot plots. Use the data for 1–3.

Question 1.
Means
Start: _________
End: _________

Answer:
Mean
Start: 7.5 miles.
End: 8.2 miles.

Explanation:
The data for the start of the school year is 5,6,6,7,7,7,7,8,8,8,8,9,9,10.
The mean is (frac<5+6+6+7+7+7+7+8+8+8+8+9+9+10><14>
= frac<105><14>)
= 7.5 miles.
The data for the end of the school year is 6,6,7,7,8,8,8,8,9,9,9,10,10,10.
The mean is (frac<6+6+7+7+8+8+8+8+9+9+9+10+10+10><14>
= frac<115><14>)
= 8.2 miles.

Question 2.
Medians
Start: _________
End: _________

Answer:
Median
Start: 7.5 miles.
End: 8 miles.

Explanation:
The median for the start of the school year is
= (7+8)/2
= 15/2
= 7.5 miles.
The median for the end of the school year is
= (8+8)/2
= 16/2
= 8 miles.

Question 3.
Ranges
Start: _________
End: _________

Answer:
Ranges
Start: 5 miles.
End: 4 miles.

Explanation:
The range for the Start of the school year is 10-5= 5 miles.
The range for the end of the school year is 10-6= 4 miles.

11.2 Comparing Data Displayed in Box Plots

The box plots show lengths of flights in inches flown by two model airplanes. Use the data for 4–5.

Question 4.
Which has a greater median flight length?
_____________

Answer:
The greater median flight length is Airplane A which is 210 in.

Explanation:
The median of Airplane A is 210 in and the median of Airplane B is 204 in. So greater median flight length is Airplane A which is 210 in.

Question 5.
Which has a greater interquartile range?
_____________

Answer: The greater IQR is Airplane B with 35 in.

Explanation:
The IQR for Airplane A is 225-208= 17 in and The IQR for Airplane B is 230-195= 35 in. So the greater IQR is Airplane B.

11.3 Using Statistical Measures to Compare Populations

Question 6.
Roberta grows pea plants, some in shade and some in sun. She picks 8 plants of each type at random and records the heights.

Express the difference in the means as a multiple of their ranges.
______

Answer: The difference in the means as a multiple of their ranges is 2.4 in.

Explanation:
The mean of Shade plant heights is (frac<7+11+11+12+9+12+8+10><8>
= frac<80><8>)
= 10 in.
The range of Shade plant heights is 12-7= 5 in.
The mean of Sun plant heights is (frac<21+24+19+19+22+23+24+24><8>
= frac<176><8>)
= 22 in.
The range of Sun plant heights is 24-19= 5 in.
The difference in the means as a multiple of their ranges is (22-10)/5
= 12/5
= 2.4 in.

Essential Question

Question 7.
How can you use and compare data to solve real-world problems?

Answer: We can use and compare data to solve real-world problems by determining if one set is larger than the other set in terms of values, means, and medians.

Selected Response – Page No. 354

Question 1.
Which statement about the data is true?

Options:
a. The difference between the medians is about 4 times the range.
b. The difference between the medians is about 4 times the IQR.
c. The difference between the medians is about 2 times the range.
d. The difference between the medians is about 2 times the IQR.

Explanation:
Set 1 median is 60 and Set 2 median is 76
The range of Set 1 is 68-55= 13
The range of Set 2 is 80-65= 15
The IQR of Set 1 is 63-59= 4
The IQR of Set 2 is 77-73= 4
The difference in medians is 76-60= 16, So the difference between the medians is about 4 times the IQR.

Question 2.
Which is a true statement based on the box plots below?

Options:
a. The data for City A has a greater range.
b. The data for City B is more symmetric.
c. The data for City A has a greater interquartile range.
d. The data for City B has a greater median.

Explanation: The length of the box for City A is much larger than for City B, so IQR for City A is greater.

Question 3.
What is −3 (frac<1><2>) written as a decimal?
Options:
a. -3.5
b. -3.05
c. -0.35
d. -0.035

Question 4.
Which is a true statement based on the dot plots below?

Options:
a. Set A has the lesser range
b. Set B has a greater median.
c. Set A has the greater mean.
d. Set B is less symmetric than Set A.

Answer: c is a true statement.

Explanation:
The median of Set A is 30 and the median of Set B is 40, so Set A has the greater mean.

Question 5.
The dot plots show the lengths of a random sample of words in a fourth-grade book and a seventh-grade book.

a. Compare the shapes of the plots.

Answer:
For Fourth grade, most of the words have a length of 6 or less and with two outliers 9 and 10.
For Seventh grade, most of the words have a length of 8 or less with 5 exceptions.

Question 5.
b. Compare the ranges of the plots. Explain what your answer means in terms of the situation.

Answer:
The Seventh grade has a larger range, so it has more variability.

Explanation:
The range for the fourth grade is 10-1=9.
The range for the seventh grade 14-2= 12.
As the Seventh grade has a larger range it has more variability.

EXERCISES – Page No. 356

Question 1.
Molly uses the school directory to select, at random, 25 students from her school for a survey on which sports people like to watch on television. She calls the students and asks them, “Do you think basketball is the best sport to watch on television?”
a. Did Molly survey a random sample or a biased sample of the students at her school?
_____________

Answer: Yes, Molly surveyed a random sample. As she selected 25 students from a school directory of the entire student’s population in her school.

Question 1.
b. Was the question she asked an unbiased question? Explain your answer.
_____________

Answer: No, the question is not unbiased. The question is biased because it assumes the person watches basketball on television.

Question 2.
There are 2,300 licensed dogs in Clarkson. A random sample of 50 of the dogs in Clarkson shows that 8 have ID microchips implanted. How many dogs in Clarkson are likely to have ID microchips implanted?
______ dogs

Explanation: Let the dogs in Clarkson to have ID microchips be X, so
X/2300 = 8/50
X= (8×2300)/50
X= 18,400/50
X= 368.

Question 3.
A store gets a shipment of 500 MP3 players. Twenty-five of the players are defective, and the rest are working. A graphing calculator is used to generate 20 random numbers to simulate a random sample of the players.
A list of 20 randomly generated numbers representing MP3 players is :

a. Let numbers 1 to 25 represent players that are _____
_____________

Answer: As there are twenty-five defective players, let the numbers 1 to 25 represent players that are defective.

Question 3.
b. Let numbers 26 to 500 represent players that are _____
_____________

Answer: Let the numbers 26 to 500 represent players that are working.

Question 3.
c. How many players in this sample are expected to be defective?
______ players

Answer: As there are 2 numbers in from 1 and 25 which are 5 and 9 are the players in the sample are expected to be defective.

Question 3.
d. If 300 players are chosen at random from the shipment, how many are expected to be defective based on the sample? Does the sample provide a reasonable inference? Explain.
______ players

Explanation:
X/300 = 2/20
X = (2×300)/20
X = 600/20
X = 30.
We may expect 25 out of 500 or 5% of the 300 players to be defective, which is only 15 players because the sample doesn’t provide a reasonable inference.

EXERCISES – Page No. 357

The dot plots show the number of hours a group of students spends online each week, and how many hours they spend reading. Compare the dot plots visually.

Question 1.
Compare the shapes, centers, and spreads of the dot plots.

Answer:
Shape:
Time spent online- Most of the students spend 4 hours are more.
Time spent reading- The students spent a maximum of 6 hours.
Centers:,6
The no.of hours spent online is centered around 6 hours.
The no.of hours spent reading is centered around 5 hours.
Spread:
The range for time spent online is 7-0=7.
The range for time spent reading is 6-0=6.

Question 2.
Calculate the medians of the dot plots.
Time online: __________
Time reading: __________

Answer:
Time online: 6 hours.
Time reading: 5 hours.

Explanation:
The data of time online is 0,4,4,5,5,6,6,6,6,6,6,7,7,7,7
The Median is 6 hours.
The data of time reading is 0,0,0,0,1,1,2,5,5,5,6,6,6,6,6
The Median is 5 hours.

Question 3.
Calculate the ranges of the dot plots.
Time online: __________
Time reading: __________

Answer:
Time online: 7 hours.
Time reading: 6 hours.

Explanation:
The range of time online is 7-0= 7.
The range of time reading is 6-0= 6.

Page No. 358

Question 4.
The average times (in minutes) a group of students spend studying and watching TV per school day are given.
Studying: 25, 30, 35, 45, 60, 60, 70, 75
Watching TV: 0, 35, 35, 45, 50, 50, 70, 75
a. Find the mean times for studying and for watching TV.
Studying: __________
Watching TV: __________

Answer:
Studying: 50.
Watching TV: 40.

Question 4.
b. Find the mean absolute deviations (MADs) for each data set.
Studying: __________
Watching TV: __________

Answer:
Studying: 16.25
Watching TV: 16.25

Explanation:
|25-50|= 25
|30-50|= 20
|35-50|= 15
|45-50|= 5
|60-50|= 10
|60-50|= 10
|70-50|= 20
|75-50|= 25
The mean absolute deviation is (frac<25+20+15+5+10+10+20+25><8>
= frac<130><8>)
= 16.25.
|0-45|= 45
|35-45|= 10
|35-45|= 10
|45-45|= 0
|50-45|= 5
|50-45|= 5
|70-45|= 25
|75-45|= 30
The mean absolute deviation is (frac<45+10+10+0+5+5+25+30><8>
= frac<130><8>)
= 16.25.

Question 4.
c. Find the difference of the means as a multiple of the MAD, to two decimal places.
_____

Explanation: (50-45)/16.25 = 5/16.25
= 0.31.

Unit 5 Performance Tasks

Question 5.
Entomologist
An entomologist is studying how two different types of flowers appeal to butterflies. The box-and-whisker plots show the number of butterflies that visited one of two different types of flowers in a field. The data were collected over a two-week period, for one hour each day.

a. Find the median, range, and interquartile range for each data set.

Answer:
Type A:
The median is 11.5
The range is 4
The IQR is 3
Type B:
The median is 11
The range is 10
The IQR is 2

Explanation:
Type A:
The median is (11+12)/2
= 23/2
= 11.5
The range is 13-9= 4
The IQR is 12-9= 3
Type B:
The median is 11
The range is 17-7= 10
The IQR is 12-10= 2

Question 5.
b. Which measure makes it appear that flower type A had a more consistent number of butterfly visits? Which measure makes it appear that flower type B did? If you had to choose one flower as having the more consistent visits, which would you choose? Explain your reasoning.

Answer: As type A has a smaller range, the range makes it appear as if type A has a more consistent number of butterflies visits. And type B had a smaller IQR, the IQR makes it appear as if type A has a more consistent number of butterflies visits. We would choose type A has to have a more consistent number of butterflies visits and it has a much smaller range. The range of the fourth quartile for type Bis larger than the range for the entire data set of type A.

Selected Response – Page No. 359

Question 1.
Which is a true statement based on the dot plots below?

Options:
a. Set B has a greater range.
b. Set B has a greater median.
c. Set B has the greater mean.
d. Set A is less symmetric than Set B.

Explanation:
Set A has a range of 60-20= 40
Set B has a range of 60-10= 50.
So Set B has a greater range.

Question 2.
Which is a solution to the equation 7g − 2 = 47?
Options:
a. g = 5
b. g = 6
c. g = 7
d. g = 8

Explanation:
7g-2= 47
7g= 47+2
7g= 49
g= 49/7
g= 7.

Question 3.
Which is a true statement based on the box plots below?

Options:
a. The data for Team B has a greater range.
b. The data for Team A is more symmetric.
c. The data for Team B has a greater interquartile range.
d. The data for Team A has a greater median.

Explanation: The box of Team B is much larger than the box of Team A, so the data for Team B have the greater interquartile range.

Question 4.
Which is the best way to choose a random sample of people from a sold-out movie audience for a survey?
Options:
a. Survey all audience members who visit the restroom during the movie.
b. Assign each seat a number, write each number on a slip of paper, and then draw several slips from a hat. Survey the people in those seats.
c. Survey all of the audience members who sit in the first or last row of seats in the movie theater.
d. Before the movie begins, ask for volunteers to participate in a survey. Survey the first twenty people who volunteer.

Explanation:
A is not random because the people are being chosen are being surveyed in one place.
B is random as all members of the population can be chosen and each member has an equal chance of being selected.
C is may not assign every member of the population an equal chance of being chosen since the number of seats in the first or last rows may have more or fewer seats than the other rows.
D is not random because participants are self selecting to do the survey.

Question 5.
Find the percent change from 84 to 63.
Options:
a. 30% decrease
b. 30% increase
c. 25% decrease
d. 25% increase

Explanation:
(84-63)/84 = 21/84
= 0.25
= 25% decrease

Question 6.
A survey asked 100 students in a school to name the temperature at which they feel most comfortable. The box plot below shows the results for temperatures in degrees Fahrenheit. Which could you infer based on the box plot below?

Options:
a. Most students prefer a temperature less than 65 degrees.
b. Most students prefer a temperature of at least 70 degrees.
c. Almost no students prefer a temperature of fewer than 75 degrees.
d. Almost no students prefer a temperature of more than 65 degrees.

Explanation: The last half of the data is about 73-85 which means 50% prefer a temperature above 73. This means that the most prefer a temperature of at least 70 degrees since more than 50% of the box plot is 70 degrees are more.

Page No. 360

Question 7.
The box plots below show data from a survey of students under 14 years old. They were asked on how many days in a month they read and draw. Based on the box plots, which is a true statement about students?

Options:
a. Most students draw at least 12 days a month.
b. Most students read less than 12 days a month.
c. Most students read more often than they draw.
d. Most students draw more often than they read.

Explanation: As 4 out of 5 key values for reading are greater than the corresponding values for drawing which means most of the students read more often than they draw.

Question 8.
Which describes the relationship between ∠NOM and ∠JOK in the diagram?

Options:
a. adjacent angles
b. complementary angles
c. supplementary angles
d. vertical angles

Explanation: ∠NOM and ∠JOK are vertical angles.

Question 9.
The tables show the typical number of minutes spent exercising each week for a group of fourth-grade students and a group of seventh-grade students.

a. What is the mean number of minutes spent exercising for fourth graders? For seventh graders?
4th grade: __________
7th grade: __________

Answer:
4th grade: 129
7th grade: 221

Question 9.
b. What is the mean absolute deviation of each data set?
4th grade: __________
7th grade: __________

Answer:
4th grade: 66.6
7th grade: 68

Explanation:
|120-129|= 9
|75-129|= 54
|30-129|= 99
|30-129|= 99
|240-129|=111
|90-129|= 39
|100-129|= 29
|180-129|= 51
|125-129|= 4
|300-129|= 171
The mean absolute deviation for fourth grade is (frac<9+54+99+99+111+39+29+51+4+171><10>
= frac<666><10>)
= 66.6
|410-221|= 189
|145-221|= 76
|240-221|= 19
|250-221|= 29
|125-221|= 96
|95-221|= 126
|210-221|= 11
|190-221|= 31
|245-221|= 24
|300-221|= 79
The mean absolute deviation for fourth grade is (frac<189+76+19+29+96+126+11+31+24+79><10>
= frac<680><10>)
= 68

Question 9.
c. Compare the two data sets with respect to their measures of center and their measures of variability.

Answer: The center of the fourth grade is much smaller than the center for 7th grade. The range is much smaller for a fourth grade than 7th grade which means that fourth graders spend less time exercising and have less variability in the number of minutes that they exercise.

Explanation:
The data of fourth grade is 30,30,75,90,100,120,125,180,240,300
Median is (100+120)/2
= 220/2
= 110
The range is 300-30= 270
The data of seventh grade is 95,125,145,190,210,240,245,250,300,410
Median is (210+240)/2
= 450/2
= 225
The range is 410-95= 315.
The center of the fourth grade is much smaller than the center for 7th grade. The range is much smaller for a fourth grade than 7th grade which means that fourth graders spend less time exercising and have less variability in the number of minutes that they exercise.

Question 9.
d. How many times the MADs is the difference between the means, to the nearest tenth?
_______

Answer: As the MADs are not the same we will find the average of them and then find the difference of the mean and divide by the average of the MADs.

Explanation:
(66.6+68)/2
= 134.6/2
= 67.3
(221-129)/67.3
= 92/67.3
= 1.37

Guided Practice – Page No. 371

Question 1.
In a hat, you have index cards with the numbers 1 through 10 written on them. Order the events from least likely to happen (1) to most likely to happen (8) when you pick one card at random. In the boxes, write a number from 1 to 8 to order the eight different events.
You pick a number greater than 0. __________
You pick an even number. __________
You pick a number that is at least 2. __________
You pick a number that is at most 0. __________
You pick a number divisible by 3. __________
You pick a number divisible by 5. __________
You pick a prime number. __________
You pick a number less than the greatest prime number. __________

Explanation:
As there are 10 numbers from 1 to 10 and thus there will be 10 possible outcomes. So,
The number greater than 0 is 1,2,3,4,5,6,7,8,9,10.
Even numbers are 2,4,6,8,10.
The number at least 2 is 2,3,4,5,6,7,8,9,10.
The number that is at most 0: as none of the integers are from 1 to 10 are at most 0.
The number divisible by 3 is 3,6,9.
The number divisible by 5 is 5,10.
The prime numbers are 2,3,5,7.
The number less than the greatest prime numbers are 1,2,3,4,5,6 as 7 is the greatest prime number from the numbers 1 to 10.
The more favorable outcomes correspond with an event, the more likely the events happen. Thus the number is at most 0 is the least likely and the greater than 0 is the most likely.
The number of events from the least likely to the most likely is
The number greater than 0 is 8
Even numbers are 5
The number at least 2 is 7
The number that is at most 0: 1
The number divisible by 3 is 3
The number divisible by 5 is 2
The prime numbers are 4
The number less than the greatest prime number is 6.

Do not move anywhere, stay on Go Math Answer Key, and enhance your math skills. After completion of your preparation go check your skills by solving the questions provided at the end of the chapter. In addition to the exercise problems, we have also given the answers with an explanation for the performance tasks.


4.8 Samples, populations and sampling

Remember, the role of descriptive statistics is to concisely summarize what we do know. In contrast, the purpose of inferential statistics is to “learn what we do not know from what we do”. What kinds of things would we like to learn about? And how do we learn them? These are the questions that lie at the heart of inferential statistics, and they are traditionally divided into two “big ideas”: estimation and hypothesis testing. The goal in this chapter is to introduce the first of these big ideas, estimation theory, but we’ll talk about sampling theory first because estimation theory doesn’t make sense until you understand sampling. So, this chapter divides into sampling theory, and how to make use of sampling theory to discuss how statisticians think about estimation. We have already done lots of sampling, so you are already familiar with some of the big ideas.

Sampling theory plays a huge role in specifying the assumptions upon which your statistical inferences rely. And in order to talk about “making inferences” the way statisticians think about it, we need to be a bit more explicit about what it is that we’re drawing inferences from (the sample) and what it is that we’re drawing inferences about (the population).

In almost every situation of interest, what we have available to us as researchers is a sample of data. We might have run experiment with some number of participants a polling company might have phoned some number of people to ask questions about voting intentions etc. Regardless: the data set available to us is finite, and incomplete. We can’t possibly get every person in the world to do our experiment a polling company doesn’t have the time or the money to ring up every voter in the country etc. In our earlier discussion of descriptive statistics, this sample was the only thing we were interested in. Our only goal was to find ways of describing, summarizing and graphing that sample. This is about to change.

4.8.1 Defining a population

A sample is a concrete thing. You can open up a data file, and there’s the data from your sample. A population, on the other hand, is a more abstract idea. It refers to the set of all possible people, or all possible observations, that you want to draw conclusions about, and is generally much bigger than the sample. In an ideal world, the researcher would begin the study with a clear idea of what the population of interest is, since the process of designing a study and testing hypotheses about the data that it produces does depend on the population about which you want to make statements. However, that doesn’t always happen in practice: usually the researcher has a fairly vague idea of what the population is and designs the study as best he/she can on that basis.

Sometimes it’s easy to state the population of interest. For instance, in the “polling company” example, the population consisted of all voters enrolled at the a time of the study – millions of people. The sample was a set of 1000 people who all belong to that population. In most situations the situation is much less simple. In a typical a psychological experiment, determining the population of interest is a bit more complicated. Suppose I run an experiment using 100 undergraduate students as my participants. My goal, as a cognitive scientist, is to try to learn something about how the mind works. So, which of the following would count as “the population”:

All of the undergraduate psychology students at the University of Adelaide?

Undergraduate psychology students in general, anywhere in the world?

Australians currently living?

Australians of similar ages to my sample?

Any human being, past, present or future?

Any biological organism with a sufficient degree of intelligence operating in a terrestrial environment?

Each of these defines a real group of mind-possessing entities, all of which might be of interest to me as a cognitive scientist, and it’s not at all clear which one ought to be the true population of interest.

4.8.2 Simple random samples

Irrespective of how we define the population, the critical point is that the sample is a subset of the population, and our goal is to use our knowledge of the sample to draw inferences about the properties of the population. The relationship between the two depends on the procedure by which the sample was selected. This procedure is referred to as a sampling method, and it is important to understand why it matters.

To keep things simple, imagine we have a bag containing 10 chips. Each chip has a unique letter printed on it, so we can distinguish between the 10 chips. The chips come in two colors, black and white.

Figure 4.9: Simple random sampling without replacement from a finite population

This set of chips is the population of interest, and it is depicted graphically on the left of Figure 4.9.

As you can see from looking at the picture, there are 4 black chips and 6 white chips, but of course in real life we wouldn’t know that unless we looked in the bag. Now imagine you run the following “experiment”: you shake up the bag, close your eyes, and pull out 4 chips without putting any of them back into the bag. First out comes the (a) chip (black), then the (c) chip (white), then (j) (white) and then finally (b) (black). If you wanted, you could then put all the chips back in the bag and repeat the experiment, as depicted on the right hand side of Figure4.9. Each time you get different results, but the procedure is identical in each case. The fact that the same procedure can lead to different results each time, we refer to it as a random process. However, because we shook the bag before pulling any chips out, it seems reasonable to think that every chip has the same chance of being selected. A procedure in which every member of the population has the same chance of being selected is called a simple random sample. The fact that we did not put the chips back in the bag after pulling them out means that you can’t observe the same thing twice, and in such cases the observations are said to have been sampled without replacement.

To help make sure you understand the importance of the sampling procedure, consider an alternative way in which the experiment could have been run. Suppose that my 5-year old son had opened the bag, and decided to pull out four black chips without putting any of them back in the bag. This biased sampling scheme is depicted in Figure 4.10.

Figure 4.10: Biased sampling without replacement from a finite population

Now consider the evidentiary value of seeing 4 black chips and 0 white chips. Clearly, it depends on the sampling scheme, does it not? If you know that the sampling scheme is biased to select only black chips, then a sample that consists of only black chips doesn’t tell you very much about the population! For this reason, statisticians really like it when a data set can be considered a simple random sample, because it makes the data analysis much easier.

A third procedure is worth mentioning. This time around we close our eyes, shake the bag, and pull out a chip. This time, however, we record the observation and then put the chip back in the bag. Again we close our eyes, shake the bag, and pull out a chip. We then repeat this procedure until we have 4 chips. Data sets generated in this way are still simple random samples, but because we put the chips back in the bag immediately after drawing them it is referred to as a sample with replacement. The difference between this situation and the first one is that it is possible to observe the same population member multiple times, as illustrated in Figure 4.11.

Figure 4.11: Simple random sampling with replacement from a finite population

Most psychology experiments tend to be sampling without replacement, because the same person is not allowed to participate in the experiment twice. However, most statistical theory is based on the assumption that the data arise from a simple random sample with replacement. In real life, this very rarely matters. If the population of interest is large (e.g., has more than 10 entities!) the difference between sampling with- and without- replacement is too small to be concerned with. The difference between simple random samples and biased samples, on the other hand, is not such an easy thing to dismiss.

4.8.3 Most samples are not simple random samples

As you can see from looking at the list of possible populations that I showed above, it is almost impossible to obtain a simple random sample from most populations of interest. When I run experiments, I’d consider it a minor miracle if my participants turned out to be a random sampling of the undergraduate psychology students at Adelaide university, even though this is by far the narrowest population that I might want to generalize to. A thorough discussion of other types of sampling schemes is beyond the scope of this book, but to give you a sense of what’s out there I’ll list a few of the more important ones:

Stratified sampling. Suppose your population is (or can be) divided into several different sub-populations, or strata. Perhaps you’re running a study at several different sites, for example. Instead of trying to sample randomly from the population as a whole, you instead try to collect a separate random sample from each of the strata. Stratified sampling is sometimes easier to do than simple random sampling, especially when the population is already divided into the distinct strata. It can also be more efficient that simple random sampling, especially when some of the sub-populations are rare. For instance, when studying schizophrenia it would be much better to divide the population into two strata (schizophrenic and not-schizophrenic), and then sample an equal number of people from each group. If you selected people randomly, you would get so few schizophrenic people in the sample that your study would be useless. This specific kind of of stratified sampling is referred to as oversampling because it makes a deliberate attempt to over-represent rare groups.

Snowball sampling is a technique that is especially useful when sampling from a “hidden” or hard to access population, and is especially common in social sciences. For instance, suppose the researchers want to conduct an opinion poll among transgender people. The research team might only have contact details for a few trans folks, so the survey starts by asking them to participate (stage 1). At the end of the survey, the participants are asked to provide contact details for other people who might want to participate. In stage 2, those new contacts are surveyed. The process continues until the researchers have sufficient data. The big advantage to snowball sampling is that it gets you data in situations that might otherwise be impossible to get any. On the statistical side, the main disadvantage is that the sample is highly non-random, and non-random in ways that are difficult to address. On the real life side, the disadvantage is that the procedure can be unethical if not handled well, because hidden populations are often hidden for a reason. I chose transgender people as an example here to highlight this: if you weren’t careful you might end up outing people who don’t want to be outed (very, very bad form), and even if you don’t make that mistake it can still be intrusive to use people’s social networks to study them. It’s certainly very hard to get people’s informed consent before contacting them, yet in many cases the simple act of contacting them and saying “hey we want to study you” can be hurtful. Social networks are complex things, and just because you can use them to get data doesn’t always mean you should.

Convenience sampling is more or less what it sounds like. The samples are chosen in a way that is convenient to the researcher, and not selected at random from the population of interest. Snowball sampling is one type of convenience sampling, but there are many others. A common example in psychology are studies that rely on undergraduate psychology students. These samples are generally non-random in two respects: firstly, reliance on undergraduate psychology students automatically means that your data are restricted to a single sub-population. Secondly, the students usually get to pick which studies they participate in, so the sample is a self selected subset of psychology students not a randomly selected subset. In real life, most studies are convenience samples of one form or another. This is sometimes a severe limitation, but not always.

4.8.4 How much does it matter if you don’t have a simple random sample?

Okay, so real world data collection tends not to involve nice simple random samples. Does that matter? A little thought should make it clear to you that it can matter if your data are not a simple random sample: just think about the difference between Figures 4.9 and 4.10. However, it’s not quite as bad as it sounds. Some types of biased samples are entirely unproblematic. For instance, when using a stratified sampling technique you actually know what the bias is because you created it deliberately, often to increase the effectiveness of your study, and there are statistical techniques that you can use to adjust for the biases you’ve introduced (not covered in this book!). So in those situations it’s not a problem.

More generally though, it’s important to remember that random sampling is a means to an end, not the end in itself. Let’s assume you’ve relied on a convenience sample, and as such you can assume it’s biased. A bias in your sampling method is only a problem if it causes you to draw the wrong conclusions. When viewed from that perspective, I’d argue that we don’t need the sample to be randomly generated in every respect: we only need it to be random with respect to the psychologically-relevant phenomenon of interest. Suppose I’m doing a study looking at working memory capacity. In study 1, I actually have the ability to sample randomly from all human beings currently alive, with one exception: I can only sample people born on a Monday. In study 2, I am able to sample randomly from the Australian population. I want to generalize my results to the population of all living humans. Which study is better? The answer, obviously, is study 1. Why? Because we have no reason to think that being “born on a Monday” has any interesting relationship to working memory capacity. In contrast, I can think of several reasons why “being Australian” might matter. Australia is a wealthy, industrialized country with a very well-developed education system. People growing up in that system will have had life experiences much more similar to the experiences of the people who designed the tests for working memory capacity. This shared experience might easily translate into similar beliefs about how to “take a test”, a shared assumption about how psychological experimentation works, and so on. These things might actually matter. For instance, “test taking” style might have taught the Australian participants how to direct their attention exclusively on fairly abstract test materials relative to people that haven’t grown up in a similar environment leading to a misleading picture of what working memory capacity is.

There are two points hidden in this discussion. Firstly, when designing your own studies, it’s important to think about what population you care about, and try hard to sample in a way that is appropriate to that population. In practice, you’re usually forced to put up with a “sample of convenience” (e.g., psychology lecturers sample psychology students because that’s the least expensive way to collect data, and our coffers aren’t exactly overflowing with gold), but if so you should at least spend some time thinking about what the dangers of this practice might be.

Secondly, if you’re going to criticize someone else’s study because they’ve used a sample of convenience rather than laboriously sampling randomly from the entire human population, at least have the courtesy to offer a specific theory as to how this might have distorted the results. Remember, everyone in science is aware of this issue, and does what they can to alleviate it. Merely pointing out that “the study only included people from group BLAH” is entirely unhelpful, and borders on being insulting to the researchers, who are aware of the issue. They just don’t happen to be in possession of the infinite supply of time and money required to construct the perfect sample. In short, if you want to offer a responsible critique of the sampling process, then be helpful. Rehashing the blindingly obvious truisms that I’ve been rambling on about in this section isn’t helpful.

4.8.5 Population parameters and sample statistics

Okay. Setting aside the thorny methodological issues associated with obtaining a random sample, let’s consider a slightly different issue. Up to this point we have been talking about populations the way a scientist might. To a psychologist, a population might be a group of people. To an ecologist, a population might be a group of bears. In most cases the populations that scientists care about are concrete things that actually exist in the real world.

Statisticians, however, are a funny lot. On the one hand, they are interested in real world data and real science in the same way that scientists are. On the other hand, they also operate in the realm of pure abstraction in the way that mathematicians do. As a consequence, statistical theory tends to be a bit abstract in how a population is defined. In much the same way that psychological researchers operationalize our abstract theoretical ideas in terms of concrete measurements, statisticians operationalize the concept of a “population” in terms of mathematical objects that they know how to work with. You’ve already come across these objects they’re called probability distributions (remember, the place where data comes from).

The idea is quite simple. Let’s say we’re talking about IQ scores. To a psychologist, the population of interest is a group of actual humans who have IQ scores. A statistician “simplifies” this by operationally defining the population as the probability distribution depicted in Figure 4.12a.

Figure 4.12: The population distribution of IQ scores (panel a) and two samples drawn randomly from it. In panel b we have a sample of 100 observations, and panel c we have a sample of 10,000 observations.

IQ tests are designed so that the average IQ is 100, the standard deviation of IQ scores is 15, and the distribution of IQ scores is normal. These values are referred to as the population parameters because they are characteristics of the entire population. That is, we say that the population mean (mu) is 100, and the population standard deviation (sigma) is 15.

Now suppose we collect some data. We select 100 people at random and administer an IQ test, giving a simple random sample from the population. The sample would consist of a collection of numbers like this:

106 101 98 80 74 . 107 72 100

Each of these IQ scores is sampled from a normal distribution with mean 100 and standard deviation 15. So if I plot a histogram of the sample, I get something like the one shown in Figure 4.12b. As you can see, the histogram is roughly the right shape, but it’s a very crude approximation to the true population distribution shown in Figure 4.12a. The mean of the sample is fairly close to the population mean 100 but not identical. In this case, it turns out that the people in the sample have a mean IQ of 98.5, and the standard deviation of their IQ scores is 15.9. These sample statistics are properties of the data set, and although they are fairly similar to the true population values, they are not the same. In general, sample statistics are the things you can calculate from your data set, and the population parameters are the things you want to learn about. Later on in this chapter we’ll talk about how you can estimate population parameters using your sample statistics and how to work out how confident you are in your estimates but before we get to that there’s a few more ideas in sampling theory that you need to know about.


Statistics Calculator: Scatter Plot

Use this page to generate a scatter diagram for a set of data:

    Enter the x and y data in the text box above. Data can be entered in two different formats:

comma or space separated x values in the first line and comma or space separated y values in the second line, or .
individual x,y values (again, separated by commas or spaces) on each line.

Individual values within a line may be separated by commas, tabs or spaces. This flexibility in the input format should make it easier to paste data taken from other applications or from text books.

For the scatter plot to be displayed the number of x-values must equal the number of y-values.
To clear the scatter graph and enter a new data set, press "Reset".

What is a scatter plot

A scatter plot (or scatter diagram) is a two-dimensional graphical representation of a set of data. Each x/y variable is represented on the graph as a dot or a cross.

This type of chart can be used in to visually describe relationships (correlation) between two numerical parameters or to represent distributions.

Excel is often used to generate scatter plots on a personal computer. See this article for a full explanation on producing a plot from a spreadsheet table.


Watch the video: Lecture Machine Learning System Design. Trading Off Precision And Recall Andrew Ng (October 2021).