PLC Statistics: Plans for Thursday sessions

Possible activity related to estimation of mean from a sample: Walk the Line

May 29

Discuss stats in the big wide world
Projects?
Review Ho and Ha for two independent samples, another option is to show how the pop means compare

Ho: μ1 = μ2
Ha: μ1 ne μ2

Review ideas for ANOVA

extension of two independent groups to 3 or more

called, analysis of variance....ANOVA
are explanatory and response categor or quant?
display image from OLI, p. 243

Hypotheses

H0:μ1=μ2=...=μk
Ha:not all the μ's are equal.

means and sds for each group
histograms by group
side-by-side boxplots

Test statistic

F-test
compares variation among groups to variation within group (between divided by within)
comparison of boxplots with large and small within group variation, OLI, p. 245
F = variation among / variation within, F statistic when null is true based on F distribution

determ what F-dist must look like

will have two degrees of freedom, one for num and one for den, different F distribution for each possibility
generally, 4 or greater is a large F statistic

Conditions for safe use of F test

samples are independent
response variable varies Normally in each population being compared
response variables has same sd in each population, rule of thumb largest no more than twice as big as smallest

Example: Smiles and Leniency

review research scenario

experimental or observational
id expl and resp variables

Step 1: Ho and Ha
Step 2: EDA and test statistic, check conditions

see pdf of spss output

Step 3: Evaluate p-value
Step 4: Conclusion....about populations
But NO information about why Ho is rejected, about which means are significantly different from the others. Use comparison methods for this

May 22

Discuss stats in the big wide world - Trenton Times article
Projects?
Quick review of t vs. z distribution, 4 steps of hypothesis testing
Introduce idea of inference for relationships--display role-type classifications.png
Review ideas for two independent samples

Show graphic Case I - designs.png
Explain how explanatory and response variables are used in Ho and Ha. (Ho: Explanatory variable is not related to response; Ha: Explanatory is related to response)
Conditions under which two sample t-test can be used? (independence, sample sizes are both large or have normal populations)
Structure of t-test is similar to structure of previous tests:

$\frac{sample \ estimate - null \ value}{standard \ error}$ = X-bar(1) - X-bar(2)/se

One issue is we have two populations, each with it's own sigma, which we estimate with the sample sd....does not have a t dist; can approx t-dist with an approx degrees of freedom.
What does a p-value mean? (that in the null hypothesis situation, this is how likely it is to get these results; p-value = .15 means somewhat probable; p-value = .001 means very unlikely; p-value = .061 means ???)

Example: drive thru speaker clarity

QSRMagazine.com surveyed 689 adults on thier drive-thru windo experience at fast food restaurants. One question was "Thinking about your most recent drive-thru experience, please rate how satisfied you were with the clarity of communication through the speaker." Responses ranged from Very Dissatisfied (1) to Very Satisfied (5).
work through 4 steps; use speakerclarity.ods (remember to set alpha in step 1).

May 15

Discuss stats in the big wide world - Facebook statistics
Quick review of hypothesis testing for pop mean (4 steps, test statistic is a z-score)
Discuss how to deal with unknown σ.

display t-dist vs. z-dist to show how the t-dist varies with different sample size and how with larger sample size it approaches the z distribution.
display z-vs-t.png to show the additional source of variation in the t statistic, so t distribution is more approp.

Example - facebook friends

work through 4 steps for facebook friends dataset, facebookfriends.xls or facebookfriends.ods (worked in sheet 2)
note: dist is skewed, so n=30 not large enough to be confident that t-distribution is approp.
set alpha = .05....fail to reject Ho.

May 8

Discuss stats in the big wide world - catherine's huff post article

Present and discuss student examples of research question and Ho and Ha.

Continue with hypothesis testing steps 2-4

Example (continued): The researcher obtains a random sample of 50 college students currently taking 15 credits and collects the number of hours they study per week: x-bar = 27 and sd = 5 hours per week. What evidence was collected? How might we summarize the evidence against H₀? (Step 2)

evidence is the sample mean and sd.

we can compare the sample results to the hypothesized value

For this example, we will employ a z statistic to compare the sample mean to the hypothesized value. How does this work? What do we call this type of statistic, generally speaking?

assume standard deviation of estimate is .7

called a test statistic....this is a very typical form

To assess the evidence we ask the question: how likely is it to get data like that observed when H₀ is true? What do we need to answer this question? (Step 3)

the probability of obtaining this value or one more extreme, if the population parameter in H₀ is true.

called the p-value

if very small, then unlikely to observe this value or one more extreme if H₀ is true.

if large, then not surprising to see a value like this, if H₀ is true. Could have happened by chance

What can we use to give us this probability....of observing a particular value or one more extreme given the population parameter provided in H₀.

sampling distribution (ask if it is safe to use the sampling distribution of the mean for this example.....yes, n>40)

(draw sampling distribution of x-bar for μ=30, σ=.7)

the z-statistic, -2.86, is from a Normal distribution representing the sampling distribution of the mean.

(draw Normal distribution, shade areas outside +/-2.86)

for a two sided H_a, P(Z <= -2.86 or Z >= 2.86)...sum the lower and upper tails: .0021*2=.0042

A p-value of p=.0042 is pretty small, but is it small enough to decide against H₀ (that the population mean is 30)? How can we decide? (Step 4)

compare our result to a threshold value

pre-determined

called significance level, α.

What are some common significance levels? What do they mean?

α=.05, α=.01, α=.1

if we find a result that would occur less than 5% (1%, 10%) of the time, when H₀ is true, then we decide to reject H₀ and accept H_a.

our result is statistically significant at level α.

What do we conclude if the p value is not smaller than α?

we decide that our data do not provide enough evidence to reject Ho

we can also say that the data do not provide enough evidence to accept Ha.

we cannot say that the data support H₀, or that we accept H₀.

What do we conclude in our example for p=.0042?
H₀: The average time full-time Rutgers students study outside of class is 30 hours per week.
H_a: The average time full-time Rutgers students study outside of class is not 30 hours per week.
assume we set α=.05 before the data were collected
p=.0042

our result is statistically significant

we reject H₀ and conclude that the average time full-time Rutgers students study outside of class is not 30 hours per week.

What if before we collected the data we suspected that Rutgers students, on average, study less than 30 hours per week. How does this change how we assess the evidence and what we conclude?

(draw Normal distribution with z=-2.86 and shade only area below)

our p-value is smaller....p=.0021....more powerful test because we have prior information about direction of difference

our conclusion doesn't change.

Who are we making a conclusion about

The population of Rutgers students from which our sample was drawn.

Discuss ideas for projects

May 1

Discuss stats in the big wide world - Statisticians slam use of teacher evaluation scheme
Continue reviewing confidence intervals...continue with notes below.
Introduce idea of hypothesis testing:

Consider a situation where a student is brought before an academic committee with a claim that she cheated. The committee assumes that the student is innocent until proven guilty. The instructor presents convincing evidence of cheating. What should the committee decide as to whether or not the student cheated?

committee should find evidence convincing and decide that student cheated.

Tests of significance work similarly. Identify the following in the cheating story.

Identify two opposing claims
- student claims innocence (claim 1); instructor claims she cheated (claim 2).
- claim 1 is challenged by claim 2
- begin with assumption that claim 1 is true.
Collect evidence
- instructor provides evidence against claim 1
- observations in sample will serve as evidence against claim 1
Assess evidence
- committee evaluates evidence: how likely (probability based) to observe this evidence if student is innocent
- evaluate sample statistics in context of sampling distribution; determine how likely to observe this result if it were to have occurred by chance.
Make a decision
- If very unlikely that student could be innocent (claim 1) given evidence (strong evidence against claim 1), then reject claim 1 and decide for claim 2
- If likely that student could be innocent (claim 1) given evidence (weak evidence against claim 1), then stay with claim 1 (cannot reject claim 1 in favor of claim 2). Note: we do not say we accept claim one, we just don't have anything better to conclude.

What do we call the two claims in tests of significance?

Null hypothesis (H_o), claim 1

typically statement of no effect, no difference; the assumed usual state

Alternative hypothesis (H_a), claim 2
- statement that disagrees with H_o, specifying what we think might be going on; written as an "opposite" of null hypothesis.

Example: Traditional practice suggests that college students should study 2 hours for every 1 hour of classroom time. Using this rule, a student with 15 hours of classroom time per week (i.e., 15 credit hours, denoted full-time) should study on average 30 hours per week. A researcher is interested in whether this rule applies at Rutgers University. What are the null and alternative hypotheses for this study? (Step 1)
- H₀: The average time full-time Rutgers students study outside of class is 30 hours per week.
- H_a: The average time full-time Rutgers students study outside of class is not 30 hours per week.
When wording the hypotheses (claims) who are they about?
- the population
- the population in our example is the university students
If we suspected that full-time Rutgers students study less than 30 hours per week, how would we have stated H_a?
- H_a: The average time full-time Rutgers students study outside of class is less than 30 hours per week.
- One-sided alternative, could be greater than or less than
- Two-sided alternative, population could differ in either direction.
- Must have a specific direction firmly in mind (without looking at the data) to choose one-sided

Have students work in pairs to come up with research question and Ho/Ha.
Discuss possibilities for individual stats projects

Apr 24

Discuss stats in the big wide world - homeschoolers have better diets; are thinner, leaner
German tank problem

Display student simulations (Margaret, Ethan, ??)

discuss desirability of point estimates which are unbiased (accurately estimate population mean) and have small variability.

Display histograms of simulations of other possible estimators. Which is the best?
Explain the estimator chosen by Allied Statisticians: max + average gap

Review concepts related to point estimation and intro to confidence intervals

Why are the statistics p-hat and x-bar good estimators for their respective population parameters? (because as long as the sample is taken at random, in the sampling distribution the distribution of the sample mean or sample proportion are exactly centered at the population parameter)
What does it mean to be an unbiased estimator? (bias is the difference between the expected value of the estimator and the true value of the population parameter being estimated; unbiased means this difference is 0; the estimate is not systematically too low or too high)
What is required when designing a study to be confident that the resulting estimator is not biased? (sample is random, design is not flawed in some way)
How can the estimator's accuracy (for predicting parameter) be improved? (larger sample size...the bigger the sample, the more of the pop that's been included, the closer the estimate...larger sample means smaller sd, narrower distribution, better estimate)
Are p-hat and x-bar the only point estimators? Are there others? (there are lots of others, that's what statistics is all about, recall german tank problem, # of iphones sold)
What's the downside of a point estimator? (It's so often wrong, maybe not by a lot, but enough to make you wonder)
What can we use to bolster the point estimate? (an interval estimate)
What does an interval estimate tell us? (the size of the error attached to the point estimate)
What is the term for an interval estimate? (confidence interval)
Why is it called that...what two elements are included? (how confident we are that parameter is in the given interval)
If we want to be 95% confident that the interval contains the population mean, how would we construct the interval (+-2 sd around the mean, with sd being the mean of the sampling distribution)

Review of concepts related to confidence intervals

Confidence intervals for population mean

What words do we use to explain a specific confidence interval? (We are XX% confident that the unknown population parameter is in the interval (A, B))
How is a confidence interval created? (point estimate +- z*(σ/sqrt(n)) )
What are the numbers that are multiplied by the sd of point estimator? (standard deviation units...z*)
What are three common values for z*? (1.645 for 90% confidennt, 2, or 1.96, for 95% confident, 2.576 for 99% confident)
Why does the interval get wider as we move from being 90% confident to 99% confident? (need to include more of the possibilities to be more confident)
What is the tradeoff when deciding how confident we want to be with our interval (higher confidence means less precision....larger range)
What example might help us understand this? (To be 100% confident, we'd have to include all possible values...which is really unhelpful if the point was to provide data to help understand or make a decision...need to narrow it down a bit...be a little be precise about what we think)
What is the margin of error? (the amount that is added or subtracted (given a certain level of confidence) to create the confidence interval)
....mathematically? ( m = z*(σ/sqrt(n)) )
how do we use the margin of error? (show image confidence interval structure.png)
How can we reduce the confidence interval (margin of error) but keep the same level of confidence (get a bigger sample size...reduces sd....reduces m)
Why does increasing the sample size reduce the margin of error (the sampling distribution is more narrow with a larger sample size, so more sure of point estimate)
So if it's so easy to be more precise (larger sample) why don't more studies take advantage of this? (cost, availability...)
What do we do if we want to determine what sample size to use for a given margin of error and level of confidence (solve for n, n = (z* x σ/m)^2 )
What do we do if we calculate a fractional value for n? why? ( round up; larger sample size is more "conservative" )
What underlying condition supports the development of a confidence interval? (sampling distribution of estimator, x-bar in this instance, is normal...central limit theorem)
What conditions must we have when creating confidence intervals? (the sample must be random, sample is large (n>30) or if sample is smaller that the variable is normally distributed -- show image confidence interval assumptions.png)
What's wrong with our calculations so far, practically speaking? (σ, the population standard deviation, is often not known!!)
How do we solve this? (use the t distribution rather than the z distribution to determine the mulitiplier)
What is the formula for the margin of error when σ is unknown? (substitute sd for σ and t* for z*... m = t*(sd/sqrt(n))
How is t* different from z*? (depends on sample size as well as confidence level...degrees of freedom)
What do we call this new formula, in which sd is substituted for σ? (standard error...of x-bar; show SAT score reports, discuss standard error used to create score range, what does it mean, think about the range of scores from one test to the next)
Under what conditions can we use the sd version of this formula? (same as for pop version...random sample, n>30 unless pop dist is distributed normally)
When can we use the z*-based calculation to get a pretty good estimate of the t*-based calculation? (large values of n...rule of thumb is >30...interesting discussion of percent error)

Confidence intervals for population proportion p

In what situations does it make sense to study the population proportion? (categorical variable)
What is the point estimator for population proportion? (p-hat)
What is the general form of a confidence interval? (estimate +/- margin of error)
To create the margin of error, we need a few things:
multiplier? (z*...1.645, 2, and 2.576 at the 90%, 95% and 99% confidence levels)
sd of sampling distribution for p-hat? sqrt(p(1-p)/n)... but we don't know p...that's what we are trying to estimate...solution: sqrt(p-hat(1-p-hat)/n)
What do we call this sd of p-hat when the p-hat is substituted for p? (standard error
If we want to be more confident that the interval contains the population parameter, what do we have to 'give' on? (precision, narrowness of interval)
What else can we do to increase precision, for a fixed level of confidence? (increase sample size)
What practical problem arises when calculating a desired sample size, given a confidence level? (formula uses p-hat, but that's what we want to estimate with the sample...)
How can we overcome this problem? (use a conservative value for p-hat...one that will make the largest standard error...this is always p-hat=.5...have the students confirm that this is true)
What is the formula for the conservative estimate of n given m at a 95% confidence level? (n=1/m^2)
What is the formula for the conservative estimate of m given n at a 95% confidence level? When is this useful? (m=1/sqrt(n); useful to give report convervative result that applies to all questions, even tho quesitons have varying levels of p-hat)
Under what conditions is it safe to use the methods described to create a confidence interval for pop prop? (p-hat must be distributed normally, therefore n * p-hat >= 10; n(1-p-hat) >= 10)

Apr 10

Discuss stats in the big wide world - sam's club point estimate
Activity: Point estimation - German tank problem

Form kids into groups
Part 1 - 10 min

Groups work on designing estimate, and use N=5 sample to calculate estimate
Share results
Discuss: How do we decide the "usefulness" of an estimator. Do the concepts of bias and variability play a role?
Discuss: How can we evaluate an estimator for bias and variability?

Part 2 - 30 min

Use a spreadsheet or other program to simulate N=5 sampling distribution for your estimate. Assume population N is 122.
Create histogram showing results.
Discuss bias and variability.

Apr 3

Discuss stats in the big wide world
Review of concepts related to sample distributions

In a sampling distribution of , what does each value in the distribution represent? (the mean of a random sample from a particular population)
What else do we need to know about that sample to make sense of the sampling distribution? (how big each sample is: n)
How many samples are in the distribution? (the assumption is an infinite number....n is NOT the number of samples, but the number of observations in EACH sample)
What happens if the sample size for all of these means is large, say n=50? (the distribution of will be normal)
What if the sample size is small, say n=3? (the distribution may or may not be normal, can't assume so)
What happens to the standard deviation of the sampling distribution when n is small? (larger) when n is large? (smaller)
What is the standard deviation of the sampling distribution of the mean? (if you know the population standard deviation, then you can calculate it: .
What is the central limit theorem? (it states that given a distribution with a mean μ and variance σ², the sampling distribution of the mean approaches a normal distribution with a mean (μ) and a variance σ²/N as N, the sample size, increases.)
What is amazing about this theorem? (regardless of the shape of the population distribution, averages that have a large enough sample will have a normal distribution)
How large must the sample size be for the central limit theorem to kick in? (larger than 30 works for nearly every population distribution; smaller sample sizes will work for pop dists that are somewhat normal.)
Is there a different sample size rule for when the population distribution has a normal distribution? (Averages based on any sample size will be normally distributed)

Play with CLT simulator
Intro to Inference

In statistics, what is meant by the term inference (inferring something about the population based on what is measured in the sample)
What is point estimation? (estimating an unknown population parameter by a single number calculated from the sample data.
What is a confidence interval? (an estimate of an unknown population parameter by an interval of values, calculated from the sample data, that is likely to contain the true value of that parameter along with an indication of how confident we are that this interval indeed captures the true value of the parameter)
What is hypothesis testing? (for a stated claim about the population, a decision whether or not the data obtained from the sample provide evidence against this claim.)
When the variable of interest is categorical, what population parameter do we make inferences about? (population proportion, p)
When the variable of interest is quantitative, what population parameter do we make inferences about? (population mean, μ)
In the context of statistical inference, what is an estimator? (a statistic used to estimate a pop parameter)
..., what is an estimate? (the value of the statistic that is used as the point estimate for the parameter)
What do we use as the point estimator for μ? (the sample statistic x-bar)

Begin Point estimation - German tank problem

Mar 27

Review of concepts related to sampling distributions
- What is sampling variability? {the idea that the characteristics of a sample, from a given population, vary from one sample to another)
- What is a parameter? (a number that describes the population)
- What is a statistic? (a number that is computed from the sample)
- What symbol is used to denote a population proportion? ( $p$ ) a sample proportion? ()
- What symbol is used to denote a population mean? (μ) a sample mean? ()
- What symbol is used to denote a population standard deviation? (σ) a sample standard deviation? (s, or sd)
- Why are parameters generally unknown, as compared to the ease of calculating statistics? (usually impractical or impossible to know the value of a variable for every individual in the population)
- Sampling distribution of :
Creating a sampling distribution

Let's consider the counts by color for the fun-size bags of M&M's in the M&M spreadsheet that you created in the first few weeks of this course to represent a sample distribution of for each color.

--Issue with using bags of M&M's to represent independent samples in a sampling distribution from the population (N varies from bag to bag; np and n(1-p) may be less than 10).

Using the spreadsheet do the following:
- Convert the count for each bag into a proportion.
- Calculate the mean of . (to estimate the population proportion)
- Calculate the standard deviation of .
- Plot the distribution of as a histogram.
Discuss the shape of the distribution. Is it normal looking? If not, consider reasons why it might not be.

Proportion of M&M's by color, provided in a letter from Mars Snackfood US:
- M&M'S MILK CHOCOLATE: 24% cyan blue, 20% orange, 16% green, 14% bright yellow, 13% red, 13% brown.
- M&M'S PEANUT: 23% cyan blue, 23% orange, 15% green, 15% bright yellow, 12% red, 12% brown.
- M&M'S PEANUT BUTTER and ALMOND: 20% cyan blue, 20% orange, 20% green, 20% bright yellow, 10% red, 10% brown.

Mar 20

Stats in big outside world: homeschoolers have better diets; are thinner, leaner
Review of concepts related to a Normal distribution

What is a normal distribution? (a bell-shaped probability density curve that can describe many natural phenomena)
What is the standard deviation rule (percentages of observations distributed in a bell-shaped curve representing 1, 2, and 3 standard deviations out from the mean: 68, 95. and 99.7 %s)
How does the standard deviation rule (learned in the first part of this course) relate to our understanding of a normal distribution as a probability density curve? (68% of observations fall between +-1 sd of the mean becomes P(μ-σ < X < μ+σ) = .68)
What do we do if we are interested in probabilities that aren't related to the sd? (need a way to estimate the area under the curve)
What does it mean to standard a normal value? (determine how many standard deviations it is away from the mean; (x-μ)/σ)
What is this standardized normal value called? (z-score, a value of the standard normal random variable Z)
What kind of values will a z-score take? (z-scores from about -3.5 to +3.5 correspond to >0 probabilities)
What is a normal table? (a table of z-scores listing the probability of obtaining a value less than, to the left of, that value)
Show a normal table with *.* in the rows and .** in the columns.

What is the probability of a normal random variable taking a value less an 1.33 standard deviations above the mean? ( .9082)
What are the two methods for solving a problem that asks for the probability of a value greater than the z-score? (1. use the symmetry of the table and solve for less than the negative of the value; 2. find the probability for less than the value and then subtract from 1)
How can you solve for probabilities between two values? (Find the probabilities below each value and subtract the smaller one from the larger)

Note that a sketch will often help you keep the calculations straight.

How do we find the corresponding number of standard deviations away from the mean when given a probability value describing an area of the curve above, below or between? (for below, find the probability value in the table and note the corresponding z-score, for above and between use the symmetric or adds-to-1 properties of the normal distribution to transform the probability in a less than value.)
What information do you need to calculate the probability of an observation greater than some value? (use mean and sd of distribution to convert value to z-score--compute probability; also need some assurance that using a normal distribution is appropriate)
How do we notate these transformations? (P(X > 13) = P(Z > 1.33) = P(Z < -1.33 = .0918)

Feel comfortable working from probability through z-score to observed value?

Besides proportion or percent, what other term is used to express probabilities in a normal distribution (percentile, 25 % score below (or at or below) the 25th percentile)

m&m data

look at variables collected and pick out those that may be Normally distributed
what properties would these variables have? (mean and median are similar....not skewed, bell-shaped distribution, no outliers)
have students work in groups to evaluate one of the identified variables

calculate mean and median
create histogram (or stem&leaf)
convert 3 values to z-scores

Mar 13

Stats in big outside world: homeschoolers have better diets; are thinner, leaner
Calculate mean in coins, dice cards activity
Review of concepts related to continuous random variables

What is a continuous random variable? (a variable that can take on any value in a given range)
What is a probability density function? (a formula that can be used to compute the probabilities of a range of outcomes for a continuous random variable)
How is a probability density function different from a histogram displaying the probabilities of a discrete random variable? (total area under the curve = 1 vs. sum of heights = 1)
How do you use a probability density function to determine probabilities? (determine the appropriate area under the curve).
What is the probability of a particular single outcome? (zero, because there is no area associated with a single value, the exact value to many decimal places--there are an infinite number of values for any continuous random variable)
When does it matter if we write P(X<=3) vs. P(X<3)? Why? (Need to specify whether the value is included or not for a discrete variable; for a continuous variable, can specify up to the edge of the value because no area associated with a particular value)
What shape is a density curve? (any shape--only restriction is the area under the curve = 1)
How can we find a specified area under a probability density curve? (calculus, specifically integration; we will use tables and spreadsheet functions)
What is a normal distribution? (a bell-shaped probability density curve that can describe many natural phenomena)
What is the standard deviation rule (percentages of observations distributed in a bell-shaped curve representing 1, 2, and 3 standard deviations out from the mean: 68, 95. and 99.7 %s)
How does the standard deviation rule (learned in the first part of this course) relate to our understanding of a normal distribution as a probability density curve? (68% of observations fall between +-1 sd of the mean becomes P(μ-σ < X < μ+σ) = .68)
What do we do if we are interested in probabilities that aren't related to the sd? (need a way to estimate the area under the curve)
What does it mean to standard a normal value? (determine how many standard deviations it is away from the mean; (x-μ)/σ)
What is this standardized normal value called? (z-score, a value of the standard normal random variable Z)
What kind of values will a z-score take? (z-scores from about -3.5 to +3.5 correspond to >0 probabilities)
What is a normal table? (a table of z-scores listing the probability of obtaining a value less than, to the left of, that value)
Show a normal table with *.* in the rows and .** in the columns.

What is the probability of a normal random variable taking a value less an 1.33 standard deviations above the mean? ( .9082)
What are the two methods for solving a problem that asks for the probability of a value greater than the z-score? (1. use the symmetry of the table and solve for less than the negative of the value; 2. find the probability for less than the value and then subtract from 1)
How can you solve for probabilities between two values? (Find the probabilities below each value and subtract the smaller one from the larger)

Note that a sketch will often help you keep the calculations straight.

How do we find the corresponding number of standard deviations away from the mean when given a probability value describing an area of the curve above, below or between? (for below, find the probability value in the table and note the corresponding z-score, for above and between use the symmetric or adds-to-1 properties of the normal distribution to transform the probability in a less than value.)
What information do you need to calculate the probability of an observation greater than some value? (use mean and sd of distribution to convert value to z-score--compute probability; also need some assurance that using a normal distribution is appropriate)
How do we notate these transformations? (P(X > 13) = P(Z > 1.33) = P(Z < -1.33 = .0918)

Feel comfortable working from probability through z-score to observed value?

Besides proportion or percent, what other term is used to express probabilities in a normal distribution (percentile, 25 % score below (or at or below) the 25th percentile)

Mar 6

Stats in big outside world: homeschoolers have better diets; are thinner, leaner

RV Problems (see google doc)

(from Collaborative Statistics, Discrete Random Variables Homework)
Exercise 2. Suppose that you are offered the following “deal.” You roll a die. If you roll a 6, you win $10. If you roll a 4 or 5, you win $5. If you roll a 1, 2, or 3, you pay $6.

What are you ultimately interested in here (the value of the roll or the money you win)?
In words, define the Random Variable X.
List the values that X may take on: x=....($10, $5, -$6)
Construct a probability distribution.
Over the long run of playing this game, what are your expected average winnings per game? (this question foreshadows future learning of expected value--mean of the prob dist)
Based on numerical values, should you take the deal?

Exercise 6. Suppose that the probability distribution for the number of years it takes to earn a Bachelor of Science (B.S.) degree is as shown in Bachelor degree year probdist.png.

In words, define the Random Variable X
What does it mean that the values 0, 1, and 2 are not included for X on the PDF?
What is the probability of completing a bachelors degree in at most 4 years?
Of those that take longer than 4 years, what is the probability of completing in more than 6 years?

Discussion of OLI content

What is another term for the mean of a random variable? (expected value)
OLI chooses not to use the term expected value when referring to the mean of a random variable. Why? (For some distributions the expected value is not a possible outcome, so how is it the most expected value)
Why is the mean of a random variable denoted as μX rather than X-bar? (mean of the population)
Describe how to calculate the mean of a random variable using the probability distribution.
How can we apply the concept of a probability dist to calculate the mean of a sample? (for each value in the sample, determine its prob of occurring, use the probdist to calc mean)
What is the standard deviation of a random variable and how is it denoted? (average deviation from the mean denoted σX)
Describe how to calculate it (subtract each value from the mean, square it, multiply by probability and add them all together, take the square root)
Why is the deviation squared in the calculation?
What is the variance of a random variable (the average squared deviation)

Transformations....

How does adding a constant affect the mean and standard deviation of a probability distribution (Mean + constant = new mean, sd is unchanged)
How does multiplying a constant affect the mean and standard deviation of a probability distribution (both are multiplied by the constant -- relocates and expands or contracts the distribution)
How does multiplying a constant affect the variance? (constant^2 times the original variance)
How can you summarize both of these changes into one concept? (linear transformation a+bμ, b^2*σ^2)
What is the mean of a random variable that is the sum of two random variables? (meanX + meanY)
What is the standard deviation of a random variable that is the sum of two random variables? (sqrt of σX^2 + σY^2)
On what condition? (independence)

Group the students in groups of 2-3. Have each group complete the Coins, Dice, Cards lab (station 5 in CasinoLab.pdf). Will need 4 coins, pair of dice, and deck of cards for each group.

16 ways to flip 4 coins, 1 each for 0 and 4 heads, 4 ways to have 1 and 3 heads, 6 ways to have 2 heads

Feb 27

Stats in big outside world: homeschoolers have better diets; are thinner, leaner

Review of probability concepts -- discuss the meaning and nuance of each of the following:

What is the Law of Total Probability? (P(B)=P(A and B) + P(not A and B)=P(A)*P(B|A) + P(not A)*P(B|notA)
How can we understand the Law of Total Probability conceptually? (Here's how to talk about it: I want to calculate the probability of B, if A happens with P(A), then B occurs with probability P(B|A) and if A doesn't happen with P(not A), then B occurs with probability P(B|not A); also suggest a Venn diagram where A, not A describe the universe and B occurs within that universe)
How can we use gen mult rule (P(A and B)=P(A)*P(B|A)) and law of total prob to reformulate the equation for P(A|B)? (display the 3 equation derivation from the bottom of this OLI page)
When is it useful to use this formulation? (when you know more about the second event than the first.)

Review of random variables -- discuss the meaning and nuance of each of the following:

What is a random variable? (a variable whose values are numerical results of a random experiment)
What are the two kinds of random variables that we care about? (discrete and continuous)
What characterizes a discrete random variable? (possible values are a list of distinct values)
What are examples of discrete random variables in disguise? (rounded values)
What are examples of discrete random variables that are better dealt with as continuous variables? (a variable with a lot of possible values--test scores)
How do we write the probability of a particular value of the random variable? (P(X=x))
What is a probability distribution? (The collection of values and probabilities associated with a particular random variable)
What properties must a probability distribution of a random variable fulfill? (1. each probability is bwtn 0 and 1, inclusive, 2. the sum of the probabilities =1)
What tool can we use to summarize a probability distribution? (a table -- show and discuss image of table from OLI, probdist table.png)
How is a probability histogram different from the distribution histogram we talked about earlier (y axis represents probabilities)
Display obama election results probdist.png (explained here) - discuss how this graph is a probability distribution (random variable is results of state polls; y-axis is in percent not proportion; seems like it could add to 1; discrete altho a lot of values...10,000 simulated results)

Conditional probability lab

Calculate a selection of probabilities based on the contents of a regular bag of m&ms (about 40). Use lab directions at Connexion's Probability Topics: M&M Lab. Consider how to make this activity better. (simulations of more trials)

RV Problems

(from Collaborative Statistics, Discrete Random Variables Homework)
Exercise 2. Suppose that you are offered the following “deal.” You roll a die. If you roll a 6, you win $10. If you roll a 4 or 5, you win $5. If you roll a 1, 2, or 3, you pay $6.

What are you ultimately interested in here (the value of the roll or the money you win)?
In words, define the Random Variable X.
List the values that X may take on: x=....($10, $5, -$6)
Construct a probability distribution.
Over the long run of playing this game, what are your expected average winnings per game? (this question foreshadows future learning of expected value--mean of the prob dist)
Based on numerical values, should you take the deal?

Exercise 6. Suppose that the probability distribution for the number of years it takes to earn a Bachelor of Science (B.S.) degree is as shown in Bachelor degree year probdist.png.

In words, define the Random Variable X
What does it mean that the values 0, 1, and 2 are not included for X on the PDF?
What is the probability of completing a bachelors degree in at most 4 years?
Of those that take longer than 4 years, what is the probability of completing in more than 6 years?

Feb 20

Stats in big outside world: homeschoolers have better diets; are thinner, leaner

Review of conditional probability concepts -- discuss the meaning and nuance of each of the following:

When is the conditional probability formula, P(B|A), undefined and how should we think about that situation? (when P(A) = 0; can't find a probability of something given an impossible event)
How can we implement the complement rule with conditional probabilities? (most important to use complement ONLY when conditioned on same event P(B|A) = 1 - P(not B|A))
How can we use the conditional probability to check whether two events are independent? (independent if P(A|B) = P(A))
Can we use P(A|B) and the P(A|not B) to check for independence? Why or why not? (Yes, because if they are the same, then whether or not B happens is irrelevant, so the events are independent.)
How can we use the multiplication rule to check for independence? (If P(A and B), e,g, computed from a two-way table, equals P(A) * P(B), then events A and B are independent)

Review of probability concepts -- discuss the meaning and nuance of each of the following:

What is the General Multiplication Rule? (P(A and B)=P(A)*P(B|A))
Can the general rule be used with independent events? Why, what happens?
How can we understand the general multiplication rule conceptually? (solve conditional probability formula for P(A and B), or realize that second probability is dependent on the first--uses the conditional probability)
What is the Law of Total Probability? (P(B)=P(A and B) + P(not A and B)=P(A)*P(B|A) + P(not A)*P(B|notA)
How can we understand the Law of Total Probability conceptually? (Here's how to talk about it: I want to calculate the probability of B, if A happens with P(A), then B occurs with probability P(B|A) and if A doesn't happen with P(not A), then B occurs with probability P(B|not A); also suggest a Venn diagram where A, not A describe the universe and B occurs within that universe)
How can we use these two rules to reformulate the equation for P(A|B)? (display the 3 equation derivation from the bottom of this OLI page)
When is it useful to use this formulation? (when you know more about the second event than the first.)

Conditional probability lab

Calculate a selection of probabilities based on the contents of a regular bag of m&ms (about 40). Use lab directions at Connexion's Probability Topics: M&M Lab.

Feb 13: Snow Day

Feb 6

Stats in big outside world: homeschoolers have better diets; are thinner, leaner

Review of concepts in probability -- discuss the meaning and nuance of each of the following:

What does conditional probability mean? (the probability of a second event conditioned on, given, a prior event)
How do we interpret P(B|A)? (the likelihood that a chosen A is also B; probability of B given A)
If you had a two-way table of counts for A, not A and B, not B, how would you calculate P(A|B)?
What is the formula, using other probabilities, for calculating P(B|A)? (P(B|A) = P(A and B)/P(A))
How does this formula reduce to the same formula used in the calculation based on the two-way table? (denominators in both probability fractions are the same, and cancel out leaving the counts from the two-way table)
When is the conditional probability formula undefined and how should we think about that situation? (when P(A) = 0; can't find a probability of something given an impossible event)
How can we implement the complement rule with conditional probabilities? (most important to use complement ONLY when conditioned on same event P(B|A) = 1 - P(not B|A))
What does it mean for 2 events to be independent?
How can we use the conditional probability to check whether two events are independent? (independent if P(A|B) = P(A))
Can we use P(A|B) and the P(A|not B) to check for independence? Why or why not? (Yes, because if they are the same, then whether or not B happens is irrelevant, so the events are independent.)
How can we use the multiplication rule to check for independence? (If P(A and B), e,g, computed from a two-way table, equals P(A) * P(B), then events A and B are independent)

Group problem solving

Instruct the students to work each probability problem as presented. Then ask a student to present and explain the answer. Use the following problems from Collaborative Statistics' probability homework:

#1, answers--
 a. {G1, G2, G3, G4, G5, Y1, Y2, Y3} 
 b. 5/8
 c. 2/3
 d. 2/8
 e. 6/8
 f. No, P(G and E)is not equal to 0

#2, answers--
 a. skip
 b. 5/8 * 5/8 = 25/64
 c. 1-P(Y1 and Y2) = 1 - (3/8 * 3/8) = 55/64
 d. P(G1 and G2)/P(G1) = 25/64 / 5/8 = 5/8; of course it's equal to P(G1) because the events are independent
 e. yes, because P(G2|G1) = P(G2), so it doesn't matter whether or not G1 occurred

#4, answers--
 a.

	1	2	3	4	5	6
1	1,1	1,2	1,3	1,4	1,5	1,6
2	2,1	2,2	2,3	2,4	2,5	2,6
3	3,1	3,2	3,3	3,4	3,5	3,6
4	4,1	4,2	4,3	4,4	4,5	4,6
5	5,1	5,2	5,3	5,4	5,5	5,6
6	6,1	6,2	6.3	6,4	6,5	6,6

 b. P(A) = 2/6 * 3/6 = 6/36 = 1/6
 c. P(B) = 21/36
 d. P(A|B) = P(A and B)/P(B) = 3/36 / 21/36 = 3/21 = 1/7
 e. no, both and A and B can occur > 3,4; P(A and B)=1/36
 f. no, knowing whether B occurred or not affects the probability of A. P(A|B) is not equal to P(A).

#8, answers--
 a. P(C and D) = P(C|D) * P(D) = .6 * .5 = .3
 b. no, because P(C and D) is greater than 0, they can both happen
 c. no 1) P(C|D) not equal to P(C); 2)P(C) * P(D) = .2 which is not equal to P(C and D) 3)we could calculate P(C|not D) to compare to P(C|D), suggest creating the two-way table.
 d. P(D|C) = P(D and C) / P(C) = .3 / .4 = .75

#9-12, answers--
  P(E|F) = 0
  P(J) = .3 because the events are independent. What happened with K doesn't matter.
  P(R) = P(Q and R)/P(Q) = .1/.4 = .25
  P(U and V) = 0 (U and V cannot both occur)
  P(U|V) = 0 (If V has occurred then U cannot occur)
  P(U or V) = .63

#19, answers--
 a. iii
 b. i
 c. iv
 d. ii

Jan 30

Discuss the meaning and nuance of each of the following rules

Prep for Rule 5:

Ask each student to provide an example of two independent and two dependent events. For each, specify whether the events are disjoint (events cannot both occur) or not disjoint (events could both occur).

Ask 1-2 students to describe two events that are disjoint -- discuss whether they are independent or dependent. (always dependent)

Rule 5: The Multiplication Rule for Independent Events: If A and B are two independent events, then P(A AND B) = P(A) * P(B) (Must be careful to use this only for independent events; discuss how joint probability would be effected for dependent events identified earlier.)

give example of how it works. Pick random person....P(Female) = .5, P(dark brown eyes) = 25% (according to amer ophalm assoc), so would expect 1/4 of 1/2 (=1/8) to be female with dark brown eyes. But if dependent, then might be another outcome

What does it mean to determine the P(at least one of). What is the difficulty of solving this kind of problem (prob of 1 + prob of 2 + ... + prob of all)? Is there a way around this? (complement rule)

In the statement "what is the probability of at least 1 of 10 monkeys scoring in the video game (by chance)", what is the complement? (none of the monkey's score)

The probability that a monkey will score is .1. How do we use the complement rule to calculate the prob of at least 1 monkey scoring? (1-.9^10 = .65)

Rule 6: The General Addition Rule: For any two events A and B, P(A or B) = P(A) + P(B) - P(A and B) (Is there anything that you need to be aware of when using this formula? Be careful to only multiply P(A) and P(B) to get P(A and B) when A and B are independent. What if A and B are dependent? Need to obtain P(A and B) by observation.)

Random fun

Have each student flip a coin 14 times and record the outcome. Ask each student to read out their sequence and tell how many H and how many T. Ask the students to compare their result with getting exactly HTHT. or all H's or all T's.

Which is more/less likely?

What is the probability of any one of these sequences? (all the same - .5^14)

What do people do when they want something to appear random? Why do you think people think "spread evenly" when they want random?

(optional) Random explained on Numb3rs, also

"Burn Rate" (season 3), at 24:30 or closer at 27:30.

"Traffic" (season 3), at 1:30

"Spree" (season 3), not sure where

Are we coins? from NPRs Radiolab

Discuss the concept of streaks in sports (or in gambling). How can we understand streaks in the context of an event having a given probability of occurring.

Play this radio essay, "Are We Coins" from NPR's Radiolab, about probability and simulations and randomness.

Jan 23

Finish T/F test simulation instructions
Stats in the big wide world

Discuss the meaning and nuance of each of the following rules

Prep for rule 4

What does 'OR' mean in probability?
What does disjoint mean? (2 events that cannot both occur at the same time) How else might it be termed? (mutually exclusive)
Have each student provide an example of two disjoint events, and two non-disjoint events.

Rule 4: The Addition Rule for Disjoint Events: If A and B are two disjoint events, then P(A or B) = P(A) + P(B) (why do you think this rule only applies to disjoint events? (draw a picture of non-disjoint events, overlapping circles, to motivate understanding)
Prep for Rule 5:

What does P(A and B) mean? Describe a Venn diagram that demonstrates this statement.
What is P(A and B) if A & B are disjoint? (0)
What does it mean for two events to be independent? (Two events A and B are said to be independent if knowing whether one event has occurred does not affect the probability that the other event occurs.)
What is the converse of independent? (dependent)
Motivating example: P(person saw the space babies commercial during 2013 super bowl) and P(person buys a Kia Sorento) - dependent and not disjoint

Random fun

Ask the students to spread themselves into a random pattern.
While standing in their spots, watch the "what's random" segment from the pilot of Numb3rs, (at 17:50, play on netflix).

Random explained on Numb3rs, also

"Burn Rate" (season 3), at 24:30 or closer at 27:30.
"Traffic" (season 3), at 1:30
"Spree" (season 3), not sure where

Jan 16

Reminder to think up a random experiment/game to share.
Stats in the big wide world

Time magazine, size of bubbles not representative of number
Gender wage gap

Analysis results for PLC census?

Review of concepts in probability

Ask the students to spread themselves into a random pattern.
While standing in their spots, watch the "too perfect" segment from "Burn Rate".
Have each student flip a coin 14 times and record the outcome. Ask each student to read out their sequence and tell how many H and how many T. Ask the students to compare their result with getting exactly HTHT. or all H's or all T's.
Which is more/less likely?
What is the probability of any one of these sequences? (all the same - .5^14)
How does this compare with what Charlie is telling us.
What do people do when they want something to appear random? Why do you think people think "spread evenly" when they want random.
What is relative frequency (empirical probability, measure the proportion of times event occurs in a large number of trials)
What did people calculate as the relative frequency of having 2 or more of the same birthdays at a party of 30 people? (one .7, two .9, one .5)
What does it mean for outcomes to be equally likely? In what kind of situations does this occur? What're the outcomes in a basketball free throw? (success failure) Are they equally likely? (depends on player's skill)
Have each student choose a marble from a bag containing different sized marbles. Does each marble have the same chance of being chosen?

Discuss the meaning and nuance of each of the following rules

Rule 1: For any event A, 0 ≤ P(A) ≤ 1 (if a probability is greater than 1, something is wrong)
Rule 2: P(S)=1; that is, the sum of the probabilities of all possible outcomes is 1. (we can figure out missing probabilities; if the sum of the probabilities in the sample space is greater than 1, something is wrong)
Rule 3: The Complement Rule: P(not A) = 1 - P(A) (sometimes it's lots easier to calculate when something doesn't occur than when it does.
What is the compliment of the birthday problem? (everyone in the group has a different birthday)
Prep for rule 4
What does 'OR' mean in probability?
What does disjoint mean? (2 events that cannot both occur at the same time) How else might it be termed? (mutually exclusive)
Have each student provide an example of two disjoint events, and two non-disjoint events.
Rule 4: The Addition Rule for Disjoint Events: If A and B are two disjoint events, then P(A or B) = P(A) + P(B) (why do you think this rule only applies to disjoint events? (draw a picture of non-disjoint events, overlapping circles, to motivate understanding)

Simulating guessing on a true-false test

See instructions for true-false test activity here.

Jan 9

Stats in the big wide world

Time magazine infographic on % of catholics worldwide, AAFT infographic about housing

Discuss data collection for surveys.

Experience collecting data

Plans for data analysis?

Review of concepts in probability

How intuitive are we at guessing probability? (display BD response graph.png)

What is a random experiment (an experiment with an unknown outcome)

What is an outcome? (the situation that results from an experiment)

What does the capital letter S stand for (sample space -- enumeration of possible outcomes)

What's an event? (a statement describing a collection of outcomes in the sample space)

What's the probability of an event? (a number that tells us how likely an event is to occur)

How are probabilities expressed? (as proportions)

Have each student describe a random experiment, identify its sample space, and name a relevant event (give them 2+ min to prepare). Decide whether order matters in the outcomes. Can we calculate the probability of the event? Why or why not? (discuss connection to ideas of relative frequency and equally likely).

What is relative frequency (empirical probability, measure the proportion of times event occurs in a large number of trials)

What did people calculate as the relative frequency of having 2 or more of the same birthdays at a party of 30 people? (one .7, two .9, one .5)

What does it mean for outcomes to be equally likely? In what kind of situations does this occur? What're the outcomes in a basketball free throw? (success failure) Are they equally likely? (depends on player's skill)

Have each student choose a marble from a bag containing different sized marbles. Does each marble have the same chance of being chosen?

Data simulation: Counting successes

Use the activity on p. 220 at counting successes. Rather than require use of a random number table, have the students design their own random number generator (encourage use of a spreadsheet program or a TI-83+ calculator).

Discuss questions b.i. and b.ii. as a group

Dec 19

Conduct survey studies

Shopping center soda aisle (Ethan, Cammy)

PLC census on post-PLC plans (Margaret, Bobby)

PC vs Mac among PLCers (Baird, Ashlin)

Dec 12

Stats in the big wide world

Time magazine - infographic that uses improperly scaled images: % of male vs. female college students in different countries

Continue with planning survey studies

Dec 5

Stats in the big wide world

wealth distribution

Turning to a discussion of surveys.

What is an open question? Why is this a problem in a survey?
What needs to be considered when writing a closed question (all options are understandable/represented/offers option to not say, or other)
What is meant by unbalanced response options? (more +/- responses, no middle choice)
What are leading questions? example?
Why is the order of questions important?
How is the idea of randomized response used to obtain honest response to sensitive question (coin flip)

Design a survey study

Groups collaborate to design a two-question survey of week-day afternoon Princeton Shopping Center patrons.

Create a sampling plan, consider the following in your plan:

minimize bias
at least 15 participants, at least 10% of patrons at any given time
include an element of random selection, if possible

Develop one-two questions to ask each patron.
You will implement your data collection plan for statistics next week

Nov 21

Review terminology in sampling and observational study design

Sampling

What's the purpose of creating a sample? (want to know something about a population, but it's too big to study all individuals. can do study on a sample and then make inference to the population) - show '''The big picture. png'''
What is the most important overall characteristic of a sample? (representative of population)
What problem does a sample have if it systematically under- or over- estimates the values of a variable? (bias)
Sampling methods (discuss how each works, examples, problems):

volunteer sample -- (study participants volunteer to be in the study; guaranteed to be biased
convenience sample -- (study participants chosen because right time/place for researcher; susceptible to bias because certain types of individuals are more likely to be selected than others)
narrowly defined sample -- (study participants chosen from defined subgroup of population; subgroup may be systematically different from population)
systematic sample -- (study participants chosen based on non-random, systematic method; predetermines who can participate, not as safe as random sampling
simple random sample -- (study participants chosen at random from the population; volunteer response, that is, non-response, can be a problem when we can't make individuals participate)

Including random selection in a sampling plan -- probability sampling plan (discuss how each works, examples, benefits):

Simple random sample (SRS)
Cluster sampling (include all members of randomly selelcted intact groups, families, classes)
Stratified sampling (choose a SRS within groups, strata; ensures representativeness for chosen factor)
Multi-stage sampling (stratify into successively smaller groups)

Does sample size matter? Give an example where it does and where it doesn't (better rep of pop vs. pilot, see if something might be useful)
What is the trade-off when considering a small vs. large sample size? (generalizability and time/money)

Study Design

What is study design?
What designs are presented in OLI Statistics? (observational study, survey, experiment)
Ask students for example of each kind of study.

Some examples for discussion

In a large midwestern university with 30 different departments, the university is considering eliminating standardized scores from their admission requirements. The university wants to find out whether the students agree with this plan. They decide to randomly select 100 students from each department, send them a survey, and follow up with a phone call if they do not return the survey within a week. What kind of sampling plan did they use?

(a) Stratified random sampling
(b) Simple random sampling
(c) Cluster sampling
(d) Multi-stage sampling

On October 20, 1993, the San Francisco Chronicle reported on a survey of top high-school students in the U.S. According to the survey: "Cheating is pervasive. Nearly 90 percent admitted some dishonesty, such as copying someone’s homework or cheating on an exam. The survey was sent last spring to 5,000 of the nearly 700,000 high achievers included in the 1993 edition of ''Who is Who Among American High School Students''. The results were based on the 1,957 completed surveys that were returned.

Is this survey representative of all teenagers? What is the population represented in this survey?

In a study of Wikipedia editor behavior, the researchers randomly selected 22 of the 40 most active WikiProjects. 125 editors were then randomly selected from the group of editors participating in these 22 projects. What kind of sampling plan did they use?

(a) Stratified random sampling
(b) Simple random sampling
(c) Cluster sampling
(d) Multi-stage sampling

A radio talk show invites listeners to enter a dispute about a proposed salary increase for city council members. The host says, "What annual salary do you think council members should get? Call us with your number." In all, 958 people call. The mean of all the salaries they suggest is $9,740 per year, and the standard deviation of the responses is $1,125. Which of the following statements applies to this situation? Should the results be used to inform the debate on council members salary? Why or why not?
The Democratic National Committee would like to collect the opinions of members of local democratic groups on a few issues. There are many thousands of local democratic groups, some large and some small. The researchers hired to do the study decide to focus on groups with over 1000 members, because they want the opinions of members in well-established groups. Of all of the groups with 100+ members, thirty groups are randomly selected. The survey is sent to all of the members in each of these groups. What kind of sampling plan did they use?

(a) Stratified random sampling
(b) Simple random sampling
(c) Cluster sampling
(d) Multi-stage sampling

Suppose two researchers wanted to determine if aspirin reduced the chance of a heart attack. Researcher 1 studied the medical records of 500 patients. For each patient, he recorded whether the person took aspirin every day and if the person had ever had a heart attack. Then he reported the percentage of heart attacks for the patients who took aspirin every day and for those who did not take aspirin every day.

Researcher 2 also studied 500 people. He randomly assigned half of the patients to take aspirin every day and the other half to take a placebo everyday. After a certain length of time, he reported the percentage of heart attacks for the patients who took aspirin every day and for those who did not take aspirin every day. Suppose that both researchers found that there is a statistically significant difference in the heart attack rates for the aspirin users and the non-aspirin users and that aspirin users had a lower rate of heart attacks.
What is the design of each study? Can researcher 1 conclude that aspirin caused the reduction in rate of heart attacks? Why or why not?

Nov 14

Statistics in the big wide world

measuring the tallest building in the US

continuation of linear relationships (see Nov 7)

review of lurking variables and causation (see Nov 7)

ID variables (cat vs. quant; expl vs. resp) for two variable relationships

Next time: review sampling and design a survey-based research study

Nov 7

Statistics in the big wide world

Freakonomics experiment
Dartmouth ending credit for AP - 90% of AP Psychology test-takers who scored a 5 failed a condensed final exam from the Dartmouth Psych class.

Have students work in groups to identify a research question with two variables

ID variables as quantitative or categorical
Assign to explanatory and response
Suggest methods for analysis

Topics from linear relationships and causation
- What is a prediction tool for two variables that are linearly related (linear regression equation)
- What is regression? (An analysis tool that can predict the value of the response variable given a value for the explanatory variable.)
- What sort of relationship must be present to use linear regression (linear....show slides for correlations with curvilinea rel. and outlier influence)
- How is the linear regression equation calculated (least squares computation; find the line that has the smallest sum of squared vertical deviations--display Least squares concept.png)
- History of term regression: originally named by Francis Galton, cousin of Charles Darwin, to describe the phenomena that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean)
- What is the formula for the linear regression equation? (Y = a + bX)
- (show slide of best fit line # boats and manatee deaths) What would we predict in manatee deaths if we were to limit power boats to 500,000 (y-hat = .125(500) - 41.4 = 62.5 - 41.4 = 21.1......about 21 deaths).
- Is it fair to make this conclusion? (probably not, it's not clear that it's a causal relationship)
- For this graph, how does the number of manatee deaths vary with the number of power boats (for each increase of one [thousand] boats the manatee deaths increases by .125)
- (show xkcd extrapolating) What is wrong with this logic?
- What is extrapolation?
- Read warning statement (p.5) about predicting future results from Vanguard annual report. What are they warning people not to do? (extrapolate)
- (Display olympics.ods.) Consider what the predicted the 1500 olympic time would be for more recent olympics (suggest using the equation -- use a calculator to determine the predicted time). What's the issue with this prediction? (actual time 2008--3;33:11, 2012--3:34.08)
- predicting height of boys vs age....will level off
Moving on to causation & lurking variables:
- (Display xkcd correlation) What is the principle that OLI lists on every page of this section? (Association does not imply causation!)
- Examples of lurking variables in OLI. Discuss the types of variables included and name the lurking variable in each -- show graphs
  - firefighters related to fire damage (lurking variable is seriousness of fire)
  - nationality predicts SAT score (lurking variable is educational level)
  - amount of light at night related to nearsightedness in children (parent's eyesight)
  - death rates for particular hospitals (severity of illness)
  - % taking the SAT in a state is related to median math score (prevalence of ACT or SAT in a state)
- What is Simpson's paradox? (adding a lurking variable causes us to rethink the direction of a relationship)
- Anyone have an example of Simpson's paradox.
- Other examples
  - Amt of salt on the road is related to number of accidents (really obvious that relationship is not causal)
  - Discuss issues with the Newsweek Back Story: "Can you cheat death". What is the article suggesting? Read some of the factors listed including "You have less than 12 years of education" and discuss which are not likely causal.
  - Discuss parade article about better outcome in hospitals on a weekday than a weekend
- Any other questions?

Oct 31

Statistics in the big wide world

Groups present results: data displays of eda on m&m data

Review

What is a variable? (something that can take on different values)

What is a statistic? (a quantity calculated from the sample data)

Topics from scatterplots and correlation

When is it appropriate to graph data using a scatterplot?

Which variable should go on the x-axis? (explanatory, if there's a clear distinction)

When interpreting a scatterplot, what elements do we look at? (Pattern: direction, form, strength; Deviations: outliers) -- show OLI scatterplot 2 of 5

What is a correlation? (measure of strength and direction of linear relationship between two quant variables)

What does a correlation look like?

Relating height and weight

(display height.ods) what's the form of this relationship?

is it appropriate to report a correlation (yes, for m/f separately)

will the m/f correlations be similar or different (ask each for a prediction -- check that not mixing up slope with correlation)

When is it inappropriate to use a correlation as a measure of strength? (linear -- need to LOOK at the data)

What are the units of measure for a correlation? (unitless)

How do we handle changing the units for one of the variables?

How can outliers effect a correlation? (single points can substantially strengthen or weaken it)

OLI included lots of tools to help you learn to interpret a correlation. Which helped you the most?

Relating gestation period and longevity

assign explanatory and response (actually could go either way)

(display animals.ods) what's the effect of the elephant (strenghtens correlation)

should we drop the elephant (outlier) from the dataset or not? If we were the researcher trying to explain this relationship what do we do? (consider getting more really big animals -- whale into the dataset)

Any other questions?

Oct 24

Statistics in the big wide world

Explore m&m data

have students pair up or work independently to address one of the questions we came up with about the data (e.g., relationship between dudes and dots, distribution of colors)

Oct 17

Use of graphs to communicate statistics....if any students have brought in a graph

Aesthetics: is the graph pleasing to look at?

Communication: what does the graph tell us about the data? Could a table be used just as effectively?

Other possibilities for data display: other graph choices? table?

Possibilities for misinterpretation: what does the graph assume about the reader? Does the graph distort the data in anyway?

Debrief on m&m data collections

Discuss results of M&M fun pack variable definition and measurement and measurement process.

What did you learn? (finished last time)

Identify each variable's type. (finished last time)

What questions do you have about the variables? (add them to a separate sheet in the spreadsheet file)

Ask for ideas about what to do next (data cleaning, exploratory data analysis)

Issues with graphs

Pie charts

Mercer Co. stimulus funds (Feb 2009)

too many slices (ideas for a better graph?)

funky orientation

what is the "Multiple" category

3D angled pie chart

distorts sizes

pie chart vs. bar chart

wp opinion

Bad graphs.ppt

Oct 10

Discuss concepts in EDA:

mean, average, median

histogram (show example histograms we, on ideas page, discuss/demonstrate bin labeling)

stemplot (show example two-sided stemplot)

skewness (any trouble understanding/remembering?)

boxplot ( (show example boxplot, on ideas page, any issues?)

1969 draft lottery (2:40), video of news report, wikipedia graph

quartiles (ask for example use -- collegeboard.com range of middle 50% of SAT scores by measure)

standard deviation (what is this? why don't we just sum up the deviations, what about summing absolute value of deviations)

Oct 3

Ask how people are making out with OLI work

Statistics in the big outside world

Anyone have something to share?

Time magazine: Energy graphs, dense and informative

Continue/finish with m&m data collection

Sept 26

Discuss how people are making out with OLI work.

Statistics in the big outside world

Anyone have something to share?

Infographic: How the primetime emmy's were consumed

Have students plan and implement m&m data collection for the variables identified at the last meeting.

Sept 18

Discuss where people stand with getting started with OLI statistics

Quizzes - not available in free edition

Ask for people to pay attention to the use of statistics in daily life...statistics in the big outside world

Consider setting up Notebook/Journal

Take notes from OLI work that you do

Record examples of statistics you come across in daily life:

2 Sep 2009: I heard an interesting statistic earlier today on the BBC morning news: "a life is lost to suicide every 30 seconds" (BBC News, 2/9/09 (14:15). I wonder how big this number is annually: 365 days x 24 hours per day x 60 min per hour x 2 half-min per min = 1,051,200 half-minutes. That's about 1 million suicides per year, let's assume globally, although the news reporter didn't say. Does this make sense

Estimated world population: 6.8 billion, CIA World Factbook, estimated for July 2009, accessed 2 Sept 2009.

Estimated death rate: 8.2 deaths/1,000 population, CIA World Factbook, 2009 est., accessed 2 Sept 2009.

Estimated current annual deaths: 55.6 million

OK, compared to 55.6 million total deaths per year, I believe that there could be 1 million of these deaths could be due to suicide. Interesting that it seems like a smaller number when compared with the total than when cited on a per 30-second basis.Discuss concept of variables, observations, dataset

Begin M&M data collection

Equipment needed:

50+ packs of fun-size M&Ms

Scale and/or balance

Internet access to Google spreadsheets

Procedure:

What variables related to fun size packs of M&Ms might be interesting to study? Brainstorm a list. (Record list in text document.)

Discuss idea of defining a variable. Use "homelessness" as an example to discuss the issues of definition.

Define each M&M variable. (Record definition in text document.)

Discuss idea of measuring a variable. Can all variables be measured equally well? Ask for examples.

Define measurement method for each M&M variable. (Record measurement method in text document).

Discuss idea of measurement process, for example census, civilian deaths in Iraq.

Collaboratively design a process for measuring and recording the identified M&M variables. (List the steps in a text document.)

Implement the measurement process as designed.

Sept 12

Introductions

What do you expect to learn about statistics?

Review structure of the course (see Statistics doc)

Review goals and pre-reqs

Spend some time looking at the OLI course

Discuss stats in big wide world (example growth of Learning Co. graph and Time magazine...in 1980 European men were 11 cm taller on average as compared to 1871)

Get commitment from students...will create course and send you invitation to register.

During class next week we will do M&M data collection

Plans from 2012-13 --------------------

Nov 20

Statistics in the big wide world
Understanding conditional percents
EDA on m&m data

students present analyses completed independently; interpret
identify one question related to the m&m data collected.
perform the necessary data analysis to answer the question.

Nov 14 - independent work on OLI...to get them moving into next section.

Nov 6

Online work

Status
Concept review

Discuss results of M&M fun pack variable definition and measurement and measurement process.

What did you learn?
Identify each variable's type.
What questions do you have about the variables? (add them to a separate sheet in the spreadsheet file)
Ask for ideas about what to do next (data cleaning, exploratory data analysis).

M&M Fun pack weight distribution activity

Fun Size Bags of M&Ms are sold as a part of a larger package. According to the Fair Packaging and Labeling Act, the law does not require each Fun Size Bag to carry a label weight; only the larger package must contain a weight of the contents in the package. If we wanted to sell Fun Size Bags of M&Ms separately, we would need to determine a label weight to place on each individual bag. Can we use our data to determine what weight to use as a label?

Decide which weight measurement to use to model weight distribution of fun size bags
Create a graph and descriptive statistics for package weight
See the statement from the Fair Packaging and Labeling Act to inform decision as to what weight to put on the label.

Oct 30 - No meeting....hurricane sandy

Oct 23

Discuss standard deviation and standard deviation rule
Statistics in the big wide world -- Time magazine, higher ed data display
M&M data collection -- continue with measurement/data collection of fun packs

Oct 16

Check in as to progress in OLI
Statistics in the big wide world
M&M data collection -- continue with measurement/data collection of fun packs

Oct 9

Check in as to progress in OLI
Statistics in the big wide world

pie chart from Trenton Times

perspective...
organization of sections

Edward Tufte's book Visual Display of Data
M&M data collection -- continue with measurement/data collection of fun packs

Oct 2

How are people doing with OLI?
Statistics in the big wide world...for each item discuss

Variable: what is being measured?
Measurement: how is it being measured?
Precision: is the statistic appropriately precise given its purpose? too general? too precise?
Ambiguity: what don't we know about this statistic?
Example:

Time magazine: 35% less body fat gained among children who consumed sugar-free beverages...(Oc 8, 2012).....researcher site

M&M data collection

Equipment needed:

50+ packs of fun-size M&Ms
Scale and/or balance
Internet access to Google spreadsheets

Procedure (continued)

Discuss idea of measurement process, for example census, civilian deaths in a war.
Collaboratively design a process for measuring and recording the identified M&M variables. (List the steps in a text document.)
Implement the measurement process as designed.

PLC Statistics

Plans for Thursday sessions

Possible activity related to estimation of mean from a sample: Walk the Line

May 29

May 22

May 15

May 8

May 1

Apr 24

Apr 10

Apr 3

Mar 27

Review of concepts related to sampling distributions

Creating a sampling distribution

Mar 20

Mar 13

Feb 27

Feb 20

Feb 13: Snow Day

Feb 6

Jan 30

Jan 23

Jan 16

Dec 19

Conduct survey studies

Shopping center soda aisle (Ethan, Cammy)

PLC census on post-PLC plans (Margaret, Bobby)

PC vs Mac among PLCers (Baird, Ashlin)

Dec 12

Stats in the big wide world

Time magazine - infographic that uses improperly scaled images: % of male vs. female college students in different countries

Continue with planning survey studies

Dec 5

Nov 21

Nov 7

Topics from linear relationships and causation

Oct 31

Plans from 2012-13 --------------------

No comments:

Post a Comment

	1	2	3	4	5	6
1	1,1	1,2	1,3	1,4	1,5	1,6
2	2,1	2,2	2,3	2,4	2,5	2,6
3	3,1	3,2	3,3	3,4	3,5	3,6
4	4,1	4,2	4,3	4,4	4,5	4,6
5	5,1	5,2	5,3	5,4	5,5	5,6
6	6,1	6,2	6.3	6,4	6,5	6,6

	1	2	3	4	5	6
1	1,1	1,2	1,3	1,4	1,5	1,6
2	2,1	2,2	2,3	2,4	2,5	2,6
3	3,1	3,2	3,3	3,4	3,5	3,6
4	4,1	4,2	4,3	4,4	4,5	4,6
5	5,1	5,2	5,3	5,4	5,5	5,6
6	6,1	6,2	6.3	6,4	6,5	6,6

Plans for Thursday sessions

Possible activity related to estimation of mean from a sample: Walk the Line

May 29

May 22

May 15

May 8

May 1

Apr 24

Apr 10

Apr 3

Mar 27

Review of concepts related to sampling distributions

Creating a sampling distribution

Mar 20

Mar 13

Feb 27

Feb 20

Feb 13: Snow Day

Feb 6

Jan 30

Jan 23

Jan 16

Dec 19

Conduct survey studies Shopping center soda aisle (Ethan, Cammy) PLC census on post-PLC plans (Margaret, Bobby) PC vs Mac among PLCers (Baird, Ashlin)

Dec 12

Stats in the big wide world Time magazine - infographic that uses improperly scaled images: % of male vs. female college students in different countries Continue with planning survey studies

Dec 5

Nov 21

Nov 7

Topics from linear relationships and causation

Oct 31

Plans from 2012-13 --------------------

No comments:

Post a Comment

Conduct survey studies

Shopping center soda aisle (Ethan, Cammy)

PLC census on post-PLC plans (Margaret, Bobby)

PC vs Mac among PLCers (Baird, Ashlin)

Stats in the big wide world

Time magazine - infographic that uses improperly scaled images: % of male vs. female college students in different countries

Continue with planning survey studies

	1	2	3	4	5	6
1	1,1	1,2	1,3	1,4	1,5	1,6
2	2,1	2,2	2,3	2,4	2,5	2,6
3	3,1	3,2	3,3	3,4	3,5	3,6
4	4,1	4,2	4,3	4,4	4,5	4,6
5	5,1	5,2	5,3	5,4	5,5	5,6
6	6,1	6,2	6.3	6,4	6,5	6,6