Statistics of small samples: hypothesis testing and correlation
Nov 2005/07/08/11
This set of notes focuses on hypothesis testing and correlation with smallsized samples. We assume you are generally familiar with topics in probability and statistics, including:
¶ Definition of probability
¶ Probability distribution, probability density function = pdf
¶ Random variable
¶ Independent events vs "chains of events": states of a baseball
game...
¶ Expectation...average?
¶ permutations vs combinations
¶ Number of combinations of N things taken r at a time
¶ Normal distribution, binomial distribution, Poisson distribution
¶ Variance and standard deviation
¶ normalizing a data point x to z by subtracting the mean and dividing by the
standard deviation
Binomial distribution example in Matlab and EXCEL:
Suppose the probability that someone is lefthanded is 0.1 = p.
What is the probability that NO people out of a SAMPLE of 10 are lefthanded?
0.9^10 = 35%
in general, if r is the number of left handers in a sample of N, type
into the command line:
>> r=0; N = 10; p = 0.1; q = 1p;
typed on Matlab command line:
probL = q^(Nr)*p^r*nchoosek(r, N)
or EXCEL function =BINOMDIST(r, N, p, 0);
Knowing from EXCEL help that
BINOMDIST(number_s, trials, probability_s, cumulative)
type into an EXCEL cell
=BINOMDIST(0, 10, 0.1, 0)
to see the answer 0.34867844 ~ 35%
The standard deviation of a binomial distribuiton = sqrt(p*q*n) where n = number of trials (coin flips)
Computing factorial and number of combinations:
EXCEL: fact, combin...Matlab factorial(N), nchoosek(N, r)
fact(200) too big for EXCEL, but not combin(200, 20)...
However combin(2000, 500) is too big for EXCEL
365 is such a large number...
for thinking about the samebirthday problem
any 2 of N have 1/365 chance of sharing same day...
combin(N, 2) > 180 results for N = 20
different answer of 23 from
http://en.wikipedia.org/wiki/Birthday_problem
human birth season, twins...
result for EN123 student pop?
Poisson distribution from Binomial, appropriate for very
low p.
See application in Katz,
1959, for Nobel Prize in 1970
epsp = "excitatory postsynaptic potential".
78 spontaneous epsps were observed, and had an averge peak of 0.4mV. In the
link above see Figs 5 and 6; the histograms there are the result of 198 single
nerve impulses (presynaptic) spaced several seconds apart. Of the 198 impulses
19 resulted in NO epsp (failure). Fig 5 is reproduced in Katz' book Nerve,
Muscle and Synapse (1966, McGrawHill) on page 137; on page 135 he compiles
the following numbers by adding up the 3 histogram bars near each of the peaks:
18 failure
44 0.4 mV peak
55 0.8 mV peak
36 1.2 mV
25 1.6 mV
12 2.0 mV
5 2.4
2 2.8
1 3.2
0 3.6
These data are well fit by a Poisson n*p = 2.3,
the only free parameter in the equation. What is n in n*p? Not 198, the number
of shocks to the presynaptic axon. It turns out n is the number of vesicles
in the presynaptic synapse. See figure below, from basal ganglia (research of
Tepper, from Rutgers...)
How many vesicles are there per synapse? Anywhere from hundreds to thousands.
One estimate says 987 vesicles per cubic micron. There are docking vesicles,
ready to be released, and reserve vesicles, recently reconstituted, away from
the membrane facing the synpatic cleft. At any rate, p is the probability that
one vesicle will be released (due to one stimulation shock...)
2011: What happens in a direct test of the Katz data
with the binomial distribution? n is the number of vesicles in the synapse and
p is probability of release of one vesicle from one presynaptic "shock."
Want n^p = 2.3. Use Excel to find
=binomdist(0, 100, 0.023, 0) = .0976015 ; *198 = 19 (rounded off...)
=binomdist(0, 200, 0.0115, 0) = .0988315
=binomdist(0, 400, 0.00575, 0) = .0995955
as long as n*p = 2.3 it is basically verified that the Poisson approx is correct...
who needs Poisson?
try = BINOMDIST(2, 100, 0.023, 0) * 198 = 53, close to 55
Is such a skewed distribution (long tail) normal? No.
Only when p = 0.5 does a binomial distribution look normal and symmetric.
The Katz' results plus electron micrographic images of
vesicles, are taken as solid evidence that there is a quantal nature to the
release of neurotransmitter, from vesicles, each vesicle containing nearly the
same number of transmitter molecules.
example in EXCEL: 2.33/198 = .0117 and
=POISSON(0, 0.0117*198, 1) gives 0.097 and 0.097*198 = 19
=BINOMDIST(0, 198, .0117, 1) gives0.097 also.
since we're looking at 0 events, =BINOMDIST(0, 198, .0117, 0) would give
the same answer since the 4th input argument is a flag for cumulative(1) or
not (0)...
% 
if p is closer to 0.5, then the normal distribution approx
is
2008: Nicholas Taleb: Black
Swan events;
see page 235 from the book: What
is the underlying power law?
Suppose 100 people own land in Italy, and you order and number them from least
to most.
the following matlab script gives frac 21% owned by the first 80%
(yes, the integral of the power fcn could be used...)
Occupy Italy?
2012: Men's
height: over 7 feet... Women's height:
shorter than normal

Some of the material for this file has been inspired by reading Chapters 710 of Loftus & Loftus, Essence of Statistics, 2nd Ed (Knopf, 1988).
Hypothesis Testing: Before we look at the details of calculating probabilities and making decisions, examine the Table below to appreciate the 4 possibilities relating reality to prediction. The null hypothesis is the status quo, it means the new idea is wrong, the defendant is innocent. No news. The alternative hypothesis is your new idea, an explanation for data that defies the conventional wisdom.
Reality


null hypothesis true

alternative hypothesis true


Judgment
(prediction) 
accept null hypothesis
(Not guilty) 
Correct, but just the
status quo 
false negative:
Wrong: missed opportunity; guilty man set free

accept alternative hypothesis
(Guilty) 
false positive: Wrong,
Bad News: Project started with no hope of success. Innocent man sent to jail.

Correct, and Big News!

Mistakenly rejecting the null hypothesis (false positive) is the worst mistake; it means you judge that some "new" causeeffect relationship has been discovered, and you're likely to inspire other researchers to follow up. Like finding an innocent man guilty.
More detail: Some independent variable has been changed in the experiment
and the data is the dependent variable. The null hypothesis is that the independent
variable has no affect on the dependent variable. You want to test that hypothesis.
Rejecting the null hypothesis means that your experiment supports the idea that the independent variable affects the dependent variable. Say the independent variable is bloodalcoholconcentration (BAC) and the dependent variable is max frequency of rotation of a knob. Each subject is tested sober and inebriated, resulting in two lists of numbers, to be processed by a ttest, with a criterion for significance to meet.
2011: Signal detection theory
SDT ; started by Green and Swets (1966)
http://wise.cgu.edu/sdtmod/index.asp
appropriate topic for EN123 as SDT considers the
detection of faint signals.
http://teachline.ls.huji.ac.il/72633/SDT_intro.pdf
Heeger (NYU) Lecture notes:
Consider a difficulttodetect "signal" (say tumor in CT scan) for
a radiologist/detector: 4 possibilities:
Information vs Criterion: "...some doctors may feel that missing an opportunity for early diagnosis may mean the difference between life and death. A false alarm, on the other hand, may result only in a routine biopsy operation. They may choose to err toward yes (tumor present) decisions. Other doctors, however, may feel that unnecessary surgeries (even routine ones) are very bad (expensive, stress, etc.). They may chose to be more conservative and say no (no turmor) more often. They will miss more tumors, but they will be doing their part to reduce unnecessary surgeries. And they may feel that a tumor, if there really is one, will be picked up at the next checkup. These arguments are not about information. Two doctors, with equally good training, looking at the same CT scan, will have the same information. But they may have a different bias/criterion."
External and and internal noiseinternal to the detector/observer.
"...the proportions of Hits and FAs reflect the effect of two underlying parameters: the first one reflects the separation between the signal and the noise (d') and the second one the strategy of the participant (C)." (Adbi)
Decision strategy: Pick a criterion along the detectorresponse axis (signal + noise) and responses above that criterion call for Hit. Low criterion will result in more false positives.
Receiver Operating Characteristic ROC The "discriminability" d' curves. false alarm on H axis, Hits on V axis.
d' = separation / spread
One goal of SDT: estimate d' and criterion from hits and false alarm rates. see wise.cgu.edu site.
Legal example: As a defense lawyer you'd like the jury to return a "not guilty" verdict. Implying you'd like to seat conservative jurors who require considerable evidence of guilt before "detecting the guilty signal."
College example: Cynical students may select professors who require little evidence that someone deserves an A. A low "criterion value for detecting that A aura" signal from a student...

The other kind of sampling:
What it means to "draw a sample from a distribution": The
(probability) distribution represents the population as a whole from which the
sample is drawn. Ideally you would know the mean and standard deviation of the
population. Your concern make the size of the sample large enough to test your
hypothesis, but not so large as to be too expensive or time consuming. (How
many people to call overnight in a political poll...)
The 2tail/1tail "paradox". Suppose you are testing the hypothesis
that guesses as to my correct weight may be either too small or too large. I
know my exact weight, but don't know the standard deviation of all guesses,
only the s.t.d. of, say, a sample of 6 guesses. Form a data set of the differences
(positive or negative) between the guesses and my actual weight. Estimate the
variance of the population from
Next divide the est. variance by N=6 to find the est. variance of the means
of sample size 6.
The null hypothesis, at a 95% criterion, would be that 95% of the mean guesses fall between 2.5% less and 2.5% more. Suppose the result is that in the one sample you have, the mean is large enough that it is greater than 97% of size6 guess means. By the 2tail criteria, the alt. hypothesis fails. But suppose you now change your hypothesis to say that guesses to my weight will be greater than actual. OK, your alt hypothesis succeeds at the 5% level. But you have now changed your mind...something that perhaps wasn't known by a reviewer approving your process.
Example: The school that wants to admit students with SATs above one level but below another level (to insure good students on one hand and to have a good "selectivity" rating on the other hand: don't admit students that are just going to go somewhere else...)
The paradox is resolved by being clear about what the alt hypothesis is before starting analysis. n.b.: one section of the SAT is arranged to have a mean of 500 and a standard deviation of 100. Grading on the curve: a score 2 std_dev above avg receives an A; one std_dev greater, a B.
Blind experimenters: We the scientific establishment don't want you the experimenter testing subjects you know to be in the "alt hypothesis" group; you would be too biased in favor of your hypothesis: better for a third party to select your subjects and give them either the drug or the placebo, or put them in the normal environment or the allvertical striped drum...
Are you smarter than a 10th grader? Sample of one: Suppose you score
600 on the SAT math test, knowing average 10th grader scores 500. What's the
probability that you're smarter than a 10th grader? Well, you did get a higher
score, by one standard deviation, and
=NORMDIST(600, 500, 100, 1) = 0.84, and 10.84 = 16% the onetailed probability
that someone will score 600 or more. You're in the 84th percentile.
That's all that can be said; one POV: it would not make much sense for you to take the test again, you might do better the second time just because of the practice.
1. Considering whether a sample has been drawn from the distribution of a population. The population mean and variance are known. For example from the population of all current Brown Univ students we may know that the average SAT verbal score was 700 with a standard deviation of 60. We have a sample of 20 lefthanded Brown students who have an average SAT verbal score of 720.
What we are really concerned with is the distribution of means of all samples
of size 10 drawn from the population. It turns out that the variance of
the samplesizeN distribution can be calculated (more precisely the formula
below can be proved as a theorem), and the answer is
where N is the size of the sample and σ^2_M is the variance of the distribution
of samples size N drawn from the population.
Suppose, as you saw above, if the sample size were N=1; then we would end up with the "sample" σ equal to the population σ now suppose the size of the samples were equal to the population, then the variance of the means would be zero, because all means would be the same. So the idea seems reasonable, about dependence of σ on sqrt of N.
Once we have the variance of the meansofsamplesizeN distribution, we can calculate
a z term,
In the old days the value of z would be looked up in a table of a normalized normal distribution; the table would tell what the integral from ∞ to z is, and we would draw conclusions from that reading.
Nowadays we can use EXCEL function NORMDIST: NORMDIST(mu_samp, mu_pop, samp_standard_dev, integral=TRUE) .
We don't really need to calculate z; just give NORMDIST the sample mean, population
mean, sample s.t.d. and say you want the cumlative integral from of answer (not
the "pointoncurve" calculation).
For the example above, type into an EXCEL cell =NORMDIST(720, 700, 60/sqrt(20), TRUE) hit enter, and see the answer 0.932 appear.
What does the answer mean? That 93% of samplessize20 drawn from the Brown SATverbal distribution have a mean less than 720. But if your criterion for accepting a significant difference is 95% (or 5% of the tail) then you would need a larger sample to confirm that...what? Your alternative hypothesis that left hand students at Brown are drawn from a different distribution, one that has significantly higher SAT verbal scores. Another way of stating the result: that you are 93% confident that lefthanders at Brown have higher SATverbal scores. Yet another way of interpreting the 93% answer: At the 7% level you reject the null hypothesis that there is no difference between the population as a whole and lefthanded students in regard to SATverbal score.
How large of a sample would be needed to be at the 95% confidence level, assuming the sample average stays at 720? Answer, by trialanderror in EXCEL: N = 25.
% 
Digression for election day: Errors in polling errors:
A qualifier seen in news articles about political polling:
margin of error: 3.1% margin of error: Suppose X voters out of N sampled
will vote for your candidate. What is the number of voters N needed in a sample
to insure that 95% of the time the actual percentage of voters underlying your
candidate's percentage of X/N will be within
±3.1 percent of X/N? This question, whose answer is N=1000 and whose
derivation is here, is different
from the question (answered below): What
should be N so that you're confident at the 95% level that the range of poll
percentages is ±3
percent of X/N if you repeated the poll many times?
The result can be tested by using matlab to simulate many polls of size 4270
and looking at the range of percentages, then finding out what fraction of the
many polls are within 3% of X/N.
See [signif, mat_sort] = SampSizeTest05(samp_num, test_num)
in work\fold23
See NormVoteStat08.xls for a direct test that candidate C has 95% confidence that he will win. What polling percentage will he need? It will depend on how many are in the sample...
%  
2. Comparing two variants of a population: We looked at whether an experimental group within a population is significantly different... Now what about comparing two experimental groups of a known population compared with each other? Lefthanded males, and lefthanded females, for example.
You will form a z term of difference of means:
Where S1 and S2 are the two samples to be compared. Notice the two estimated
variances are added together
Suppose it's known that the average area of maple leaves
on the ground in October is 25 cmsq, with a standard deviation of 5 cmsq.
A sample of 12 Japanese maple leaves has an averge area of 34 cmsq. Someone
else comes in with a sample of 18 big leaf maple leaves with an average area
of 38 cmsq. Is it significant at the 5% level that big leaf maple leaves are
larger than Japanese maple leaves?
From the above formula, the estimated std_dev is 1.86, z is 2.14 and =NORMDIST(4,
0, 1.86, 1) = 0.984,
so the significance is 1.6% < 5% answer: yes the difference is signifcant.
Big Leaf Maple leaves
3. If
population statistics are not available:
Now
suppose, like in the case of the Lab 9 knob rotation experiment, with 11 subjects,
that population statistics are not known.
What can be done with s.t.d. of the samples?
In
their chpt 9, on parameter estimation, Loftus & Loftus say that with sample
variance S^2 then
The resulting standard_dev can be divided into a difference,
greater than zero, but we should not call it a z variable, but a tvariable.
What is t? From Adler & Roessler, Intro
to Prob & Stat 5th Ed, Freeman (1972), p. 156,
Where m is the "null hypothesis mean" of the population.
The distribution of t depends on n and is available in tables or
kept as a calculation in EXCEL of TDIST(t, degrees of freedom, tails)
expressed as a cumulative probability.
For example, suppose
a sample of 9 femurs from ringtailed lemurs show their mean length to be 20
cm, And that the variance of the length is 9 cm. What is the probability that
the ringtail lemur femurs are NOT from a population of mean length 24 cm?
ringtailed lemur from Madagascar
4. Comparing two samples of different size, without
knowing population stats: (like the knob rotation problem) In this case
the degrees of freedom is n1 + n2  2, where n1 is the size of sample 1
and n2 is the size of sample 2. The variance of the sample means for this problem
is the sum of the two sample variances of the two means. Then term m in the
ttest formula above is normally set to 0, so
where the variances are of the sample means.
(2007) Two small samples of different sizes, with different variances: Example 105 from L&L: In the notes handed out, you can study an example whose data base is reproduced below in a screen shot of an EXCEL spreadsheet. There are two samples, one of 5 subjects and the other of 7 subjects. As you can see, the L&L analysis ends up with a t value of 3.18. If you look under EXCEL/Tools/Data Analysis near the bottom of the menu are two choices, "tTest: Two Sample Assuming Equal Variances" and "tTest: Two Sample Assuming Unequal Variances." If you run these tools on the Example 105 data only the "equal variances" choice comes up with the same t answer of 3.18. As you follow the Loftus and Loftus method, you see they combine two unequal variances into a weighted single variance.
Notice that in the cell highlighted below TDIST
is used to calculate the onetailed probability associated with the tvalue
of 3.18 and 10 degrees of freedom.
For the knob rotation lab you can use the "tTest: Paired Two Sample
for Means" tool.
Creating your own data sample of given mean and standard deviation: The pdf, probability density function, is the integral of the probability distribution. What happens if you sample at equal intervals of cumulative probability increase, and pick off the associated z values, then unnormalize them? See test code in folder fold23 function [pdfx, pdfy, samp] = pdf_tst08(div, mu, stdv) (2008). The result will be a sample with a smaller variance than the underlying population, as mentioned above.
Correlation. In hypothesis testing with ttest we
are interested in how different one group is from another. Correlation looks
at how similar one set of data is to another. Must be "paired data".
Example: In the knob rotation lab we record from the same subject the max rotation frequency for the dominant and the nondominant hands. Results vary from subject to subject. Are the results correlated? (is the Dom/ND ratio always the same for each subject?)
Definition of correlation: Assume we have a set of paired data { x, y }
Example: a list of numbers correlated with itself will have an r value
of 1.0. Some of the pairs will have products greater than 1, but because of
mean = 0, most will have products less than 1. The average of the pairs will
be 0, the sum will be M1, and the r value (M1)/(M1) = 1.
The correlation coefficient will be between 1 and +1.
A correlation of zero means the two variables are independent. A correlation
near 1 means the two variables related with a minus sign factor, Y = m*X...
A threshold can be set to determine if correlation meets a criterion: greater than 0.5 for example.
Graphics: plot y data vs x data and look for the points to lie on a straight line for perfect correlation, either plus or minus.
In EXCEL you can find the operation Pearson: from the EXCEL Help menu we learn
that it
Returns the Pearson product moment correlation coefficient,
r, a dimensionless index that ranges from 1.0 to 1.0 inclusive and reflects
the extent of a linear relationship between two data sets.
Syntax
PEARSON(array1,array2)
Array1 is a set of independent values.
Array2 is a set of dependent values.
A B
9 10
7 6
5 1
3 5
1 3
Formula Description (Result)
=PEARSON(A2:A6,B2:B6)
Pearson product moment correlation coefficient for the data sets above (0.699379)
Extra: Shifting time series to match up cause and effect; crosscorrelation;
special case of autocorrelation. See image below, from wikipedia, of a time
series with a hidden sine wave, and the autocorrelation that reveals the hidden
pattern...
Glossary/Summary:
null hypothesis: 2x2 matrix of decision vs reality
number of combinations of N things taken r at at time
binomial distribution
normal (gaussian, bell curve) distribution
Poisson distribution: release of synaptic vesicles example
standard deviation of samples of size N
estimating variance when population stats not available
when to use ttest and t tables
EXCEL functions NORMDIST, BINOMDIST, TDIST, POISSON, COMBIN
use of EXCEL Data Analysis Toolbox for statistics
Correlation, cross correlation
extra:
Fuzzy Logic, fuzzy membership
in fuzzy sets: To what degree is someone in the fuzzy set of tall people?
Curse of Dimensionality
Baseball batting order optimization (Markov chain, conditional probabilities)
See RS vs TB sheet
states 0, 1, 2 outs; runners on nowhere, 1, 12, 13, 23, 123 implies 18
states.
What's the probability of changing from one state to another, and is a runscored
during that state transition?
for nn = 1:100
ara(nn) = nn^6;
% the power is 6
end,
tot = sum(ara);
tot80 = sum(ara(1:80);
frac = tot80/tot
but
tot98 = sum(ara(1:98))
and
tot98/frac = 86%, so the top 2% do not own 50% of the land...