Statistical Analysis of Data, especially in the Service of Hypothesis Testing and Decision Making

update: Nov. 2002, 2004

¶ Probability, fuzzy sets, estimation, and decision-making

¶ Counting ordered combinations segues to binomial distribution

¶ Binomial distribution segues to the normal distribution

¶ hypothesis testing: the null hypothesis

¶ correlation

¶ Markov chain: conditional probabilities (coming soon)

**Reading:** Chapters 7-10 of Loftus & Loftus, *Essence of Statistics,*
with emphasis on 7 & 10.

**Example: **Consider statistical analysis for
hypothesis testing. Suppose you have an explanation--a hypothesis--for some observation
of nature: Example (from visual physiology): that **direction
selective neurons** recorded from the vertebrate retina require excitatory AND
connections in their dendritic trees, with one input coming through a delay line
from a neighboring photoreceptor. Suppose you design an experiment
to test your hypothesis: record proximal signals from a dendritic bifurcation while
controlling its two distal inputs. Two inputs are given uncorrelated noise of 50%
duty cycle square waves, and the percentage of time OUT is HI is recorded. 25% of
the time you expect the two input signals to be simultaneously HI, driving the proximal
dendrite HI. You measure the actual % time the proximal dendrite is HI. To the extent
it is near 25% your "alternative" hypothesis looks good. Can you do better?
Yes. (1) Measure from many bifurcations. 100? (2) Keep track of stimulation and
response: Is the output HI **correlated** with inputs being simultaneously HI?
What about correlating current output with input from some msec ago?

**Probability:** the study of probability grew of an interest in understanding
various forms of gambling. Much progress was made in mid-17th-century France, during
a period when high-stakes gambling flourished in the and around the Royal Court.
Pascal (1623-62) made the most progress. See Chpt 1 of *Physics With Illustrative
Examples from Biology and Medicine Vol 2, Statistical Physics, *by Villars &
Benedek, A-W (1974). **
Probability: definition:
**Expectation is average, or mean, in typical examples.

**Probability: Example: baseball: **Suppose
you've collected hitting statistics (probability of single, double, homerun,
out, etc) on batters of team RS. You have also written a baseball simulator
that uses a random number generator to decide what happens to each batter in
each inning and keeps track of the runs scored for all probabilities above a
threshold. For a given lineup against a given pitcher, you run the simulator
1000 times, and generate a table of percentages of runs/game that team RS scored.
Let's say it's a table of probabilities for 9 entries, from 0 RPG to 8 RPG.
You do a similar set of simulations for team NY lineup against a RS pitcher.
What is the average score per game for team RS? What is the probability that
team RS will defeat team NY? The probability that the game will go to extra
innings?

ANS: Multiply runs/**inning** vectors with itself 9 times to generate the Runs/**Game**
vector for each team. Multiply the two team's RPG vectors together (outer product)
to form a 9 x 9 matrix of all probabilities of RPG clashes. The spreadsheet below
shows in color what sections map out Tampa, Red Sox and Extra Inning probabilities.

link to spreadsheet showing the probability matrix for all combinations of RPG of the two teams

**Counting: Permutations vs combinations
**Combinations are arrangements without regard to order; in permutations, order
counts How many permutations of N things taken N at a time? N!. [For the first position
there are N possibilities, for the second, N-1, etc.] How many

**The binomial distribution:
**The formula above, for the probability of exactly r out of N, describes the
binomial distribution. Example: What is the distribution of the number of heads
in a 10-coin flip? There could be anywhere from 0 to 10 heads up in such an event.
What are the number of combinations of 10 things r at a time, where r will be the
number of heads? Consider expansion of ,
the coefficients of the expansion will be the numbers of successes for each of the
powers of p, for 2^10 = 1024 trials. The sum of the coefficients will equal 1024.
And can be seen as the coefficients of Pascal's triangle. The standard deviation
of a binomial distribution is .

Examples of binomial distributions.

*Another possible FTQ: *Suppose 10% of the population
is left handed, (for people under the age of 50). In a nursing home you interview
201 residents in their 80's, and find that 12 are left-handed. What is the chance
that 80 year olds are from the same distribution as the under 50-population?

Use EXCEL function BINOMDIST(success, trials, prob of success,
cumulative=1),

type =BINOMDIST(12, 201, 0.1, 1) in an EXCEL cell and find the answer, 3%.

Thus at the 5% level it would appear that lefties die young.

**Election 2004:** Suppose before the election you ask
9 of your friends who they will vote for, Bush or Kerry. 5 say B and 4 say K. None
are undecided, and none named a 3-rd party candidate. In fact B 5 and K 4 is one
of 10 possible outcomes, from B 0 to B 9. Using BINOMDIST with EXCEL you can calculate
the probabilities of each of the 10 possibilities.

Say the null hypothesis is that will each candidate will
receive the same number of votes: that the probability of any of the 9 friends voting
for Bush is assumed to be 0.5, then what is the signifcance of 5/9 actually choosing
Bush?

**The Poisson distribution:
**Suppose p << q , implying rare events. A recast of the binomial expansion
results in a Poisson distribution. The probability of n events in time interval
Δ t is

The arrival of raindrops on a roof tile per second, or the release of neurotransmitter vesicles at a synapse, within 100 msec of an action potential, can be described by Poisson statistics. The key number is many "Poisson experiments" is how many failures there are: how many instance of zero raindrops, or zero vesicles released? orzero hurricanes striking the USA during a season?

Katz, 1959, for Nobel Prize in 1970

-------------------------------------------------------------------------------------------

As part of the knob rotation lab, you'll learn to use EXCEL for statistical
analysis, particularly the t-test in Tools/Analysis. Also see normhist
function for calculating from the normal distribution.

* Normalizing Data: *Look at the exponent of e in the definition of
the normal distribution. L

*Example 1 of FTQ for Lab 9:*

Suppose the population of college students is tested on how many digits each can
remember. Say the average is 7 and the variance of the population is 4. Now say
a sample of 12 Brown students can remember a average of 8 digits. Is is significant
at the 5% level that Brown students are better at remembering digits than the average
college student?

**Comparing two variants of a population: **The example above investigates whether
an experimental group within a a population is significantly different. What about
comparing two experimental groups of a known population with each other? You will
form a z term of

Where S1 and S2 are the two samples to be compared.

Suppose it's known that the average area of maple leaves on the ground in October is 25 cm-sq, with a standard deviation of 5 cm-sq. A sample of 22 Japanese maple leaves has an average area of 32 cm-sq. Someone else comes in with a sample of 18 bigleaf maple leaves with an average area of 38 cm-sq. Is it significant that bigleaf maple leaves are larger than Japanese maple leaves?

Seen in every news article about political polling: **margin
of error:** 3% margin of error: *if you asked a poll question 100 times, 95
of those times the percentage of people giving a particular answer would be within
±3 percent of those who gave that same answer to the poll question.*

To find a situation where 1000 people would satisfy the margin
of error above, assume the population of the state is 1,000,000 and that the probability
of choosing the candidate is 0.5. then

See websites for calculator**,** definition, etc:**
**http://www.americanresearchgroup.com/moe.html

which calculates the same answer as above, for population and sample size.

http://whyfiles.org/009poll/math_primer.html

**If population
statistics are not available:
**Now
suppose, like in the case of the knob rotation experiment, that population statistics
are not known.

What can be done with s.t.d. of the samples?

Chapter 10 of L&L tells you about the need for **t-tables**
when the underlying population statistics are not known, as is the case for our
knob rotation experiment. You can estimate population statistics as given by the
examples in chpt 10, or go directly to the EXCEL t-test analysis tool. What you
need to know about the EXCEL tool is how to interpret the results it will present.
Study what it means to be a one-tailed or two-tailed test, and what the calculated
probabilities mean in terms of hypothesis testing.

**(2004) What is t?** From
Adler & Roessler, *Intro to Prob & Stat 5th Ed*, Freeman (1972), p.
156,

Where m is the "null hypothesis mean" of the population.

The distribution of t depends on n and is tabulated in tables or

kept as a calculation in EXCEL of TDIST(t, degrees of freedom, tails)

expressed as a cumulative probability.

**For example, **suppose
a sample of 9 femurs from lemurs show their mean length to be 20 cm, And that the
variance of the length is 9 cm. What is the probability that the lemur femurs are
NOT from a population of mean length 24 cm?

**Comparing two samples of different size, without knowing
population stats:** (like the knob rotation problem) In this case the degrees
of freedom is n1 + n2 - 2, where n1 is the size of sample 1 and n2 is the size
of sample 2. The variance of the sample means for this problem is the sum of
the two sample variances of the two means. Then term m in the t formula above
is normally set to 0, so

where the variances are of the sample **means. **

**Correlation. **

In the knob rotation lab we record from the same subject the
max rotation frequency for the dominant and the non-dominant hands. Results vary
from subject to subject. Are the results correlated? (is the Dom/ND ratio always
the same for each subject?)

Definition of correlation: Assume we have a set of paired data { x, y }

The correlation coefficient will be between -1 and +1.

In EXCEL you can find the operation Pearson: from the EXCEL Help menu we learn
that it

Returns the Pearson product moment correlation coefficient,
r, a dimensionless index that ranges from -1.0 to 1.0 inclusive and reflects the
extent of a linear relationship between two data sets.

Syntax

PEARSON(array1,array2)

Array1 is a set of independent values.

Array2 is a set of dependent values.

A B

9 10

7 6

5 1

3 5

1 3

Formula Description (Result)

=PEARSON(A2:A6,B2:B6)

Pearson product moment correlation coefficient for the data sets above (0.699379)

**Glossary: **