v1 · padrão canônico

Lesson 103 — Hypothesis testing: structure and logic

Formal structure of hypothesis testing: H0 vs H1, test statistic, p-value, significance level, Type I and II errors, and test power.

Used in: 3.º ano do EM (17-18 anos) · Equiv. Stochastik LK alemão · Equiv. Math B japonês · H2 Statistics singapurense

p\text{-value} = P(T \geq t_{\mathrm{obs}} \mid H_0) \leq \alpha \Rightarrow \text{reject } H_0

Choose your door

Rigorous notation, full derivation, hypotheses

Rigorous definition

The five elements of a hypothesis test

"The null hypothesis $H_0$ represents a claim of skepticism. It is the status quo that would be maintained unless there is sufficient evidence against it." — OpenIntro Statistics, §5.1

Errors and test power

Definition· Type I Error, Type II Error, and Power

	$H_0$ true	$H_0$ false
Reject $H_0$	Type I Error ( $\alpha$ )	Correct decision (power $= 1-\beta$ )
Do not reject $H_0$	Correct decision	Type II Error ( $\beta$ )

Type I Error (false positive): rejecting $H_0$ when it is true. Probability controlled by $\alpha$ .
Type II Error (false negative): failing to reject $H_0$ when it is false. Probability $\beta$ (depends on $H_1$ , $\sigma$ , $n$ ).
Power $= 1 - \beta$ : probability of detecting the real effect.

For a fixed sample size, decreasing $\alpha$ increases $\beta$ (trade-off). To increase power without sacrificing $\alpha$ : increase $n$ .

Formal definition of the p-value

"The p-value measures how consistent the data are with $H_0$ . A small p-value indicates that the data are incompatible with $H_0$ — not that $H_0$ is false with probability $1-p$ ." — OpenIntro Statistics, §5.1

Types of alternative hypothesis

Solved examples

Example— 103.1· Two-tailed z-test for mean (basic)

Problem. A company claims the average weight of its coffee bags is $\mu_0 = 500$ g. A sample of $n = 36$ bags gives $\bar X = 492$ g with $\sigma = 24$ g (known). At level $\alpha = 0.05$ , do the data contradict the claim?

Strategy. $H_0: \mu = 500$ , $H_1: \mu \neq 500$ (two-tailed). Use the z-statistic since $\sigma$ is known.

Resolution.

$Z = \frac{\bar X - \mu_0}{\sigma/\sqrt{n}} = \frac{492 - 500}{24/\sqrt{36}} = \frac{-8}{4} = -2.00$

Two-tailed p-value: $p = 2\,P(Z \leq -2.00) = 2 \times 0.0228 = 0.0456$ .

Since $p = 0.0456 < \alpha = 0.05$ , we reject $H_0$ . The data contradict the company's claim at the 5% level.

Verification. The critical value for two-tailed with $\alpha = 0.05$ is $z_{0.025} = 1.960$ . Since $|{-}2.00| = 2.00 > 1.960$ , the rejection by the statistic agrees with the p-value. Consistent.

Source. OpenStax Statistics, §9.2, Example 9.3 — CC-BY.

Example— 103.2· Identifying Type I and II errors (conceptual)

Problem. A quality test checks if a batch of medication has an average active ingredient concentration of 50 mg ( $H_0$ ). At the 5% level, the batch is approved or rejected. (a) What constitutes a Type I Error in this context? (b) And a Type II Error? (c) Which is more serious?

Strategy. Map formal definitions to the specific context.

Resolution.

(a) Type I Error: rejecting $H_0$ when $\mu = 50$ mg — that is, rejecting a batch that is actually compliant. Consequence: waste of good product, re-work cost.

(b) Type II Error: failing to reject $H_0$ when $\mu \neq 50$ mg — that is, approving a batch out of specification. Consequence: underdosed or overdosed medication reaches the patient.

(c) In most pharmaceutical contexts, the Type II Error is more serious: medication out of specification can cause harm to the patient. Therefore, clinical trials use a small $\alpha$ but require high power (low $\beta$ ), increasing $n$ .

Verification. The asymmetry between errors justifies calibrating $\alpha$ and $\beta$ differently by context — medicine uses $\alpha = 0.01$ and 80-90% power.

Source. OpenIntro Statistics, §5.2, Example 5.4 — CC-BY-SA.

Example— 103.3· Power and sample size calculation (intermediate)

Problem. A researcher wants to detect that the average service time at a clinic changed from $\mu_0 = 30$ min to $\mu_1 = 27$ min ( $\delta = 3$ min), with $\sigma = 10$ min, $\alpha = 0.05$ (two-tailed) and 80% power. What is the minimum $n$ ?

Strategy. Apply the sample size formula for power: $n = (z_{\alpha/2} + z_\beta)^2 \sigma^2/\delta^2$ .

Resolution.

$z_{0.025} = 1.960$ , $z_{0.20} = 0.842$ (80% power $\Rightarrow \beta = 0.20$ ).

$n = \frac{(1.960 + 0.842)^2 \times 10^2}{3^2} = \frac{(2.802)^2 \times 100}{9} = \frac{7.851 \times 100}{9} \approx 87.2$

Round up: $n = 88$ visits.

Verification. If $\delta = 6$ min (effect doubled): $n = 7.851 \times 100/36 \approx 22$ . Larger effect requires smaller sample — coherent.

Source. OpenIntro Statistics, §5.3, Example 5.7 — CC-BY-SA.

Example— 103.4· One-tailed test — medication (intermediate)

Problem. A new anticoagulant claims to reduce the average clotting time from $\mu_0 = 12$ s to below that. A sample of $n = 20$ patients gives $\bar X = 11.2$ s and $s = 2$ s. At level $\alpha = 0.05$ , is the medication effective?

Strategy. $H_0: \mu \geq 12$ , $H_1: \mu < 12$ (left-tailed). t-statistic with 19 degrees of freedom.

Resolution.

$T = \frac{11.2 - 12}{2/\sqrt{20}} = \frac{-0.8}{0.4472} = -1.789$

For $H_1: \mu < 12$ , p-value $= P(t_{19} \leq -1.789)$ . From t-table: $P(t_{19} \leq -1.729) = 0.05$ and $P(t_{19} \leq -2.093) = 0.025$ . Thus $p \approx 0.045$ .

Since $p = 0.045 < 0.05$ , we reject $H_0$ . Evidence that the medication reduces clotting time.

Verification. One-tailed critical value: $t_{0.05, 19} = -1.729$ . Since $T = -1.789 < -1.729$ , the rejection by the statistic agrees. Consistent.

Source. OpenStax Statistics, §9.4, Example 9.8 — CC-BY.

Example— 103.5· Misinterpretation of p-value and correction (advanced)

Problem. A researcher obtains $p = 0.03$ in a test of $H_0: \mu = 0$ and states: "There is a 97% probability that the effect is real." Identify the error and formulate the correct interpretation.

Strategy. Apply the formal definition of p-value and distinguish probabilities about data from probabilities about hypotheses.

Resolution.

The statement is incorrect for two reasons:

The p-value is the probability about the data (given $H_0$ ), not about the hypotheses. $P(\text{data} \mid H_0) \neq P(H_0 \mid \text{data})$ — confusing the two is the transposition of the conditional fallacy (base rate neglect).
$1 - p\text{-value} = 0.97$ does not have the interpretation of the probability of $H_1$ . To obtain $P(H_1 \mid \text{data})$ , one would need Bayes' Theorem with a prior over the hypotheses.

Correct interpretation: "If $H_0$ were true, there would be only a 3% probability of observing an effect as large (or larger) than the one observed. The data are statistically incompatible with $H_0$ at the 5% level."

Verification. Two independent studies with $p = 0.04$ each do not imply a third study with $p = 0.04$ — evidence combination is done by meta-analysis, not by multiplying p-values.

Source. OpenIntro Statistics, §5.1, Section "Interpreting p-values" — CC-BY-SA.

Exercise list

26 exercises · 6 with worked solution (25%)

Application 18Understanding 4Modeling 2Challenge 1Proof 1

Ex. 103.1ApplicationAnswer key
Formulate the hypotheses $H_0$ and $H_1$ for the following scenario: a consumer protection agency wants to verify if the average weight of a 500 g package of flour is compliant with the declaration.
Solve online
Ex. 103.2Application
Researchers want to verify if teenagers sleep less than the recommended 8 hours per night. Formulate $H_0$ and $H_1$ .
Solve online
Ex. 103.3Application
$H_0: \mu = 50$ , $H_1: \mu \neq 50$ . Data: $n = 25$ , $\bar X = 52$ , $\sigma = 10$ (known). Calculate the z-statistic and the p-value. Conclude for $\alpha = 0.05$ .
Solve online
Ex. 103.4Application
A manufacturer claims its bulbs last on average 1000 h. A sample of $n = 64$ bulbs gives $\bar X = 985$ h with $\sigma = 50$ h (known). At the 5% level, is the average lifespan less than claimed?
Solve online
Ex. 103.5Application
In a criminal trial, $H_0$ is "the defendant is innocent" and $H_1$ is "the defendant is guilty". Describe Type I and Type II Errors in this context. Which is considered more serious in the legal system? Why?
Solve online
Ex. 103.6Understanding
A test results in $p = 0.03$ . Which of the statements below is correct?
Solve online
Ex. 103.7Understanding
A test with $n = 10$ results in $p = 0.12$ . The researcher concludes "the effect does not exist". What might be wrong?
Solve online
Ex. 103.8Application
A school implemented a new methodology. The historical average grade is $\mu_0 = 35$ points. After intervention, $n = 40$ students had $\bar X = 37$ and $\sigma = 8$ (known). At the 5% level, did the grade improve?
Solve online
Ex. 103.9Application
A clinic wants to detect a 5 min reduction in service time ( $\delta = 5$ , $\sigma = 10$ ). With $\alpha = 0.05$ and 90% power, what is the minimum $n$ ?
Solve online
Ex. 103.10ApplicationAnswer key
A coin is flipped 100 times and gets 60 heads. At the 5% level, is the coin fair?
Solve online
Ex. 103.11Application
A researcher changes the significance level from $\alpha = 0.05$ to $\alpha = 0.01$ while keeping $n$ fixed. Explain the effect on Type II Error and test power.
Solve online
Ex. 103.12ApplicationAnswer key
Normal fasting blood glucose: $\mu_0 = 120$ mg/dL. A sample of $n = 50$ diabetics gives $\bar X = 128$ mg/dL with $\sigma = 20$ mg/dL. At the 1% level, is average blood glucose elevated?
Solve online
Ex. 103.13Understanding
A result is "statistically significant at 5%". What does this correctly mean?
Solve online
Ex. 103.14Application
A company wants to detect if the average weight of its products dropped from $\mu_0 = 250$ g to $\mu_1 = 245$ g, with $\sigma = 20$ g, $\alpha = 0.05$ and 80% power. What is the minimum $n$ ?
Solve online
Ex. 103.15Application
A genomics study performs 1000 simultaneous tests with $\alpha = 0.05$ . All tested genes are null (no real effect). How many false positives are expected? If 60 genes are "significant", what is the estimated false discovery rate?
Solve online
Ex. 103.16Application
A coin is flipped 800 times and gets 384 heads. At the 5% level, is the coin fair?
Solve online
Ex. 103.17ApplicationAnswer key
A survey with $n = 30$ teenagers recorded an average sleep of $\bar X = 7.5$ h with $\sigma = 1.5$ h (from previous studies). At the 5% level, do they sleep less than 8 hours?
Solve online
Ex. 103.18UnderstandingAnswer key
Which of the statements about statistical significance is correct?
Solve online
Ex. 103.19Modeling
A clinical trial tests 20 endpoints simultaneously with $\alpha = 0.05$ . What is the probability of at least one false positive without correction? Describe how Bonferroni correction solves the problem and discuss its limitation.
Solve online
Ex. 103.20Application
The historical ENEM approval rate of a school is 30%. After a new methodology, 38 out of 100 students passed. At the 5% level, did the rate improve?
Solve online
Ex. 103.21Application
Test $H_0: \mu = 50$ vs $H_1: \mu \neq 50$ with $\sigma = 10$ and $\bar X = 51$ . Calculate the p-value for $n = 10$ and $n = 10000$ . What does this reveal about the p-value and effect size?
Solve online
Ex. 103.22ApplicationAnswer key
Normal systolic pressure: $\mu_0 = 120$ mmHg. Sample of $n = 60$ sedentary adults: $\bar X = 125$ mmHg, $\sigma = 15$ mmHg. At the 1% level, is average pressure elevated?
Solve online
Ex. 103.23Application
A veterinary study wants to detect that the average weight of pigs of a breed changed from 125 kg to 120 kg ( $\delta = 5$ , $\sigma = 15$ ). With $\alpha = 0.05$ two-tailed and 80% power, how many animals are needed?
Solve online
Ex. 103.24Modeling
A school's ENEM has $\bar X = 52$ points against $\mu_0 = 50$ state average, with $s = 10$ and $n = 10000$ students. The result is "highly significant" ( $p < 0.001$ ). Calculate Cohen's effect size $d$ . Is the 2-point difference educationally relevant? Discuss.
Solve online
Ex. 103.25Challenge
Show that, under $H_0$ true, the p-value has a Uniform(0,1) distribution for continuous tests. Use this result to verify that $P(\text{reject } H_0 \mid H_0) = \alpha$ .
Solve online
Ex. 103.26Proof
Use the Neyman-Pearson Lemma to show that the one-tailed z-test (reject if $\bar X > c$ ) is the most powerful level $\alpha$ test for $H_0: \mu = \mu_0$ vs $H_1: \mu = \mu_1 > \mu_0$ with normal data and known $\sigma$ .
Solve online

Sources

OpenIntro Statistics (4th ed.) — Diez, Çetinkaya-Rundel, Barr · CC-BY-SA. Sections §5.1–5.3 (test structure, p-value, power, sample size).
Statistics (OpenStax) — Illowsky, Dean · CC-BY. Chapter 9 (null and alternative hypotheses, Type I and II errors, complete examples with z).
Statistical Thinking for the 21st Century — Russell Poldrack · CC-BY-NC. Chapters 10–11 (replicability crisis, responsible use of p-value, FDR, effect size).