statanalysis.hyp_testi_md package

Submodules

utils - Dans un test, H0 est l’hypothese pessimiste

il faudra donc assez d’evidence (p<0.05) afin de la rejeter

todo - refactor output (last lines) - use “alternative” instead of “tail” - use kwargs format while calling functions - reorder fcts attributes

statanalysis.hyp_testi_md.hp_estimators.HPE_FROM_P_VALUE(tail: str | None = None, p_value=None, t_stat=None, p_hat=None, p0=None, std_stat_eval=None, alpha=None, test='z_test', ddl=0, onetail=False)

_summary_

Parameters:

tail (str, optional) – “middle” or “left” or “right”
p_value (_type_, optional) – _description_. Defaults to None.
t_stat (_type_, optional) – _description_. Defaults to None.
p_hat (_type_, optional) – _description_. Defaults to None.
p0 (_type_, optional) – _description_. Defaults to None.
std_stat_eval (_type_, optional) – _description_. Defaults to None.
alpha (_type_, optional) – _description_. Defaults to None.
test (str, optional) – _description_. Defaults to “z_test”.
ddl (int, optional) – _description_. Defaults to 0.
onetail (bool, optional) – if tail=”middle”. return one_tail_cf_p_value instead of the 2tail_2cf_p_value Defaults to False.

Returns:

_description_

Return type:

_type_

statanalysis.hyp_testi_md.hp_estimators.HPE_MEAN_MANY(*samples, alpha=None)

check if mean is equal accross many samples

Hypothesis H0: mean1 = mean2 = mean3 = …. H1: one is different

Hypothesis - each sample is

simple random

normal

indepebdant from others

same variance
- if added, the “same variance test” should use levene test but apparently, use levene test [plus robuste que fusher ou bartlett face à la non-normalité de la donnée](https://fr.wikipedia.org/wiki/Test_de_Bartlett)

Fisher test

The F Distribution is also called the Snedecor’s F, Fisher’s F or the Fisher–Snedecor distribution [1](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) [2](https://blog.minitab.com/fr/comprendre-lanalyse-de-la-variance-anova-et-le-test-f)

Returns:: (float) F p_value: (float)
Return type:: stat

statanalysis.hyp_testi_md.hp_estimators.HPE_MEAN_ONE(alpha, p0, mean_dist, n, std_sample, tail='tail-right')

get the mean of a population from a sample (no sign pb) using using a T-statistic (always T for mean!! unless youre comparing a sample vs a population of known std) - alpha: - n: number of observations == len(sample) - mean_dist: the mean measured on the sample = mean(sample) - std_sample: std of the sample ==std(sample). You should use a real estimate (ffod=n-1)

Hyp - simple random sample - better the population follow nornal dist. Or use large sample (>10)

Alternative to normality: Wilcoxon Signed Rank Test

Theo - read [here](https://en.wikipedia.org/wiki/Student’s_t-distribution#How_Student’s_distribution_arises_from_sampling)

statanalysis.hyp_testi_md.hp_estimators.HPE_MEAN_TWO_NOTPAIRED(alpha, diff_mean, N1, N2, std_sample_1, std_sample_2, pool=False, tail='tail-middle')

check the diff in mean of two populations(taking their samples) (sign(diff_mean) => no sign pb) - alpha: - N1: number of observations == len(sample1) - N2: number of observations == len(sample2) - mean_dist: the mean measured on the sample = mean(sample) - std_sample_1: std of the sample ==std(sample1) - std_sample_2: std of the sample ==std(sample2) - pool: default False

True

if we assume that our populations variance are equal

we use a t-distribution of (N1+N2-1) ddl

False

if we assume that our populations variance are not equal

we use a t-distribution of min(N1, N2)-1 ddl

Hyp - both the population follow normal dist. Or use large sample (>10) - the populations are independant from each other - use simple random samples - for pool=True, variances must be the same

to test that, you can

use levene test [plus robuste que fusher ou bartlett face à la non-normalité de la donnée](https://fr.wikipedia.org/wiki/Test_de_Bartlett)
::H0: Variances are equals; H1: there are not ::scipy.stats.levene(liste1,liste2, center=’mean’) ::solution = “no equality” if p-value<0.05 else “equality”

or check if IQR are the same

IQR = quantile(75%) - quantile(25%)

Theo - read [here](https://en.wikipedia.org/wiki/Student’s_t-distribution#How_Student’s_distribution_arises_from_sampling)

statanalysis.hyp_testi_md.hp_estimators.HPE_MEAN_TWO_PAIRED(alpha, mean_diff_sample, n, std_diff_sample, tail='tail-middle')

get the difference of mean between two list paired (no sign pb) using a T-statistic (always T for mean!! unless youre comparing a sample vs a population of known std) - alpha: - mean_diff_sample: the mean measured on the sample = mean(sample) - n: number of observations == len(sample) == n1 == n2 - std_diff_sample: std of the sample ==std(sample). You should use a real estimate (ffod=n-1) - tail: default=Tails.middle to test the equality (mean_diff=0). But we can also to mean_diff>0 (right) or mean_diff<0 (left)

Hyp - simple random sample - better when the diff of the samples (sample1 - sample2) follow nornal dist. Or use large sample (>10) - std_diff_sample is a good data based estimated [use (n-1) instead of n]. example: np.std(sample1 - sample2, ddof=1) is better than ddof=0 (default)

Hypothesis - H0: p1 - p2 = 0 - H1:

H1: p1 - p2 != 0 for(tail=middle)

H1: p1 - p2 > 0 for(tail=right)

H1: p1 - p2 < 0 for(tail=left)

statanalysis.hyp_testi_md.hp_estimators.HPE_PROPORTION_ONE(alpha, p0, proportion, n, tail='tail-right')

check a proportion of an attribute value (male gender, ) in a population based on a sample (no sign pb) using a Z-statistic - alpha: p_value_max: significance level - p0: proportion under the null - proportion: measurement - n: number of observations == len(sample) - tail:

right: check if p>p0

left: check if p<p0

middle: ckeck id p==p0

Hyp - simple random sample - large sample (np>10)

Hypotheses - H0: proportion = p0 - H1:

tail==right => proportion > p0

tail==left => proportion < p0

tail==middle => proportion != p0

Detail - use a normal distribion (Z-statistic)

Result (ex:tail=right) - if reject==True

There is sufficient evidence to conclude that the population proportion of {….} is greater than p0

statanalysis.hyp_testi_md.hp_estimators.HPE_PROPORTION_TW0(alpha, p1, p2, n1, n2, tail='tail-middle', evcpp=False)

check the diff of proportion between 2 population based on a sample from each population (p1-p2) #p1-p2# using a Z-statistic (always used for difference of estimates). there is also fisher and chi-square - alpha: level of significance - p1: proportion of liste1 - p2: proportion of liste2 - n1: len(liste1) - n2: len(liste2) - evcpp: bool(defult=False) (True -> Estimate of the variance of the combined population proportion)

Hyp - two independant samples - two random samples - large enough data

Hypotheses - H0: proportion = p0 - H1: proportion !=p0

Detail - use a normal distribion (Z-statistic)

statanalysis.hyp_testi_md.hp_estimators.print(*args)

utils

Dans un test, H0 est l’hypothese pessimiste
- il faudra donc assez d’evidence (p<0.05) afin de la rejeter
- on a alors mis une borne max faible sur l’erreur de type 1 (rejeter H0 alors qu’il est vrai)

Some defs

parameter: A quantifiable characteristic of a population (baseline)
alpha: level of significance = type1_error = proba(reject_null;when null is True)

todo - docstrings for the fcts here

todo - use kwargs format while calling functions - reorder fcts attributes

note - mean_two_paired(Sample1, Sample2) <==> mean_one(Sample1 - Sample2)

statanalysis.hyp_testi_md.hypothesis_testing.HP_MEAN_MANY(*samples)

statanalysis.hyp_testi_md.hypothesis_testing.HP_MEAN_ONE(p0: float, alpha: float, sample: list, symb='p>p0')

ONE MEAN: We need the spread (std): We will use an estimation

Data

alpha:..
sample: value…

Method - Use t-distribution to calculate few

Hypothesis - Samples follow a normal (or large enough to bypass this assumption) => means of these sample follow a t-dist

statanalysis.hyp_testi_md.hypothesis_testing.HP_MEAN_TWO_NOTPAIR(alpha, sample1, sample2, symb='p!=p0', pool=False)

TWO MEANS FOR PAIRED DATA: We have have estimate a parameter p on two populations (1 , 2).How to check p1-p2 != 0? #p1-p2#

Construction - It is like,

checking if a feature magnitude change when going from a category to another

Example_contexte

having a dataframe df, with 3 col [name, score, equipe, role]

equipe: “1” or “2”

role: df.role.nunique = 11 => len(df)==22

Now there is a battle: For a “same role” fight”, which team is the best?

Example_question

if education level are generally equal -> mean difference is 0

Is there a mean difference between the education level based on gender

if education levels are unequel -> mean difference is not 0

So, Look for 0 in the ranfe of reaonable values

We need the spread (std): We will use an estimation

Data

alpha:..
Sample1: list: values…
Sample2: list: (same len) values…
pool: default False
- True
  
  if we assume that our populations variance are equal
  
  we use a t-distribution of (N1+N2-1) ddl
- False
  
  if we assume that our populations variance are not equal
  
  we use a t-distribution of min(N1, N2)-1 ddl

Method - Use t-distribution to calculate few - create joint alpha

Hypothesis - a random sample - Samples follow a normal (or large enough to bypass this assumption: 10 per category) => means of these sample follow a t-dist

description
With {alpha} alpha, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}
if all values are above 0, cool there is a significativity

statanalysis.hyp_testi_md.hypothesis_testing.HP_MEAN_TWO_PAIR(alpha, sample1, sample2, symb='p!=p0')

TWO MEANS FOR PAIRED DATA: We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? #p1-p2#

What is paired data:

measurements took on individuals (people, home, any object)
technicality:
- When in a dataset (len = n) there is a row df.a witch values only repeat twice (=> df.a.nunique = n/2)
- we can do a plot(x=feature1, y=feature2)
examples
- Each home need canibet quote from two suppliers => we want to know if there is an average difference in nb_quotes from between twese two suppliers
- In a blind taste test to compare two new juice flavors, grape and apple, consumers were given a sample of each flavor and the results will be used to estimate the percentage of all such consumers who prefer the grape flavor to the apple flavor.
Construction
- It is like,
  
  checking if a feature magnitude change when going from a category to another, each pair split the two cztegories
  
  Example_contexte
  
  having a dataframe df, with 3 col [name, score, equipe, role]
  
  equipe: “1” or “2”
  
  role: df.role.nunique = 11 => len(df)==22
  
  Now there is a battle: For a “same role” fight”, which team is the best?
  
  Example_question
  
  if education level are generally equal -> mean difference is 0
  
  Is there a mean difference between the education level of twins
  
  if education levels are unequel -> mean difference is not 0
  
  So, Look for 0 in the ranfe of reaonable values

We need the spread (std): We will use an estimation

Equivl - estimate_population_mean(alpha, sample1 - sample2)

Data

alpha:..
Sample1: list: values…
Sample2: list: (same len) values…

Method - Use t-distribution to calculate few - create joint alpha

Hypothesis - a random sample of identical twin sets - Samples follow a normal (or large enough to bypass this assumption: (ex 20 twins)) => means of these sample follow a t-dist

description
With {alpha} alpha, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]}
if all values are above 0, cool there is a significativity

statanalysis.hyp_testi_md.hypothesis_testing.HP_PROPORTION_ONE(sample_size: int, parameter: float, p0: float, alpha: float, symb='p>p0')

ONE PROPORTION:alpha calculus after a statistic test - input

sample_size: int: sample size (more than 10 to use this method)

parameter: float: the measurement on the sample

alpha: float: alpha alpha (between O and 1). Greater the alpha, wider the interval

method: str: either “classic” (default) or “conservative.

Example:
- how many men in the entire population with a con ?
- a form filled by 300 people show that there is only 120 men => p = (120/300); N=300
Hypothesis
- the sample is over 10 for each of the categories in place => we use the “Law of Large Numbers”
- the sample proportion comes from data that is considered a simple random sample
Idea
- let P: the real proportion in the population
- let S: Size of each sample == nb of observations per sample
- For many samples, we calculate proportions per sample: ex: for N samples of size S => N proportions values
- (p - P) / ( p*(1-p)/S ) follow a normal distribution
Descriptions:
- For a given polulation and a parameter P to find, If we repeated this study many times, each producing a new sample (of the same size {res.sample_size==S}) from witch a {res.alpha} alpha is computed, then {res.alpha} of the resulting alphas would be excpected to contain the true value P
- If the entire interval verify a property, then it is reasonable say that the parameter verify that property
Result
- with a {res.alpha} alpha, we estimate that the populztion proportion who are men is between {res.left_tail} and {res.right_tail}

statanalysis.hyp_testi_md.hypothesis_testing.HP_PROPORTION_TWO(alpha, p1, p2, N1, N2, symb='p!=p0', evcpp=False)

TWO PROPORTIONS: We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? #p1-p2#

Method

create joint alpha
evcpp: bool(defult=False) (True -> Estimate of the variance of the combined population proportion)

Construction - Cmparison

Hypotheses - two independant random samples - large enough sample sizes : 10 per category (ex 10 yes, 10 no)

statanalysis.hyp_testi_md.hypothesis_testing.print(*args)

statanalysis.hyp_testi_md package

Submodules

Module contents