statanalysis.conf_inte_md package

Submodules

todo - refactor output (last lines) - use “alternative” instead of “tail” - use kwargs format while calling functions - reorder fcts attributes

statanalysis.conf_inte_md.ci_estimators.CIE_MEAN_ONE(n, mean_dist, std_sample, t_etoile=None, cf: float | None = None)

Get_interval_mean:get the mean of a population from a sample (no sign pb) - cf: confidence level (or coverage_probability) - n: number of observations == len(sample) - mean_dist: the mean measured on the sample = mean(sample) - std_sample: std of the sample ==std(sample) - t_etoile: if set, cf is ignored.

Hyp - better the population follow nornal dist. Or use large sample (>10)

Alternative to normality: Wilcoxon Signed Rank Test

Theo - reade [here](https://en.wikipedia.org/wiki/Student’s_t-distribution#How_Student’s_distribution_arises_from_sampling)

statanalysis.conf_inte_md.ci_estimators.CIE_MEAN_TWO(N1, N2, diff_mean, std_sample_1, std_sample_2, t_etoile=None, pool=False, cf: float | None = None)

Get_interval_diff_mean: get the diff in mean of two populations(taking their samples) (sign(diff_mean) => no sign pb) - cf: confidence level (or coverage_probability) - N1: number of observations == len(sample1) - N2: number of observations == len(sample2) - mean_dist: the mean measured on the sample = mean(sample) - std_sample_1: std of the sample ==std(sample1) - std_sample_2: std of the sample ==std(sample2) - t_etoile: if set, cf is ignored. - pool: default False

True

if we assume that our populations variance are equal

we use a t-distribution of (N1+N2-1) ddl

False

if we assume that our populations variance are not equal

we use a t-distribution of min(N1, N2)-1 ddl

Hyp - both the population follow normal dist. Or use large sample (>10) - the populations are independant from each other - use simple random samples - for pool=True, variances are assume to be the same

to test that, you can

use levene test [plus robuste que fusher ou bartlett face à la non-normalité de la donnée](https://fr.wikipedia.org/wiki/Test_de_Bartlett)

H0: Variances are equals; H1: there are not

`python scipy.stats.levene(liste1,liste2, center='mean') solution = "no equality" if p-value<0.05 else "equality" `

or check if IQR are the same

IQR = quantile(75%) - quantile(25%)

Eqvl - scipy.stats.ttest_ind(liste1,liste2, equal_var = False | True)

Eqvl_pointWise estimation - Assume diff_mean = 82 - Result: diff_mean in CI = [77.33, 87.63] - If we test H0:p=80 vs H1:p>80, we would fail to reject the null because H1 is not valide here - As sa matter of fact, there is some value in CI below 80 witch if not compatible with H1 => the test doest give enough evidence to reject H0

Theo - read [here](https://en.wikipedia.org/wiki/Student’s_t-distribution#How_Student’s_distribution_arises_from_sampling)

statanalysis.conf_inte_md.ci_estimators.CIE_ONE_PROPORTION(proportion, n, method, cf: float | None = None)

Get_interval_simple: get a proportion of an attribute value (male gender, ) in a population based on a sample (no sign pb) - cf: confidence_level (or coverage_probability) - proportion: measurement - n: number of observations == len(sample) - method: “classic” or “conservative”

Hyp - better the population follow nornal dist. Or use large sample (>10)

statanalysis.conf_inte_md.ci_estimators.CIE_PROPORTION_TWO(p1, p2, n1, n2, cf: float | None = None)

Get_interval_diff: get the diff of mean between 2 population based on a sample from each population (p1-p2) #p1-p2# - cf: confidence_level (or coverage_probability) - p1: mean of liste1 - p2: mean of liste2 - n1: len(liste1) - n2: len(liste2)

Hyp - better the populations follow normal dist. Or use large samples (>10)

statanalysis.conf_inte_md.ci_estimators.get_min_sample(moe: float, p=None, method=None, cf: float | None = None)

Get_min_sample:get the minimum of sample_size to use for a - input

cf: confidence (or coverage_probability): between 0 and 1

moe: margin of error

method (optional): “conservative” (default)

p: not used if method==”conservative”

Hyp - better the population follow nornal dist. Or use large sample (>10)

statanalysis.conf_inte_md.ci_estimators.print(*args)

Some defs - parameter: A quantifiable characteristic of a population - confidence interval: range of reasonable values for the parameter

todo - use kwargs format while calling functions - reorder fcts attributes

statanalysis.conf_inte_md.confidence_interval.IC_MEAN_ONE(sample: list, t_etoile=None, confidence: float | None = None)

Estimate_population_mean(ONE MEAN): We need the spread (std): We will use an estimation

Data

confidence:..
sample: value…

Method - Use t-distribution to calculate few

Hypothesis - Samples follow a normal (or large enough to bypass this assumption) => means of these sample follow a t-dist

statanalysis.conf_inte_md.confidence_interval.IC_MEAN_TWO_NOTPAIR(sample1, sample2, pool=False, confidence: float | None = None)

Difference_population_means_for_nonpaired_data(TWO MEANS FOR PAIRED DATA): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? #p1-p2#

Construction - It is like,

checking if a feature magnitude change when going from a category to another

Example_contexte

having a dataframe df, with 3 col [name, score, equipe, role]

equipe: “1” or “2”

role: df.role.nunique = 11 => len(df)==22

Now there is a battle: For a “same role” fight”, which team is the best?

Example_question

if education level are generally equal -> mean difference is 0

Is there a mean difference between the education level based on gender

if education levels are unequel -> mean difference is not 0

So, Look for 0 in the ranfe of reaonable values

We need the spread (std): We will use an estimation

Args

confidence:..
Sample1: list: values…
Sample2: list: (same len) values…
pool: default False
- True
  
  if we assume that our populations variance are equal
  
  we use a t-distribution of (N1+N2-1) ddl
- False
  
  if we assume that our populations variance are not equal
  
  we use a t-distribution of min(N1, N2)-1 ddl

Method - Use t-distribution to calculate few - create joint confidence interval

Hypothesis - a random sample - Samples follow a normal (or large enough to bypass this assumption: 10 per category) => means of these sample follow a t-dist

Notes - With {cf} confidence, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]} - if all values are above 0, cool there is a significativity

statanalysis.conf_inte_md.confidence_interval.IC_MEAN_TWO_PAIR(sample1, sample2, t_etoile=None, confidence: float | None = None)

Difference_population_means_for_paired_data(TWO MEANS FOR PAIRED DATA): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? #p1-p2#

What is paired data ?

measurements took on individuals (people, home, any object)

technicality:

When in a dataset (len = n) there is a row df.a witch values only repeat twice (=> df.a.nunique = n/2)

we can do a plot(x=feature1, y=feature2)

examples

Each home need canibet quote from two suppliers => we want to know if there is an average difference in nb_quotes from between twese two suppliers

In a blind taste test to compare two new juice flavors, grape and apple, consumers were given a sample of each flavor and the results will be used to estimate the percentage of all such consumers who prefer the grape flavor to the apple flavor.

Construction

It is like,

checking if a feature magnitude change when going from a category to another, each pair split the two cztegories

Example_contexte

having a dataframe df, with 3 col [name, score, equipe, role]

equipe: “1” or “2”

role: df.role.nunique = 11 => len(df)==22

Now there is a battle: For a “same role” fight”, which team is the best?

Example_question

if education level are generally equal -> mean difference is 0

Is there a mean difference between the education level of twins

if education levels are unequel -> mean difference is not 0

So, Look for 0 in the ranfe of reaonable values

We need the spread (std): We will use an estimation

Equivl - IC_MEAN_ONE(confidence, sample1 - sample2)

Data

confidence:..
Sample1: list: values…
Sample2: list: (same len) values…

Method - Use t-distribution to calculate few - create joint confidence interval

Hypothesis - a random sample of identical twin sets - Samples follow a normal (or large enough to bypass this assumption: (ex 20 twins)) => means of these sample follow a t-dist

Notes - With {cf} confidence, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]} - if all values are above 0, cool there is a significativity

statanalysis.conf_inte_md.confidence_interval.IC_PROPORTION_ONE(sample_size: int, parameter: float, confidence: float | None = None, method: str | None = None)

Confidence_interval(ONE PROPORTION):Confidence interval calculus after a statistic test - input

sample_size: int: sample size (more than 10 to use this method)

parameter: float: the measurement on the sample

confidence: float: confidence confidence (between O and 1). Greater the confidence, wider the interval

method: str: either “classic” (default) or “conservative.

Example:
- how many men in the entire population with a con ?
- a form filled by 300 people show that there is only 120 men => p = (120/300); N=300
Hypothesis
- the sample is over 10 for each of the categories in place => we use the “Law of Large Numbers”
- the sample proportion comes from data that is considered a simple random sample
Idea
- let P: the real proportion in the population
- let S: Size of each sample == nb of observations per sample
- For many samples, we calculate proportions per sample: ex: for N samples of size S => N proportions values
- (p - P) / ( p*(1-p)/S ) follow a normal distribution
Descriptions:
- For a given polulation and a parameter P to find, If we repeated this study many times, each producing a new sample (of the same size {res.sample_size==S}) from witch a {res.confidence} confidence interval is computed, then {res.confidence} of the resulting confidence intervals would be excpected to contain the true value P
- If the entire interval verify a property, then it is reasonable say that the parameter verify that property
Result
- with a {res.confidence} confidence, we estimate that the populztion proportion who are men is between {res.left_tail} and {res.right_tail}

statanalysis.conf_inte_md.confidence_interval.IC_PROPORTION_TWO(p1, p2, N1, N2, confidence: float | None = None)

Difference_population_proportion(TWO PROPORTIONS): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? #p1-p2#

Method

create joint confidence interval

Construction - Cmparison

Hypotheses - two independant random samples - large enough sample sizes : 10 per category (ex 10 yes, 10 no)

statanalysis.conf_inte_md.confidence_interval.print(*args)

statanalysis.conf_inte_md package

Submodules

Module contents