statanalysis.conf_inte_md package
Submodules
todo - refactor output (last lines) - use “alternative” instead of “tail” - use kwargs format while calling functions - reorder fcts attributes
- statanalysis.conf_inte_md.ci_estimators.CIE_MEAN_ONE(n, mean_dist, std_sample, t_etoile=None, cf: float | None = None)
Get_interval_mean:get the mean of a population from a sample (no sign pb) - cf: confidence level (or coverage_probability) - n: number of observations == len(sample) - mean_dist: the mean measured on the sample = mean(sample) - std_sample: std of the sample ==std(sample) - t_etoile: if set, cf is ignored.
Hyp - better the population follow nornal dist. Or use large sample (>10)
Alternative to normality: Wilcoxon Signed Rank Test
Theo - reade [here](https://en.wikipedia.org/wiki/Student’s_t-distribution#How_Student’s_distribution_arises_from_sampling)
- statanalysis.conf_inte_md.ci_estimators.CIE_MEAN_TWO(N1, N2, diff_mean, std_sample_1, std_sample_2, t_etoile=None, pool=False, cf: float | None = None)
Get_interval_diff_mean: get the diff in mean of two populations(taking their samples) (sign(diff_mean) => no sign pb) - cf: confidence level (or coverage_probability) - N1: number of observations == len(sample1) - N2: number of observations == len(sample2) - mean_dist: the mean measured on the sample = mean(sample) - std_sample_1: std of the sample ==std(sample1) - std_sample_2: std of the sample ==std(sample2) - t_etoile: if set, cf is ignored. - pool: default False
- True
if we assume that our populations variance are equal
we use a t-distribution of (N1+N2-1) ddl
- False
if we assume that our populations variance are not equal
we use a t-distribution of min(N1, N2)-1 ddl
Hyp - both the population follow normal dist. Or use large sample (>10) - the populations are independant from each other - use simple random samples - for pool=True, variances are assume to be the same
- to test that, you can
- use levene test [plus robuste que fusher ou bartlett face à la non-normalité de la donnée](https://fr.wikipedia.org/wiki/Test_de_Bartlett)
H0: Variances are equals; H1: there are not
`python scipy.stats.levene(liste1,liste2, center='mean') solution = "no equality" if p-value<0.05 else "equality" `
- or check if IQR are the same
IQR = quantile(75%) - quantile(25%)
Eqvl - scipy.stats.ttest_ind(liste1,liste2, equal_var = False | True)
Eqvl_pointWise estimation - Assume diff_mean = 82 - Result: diff_mean in CI = [77.33, 87.63] - If we test H0:p=80 vs H1:p>80, we would fail to reject the null because H1 is not valide here - As sa matter of fact, there is some value in CI below 80 witch if not compatible with H1 => the test doest give enough evidence to reject H0
Theo - read [here](https://en.wikipedia.org/wiki/Student’s_t-distribution#How_Student’s_distribution_arises_from_sampling)
- statanalysis.conf_inte_md.ci_estimators.CIE_ONE_PROPORTION(proportion, n, method, cf: float | None = None)
Get_interval_simple: get a proportion of an attribute value (male gender, ) in a population based on a sample (no sign pb) - cf: confidence_level (or coverage_probability) - proportion: measurement - n: number of observations == len(sample) - method: “classic” or “conservative”
Hyp - better the population follow nornal dist. Or use large sample (>10)
- statanalysis.conf_inte_md.ci_estimators.CIE_PROPORTION_TWO(p1, p2, n1, n2, cf: float | None = None)
Get_interval_diff: get the diff of mean between 2 population based on a sample from each population (p1-p2) #p1-p2# - cf: confidence_level (or coverage_probability) - p1: mean of liste1 - p2: mean of liste2 - n1: len(liste1) - n2: len(liste2)
Hyp - better the populations follow normal dist. Or use large samples (>10)
- statanalysis.conf_inte_md.ci_estimators.get_min_sample(moe: float, p=None, method=None, cf: float | None = None)
Get_min_sample:get the minimum of sample_size to use for a - input
cf: confidence (or coverage_probability): between 0 and 1
moe: margin of error
method (optional): “conservative” (default)
p: not used if method==”conservative”
Hyp - better the population follow nornal dist. Or use large sample (>10)
- statanalysis.conf_inte_md.ci_estimators.print(*args)
Some defs - parameter: A quantifiable characteristic of a population - confidence interval: range of reasonable values for the parameter
todo - use kwargs format while calling functions - reorder fcts attributes
- statanalysis.conf_inte_md.confidence_interval.IC_MEAN_ONE(sample: list, t_etoile=None, confidence: float | None = None)
Estimate_population_mean(ONE MEAN): We need the spread (std): We will use an estimation
- Data
confidence:..
sample: value…
Method - Use t-distribution to calculate few
Hypothesis - Samples follow a normal (or large enough to bypass this assumption) => means of these sample follow a t-dist
- statanalysis.conf_inte_md.confidence_interval.IC_MEAN_TWO_NOTPAIR(sample1, sample2, pool=False, confidence: float | None = None)
Difference_population_means_for_nonpaired_data(TWO MEANS FOR PAIRED DATA): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? #p1-p2#
Construction - It is like,
checking if a feature magnitude change when going from a category to another
- Example_contexte
- having a dataframe df, with 3 col [name, score, equipe, role]
equipe: “1” or “2”
role: df.role.nunique = 11 => len(df)==22
Now there is a battle: For a “same role” fight”, which team is the best?
- Example_question
- if education level are generally equal -> mean difference is 0
Is there a mean difference between the education level based on gender
if education levels are unequel -> mean difference is not 0
So, Look for 0 in the ranfe of reaonable values
We need the spread (std): We will use an estimation
- Args
confidence:..
Sample1: list: values…
Sample2: list: (same len) values…
- pool: default False
- True
if we assume that our populations variance are equal
we use a t-distribution of (N1+N2-1) ddl
- False
if we assume that our populations variance are not equal
we use a t-distribution of min(N1, N2)-1 ddl
Method - Use t-distribution to calculate few - create joint confidence interval
Hypothesis - a random sample - Samples follow a normal (or large enough to bypass this assumption: 10 per category) => means of these sample follow a t-dist
Notes - With {cf} confidence, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]} - if all values are above 0, cool there is a significativity
- statanalysis.conf_inte_md.confidence_interval.IC_MEAN_TWO_PAIR(sample1, sample2, t_etoile=None, confidence: float | None = None)
Difference_population_means_for_paired_data(TWO MEANS FOR PAIRED DATA): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? #p1-p2#
What is paired data ?
measurements took on individuals (people, home, any object)
- technicality:
When in a dataset (len = n) there is a row df.a witch values only repeat twice (=> df.a.nunique = n/2)
we can do a plot(x=feature1, y=feature2)
- examples
Each home need canibet quote from two suppliers => we want to know if there is an average difference in nb_quotes from between twese two suppliers
In a blind taste test to compare two new juice flavors, grape and apple, consumers were given a sample of each flavor and the results will be used to estimate the percentage of all such consumers who prefer the grape flavor to the apple flavor.
- Construction
- It is like,
checking if a feature magnitude change when going from a category to another, each pair split the two cztegories
- Example_contexte
- having a dataframe df, with 3 col [name, score, equipe, role]
equipe: “1” or “2”
role: df.role.nunique = 11 => len(df)==22
Now there is a battle: For a “same role” fight”, which team is the best?
- Example_question
- if education level are generally equal -> mean difference is 0
Is there a mean difference between the education level of twins
if education levels are unequel -> mean difference is not 0
So, Look for 0 in the ranfe of reaonable values
We need the spread (std): We will use an estimation
Equivl - IC_MEAN_ONE(confidence, sample1 - sample2)
- Data
confidence:..
Sample1: list: values…
Sample2: list: (same len) values…
Method - Use t-distribution to calculate few - create joint confidence interval
Hypothesis - a random sample of identical twin sets - Samples follow a normal (or large enough to bypass this assumption: (ex 20 twins)) => means of these sample follow a t-dist
Notes - With {cf} confidence, the population mean difference of the (second_team - first_team) attribute is estimated to be between {data.interval[0]} and {dat.interval[1]} - if all values are above 0, cool there is a significativity
- statanalysis.conf_inte_md.confidence_interval.IC_PROPORTION_ONE(sample_size: int, parameter: float, confidence: float | None = None, method: str | None = None)
Confidence_interval(ONE PROPORTION):Confidence interval calculus after a statistic test - input
sample_size: int: sample size (more than 10 to use this method)
parameter: float: the measurement on the sample
confidence: float: confidence confidence (between O and 1). Greater the confidence, wider the interval
method: str: either “classic” (default) or “conservative.
- Example:
how many men in the entire population with a con ?
a form filled by 300 people show that there is only 120 men => p = (120/300); N=300
- Hypothesis
the sample is over 10 for each of the categories in place => we use the “Law of Large Numbers”
the sample proportion comes from data that is considered a simple random sample
- Idea
let P: the real proportion in the population
let S: Size of each sample == nb of observations per sample
For many samples, we calculate proportions per sample: ex: for N samples of size S => N proportions values
(p - P) / ( p*(1-p)/S ) follow a normal distribution
- Descriptions:
For a given polulation and a parameter P to find, If we repeated this study many times, each producing a new sample (of the same size {res.sample_size==S}) from witch a {res.confidence} confidence interval is computed, then {res.confidence} of the resulting confidence intervals would be excpected to contain the true value P
If the entire interval verify a property, then it is reasonable say that the parameter verify that property
- Result
with a {res.confidence} confidence, we estimate that the populztion proportion who are men is between {res.left_tail} and {res.right_tail}
- statanalysis.conf_inte_md.confidence_interval.IC_PROPORTION_TWO(p1, p2, N1, N2, confidence: float | None = None)
Difference_population_proportion(TWO PROPORTIONS): We have have estimate a parameter p on two populations (1 , 2).How to estimate p1-p2 ? #p1-p2#
- Method
create joint confidence interval
Construction - Cmparison
Hypotheses - two independant random samples - large enough sample sizes : 10 per category (ex 10 yes, 10 no)
- statanalysis.conf_inte_md.confidence_interval.print(*args)