statanalysis.mdl_esti_md package

Submodules

class statanalysis.mdl_esti_md.hp_estimators_regression.ComputeRegression(logit=False, fit_intercept=True, alpha=None, debug=False)

Bases: object

_estimate_log_reg_coeff_std()

_summary_

Info - cool: [web.stanford.edu](https://web.stanford.edu/class/archive/stats/stats200/stats200.1172/Lecture26.pdf) - another(not used): [stats.stackexchange.com](https://stats.stackexchange.com/questions/60723/bias-of-maximum-likelihood-estimators-for-logistic-regression)

Returns:: _description_
Return type:: _type_

_estimate_logit_reg_coeffs(num_iterations: int | None = None, learning_rate: float | None = None, verbose: bool = True): some documentation: - [implementation - github.com/susanli2016 - ipynb](https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Logistic%20Regression%20in%20Python%20-%20Step%20by%20Step.ipynb) - [implementation - github.com/aihubprojects - ipynb](https://github.com/aihubprojects/Logistic-Regression-From-Scratch-Python/blob/master/LogisticRegressionImplementation.ipynb) - [MLE - arunaddagatla.medium.com](https://arunaddagatla.medium.com/maximum-likelihood-estimation-in-logistic-regression-f86ff1627b67) - [R2 interpretation - stats.stackexchange.com](https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation) - [biais in logistic regression - stats.stackexchange](https://stats.stackexchange.com/questions/113766/omitted-variable-bias-in-logistic-regression-vs-omitted-variable-bias-in-ordina)

_pred_target(X)

apply sigmoid and return proba

Parameters:: X (_type_) – _description_
Returns:: _description_
Return type:: _type_

count_nan_and_inf(arr)

fit(X, y, nb_iter: float | None = None, learning_rate: float | None = None)

_summary_

Parameters:

X (2-dim array) – list of columns (including slope) (n,nb_params)
y (1-dim array) – observations (n,)
alpha (_type_, optional) – _description_. Defaults to None.
debug (bool, optional) – _description_. Defaults to False.

Raises:

Exception – _description_

Returns:

_description_

Return type:

_type_

get_regression_results()

predict(X_test: ndarray, lim=0.5)

predict_proba(X_test: ndarray)

statanalysis.mdl_esti_md.hp_estimators_regression.log_loss(yp, y, min_tol: float | None = None)

statanalysis.mdl_esti_md.hp_estimators_regression.sigmoid(z)

Author: Susan Li source: LogisticRegressionImplementation.ipynb - github.com/aihubprojects

class statanalysis.mdl_esti_md.log_reg_example.LogisticRegression(learning_rate=0.01, num_iterations=50000, fit_intercept=True, verbose=False)

Bases: object

fit(X, y)

predict(X)

predict_prob(X)

We know why t-student is useful what about khi-2 ? we know fisher ? yes F

add a fct to predict
- attention to extrapolation (unsern data) vs interpolation
another for the curve showing the std
- the interval should be narrower tinyer when X reacg the sample mean
a good list of intel/reminder about the regression [here - sites.ualberta.ca - pdf](https://sites.ualberta.ca/~lkgray/uploads/7/3/6/2/7362679/slides_-_multiplelinearregressionaic.pdf)

statanalysis.mdl_esti_md.model_estimator.ME_Normal_dist(sample: list, alpha=None, debug=False)

estimate a normal distribution from a sample

visualisation: - check if normal:

sns.distplot(data.X)

check if qq-plot is linear [en.wikipedia.org](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot)
::from statsmodels.graphics.gofplots import qqplot ::from matplotlib import pyplot ::qqplot(sample, line=’s’) ::pyplot.show()

hypothesis - X = m + N(0,s**2)

check normal hypothesis: [machinelearningmastery](https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/)

lenght - you may need data over 1000 samples to get

statanalysis.mdl_esti_md.model_estimator.ME_Regression(x: list, y: list, degre: int, logit=False, fit_intercept=True, debug=False, alpha: float = 0.05, nb_iter: int = 100000, learning_rate: float = 0.1)

estimate a regression model from two samples

prediction - predict Y conditional on X assuming that Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + N(0,s**2) - Y is a dependant variable - x, s are independant ones => predictors of the dependant variables - If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + var_exp_1*G +

visualisation: - sns.scatterplot(X,Y)

hypothesis - Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + err - err ~~> N(0,s**2) - variance(error)==s**2 is the same accross the data - var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 - pr[i] cst - pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in)

prediction - each pr[i] have a mean and a std based on normal distribution - Y too =>

Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*X^2 + pr_[3]*X^3

Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn’t s ?

predictors - pr[i], s**2

lenght - you may need data over 1000 samples to get

Others D’ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]

utils - [standard error of the intercept - stats.stackexchange](https://stats.stackexchange.com/questions/173271/what-exactly-is-the-standard-error-of-the-intercept-in-multiple-regression-analy)

statanalysis.mdl_esti_md.model_estimator.ME_logistic_regression(X: list, y: list, debug=False, alpha=None)

statanalysis.mdl_esti_md.model_estimator.ME_multiple_regression(X: list, y: list, logit=False, fit_intercept=True, debug=False, alpha: float = 0.05, nb_iter: int = 100000, learning_rate: float = 0.1)

_summary_

Parameters:

X (list) – _description_
y (list) – _description_
debug (bool, optional) – _description_. Defaults to False.
alpha (_type_, optional) – _description_. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.

estimate a regression model from two samples

prediction - predict Y conditional on X, B, G, … assuming that Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + N(0,s**2) - Y is a dependant variable - x, B, G, …., s are independant ones => predictors of the dependant variables - If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + pr[4]*T1 +pr[5]*T2 +

=> The correlation of the repeated measures needs to be taken into account, and time since administration needs to be added to the model as an independent variable.

Questions of interest - Are you interested in establishing a relationship? - Are you interested in which predictors are driving that relationship?

visualisation: - sns.scatterplot(X[i],y) for i in range(len(X)) - check for Form_linear_or_not;Direction_pos_or_neg;Strengh_of_the_colinearity;Outliers

hypothesis - Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + err - err ~~> N(0,s**2) - variance(error)==s**2 is the same accross the data - var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G - pr[i] cst - pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in) - non Collinearity a.k.a Multicollinearity

a correlation with be computed

Anyway, i does not change the predictive power not the efficieency of the model

Too, i guess aic selection remove one right ?

But data about coefficients are not good because there is repetition

Regression Trees = can handle correlated data well

prediction - each pr[i] have a mean and a std based on normal distribution - Y too =>

Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*B + pr_h[3]*G

Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn’t s ?

predictors - pr[i], s**2

lenght - you may need data over 1000 samples to get

Others D’ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]

Raises:: Exception – _description_
Returns:: _description_
Return type:: _type_

class statanalysis.mdl_esti_md.prediction_metrics.PredictionMetrics(y_true: list, y_pred_proba: list, binary: bool)

Bases: object

_log_likelihood_lin_reg(std_eval: float, debug=False)

_summary_

Parameters:

y (list) – _description_
self.y_pred (list) – _description_
std_eval (float) – _description_
debug (bool, optional) – _description_. Defaults to False.

Utils

[mle regression - cs.princeton.edu - pdf](https://www.cs.princeton.edu/courses/archive/fall18/cos324/files/mle-regression.pdf)

Returns:: _description_
Return type:: log_likelihood

compute_log_likelihood(std_eval: float | None = None, debug=False, min_tol: float = True)

_summary_

Parameters:

std_eval (float, optional) – (ignored if self.binary=True). Defaults to None.
debug (bool, optional) – _description_. Defaults to False.
min_tol (float, optional) – (ignored if self.binary=False). Defaults to None.

Returns:

_description_

Return type:

_type_

compute_mae()

get_binary_accuracy()

get_binary_regression_res()

get_confusion_matrix()

get_f1_score()

get_precision_score()

get_recall_score()

log_loss(min_tol: float = True)

log_loss_flat(min_tol: float | None = None)

statanalysis.mdl_esti_md.prediction_metrics.compute_aic_bic(dfr: int, n: int, llh: float, method: str = 'basic')

_summary_

Utils

It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.
[aic and bic in python - medium.com/analytics-vidhya](https://medium.com/analytics-vidhya/probabilistic-model-selection-with-aic-bic-in-python-f8471d6add32)

Parameters:

dfr (int) – nb_predictors(not including the intercept)
dfe (int) – nb of observations
llh (float) – log likelihood

Question: what about mixed models ?

Returns:: aicself, y_true, y_pred
Return type:: float

statanalysis.mdl_esti_md.prediction_metrics.compute_kurtosis(arr, residuals=None)

_summary_

Parameters:: y (list|array-like) – _description_

Utils - [kurtosis and skewness - spcforexcel.com](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics)

Returns:: _description_
Return type:: _type_

statanalysis.mdl_esti_md.prediction_metrics.compute_skew(arr)

_summary_

Parameters:: y (_type_) – _description_

Utils - [skewness and kurtosis - spcforexcel.com](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics) - [skewness - thoughtco.com](https://www.thoughtco.com/what-is-skewness-in-statistics-3126242)

Returns:: _description_
Return type:: _type_

statanalysis.mdl_esti_md.prediction_results.HPE_REGRESSION_FISHER_TEST(y: list, y_hat: list, nb_param: int, alpha: float | None = None)

check if mean is equal accross many samples

Args

y (list): array-like of 1 dim y_hat (list): array-like of 1 dim nb_param (int): number of parameter in the regression (include the intercept). ex: for 6 independant variables, nb_params=7 alpha (float, optional): _description_. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.

Hypothesis

H0: β1 = β2 = … = βk-1 = 0; k=nb_params H1: βj ≠ 0, for at least one value of j

Hypothesis

each sample is
- simple random
- normal
- indepebdant from others
same variance
- attention: use levene test (plus robuste que fusher ou bartlett face à la non-normalité de la donnée)(https://fr.wikipedia.org/wiki/Test_de_Bartlett)

Fisher test

The F Distribution is also called the Snedecor’s F, Fisher’s F or the Fisher–Snedecor distribution
[f_oneway - docs.scipy.org/doc](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html)
[anova and f test - blog.minitab.com](https://blog.minitab.com/fr/comprendre-lanalyse-de-la-variance-anova-et-le-test-f)
[f-test-reg - facweb.cs.depaul.edu/sjost](http://facweb.cs.depaul.edu/sjost/csc423/documents/f-test-reg.htm)

Returns:: (RegressionFisherTestData)
Return type:: data

class statanalysis.mdl_esti_md.prediction_results.RegressionResultData(y: numpy.ndarray, y_hat: numpy.ndarray, nb_obs: int, nb_param: int, alpha: float, coeffs: numpy.ndarray, list_coeffs_std: numpy.ndarray)

Bases: object

alpha: float

coeffs: ndarray

list_coeffs_std: ndarray

nb_obs: int

nb_param: int

residu_std: float

residuals: ndarray

y: ndarray

y_hat: ndarray

statanalysis.mdl_esti_md.prediction_results.compute_linear_regression_results(crd: RegressionResultData, debug: bool = False)

statanalysis.mdl_esti_md.prediction_results.compute_logit_regression_results(crd: RegressionResultData, debug: bool = False)

_summary_

Parameters:

crd (RegressionResultData) – _description_
debug (bool, optional) – _description_. Defaults to False.

Info - [understand rs outputs - stats.stackexchange.com](https://stats.stackexchange.com/questions/86351/interpretation-of-rs-output-for-binomial-regression) - [pseudo-rcarre - stats.stackexchange.com](https://stats.stackexchange.com/questions/3559/which-pseudo-r2-measure-is-the-one-to-report-for-logistic-regression-cox-s) :returns: _description_ :rtype: _type_

statanalysis.mdl_esti_md package

Submodules

Module contents