statanalysis.mdl_esti_md package

Submodules

class statanalysis.mdl_esti_md.hp_estimators_regression.ComputeRegression(logit=False, fit_intercept=True, alpha=None, debug=False)

Bases: object

_estimate_log_reg_coeff_std()

_summary_

Info - cool: [web.stanford.edu](https://web.stanford.edu/class/archive/stats/stats200/stats200.1172/Lecture26.pdf) - another(not used): [stats.stackexchange.com](https://stats.stackexchange.com/questions/60723/bias-of-maximum-likelihood-estimators-for-logistic-regression)

Returns:

_description_

Return type:

_type_

_estimate_logit_reg_coeffs(num_iterations: int | None = None, learning_rate: float | None = None, verbose: bool = True)

some documentation: - [implementation - github.com/susanli2016 - ipynb](https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Logistic%20Regression%20in%20Python%20-%20Step%20by%20Step.ipynb) - [implementation - github.com/aihubprojects - ipynb](https://github.com/aihubprojects/Logistic-Regression-From-Scratch-Python/blob/master/LogisticRegressionImplementation.ipynb) - [MLE - arunaddagatla.medium.com](https://arunaddagatla.medium.com/maximum-likelihood-estimation-in-logistic-regression-f86ff1627b67) - [R2 interpretation - stats.stackexchange.com](https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation) - [biais in logistic regression - stats.stackexchange](https://stats.stackexchange.com/questions/113766/omitted-variable-bias-in-logistic-regression-vs-omitted-variable-bias-in-ordina)

_pred_target(X)

apply sigmoid and return proba

Parameters:

X (_type_) – _description_

Returns:

_description_

Return type:

_type_

count_nan_and_inf(arr)
fit(X, y, nb_iter: float | None = None, learning_rate: float | None = None)

_summary_

Parameters:
  • X (2-dim array) – list of columns (including slope) (n,nb_params)

  • y (1-dim array) – observations (n,)

  • alpha (_type_, optional) – _description_. Defaults to None.

  • debug (bool, optional) – _description_. Defaults to False.

Raises:

Exception – _description_

Returns:

_description_

Return type:

_type_

get_regression_results()
predict(X_test: ndarray, lim=0.5)
predict_proba(X_test: ndarray)
statanalysis.mdl_esti_md.hp_estimators_regression.log_loss(yp, y, min_tol: float | None = None)
statanalysis.mdl_esti_md.hp_estimators_regression.sigmoid(z)

Author: Susan Li source: LogisticRegressionImplementation.ipynb - github.com/aihubprojects

class statanalysis.mdl_esti_md.log_reg_example.LogisticRegression(learning_rate=0.01, num_iterations=50000, fit_intercept=True, verbose=False)

Bases: object

fit(X, y)
predict(X)
predict_prob(X)

We know why t-student is useful what about khi-2 ? we know fisher ? yes F

  • add a fct to predict
    • attention to extrapolation (unsern data) vs interpolation

  • another for the curve showing the std
    • the interval should be narrower tinyer when X reacg the sample mean

  • a good list of intel/reminder about the regression [here - sites.ualberta.ca - pdf](https://sites.ualberta.ca/~lkgray/uploads/7/3/6/2/7362679/slides_-_multiplelinearregressionaic.pdf)

statanalysis.mdl_esti_md.model_estimator.ME_Normal_dist(sample: list, alpha=None, debug=False)

estimate a normal distribution from a sample

visualisation: - check if normal:

  • sns.distplot(data.X)

  • check if qq-plot is linear [en.wikipedia.org](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot)

    ::from statsmodels.graphics.gofplots import qqplot ::from matplotlib import pyplot ::qqplot(sample, line=’s’) ::pyplot.show()

hypothesis - X = m + N(0,s**2)

lenght - you may need data over 1000 samples to get

statanalysis.mdl_esti_md.model_estimator.ME_Regression(x: list, y: list, degre: int, logit=False, fit_intercept=True, debug=False, alpha: float = 0.05, nb_iter: int = 100000, learning_rate: float = 0.1)

estimate a regression model from two samples

prediction - predict Y conditional on X assuming that Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + N(0,s**2) - Y is a dependant variable - x, s are independant ones => predictors of the dependant variables - If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + var_exp_1*G +

visualisation: - sns.scatterplot(X,Y)

hypothesis - Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + err - err ~~> N(0,s**2) - variance(error)==s**2 is the same accross the data - var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 - pr[i] cst - pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in)

prediction - each pr[i] have a mean and a std based on normal distribution - Y too =>

  • Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*X^2 + pr_[3]*X^3

  • Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn’t s ?

predictors - pr[i], s**2

lenght - you may need data over 1000 samples to get

Others D’ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]

utils - [standard error of the intercept - stats.stackexchange](https://stats.stackexchange.com/questions/173271/what-exactly-is-the-standard-error-of-the-intercept-in-multiple-regression-analy)

statanalysis.mdl_esti_md.model_estimator.ME_logistic_regression(X: list, y: list, debug=False, alpha=None)
statanalysis.mdl_esti_md.model_estimator.ME_multiple_regression(X: list, y: list, logit=False, fit_intercept=True, debug=False, alpha: float = 0.05, nb_iter: int = 100000, learning_rate: float = 0.1)

_summary_

Parameters:
  • X (list) – _description_

  • y (list) – _description_

  • debug (bool, optional) – _description_. Defaults to False.

  • alpha (_type_, optional) – _description_. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.

estimate a regression model from two samples

prediction - predict Y conditional on X, B, G, … assuming that Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + N(0,s**2) - Y is a dependant variable - x, B, G, …., s are independant ones => predictors of the dependant variables - If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + pr[4]*T1 +pr[5]*T2 +

=> The correlation of the repeated measures needs to be taken into account, and time since administration needs to be added to the model as an independent variable.

Questions of interest - Are you interested in establishing a relationship? - Are you interested in which predictors are driving that relationship?

visualisation: - sns.scatterplot(X[i],y) for i in range(len(X)) - check for Form_linear_or_not;Direction_pos_or_neg;Strengh_of_the_colinearity;Outliers

hypothesis - Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + err - err ~~> N(0,s**2) - variance(error)==s**2 is the same accross the data - var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G - pr[i] cst - pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in) - non Collinearity a.k.a Multicollinearity

  • a correlation with be computed

  • Anyway, i does not change the predictive power not the efficieency of the model

  • Too, i guess aic selection remove one right ?

  • But data about coefficients are not good because there is repetition

  • Regression Trees = can handle correlated data well

prediction - each pr[i] have a mean and a std based on normal distribution - Y too =>

  • Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*B + pr_h[3]*G

  • Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn’t s ?

predictors - pr[i], s**2

lenght - you may need data over 1000 samples to get

Others D’ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]

Raises:

Exception – _description_

Returns:

_description_

Return type:

_type_

class statanalysis.mdl_esti_md.prediction_metrics.PredictionMetrics(y_true: list, y_pred_proba: list, binary: bool)

Bases: object

_log_likelihood_lin_reg(std_eval: float, debug=False)

_summary_

Parameters:
  • y (list) – _description_

  • self.y_pred (list) – _description_

  • std_eval (float) – _description_

  • debug (bool, optional) – _description_. Defaults to False.

Utils
Returns:

_description_

Return type:

log_likelihood

compute_log_likelihood(std_eval: float | None = None, debug=False, min_tol: float = True)

_summary_

Parameters:
  • std_eval (float, optional) – (ignored if self.binary=True). Defaults to None.

  • debug (bool, optional) – _description_. Defaults to False.

  • min_tol (float, optional) – (ignored if self.binary=False). Defaults to None.

Returns:

_description_

Return type:

_type_

compute_mae()
get_binary_accuracy()
get_binary_regression_res()
get_confusion_matrix()
get_f1_score()
get_precision_score()
get_recall_score()
log_loss(min_tol: float = True)
log_loss_flat(min_tol: float | None = None)
statanalysis.mdl_esti_md.prediction_metrics.compute_aic_bic(dfr: int, n: int, llh: float, method: str = 'basic')

_summary_

Utils
Parameters:
  • dfr (int) – nb_predictors(not including the intercept)

  • dfe (int) – nb of observations

  • llh (float) – log likelihood

Question

what about mixed models ?

Returns:

aicself, y_true, y_pred

Return type:

float

statanalysis.mdl_esti_md.prediction_metrics.compute_kurtosis(arr, residuals=None)

_summary_

Parameters:

y (list|array-like) – _description_

Utils - [kurtosis and skewness - spcforexcel.com](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics)

Returns:

_description_

Return type:

_type_

statanalysis.mdl_esti_md.prediction_metrics.compute_skew(arr)

_summary_

Parameters:

y (_type_) – _description_

Utils - [skewness and kurtosis - spcforexcel.com](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics) - [skewness - thoughtco.com](https://www.thoughtco.com/what-is-skewness-in-statistics-3126242)

Returns:

_description_

Return type:

_type_

statanalysis.mdl_esti_md.prediction_results.HPE_REGRESSION_FISHER_TEST(y: list, y_hat: list, nb_param: int, alpha: float | None = None)

check if mean is equal accross many samples

Args

y (list): array-like of 1 dim y_hat (list): array-like of 1 dim nb_param (int): number of parameter in the regression (include the intercept). ex: for 6 independant variables, nb_params=7 alpha (float, optional): _description_. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.

Hypothesis

H0: β1 = β2 = … = βk-1 = 0; k=nb_params H1: βj ≠ 0, for at least one value of j

Hypothesis
Fisher test
Returns:

(RegressionFisherTestData)

Return type:

data

class statanalysis.mdl_esti_md.prediction_results.RegressionResultData(y: numpy.ndarray, y_hat: numpy.ndarray, nb_obs: int, nb_param: int, alpha: float, coeffs: numpy.ndarray, list_coeffs_std: numpy.ndarray)

Bases: object

alpha: float
coeffs: ndarray
list_coeffs_std: ndarray
nb_obs: int
nb_param: int
residu_std: float
residuals: ndarray
y: ndarray
y_hat: ndarray
statanalysis.mdl_esti_md.prediction_results.compute_linear_regression_results(crd: RegressionResultData, debug: bool = False)
statanalysis.mdl_esti_md.prediction_results.compute_logit_regression_results(crd: RegressionResultData, debug: bool = False)

_summary_

Parameters:
  • crd (RegressionResultData) – _description_

  • debug (bool, optional) – _description_. Defaults to False.

Info - [understand rs outputs - stats.stackexchange.com](https://stats.stackexchange.com/questions/86351/interpretation-of-rs-output-for-binomial-regression) - [pseudo-rcarre - stats.stackexchange.com](https://stats.stackexchange.com/questions/3559/which-pseudo-r2-measure-is-the-one-to-report-for-logistic-regression-cox-s) :returns: _description_ :rtype: _type_

Module contents