statanalysis.mdl_esti_md package
Submodules
- class statanalysis.mdl_esti_md.hp_estimators_regression.ComputeRegression(logit=False, fit_intercept=True, alpha=None, debug=False)
Bases:
object- _estimate_log_reg_coeff_std()
_summary_
Info - cool: [web.stanford.edu](https://web.stanford.edu/class/archive/stats/stats200/stats200.1172/Lecture26.pdf) - another(not used): [stats.stackexchange.com](https://stats.stackexchange.com/questions/60723/bias-of-maximum-likelihood-estimators-for-logistic-regression)
- Returns:
_description_
- Return type:
_type_
- _estimate_logit_reg_coeffs(num_iterations: int | None = None, learning_rate: float | None = None, verbose: bool = True)
some documentation: - [implementation - github.com/susanli2016 - ipynb](https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Logistic%20Regression%20in%20Python%20-%20Step%20by%20Step.ipynb) - [implementation - github.com/aihubprojects - ipynb](https://github.com/aihubprojects/Logistic-Regression-From-Scratch-Python/blob/master/LogisticRegressionImplementation.ipynb) - [MLE - arunaddagatla.medium.com](https://arunaddagatla.medium.com/maximum-likelihood-estimation-in-logistic-regression-f86ff1627b67) - [R2 interpretation - stats.stackexchange.com](https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation) - [biais in logistic regression - stats.stackexchange](https://stats.stackexchange.com/questions/113766/omitted-variable-bias-in-logistic-regression-vs-omitted-variable-bias-in-ordina)
- _pred_target(X)
apply sigmoid and return proba
- Parameters:
X (_type_) – _description_
- Returns:
_description_
- Return type:
_type_
- count_nan_and_inf(arr)
- fit(X, y, nb_iter: float | None = None, learning_rate: float | None = None)
_summary_
- Parameters:
X (2-dim array) – list of columns (including slope) (n,nb_params)
y (1-dim array) – observations (n,)
alpha (_type_, optional) – _description_. Defaults to None.
debug (bool, optional) – _description_. Defaults to False.
- Raises:
Exception – _description_
- Returns:
_description_
- Return type:
_type_
- get_regression_results()
- predict(X_test: ndarray, lim=0.5)
- predict_proba(X_test: ndarray)
- statanalysis.mdl_esti_md.hp_estimators_regression.log_loss(yp, y, min_tol: float | None = None)
- statanalysis.mdl_esti_md.hp_estimators_regression.sigmoid(z)
Author: Susan Li source: LogisticRegressionImplementation.ipynb - github.com/aihubprojects
- class statanalysis.mdl_esti_md.log_reg_example.LogisticRegression(learning_rate=0.01, num_iterations=50000, fit_intercept=True, verbose=False)
Bases:
object- fit(X, y)
- predict(X)
- predict_prob(X)
We know why t-student is useful what about khi-2 ? we know fisher ? yes F
- add a fct to predict
attention to extrapolation (unsern data) vs interpolation
- another for the curve showing the std
the interval should be narrower tinyer when X reacg the sample mean
a good list of intel/reminder about the regression [here - sites.ualberta.ca - pdf](https://sites.ualberta.ca/~lkgray/uploads/7/3/6/2/7362679/slides_-_multiplelinearregressionaic.pdf)
- statanalysis.mdl_esti_md.model_estimator.ME_Normal_dist(sample: list, alpha=None, debug=False)
estimate a normal distribution from a sample
visualisation: - check if normal:
sns.distplot(data.X)
- check if qq-plot is linear [en.wikipedia.org](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot)
::from statsmodels.graphics.gofplots import qqplot ::from matplotlib import pyplot ::qqplot(sample, line=’s’) ::pyplot.show()
hypothesis - X = m + N(0,s**2)
check normal hypothesis: [machinelearningmastery](https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/)
lenght - you may need data over 1000 samples to get
- statanalysis.mdl_esti_md.model_estimator.ME_Regression(x: list, y: list, degre: int, logit=False, fit_intercept=True, debug=False, alpha: float = 0.05, nb_iter: int = 100000, learning_rate: float = 0.1)
estimate a regression model from two samples
prediction - predict Y conditional on X assuming that Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + N(0,s**2) - Y is a dependant variable - x, s are independant ones => predictors of the dependant variables - If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + var_exp_1*G +
visualisation: - sns.scatterplot(X,Y)
hypothesis - Y = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 + err - err ~~> N(0,s**2) - variance(error)==s**2 is the same accross the data - var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*X^2 + pr[3]*X^3 - pr[i] cst - pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in)
prediction - each pr[i] have a mean and a std based on normal distribution - Y too =>
Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*X^2 + pr_[3]*X^3
Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn’t s ?
predictors - pr[i], s**2
lenght - you may need data over 1000 samples to get
Others D’ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]
utils - [standard error of the intercept - stats.stackexchange](https://stats.stackexchange.com/questions/173271/what-exactly-is-the-standard-error-of-the-intercept-in-multiple-regression-analy)
- statanalysis.mdl_esti_md.model_estimator.ME_logistic_regression(X: list, y: list, debug=False, alpha=None)
- statanalysis.mdl_esti_md.model_estimator.ME_multiple_regression(X: list, y: list, logit=False, fit_intercept=True, debug=False, alpha: float = 0.05, nb_iter: int = 100000, learning_rate: float = 0.1)
_summary_
- Parameters:
X (list) – _description_
y (list) – _description_
debug (bool, optional) – _description_. Defaults to False.
alpha (_type_, optional) – _description_. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.
estimate a regression model from two samples
prediction - predict Y conditional on X, B, G, … assuming that Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + N(0,s**2) - Y is a dependant variable - x, B, G, …., s are independant ones => predictors of the dependant variables - If there is a time stamp of measures (or paired data), please add them as independant variables pr[0] + pr[4]*T1 +pr[5]*T2 +
=> The correlation of the repeated measures needs to be taken into account, and time since administration needs to be added to the model as an independent variable.
Questions of interest - Are you interested in establishing a relationship? - Are you interested in which predictors are driving that relationship?
visualisation: - sns.scatterplot(X[i],y) for i in range(len(X)) - check for Form_linear_or_not;Direction_pos_or_neg;Strengh_of_the_colinearity;Outliers
hypothesis - Y = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G + err - err ~~> N(0,s**2) - variance(error)==s**2 is the same accross the data - var(Y/X)==s**2 ; E(Y/X) = pr[0] + pr[1]*X + pr[2]*B + pr[3]*G - pr[i] cst - pr[i] not null => i add a test hypothesis (to reject the null H0:coeff==0 against H1:coeff!=0), not a confidence interval (to check if 0 if not in) - non Collinearity a.k.a Multicollinearity
a correlation with be computed
Anyway, i does not change the predictive power not the efficieency of the model
Too, i guess aic selection remove one right ?
But data about coefficients are not good because there is repetition
Regression Trees = can handle correlated data well
prediction - each pr[i] have a mean and a std based on normal distribution - Y too =>
Mean(Y) = y_hat = pr_h[0] + pr_h[1]*X + pr_h[2]*B + pr_h[3]*G
Some model can predict quantile(Y, 95%) but i will just add std(y_hat) later. uuh isn’t s ?
predictors - pr[i], s**2
lenght - you may need data over 1000 samples to get
Others D’ont forget about the errors ! Predictions have certain uncertainty => [ poorer fitted model => larger uncertainty]
- Raises:
Exception – _description_
- Returns:
_description_
- Return type:
_type_
- class statanalysis.mdl_esti_md.prediction_metrics.PredictionMetrics(y_true: list, y_pred_proba: list, binary: bool)
Bases:
object- _log_likelihood_lin_reg(std_eval: float, debug=False)
_summary_
- Parameters:
y (list) – _description_
self.y_pred (list) – _description_
std_eval (float) – _description_
debug (bool, optional) – _description_. Defaults to False.
- Utils
[mle regression - cs.princeton.edu - pdf](https://www.cs.princeton.edu/courses/archive/fall18/cos324/files/mle-regression.pdf)
- Returns:
_description_
- Return type:
log_likelihood
- compute_log_likelihood(std_eval: float | None = None, debug=False, min_tol: float = True)
_summary_
- Parameters:
std_eval (float, optional) – (ignored if self.binary=True). Defaults to None.
debug (bool, optional) – _description_. Defaults to False.
min_tol (float, optional) – (ignored if self.binary=False). Defaults to None.
- Returns:
_description_
- Return type:
_type_
- compute_mae()
- get_binary_accuracy()
- get_binary_regression_res()
- get_confusion_matrix()
- get_f1_score()
- get_precision_score()
- get_recall_score()
- log_loss(min_tol: float = True)
- log_loss_flat(min_tol: float | None = None)
- statanalysis.mdl_esti_md.prediction_metrics.compute_aic_bic(dfr: int, n: int, llh: float, method: str = 'basic')
_summary_
- Utils
It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.
[aic and bic in python - medium.com/analytics-vidhya](https://medium.com/analytics-vidhya/probabilistic-model-selection-with-aic-bic-in-python-f8471d6add32)
- Parameters:
dfr (int) – nb_predictors(not including the intercept)
dfe (int) – nb of observations
llh (float) – log likelihood
- Question
what about mixed models ?
- Returns:
aicself, y_true, y_pred
- Return type:
float
- statanalysis.mdl_esti_md.prediction_metrics.compute_kurtosis(arr, residuals=None)
_summary_
- Parameters:
y (list|array-like) – _description_
Utils - [kurtosis and skewness - spcforexcel.com](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics)
- Returns:
_description_
- Return type:
_type_
- statanalysis.mdl_esti_md.prediction_metrics.compute_skew(arr)
_summary_
- Parameters:
y (_type_) – _description_
Utils - [skewness and kurtosis - spcforexcel.com](https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics) - [skewness - thoughtco.com](https://www.thoughtco.com/what-is-skewness-in-statistics-3126242)
- Returns:
_description_
- Return type:
_type_
- statanalysis.mdl_esti_md.prediction_results.HPE_REGRESSION_FISHER_TEST(y: list, y_hat: list, nb_param: int, alpha: float | None = None)
check if mean is equal accross many samples
- Args
y (list): array-like of 1 dim y_hat (list): array-like of 1 dim nb_param (int): number of parameter in the regression (include the intercept). ex: for 6 independant variables, nb_params=7 alpha (float, optional): _description_. Defaults to COMMON_ALPHA_FOR_HYPH_TEST.
- Hypothesis
H0: β1 = β2 = … = βk-1 = 0; k=nb_params H1: βj ≠ 0, for at least one value of j
- Hypothesis
- each sample is
simple random
normal
indepebdant from others
- same variance
attention: use levene test (plus robuste que fusher ou bartlett face à la non-normalité de la donnée)(https://fr.wikipedia.org/wiki/Test_de_Bartlett)
- Fisher test
The F Distribution is also called the Snedecor’s F, Fisher’s F or the Fisher–Snedecor distribution
[f_oneway - docs.scipy.org/doc](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html)
[anova and f test - blog.minitab.com](https://blog.minitab.com/fr/comprendre-lanalyse-de-la-variance-anova-et-le-test-f)
[f-test-reg - facweb.cs.depaul.edu/sjost](http://facweb.cs.depaul.edu/sjost/csc423/documents/f-test-reg.htm)
- Returns:
(RegressionFisherTestData)
- Return type:
data
- class statanalysis.mdl_esti_md.prediction_results.RegressionResultData(y: numpy.ndarray, y_hat: numpy.ndarray, nb_obs: int, nb_param: int, alpha: float, coeffs: numpy.ndarray, list_coeffs_std: numpy.ndarray)
Bases:
object- alpha: float
- coeffs: ndarray
- list_coeffs_std: ndarray
- nb_obs: int
- nb_param: int
- residu_std: float
- residuals: ndarray
- y: ndarray
- y_hat: ndarray
- statanalysis.mdl_esti_md.prediction_results.compute_linear_regression_results(crd: RegressionResultData, debug: bool = False)
- statanalysis.mdl_esti_md.prediction_results.compute_logit_regression_results(crd: RegressionResultData, debug: bool = False)
_summary_
- Parameters:
crd (RegressionResultData) – _description_
debug (bool, optional) – _description_. Defaults to False.
Info - [understand rs outputs - stats.stackexchange.com](https://stats.stackexchange.com/questions/86351/interpretation-of-rs-output-for-binomial-regression) - [pseudo-rcarre - stats.stackexchange.com](https://stats.stackexchange.com/questions/3559/which-pseudo-r2-measure-is-the-one-to-report-for-logistic-regression-cox-s) :returns: _description_ :rtype: _type_