regresní analýza

Inbigdataanalysis,regressionanalysisisapredictivemodelingtechniquethatstudiestherelationshipbetweenthedependentvariable(target)andtheindependentvariable(predictor).Thistechniqueiscommonlyusedforpredictiveanalysis,timeseriesmodeling,anddiscoveryofcausalrelationshipsbetweenvariables.Forexample,thebestwaytostudytherelationshipbetweendriver'srecklessdrivingandthenumberofroadtrafficaccidentsisregression.

Metody

Therearevariousregressiontechniquesforprediction.Thesetechniquesmainlyhavethreemeasures(thenumberofindependentvariables,thetypeofdependentvariable,andtheshapeoftheregressionline).

1.Lineární regrese

Itisoneofthemostwell-knownmodelingtechniques.Linearregressionisusuallyoneofthepreferredtechniqueswhenpeoplelearnpredictivemodels.Inthistechnique,thedependentvariableiscontinuous,theindependentvariablecanbecontinuousordiscrete,andthenatureoftheregressionlineislinear.

Linearregressionusesthebestfittedstraightline(alsoknownastheregressionline)toestablisharelationshipbetweenthedependentvariable(Y)andoneormoreindependentvariables(X).

Vícenásobnou lineární regresi lze vyjádřit jako Y=a+b1*X+b2*X2+e, kde představuje průsečík, představuje sklon přímky a je chybovým termínem. Vícenásobná lineární regrese může předpovědět hodnotu cílové proměnné na základě daných proměnných (proměnných).

2.Logistická regreseLogistická regrese

Logistická regrese se používá k výpočtu pravděpodobnosti "události=úspěchu"a"události=selhání". Když je typ závislé proměnnéabinární(1/0,pravda/nepravda,ano/ne)proměnná,měla by se použít logistickáregrese.Zde,hodnotaYje0nebo1,ato může býtvyjádřenorovnicí

odds=p/(1-p)=pravděpodobnost výskytu události/pravděpodobnost výskytu události

ln(pravděpodobnost)=ln(p/(1–p))

logit(p)=ln(p/(1-p))=b0+b1X1+b2X2+b3X3....+bkXk

Intheaboveformula,theexpressionphasTheprobabilityofacertainfeature.Youshouldaskthequestion:"Whyuselogarithmintheformula?".

Becausethebinomialdistribution(dependentvariable)isusedhere,itisnecessarytochoosealinkfunctionthatisbestforthisdistribution.ItistheLogitfunction.Intheaboveequation,theparametersareselectedbyobservingthemaximumlikelihoodestimatesofthesample,ratherthanminimizingthesumofsquareerror(asusedinordinaryregression).

3. Polynomiální regrese

Foraregressionequation,iftheindexoftheindependentvariableisgreaterthan1,thenitisapolynomialregressionequation.Asshowninthefollowingequation:

y=a+b*x^2

Inthisregressiontechnique,thebestfitlineisnotastraightline.Itisacurveusedtofitthedatapoints.

4.StepwiseRegrese

Thisformofregressioncanbeusedwhendealingwithmultipleindependentvariables.Inthistechnique,theselectionofindependentvariablesisdoneinanautomaticprocess,includingnon-humanoperations.

Thisfeatistoidentifyimportantvariablesbyobservingstatisticalhodnotas,suchasR-square,t-statsandAICindicators.Stepwiseregressionfitsthemodelbyadding/removingcovariatesbasedonspecifiedcriteriaatthesametime.Someofthemostcommonlyusedstepwiseregressionmethodsarelistedbelow:

Thestandardstepwiseregressionmethoddoestwothings.Thatis,thepredictionrequiredforeachstepisaddedanddeleted.

Theforwardselectionmethodstartswiththemostsignificantpredictioninthemodel,andthenaddsvariablesforeachstep.

Thebackwardeliminationmethodstartsatthesametimeasallpredictionsofthemodel,andtheneliminatestheleastsignificantvariableateachstep.

Thepurposeofthismodelingtechniqueistousetheleastnumberofpredictorstomaximizepredictivepower.Thisisalsooneofthewaystodealwithhigh-dimensionaldatasets.2

5.Regrese hřebene

Whenthereismultiplecollinearity(independentvariablesarehighlycorrelated)betweenthedata,ridgeregressionanalysisisrequired.Inthepresenceofmulticollinearity,althoughtheestimatedhodnotameasuredbytheleastsquaremethod(OLS)doesnothaveabias,theirvariancewillbelarge,whichmakestheobservedhodnotaveryfarfromthetruehodnota.Ridgeregressionreducesthestandarderrorbyaddingadeviationhodnotatotheregressionestimate.

Inthelinearequation,thepredictionerrorcanbedividedinto2components,oneiscausedbybiasandtheotheriscausedbyvariance.Thepredictionerrormaybecausedbyeitherorbothofthese.Here,theerrorcausedbyvariancewillbediscussed.

Ridgeregressionsolvestheproblemofmulticollinearitythroughtheshrinkageparameterλ(lambda).Considerthefollowingequation:

L2=argmin||y=xβ||

+λ||β||

Inthisformula,Therearetwocomponents.Thefirstistheleastsquareterm,andtheotherisλtimesβ-square,whereβisthecorrelationcoefficientvector,whichisaddedtotheleastsquaretermtogetherwiththeshrinkageparametertogetaverylowvariance.

6.LassoRegrese

Itissimilartoridgeregression.Lasso(LeastAbsoluteShrinkageandSelectionOperator)willalsogiveapenaltyhodnotatotheregressioncoefficientvector.Inaddition,itcanreducethedegreeofvariationandimprovetheaccuracyofthelinearregressionmodel.Takealookatthefollowingformula:

L1=agrmin||y-xβ||

+λ||β||

LassoregressionandRidgeregressionhaveOnedifferenceisthatthepenaltyfunctionitusesistheL1norm,nottheL2norm.Thisleadstoapenalty(orequaltothesumoftheabsolutehodnotaoftheconstraintestimate)hodnotathatmakessomeparameterestimatesequaltozero.Thelargerthepenaltyhodnotais,thefurtherestimationwillmakethereductionhodnotaclosertozero.Thiswillresultintheselectionofvariablesfromthegivennvariables.

Ifthepredictedsetofvariablesishighlycorrelated,Lassowillselectoneofthevariablesandshrinktheotherstozero.

7.Elastická netregrese

ElasticNetisamixtureofLassoandRidgeregressiontechniques.ItusesL1fortrainingandL2firstastheregularizationmatrix.ElasticNetisusefulwhentherearemultiplerelatedfeatures.Lassowillpickoneofthematrandom,whileElasticNetwillchoosetwo.

ThepracticaladvantagebetweenLassoandRidgeisthatitallowsElasticNettoinheritsomeofthestabilityofRidgeintheloopstate.

Dataexplorationisaninevitablepartofbuildingapredictivemodel.Itshouldbethefirststepwhenchoosingasuitablemodel,suchasidentifyingtherelationshipandinfluenceofvariables.Moresuitablefortheadvantagesofdifferentmodels,youcananalyzedifferentindexparameters,suchasstatisticallysignificantparameters,R-square,AdjustedR-square,AIC,BIC,anderrorterms.TheotheristheMallows’Cpcriterion.Thisismainlybycomparingthemodelwithallpossiblesub-models(orchoosingthemcarefully)andcheckingforpossibledeviationsinyourmodel.

Crossvalidationisthebestwaytoevaluatepredictivemodels.Here,divideyourdatasetintotwo(onefortrainingandoneforvalidation).Useasimplemeansquareerrorbetweentheobservedhodnotaandthepredictedhodnotatomeasureyourpredictionaccuracy.

Ifyourdatasetismultiplemixedvariables,thenyoushouldnotchoosetheautomaticmodelselectionmethod,becauseyoushouldnotwanttoputallthevariablesinthesamemodelatthesametime.

Itwillalsodependonyourpurpose.Itmayhappenthatalesspowerfulmodeliseasiertoimplementthanahighlystatisticallysignificantmodel.Regressionregularizationmethods(Lasso,RidgeandElasticNet)workwellinthecaseofhigh-dimensionalandmulticollinearitybetweendatasetvariables.3

Předpoklady a obsah

Indataanalysis,someconditionalassumptionsaregenerallyrequiredforthedata:

Homogenita odchylky

LinearityRelations

Akumulační efekty

Proměnná chyba měření jedu

Proměnné sledují vícerozměrné normální rozdělení

Dodržujte nezávislost

Themodeliscomplete(novariablesthatshouldnotbeentered,andnovariablesthatshouldbeenteredarenotincluded)

Chyba je nezávislá a je (0,1) normální rozdělení.

Realisticdataoftencannotfullycomplywiththeaboveassumptions.Therefore,statisticianshavedevelopedmanyregressionmodelstosolvetheconstraintsoftheassumedprocessoflinearregressionmodels.

Hlavním obsahem regresní analýzy je:

①Startingfromasetofdata,determinethequantitativerelationshipbetweencertainvariables,thatis,establishamathematicalmodelandestimatetheunknownparameters.Thecommonmethodofestimatingparametersistheleastsquaresmethod.

②Otestujte důvěryhodnost těchto vztahů.

③Intherelationshipwheremanyindependentvariablesaffectadependentvariabletogether,determinewhich(orwhich)independentvariableshavesignificanteffects,andwhichindependentvariableshaveinsignificanteffects,willaffectsignificantTheindependentvariablesareaddedtothemodel,andtheinsignificantvariablesareeliminated,usuallybystepwiseregression,forwardregression,andbackwardregression.

④Usetherequiredrelationshiptopredictorcontrolacertainproductionprocess.Theapplicationofregressionanalysisisveryextensive,andthestatisticalsoftwarepackagemakesthecalculationofvariousregressionmethodsveryconvenient.

Inregressionanalysis,variablesaredividedintotwocategories.Onetypeisdependentvariables,whichareusuallyatypeofindexthatisconcernedinactualproblems,usuallyrepresentedbyY;andtheothertypeofvariablethataffectsthehodnotaofthedependentvariableiscalledindependentvariable,whichisrepresentedbyX.

Hlavní problémy výzkumu regresní analýzy jsou:

(1)DeterminethequantitativerelationshipexpressionbetweenYandX,thisexpressioniscalledregressionequation;

(2)Testthereliabilityoftheobtainedregressionequation;

(3)DeterminewhethertheindependentvariableXhasaneffectonthedependentvariableY;

(4)Použijte získanou regresní rovnici k předpovědi a řízení.4

aplikace

Correlationanalysisstudiesthecorrelationbetweenphenomena,thedirectionandclosenessofcorrelation,andgenerallydoesnotdistinguishbetweenindependentvariablesordependentvariables.Regressionanalysisistoanalyzethespecificformsofcorrelationbetweenphenomena,determinethecausalrelationship,andusemathematicalmodelstoexpressthespecificrelationship.Forexample,itcanbeknownfromcorrelationanalysisthat"quality"and"usersatisfaction"variablesarecloselyrelated,butwhichvariablebetweenthesetwovariablesisaffectedbywhichvariable,andthedegreeofinfluence,requiresregressionanalysis.tomakesure.1

Generallyspeaking,regressionanalysisistodeterminethecausalrelationshipbetweendependentvariablesandindependentvariables,establisharegressionmodel,andsolvetheparametersofthemodelbasedonthemeasureddata,andthenevaluatetheregressionmodelWhetheritcanfitthemeasureddatawell;ifitcanfitwell,youcanmakefurtherpredictionsbasedontheindependentvariables.

Forexample,ifyouwanttostudythecausalrelationshipbetweenqualityandusersatisfaction,inapracticalsense,productqualitywillaffectusersatisfaction,sosetusersatisfactionasthedependentvariableandrecorditasY;Qualityistheindependentvariable,denotedasX.Thefollowinglinearrelationshipcanusuallybeestablished:Y=A+BX+§

where:AandBareundeterminedparameters,Aistheinterceptoftheregressionline;Bistheslopeoftheregressionline,whichmeansthatXchangesbyoneInunit,theaveragechangeofY;§isarandomerroritemthatdependsonusersatisfaction.

Pro empirickou regresní rovnici:y=0,857+0,836x

Theinterceptoftheregressionlineonthey-axisis0.857andtheslopeis0.836,whichmeansthatforeverypointinquality,usersatisfactionAnaverageincreaseof0.836points;inotherwords,thecontributionofa1pointimprovementinqualitytousersatisfactionis0.836points.

Theexampleshownaboveisasimplelinearregressionproblemofoneindependentvariable.Duringdataanalysis,thiscanalsobeextendedtomultipleregressionofmultipleindependentvariables.PleaserefertothespecificregressionprocessandvýznamRefertorelevantstatisticsbooks.Inaddition,intheSPSSresultoutput,R2,FtesthodnotaandTtesthodnotacanalsobereported.R2isalsocalledthecoefficientofdeterminationoftheequation,whichindicatesthedegreeofinterpretationofthevariableXtoYintheequation.ThehodnotaofR2isbetween0and1.Thecloserto1,thestrongertheinterpretationabilityofXtoYintheequation.R2isusuallymultipliedby100%toexpressthepercentageofYchangeexplainedbytheregressionequation.TheFtestisoutputthroughtheanalysisofvariancetable,andthesignificancelevelisusedtotestwhetherthelinearrelationshipoftheregressionequationissignificant.Generallyspeaking,significancelevelsabove0.05arevýznamful.WhentheFtestpasses,itmeansthatatleastoneoftheregressioncoefficientsintheequationissignificant,butnotallregressioncoefficientsaresignificant,soaTtestisneededtoverifythesignificanceoftheregressioncoefficients.Similarly,theTtestcanbedeterminedbythesignificanceleveloralook-uptable.Intheexampleshownabove,thevýznamofeachparameterisshowninthetablebelow.

Test lineární regrese

regression analysis

index

hodnota

Úroveň významnosti

Význam

R2

0,89

"Kvalita" vysvětluje 89% stupně změny "Spokojenost uživatelů"

F

276,82

0,001

Thelinearrelationshipoftheregressionequationissignificant

T

16,64

0,001

Koeficient regresní rovnice je významný

SamplelinearregressionanalysisofSIMmobilephoneusersatisfactionandrelatedvariables

TakethelinearregressionanalysisofSIMmobilephoneusersatisfactionandrelatedvariablesasanexampletofurtherillustrateaplikaceoflinearregression.Inapracticalsense,mobilephoneusersatisfactionshouldberelatedtoproductquality,price,andimage.Therefore,“usersatisfaction”isusedasthedependentvariable,and“quality”,“image”and“price”areindependentvariables.regressionanalysis.UsingtheregressionanalysisofSPSSsoftware,theregressionequationisobtainedasfollows:

Uživatelská spokojenost=0,008×obrázek+0,645×kvalita+0,221×cena

ForSIMmobilephones,thequalityisThecontributionofusersatisfactionisrelativelylarge.Forevery1pointincreaseinquality,usersatisfactionwillincreaseby0.645points;followedbyprice.Forevery1pointincreaseintheevaluationofpricesbyusers,theirsatisfactionwillincreaseby0.221points;andtheimageissatisfiedwiththeproductusers.Thecontributionofdegreeisrelativelysmall,andforevery1pointincreaseinimage,usersatisfactiononlyincreasesby0,008points.

Thetestindicatorsandtheirvýznamsoftheequationareasfollows:

p>

Index

Úroveň významnosti

význam

R2

0,89

89% spokojenosti "stupeň změny".

F

248,53

0,001

Thelinearrelationshipoftheregressionequationissignificant

T (obrázek)

0,00

1 000

The"image"variablehardlycontributestotheregressionequation

T (kvalita)

13,93

0,001

"Quality"hasagreatcontributiontotheregressionequation

T (cena)

5,00

0,001

"Price"hasagreatcontributiontotheregressionequation

Z hlediska testovacích ukazatelů rovnice "obrázek" nepřispívá příliš k celé regresní rovnici a měl by být odstraněn. "Spokojenost uživatelů" a "spokojenost uživatelů" by tedy měla být odstraněna. 2 x 4 x 5 x kvalita: 1 x 5 cen + ce

Everytimeauser’sevaluationofthepriceincreasesby1point,hissatisfactionwillincreaseby0.221points(inthisexampleIn,“image”hasalmostnocontributiontotheequation,sotheequationobtainedissimilartothecoefficientsofthepreviousregressionequation).

Thetestindicatorsandvýznamsoftheequationareasfollows:

Index

Úroveň významnosti

Význam

R2

0,89

89%Stupeň změny"spokojenosti uživatelů"

F

374,69

0,001

Thelinearrelationshipoftheregressionequationissignificant

T (kvalita)

15.15

0,001

"Quality"hasagreatcontributiontotheregressionequation

T (cena)

5.06

0,001

"Price"hasagreatcontributiontotheregressionequation

Kroky k určení proměnných

Clarifythespecifictargetoftheprediction,andalsodeterminethedependentvariable.Ifthespecifictargetforforecastingisthesalesvolumeofthenextyear,thenthesalesvolumeYisthedependentvariable.Throughmarketresearchanddatareview,findtherelevantinfluencingfactorsoftheforecasttarget,thatis,independentvariables,andselectthemaininfluencingfactorsfromthem.

Zavedení prediktivního modelu

Calculatebasedonhistoricalstatisticaldataofindependentvariablesanddependentvariables,andestablishregressionanalysisequations,thatis,regressionanalysispredictivemodels.

Provádění korelační analýzy

Regressionanalysisisthemathematicalstatisticalanalysisandprocessingofcausalinfluencingfactors(independentvariables)andpredictionobjects(dependentvariables).Onlywhentheindependentvariableandthedependentvariabledohaveacertainrelationship,theestablishedregressionequationisvýznamful.Therefore,whetherthefactorastheindependentvariableisrelatedtothepredictedobjectasthedependentvariable,thedegreeofcorrelation,andthedegreeofcertaintyinjudgingthedegreeofsuchcorrelation,havebecomeproblemsthatmustbesolvedinregressionanalysis.Forcorrelationanalysis,correlationisgenerallyrequired,andthedegreeofcorrelationbetweentheindependentvariableandthedependentvariableisjudgedbythesizeofthecorrelationcoefficient.

Vypočítejte chybu předpovědi

Whethertheregressionpredictionmodelcanbeusedforactualpredictiondependsonthetestoftheregressionpredictionmodelandthecalculationofthepredictionerror.Onlywhentheregressionequationpassesvarioustestsandthepredictionerrorissmall,cantheregressionequationbeusedasapredictionmodelforprediction.

Determinethepredictedhodnota

Usingtheregressionpredictionmodeltocalculatethepredictedhodnota,andcomprehensivelyanalyzethepredictedhodnotatodeterminethefinalpredictedhodnota.

Věnujte pozornost problému

Whenapplyingtheregressionpredictionmethod,firstdeterminewhetherthereisacorrelationbetweenthevariables.Ifthereisnocorrelationbetweenthevariables,applyingregressionforecastingmethodstothesevariableswillgivewrongresults.

Payattentiontothecorrectapplicationofregressionanalysisandprediction:

①Použijte kvalitativní analýzu k určení závislosti mezi jevy;

②Vyhněte se bitrální extrapolaci předpovědi regrese;

③Použijte vhodná data;

Související články
HORNÍ