Náhodný les - techintroduce

Definice

Inmachinelearning,randomforestisaclassifierthatcontainsmultipledecisiontrees,andtheoutputcategoriesareoutputbyindividualtreesDependingonthemodeofthecategory.LeoBreimanandAdeleCutlerdevelopedanalgorithmtodeducerandomforests.And"RandomForests"istheirtrademark.ThistermisderivedfromrandomdecisionforestsproposedbyTinKamHoofBellLabsin1995.ThismethodcombinesBreimans'"Bootstrapaggregating"ideaandHo's"randomsubspacemethod"tobuildasetofdecisiontrees.

Učební algoritmus

Postavte každý strom podle následujícího algoritmu:

UseNtorepresenttrainingThenumberofusecases(samples),Mrepresentsthenumberoffeatures.
Enterthenumberoffeaturesmtodeterminethedecisionresultofanodeonthedecisiontree;wheremshouldbemuchsmallerthanM.
FromNtrainingcases(samples)withreplacementsampling,takesamplesNtimestoformAtrainingset(iebootstrapsampling),anduseun-selectedusecases(samples)tomakepredictionstoevaluatetheerror.
Foreachnode,randomlyselectmfeatures,andthedecisionofeachnodeonthedecisiontreeisdeterminedbasedonthesefeatures.Accordingtothesemcharacteristics,calculatethebestsplitmethod.
Eachtreewillgrowcompletelywithoutpruning,whichmaybeusedafteranormaltree-likeclassifierisbuilt).

Výhody

Výhody náhodného zalesňování:

1) Pro mnoho druhů dat může generovat vysocepřesný klasifikátor;

2)Může pracovat s velkým počtem vstupních proměnných;

3)Itcanevaluatetheimportanceofvariableswhendeterminingthecategory;

4)Whenbuildingtheforest,itcanproduceanunbiasedestimateofthegeneralizederrorinternally;

5)Itcontainsagoodmethodtoestimatethemissingdata,andifthereisalargepartoftheIfthedataismissing,theaccuracycanstillbemaintained;

6)Itprovidesanexperimentalmethodtodetectvariableinteractions;

7) Pro nevyvážené klasifikační datové sady, může vyvážit chyby;

8)Itcalculatestheclosenessineachcase,whichisveryusefulfordatamining,detectingoutliersandvisualizingdata;

9)Usetheabove.Itcanbeextendedtounlabeleddata,whichusuallyusesunsupervisedclustering.Itcanalsodetectdeviatorsandwatchdata;

10) Proces učení je velmi rychlý.

Související pojmy

1.Split:Inthetrainingprocessofthedecisiontree,thetrainingdatasetneedstobesplitintotwosub-datasetsagainandagain.Thisprocessiscalledsplitting.

2.Features:Inaclassificationproblem,thedatainputintotheclassifieriscalledafeature.Taketheabovestockpriceforecastingproblemasanexample.Thecharacteristicisthetradingvolumeandclosingpriceofthepreviousday.

3.Featurestobeselected:Intheprocessofconstructingthedecisiontree,itisnecessarytoselectfeaturesfromallthefeaturesinacertainorder.Thefeaturestobeselectedarethesetoffeaturesthathavenotbeenselectedbeforethestep.Forexample,ifallthefeaturesareABCDE,inthefirststep,thecandidatefeatureisABCDE,andinthefirststep,Cisselected,theninthesecondstep,thecandidatefeatureisABDE.

4.Splitfeature:Thedefinitionofthereceptionselectionfeature.Eachselectedfeatureisthesplitfeature.Forexample,intheaboveexample,thefirstsplitfeatureisC.Becausetheseselectedfeaturesdividethedatasetintodisjointparts,theyarecalledsplitfeatures.

Konstrukce rozhodovacího stromu

Totalkaboutrandomforest,wemustfirsttalkaboutdecisiontrees.Decisiontreeisabasicclassifier,whichgenerallydividesfeaturesintotwocategories(decisiontreecanalsobeusedforregression,butthisarticlewillnotshowitforthetimebeing).Theconstructeddecisiontreehasatreestructure,whichcanbeconsideredasacollectionofif-thenrules.Themainadvantageisthatthemodelisreadableandtheclassificationspeedisfast.

Weusetheprocessofselectingquantitativetoolstovisualizetheconstructionofthedecisiontree.Supposewewanttochooseanexcellentquantitativetooltohelpusbetterstocks,howtochoose?

Thefirststep:seeifthedataprovidedbythetoolisverycomprehensive,don’tuseitifthedataisnotcomprehensive.

Krok 2: Zkontrolujte, zda se rozhraní API poskytované tímto nástrojem snadno používá. Pokud rozhraní API není dobré, nepoužívejte ho.

Step3:Checkwhetherthebacktestingprocessofthetoolisreliable,andthestrategiesthatarenotreliablebacktestingarenotused.

Step4:Checkwhetherthetoolsupportssimulatedtrading.Backtestingonlyallowsyoutojudgewhetherthestrategyisusefulinhistory.Atleastasimulateddiskisneededbeforetheformaloperation.

Tímto způsobem bude kvantitativní nástroj na trhu označen „zda jsou data komplexní“, „zda je API snadné použít“, zda je zpětné testování spolehlivé“ a „zda je podporováno simulované obchodování“, „použít“ a „nepoužívat“.

Theaboveistheconstructionofadecisiontree,andthelogiccanberepresentedinFigure1:

Na obrázku 1 jsou "data", "API" a "backtest" v zeleném barevném poli"Simulované obchodování" v tomto rozhodovacím stromě. Je-li pořadí funkcí odlišné, může být rozhodovací strom vytvořený ze stejné datové sady odlišně. Funkce objednávky je "data" "a"simulovaná funkce",""vybereme"data"a""""""""""""""""""" "API" a "backtest", pak je sestrojený rozhodovací strom zcela odlišný.

Itcanbeseenthatthemainjobofthedecisiontreeistoselectfeaturestodividethedataset,andfinallyputthedataontwodifferenttypesoflabels.Howtochoosethebestfeature?Alsousetheexampleofselectingquantizationtoolsabove:supposethereare100quantizationtoolsonthemarketasthetrainingdataset,andthesequantizationtoolshavebeenlabeled"available"and"unavailable".

Wefirsttriedtodividethedatasetintotwocategoriesby"IstheAPIeasytouse";wefoundthattheAPIsof90quantitativetoolsareeasytouse,andtheAPIsof10quantitativetoolsarenoteasytouse.Amongthe90quantitativetools,40arelabeledas"available"and50arelabeledas"unavailable".Then,theclassificationeffectofthe"APIiseasytouse"onthedataisnotEspeciallygood.Because,givenyouanewquantitativetool,evenifitsAPIiseasytouse,youstillcannotlabelitas"used"well.

Assumeagainthatthesame100quantitativetoolscanbedividedintotwocategoriesby"Doyousupportsimulatedtrading".Onecategoryhas40quantitativetooldata,andall40quantitativetoolssupportSimulatedtransactionswereeventuallylabeled"used".Anothercategoryhad60quantitativetools,noneofwhichsupportedsimulatedtransactions,andtheywereeventuallylabeled"notused".Ifanewquantitativetoolsupportssimulatedtrading,youcanjudgewhetherthequantitativetoolcanbeused.Webelievethattheclassificationofdataby"whetheritsupportssimulatedtrading"isveryeffective.

Inreal-worldapplications,datasetsoftenfailtoachievetheabove-mentionedclassificationeffectof"whethersimulatedtradingissupported".Soweusedifferentcriteriatomeasurethecontributionoffeatures.Threemainstreamcriteriaarelisted:ID3algorithm(proposedbyJ.RossQuinlanin1986)usesthefeaturewiththelargestinformationgain;C4.5algorithm(proposedbyJ.RossQuinlanin1993)usestheinformationgainratiotoselectfeatures;CARTalgorithm(Breimanetal.(proposedin1984)usetheGiniindexminimizationcriterionforfeatureselection.

Stavba náhodného lesa

Thedecisiontreeisequivalenttoamaster,classifyingnewdatathroughtheknowledgelearnedinthedataset.Butasthesayinggoes,oneZhugeLiangcan'tplaywiththreeheads.Randomforestisanalgorithmthathopestobuildmultipleheadsandhopesthatthefinalclassificationeffectcanexceedasinglemaster.

Howtobuildarandomforest?Therearetwoaspects:randomselectionofdata,andrandomselectionoffeaturestobeselected.

1. Náhodný výběr dat:

First,takeasamplewithreplacementfromtheoriginaldatasettoconstructasub-dataset.Thedatavolumeofthesub-datasetisthesameastheoriginaldata.Setthesame.Elementsindifferentsub-datasetscanberepeated,andelementsinthesamesub-datasetcanalsoberepeated.Second,usesub-datasetstoconstructsub-decisiontrees,putthisdataineachsub-decisiontree,andeachsub-decisiontreeoutputsaresult.Finally,ifthereisnewdatathatneedstobeclassifiedthroughtherandomforest,theoutputresultoftherandomforestcanbeobtainedbyvotingonthejudgmentresultsofthesub-decisiontree.AsshowninFigure3,assumingthatthereare3sub-decisiontreesintherandomforest,theclassificationresultof2sub-treesistypeA,andtheclassificationresultof1sub-treeistypeB,thentheclassificationresultoftherandomforestistypeA.

2.Náhodný výběr funkcí, které se mají vybrat

Similartorandomselectionofdatasets,eachsplittingprocessofthesubtreeintherandomforestdoesnotuseallthefeaturestobeselected,Butrandomlyselectacertainfeaturefromallthefeaturestobeselected,andthenselecttheoptimalfeaturefromtherandomlyselectedfeatures.Inthisway,thedecisiontreesintherandomforestcanbedifferentfromeachother,andthediversityofthesystemisimproved,therebyimprovingtheclassificationperformance.

InFigure4,thebluesquaresrepresentallthefeaturesthatcanbeselected,thatis,thefeaturestobeselected.Theyellowsquareisasplitfeature.Ontheleftisthefeatureselectionprocessofadecisiontree.Splittingiscompletedbyselectingtheoptimalsplitfeaturefromthefeaturestobeselected(don'tforgettheID3algorithm,C4.5algorithm,CARTalgorithm,etc.mentionedabove).Ontherightisthefeatureselectionprocessofasubtreeinarandomforest.