Случайна гора

Определение

Inmachinelearning,randomforestisaclassifierthatcontainsmultipledecisiontrees,andtheoutputcategoriesareoutputbyindividualtreesDependingonthemodeofthecategory.LeoBreimanandAdeleCutlerdevelopedanalgorithmtodeducerandomforests.And"RandomForests"istheirtrademark.ThistermisderivedfromrandomdecisionforestsproposedbyTinKamHoofBellLabsin1995.ThismethodcombinesBreimans'"Bootstrapaggregating"ideaandHo's"randomsubspacemethod"tobuildasetofdecisiontrees.

Алгоритъм за обучение

Изградете всяко дърво според следния алгоритъм:

  1. UseNtorepresenttrainingThenumberofusecases(samples),Mrepresentsthenumberoffeatures.Random forest

  2. Enterthenumberoffeaturesmtodeterminethedecisionresultofanodeonthedecisiontree;wheremshouldbemuchsmallerthanM.

  3. FromNtrainingcases(samples)withreplacementsampling,takesamplesNtimestoformAtrainingset(iebootstrapsampling),anduseun-selectedusecases(samples)tomakepredictionstoevaluatetheerror.

  4. Foreachnode,randomlyselectmfeatures,andthedecisionofeachnodeonthedecisiontreeisdeterminedbasedonthesefeatures.Accordingtothesemcharacteristics,calculatethebestsplitmethod.

  5. Eachtreewillgrowcompletelywithoutpruning,whichmaybeusedafteranormaltree-likeclassifierisbuilt).

Предимства

Предимствата на случайната гора са:

1) За много видове данни може да генерира класификатор с висока точност;

2) Може да обработва голям брой фин въведени променливи;

3)Itcanevaluatetheimportanceofvariableswhendeterminingthecategory;

4)Whenbuildingtheforest,itcanproduceanunbiasedestimateofthegeneralizederrorinternally;

5)Itcontainsagoodmethodtoestimatethemissingdata,andifthereisalargepartoftheIfthedataismissing,theaccuracycanstillbemaintained;

6)Itprovidesanexperimentalmethodtodetectvariableinteractions;

7) За небалансирани набори от данни за класификация може да балансира грешки;

8)Itcalculatestheclosenessineachcase,whichisveryusefulfordatamining,detectingoutliersandvisualizingdata;

9)Usetheabove.Itcanbeextendedtounlabeleddata,whichusuallyusesunsupervisedclustering.Itcanalsodetectdeviatorsandwatchdata;

10) Процесът на обучение е много бърз.

Свързани понятия

1.Split:Inthetrainingprocessofthedecisiontree,thetrainingdatasetneedstobesplitintotwosub-datasetsagainandagain.Thisprocessiscalledsplitting.

2.Features:Inaclassificationproblem,thedatainputintotheclassifieriscalledafeature.Taketheabovestockpriceforecastingproblemasanexample.Thecharacteristicisthetradingvolumeandclosingpriceofthepreviousday.

3.Featurestobeselected:Intheprocessofconstructingthedecisiontree,itisnecessarytoselectfeaturesfromallthefeaturesinacertainorder.Thefeaturestobeselectedarethesetoffeaturesthathavenotbeenselectedbeforethestep.Forexample,ifallthefeaturesareABCDE,inthefirststep,thecandidatefeatureisABCDE,andinthefirststep,Cisselected,theninthesecondstep,thecandidatefeatureisABDE.

4.Splitfeature:Thedefinitionofthereceptionselectionfeature.Eachselectedfeatureisthesplitfeature.Forexample,intheaboveexample,thefirstsplitfeatureisC.Becausetheseselectedfeaturesdividethedatasetintodisjointparts,theyarecalledsplitfeatures.

Дърво на решения

Totalkaboutrandomforest,wemustfirsttalkaboutdecisiontrees.Decisiontreeisabasicclassifier,whichgenerallydividesfeaturesintotwocategories(decisiontreecanalsobeusedforregression,butthisarticlewillnotshowitforthetimebeing).Theconstructeddecisiontreehasatreestructure,whichcanbeconsideredasacollectionofif-thenrules.Themainadvantageisthatthemodelisreadableandtheclassificationspeedisfast.

Weusetheprocessofselectingquantitativetoolstovisualizetheconstructionofthedecisiontree.Supposewewanttochooseanexcellentquantitativetooltohelpusbetterstocks,howtochoose?

Thefirststep:seeifthedataprovidedbythetoolisverycomprehensive,don’tuseitifthedataisnotcomprehensive.

Стъпка 2: Проверете дали API, предоставен от инструмента, е лесен за използване. Ако API не е добър, не го използвайте.

Step3:Checkwhetherthebacktestingprocessofthetoolisreliable,andthestrategiesthatarenotreliablebacktestingarenotused.

Step4:Checkwhetherthetoolsupportssimulatedtrading.Backtestingonlyallowsyoutojudgewhetherthestrategyisusefulinhistory.Atleastasimulateddiskisneededbeforetheformaloperation.

По този начин количествените инструменти на пазара ще бъдат обозначени с „дали данните са изчерпателни“, „дали API е лесен за използване“, „дали обратният тест е надежден“ и „дали се поддържа симулирана търговия“. „Използвайте“ и „Не използвайте“.

Theaboveistheconstructionofadecisiontree,andthelogiccanberepresentedinFigure1:

На Фигура 1 "данните", "API" и "backtest" в зеленото цветно поле "Симулирана търговия" е функция в това дърво на решения. Ако редът на функциите е различен, дървото на вземане на решения, конструирано от същия набор от данни, може също да бъде различен. Редът на функциите е "данни", "API", "backtest" и "симулирана транзакция". "API" и "backtest", тогава конструираното дърво на решенията е напълно различно.

Itcanbeseenthatthemainjobofthedecisiontreeistoselectfeaturestodividethedataset,andfinallyputthedataontwodifferenttypesoflabels.Howtochoosethebestfeature?Alsousetheexampleofselectingquantizationtoolsabove:supposethereare100quantizationtoolsonthemarketasthetrainingdataset,andthesequantizationtoolshavebeenlabeled"available"and"unavailable".

Wefirsttriedtodividethedatasetintotwocategoriesby"IstheAPIeasytouse";wefoundthattheAPIsof90quantitativetoolsareeasytouse,andtheAPIsof10quantitativetoolsarenoteasytouse.Amongthe90quantitativetools,40arelabeledas"available"and50arelabeledas"unavailable".Then,theclassificationeffectofthe"APIiseasytouse"onthedataisnotEspeciallygood.Because,givenyouanewquantitativetool,evenifitsAPIiseasytouse,youstillcannotlabelitas"used"well.

Assumeagainthatthesame100quantitativetoolscanbedividedintotwocategoriesby"Doyousupportsimulatedtrading".Onecategoryhas40quantitativetooldata,andall40quantitativetoolssupportSimulatedtransactionswereeventuallylabeled"used".Anothercategoryhad60quantitativetools,noneofwhichsupportedsimulatedtransactions,andtheywereeventuallylabeled"notused".Ifanewquantitativetoolsupportssimulatedtrading,youcanjudgewhetherthequantitativetoolcanbeused.Webelievethattheclassificationofdataby"whetheritsupportssimulatedtrading"isveryeffective.

Inreal-worldapplications,datasetsoftenfailtoachievetheabove-mentionedclassificationeffectof"whethersimulatedtradingissupported".Soweusedifferentcriteriatomeasurethecontributionoffeatures.Threemainstreamcriteriaarelisted:ID3algorithm(proposedbyJ.RossQuinlanin1986)usesthefeaturewiththelargestinformationgain;C4.5algorithm(proposedbyJ.RossQuinlanin1993)usestheinformationgainratiotoselectfeatures;CARTalgorithm(Breimanetal.(proposedin1984)usetheGiniindexminimizationcriterionforfeatureselection.

Случайна горска конструкция

Thedecisiontreeisequivalenttoamaster,classifyingnewdatathroughtheknowledgelearnedinthedataset.Butasthesayinggoes,oneZhugeLiangcan'tplaywiththreeheads.Randomforestisanalgorithmthathopestobuildmultipleheadsandhopesthatthefinalclassificationeffectcanexceedasinglemaster.

Howtobuildarandomforest?Therearetwoaspects:randomselectionofdata,andrandomselectionoffeaturestobeselected.

1. Случаен избор на данни:

First,takeasamplewithreplacementfromtheoriginaldatasettoconstructasub-dataset.Thedatavolumeofthesub-datasetisthesameastheoriginaldata.Setthesame.Elementsindifferentsub-datasetscanberepeated,andelementsinthesamesub-datasetcanalsoberepeated.Second,usesub-datasetstoconstructsub-decisiontrees,putthisdataineachsub-decisiontree,andeachsub-decisiontreeoutputsaresult.Finally,ifthereisnewdatathatneedstobeclassifiedthroughtherandomforest,theoutputresultoftherandomforestcanbeobtainedbyvotingonthejudgmentresultsofthesub-decisiontree.AsshowninFigure3,assumingthatthereare3sub-decisiontreesintherandomforest,theclassificationresultof2sub-treesistypeA,andtheclassificationresultof1sub-treeistypeB,thentheclassificationresultoftherandomforestistypeA.

2. Случаен избор на функциите, които да бъдат избрани

Similartorandomselectionofdatasets,eachsplittingprocessofthesubtreeintherandomforestdoesnotuseallthefeaturestobeselected,Butrandomlyselectacertainfeaturefromallthefeaturestobeselected,andthenselecttheoptimalfeaturefromtherandomlyselectedfeatures.Inthisway,thedecisiontreesintherandomforestcanbedifferentfromeachother,andthediversityofthesystemisimproved,therebyimprovingtheclassificationperformance.

InFigure4,thebluesquaresrepresentallthefeaturesthatcanbeselected,thatis,thefeaturestobeselected.Theyellowsquareisasplitfeature.Ontheleftisthefeatureselectionprocessofadecisiontree.Splittingiscompletedbyselectingtheoptimalsplitfeaturefromthefeaturestobeselected(don'tforgettheID3algorithm,C4.5algorithm,CARTalgorithm,etc.mentionedabove).Ontherightisthefeatureselectionprocessofasubtreeinarandomforest.

Related Articles
TOP