ID3 алгоритъм - techintroduce

Основни познания

TheID3алгоритъмwasfirstproposedbyJ.RossQuinlanattheUniversityofSydneyin1975asaclassificationpredictionalgorithm.Thecoreofthealgorithmis"informationentropy"..TheID3алгоритъмcalculatestheinformationgainofeachattributeandconsidersthattheattributewithhighinformationgainisagoodattribute.Eachtimetheattributewiththehighestinformationgainisselectedasthepartitionstandard,thisprocessisrepeateduntiladecisiontreethatcanperfectlyclassifytrainingexamplesisgenerated.

Decisiontreeistoclassifydatatoachievethepurposeofprediction.Thedecisiontreemethodfirstformsadecisiontreebasedonthetrainingsetdata.Ifthetreecannotgivethecorrectclassificationtoallobjects,selectsomeexceptionstoaddtothetrainingsetdata,andrepeattheprocessuntilthecorrectdecisionsetisformed.Thedecisiontreerepresentsthetreestructureofthedecisionset.

Thedecisiontreeiscomposedofdecisionnodes,branchesandleaves.Thetopnodeinthedecisiontreeistherootnode,andeachbranchisanewdecisionnode,oraleafofthetree.Eachdecisionnoderepresentsaproblemordecision,andusuallycorrespondstotheattributesoftheobjecttobeclassified.Eachleafnoderepresentsapossibleclassificationresult.Intheprocessoftraversingthedecisiontreefromtoptobottom,eachnodewillencounteratest,andthedifferenttestoutputsoftheproblemoneachnodewillleadtodifferentbranches,andfinallyaleafnodewillbereached.ThisprocessItistheprocessofusingdecisiontreestoclassify,usingseveralvariablestodeterminethecategoryitbelongsto.

ID3алгоритъм

TheID3алгоритъмwasfirstproposedbyQuinlan.Thealgorithmisbasedoninformationtheory,andusesinformationentropyandinformationgainasmeasurementstandards,soastorealizetheinductiveclassificationofdata.Thefollowingaresomebasicconceptsofinformationtheory:

Дефиниция1:Ако има съобщения със същата вероятност, вероятността за всяко съобщение е 1/n, а количеството информация, предадено от съобщението, е-Log2(1/n)

Определение 2: Ако има съобщения и даденото разпределение на вероятностите е P=(p1,p2...pn), тогава количеството информация, предадено от разпределението, се нарича ентропия на P. За

ID3 algorithm

Дефиниция 3: Ако наборът от записи е разделен на независими категорииC1C2..C в съответствие със стойността на атрибута на категорията, количеството информация, необходимо за идентифициране на кой категоричен елемент от T принадлежи, еInfo(T)=I(p), където P е разпределението на вероятността от C1C2...Ck,това е,T=(|C1|/|T|,|.....|Ck|

ID3 algorithm

Definition4:IfWefirstdivideTintosetsT1,T2...Tnaccordingtothevalueofthenon-categoryattributeX,andthendeterminetheamountofinformationofanelementclassinTcanbeobtainedbydeterminingtheweightedaveragevalueofTi,thatis,theweightedaveragevalueofInfo(Ti)is:

Информация(X,T)=(i=1tonsum)((|Ti|/|T|)Инфо(Ti))

Definition5:InformationGainisthedifferencebetweentwoamountsofinformation.OneamountofinformationistheamountofinformationofoneelementofTthatneedstobedetermined,andtheotheramountofinformationistheamountofinformationofoneelementofTthatneedstobedeterminedafterthevalueofattributeXhasbeenobtained.Theinformationgaindegreeformulais:

Печалба(X,T)=Инфо(T)-Инфо(X,T)

ID3алгоритъмcalculatestheinformationgainofeachattribute,Andselecttheattributewiththehighestgainasthetestattributeofthegivenset.Createanodefortheselectedtestattribute,markitwiththeattributeofthenode,createabranchforeachvalueoftheattributeanddividethesampleaccordingly.

Описание на данните

Използваните примерни данни имат определени изисквания. ID3 е:

Description-attribute-attributeswiththesamevaluemustdescribeeachexampleandhaveafixednumberofvalues.

Предварително дефинираните атрибути на екземпляр на клас трябва да бъдат дефинирани, тоест те не са разпознати ID3.

Дискретен клас - класът трябва да бъде остър и различен. Разлагането на непрекъснати класове в размити категории (като металите, които са "твърди, трудни, гъвкави, нежни и меки" не е достоверно.

Enoughexamples-becauseinductivegeneralizationisusedfor(Thatis,itisnotpossibletofindout.)Enoughtestcasesmustbeselectedtodistinguishvalidpatternsandeliminatetheinfluenceofspecialcoincidencefactors.

Избор на атрибути

ID3determineswhichattributesarebest.Astatisticalfeature,calledinformationgain,usesentropytoobtainagivenattributetomeasurethetrainingexamplesbroughtintothetargetclass.Theinformationwiththehighestinformationgain(informationisthemostusefulcategory)isselected.Inordertoclarifythegain,wefirstborrowfrominformationtheoryOnedefinitioniscalledentropy.Everyattributehasanentropy.