Technický proces
Consideringthedataitself,dataminingusuallyrequires8stepsincludingdatacleaning,datatransformation,dataminingimplementationprocess,patternevaluationandknowledgerepresentation.
(1)Informationcollection:Abstractthecharacteristicinformationneededindataanalysisaccordingtothedetermineddataanalysisobject,thenselecttheappropriateinformationcollectionmethod,andstorethecollectedinformationinthedatabase.Formassivedata,choosingasuitabledatawarehousefordatastorageandmanagementiscrucial.
(2)Dataintegration:Logicallyorphysicallycentralizedatafromdifferentsources,formats,andcharacteristics,soastoprovideenterpriseswithcomprehensivedatasharing.
(3)Dataprotocol:Ittakesalongtimetoexecutemostdataminingalgorithmsevenonasmallamountofdata,andtheamountofdataisoftenverylargewhendoingbusinessoperationdatamining.Dataspecificationtechnologycanbeusedtoobtainthespecificationrepresentationofthedataset,whichismuchsmaller,butstillclosetomaintainingtheintegrityoftheoriginaldata,andtheresultsofdataminingafterthespecificationarethesameoralmostthesameastheresultsbeforethespecification.
(4)Datacleaning:someofthedatainthedatabaseisincomplete(someattributesofinterestaremissingattributevalues),noisy(includingwrongattributevalues),andareinconsistent(Thesameinformationisexpressedindifferentways),sodatacleaningisrequiredtostorecomplete,correct,andconsistentdatainformationinthedatawarehouse.
(5)Datatransformation:Transformdataintoaformsuitablefordataminingthroughsmoothaggregation,datageneralization,andstandardization.Forsomereal-numbereddata,itisalsoanimportantsteptotransformthedatathroughconceptualstratificationanddiscretizationofthedata.
(6)Dataminingprocess:Accordingtothedatainformationinthedatawarehouse,selecttheappropriateanalysistools,applystatisticalmethods,case-basedreasoning,decisiontrees,rule-basedreasoning,fuzzysets,evenneuralnetworks,geneticsThealgorithmicmethodprocessesinformationandobtainsusefulanalysisinformation.
(7)Modelevaluation:Fromabusinessperspective,industryexpertsverifythecorrectnessofthedataminingresults.
(8)Knowledgerepresentation:theanalysisinformationobtainedbydataminingispresentedtousersinavisualmanner,orstoredasnewknowledgeintheknowledgebaseforusebyotherapplications.
Thedataminingprocessisaniterativeprocess.Ifeachstepdoesnotachievetheexpectedgoal,youneedtogobacktothepreviousstep,re-adjustandexecuteit.Noteverydataminingjobrequireseverysteplistedhere.Forexample,whentherearenomultipledatasourcesinacertainjob,thestep(2)dataintegrationstepcanbeomitted.
Krok 3
Operační metoda
Nervová síť
Becauseofitsgoodrobustness,self-organizationandadaptability,parallelprocessing,distributedstorage,andhighfaulttolerance,neuralnetworksareverysuitableforsolvingdataminingproblems.Theyareusedforclassification,Thefeedforwardneuralnetworkmodelforpredictionandpatternrecognition;representedbyHopfield'sdiscretemodelandcontinuousmodel,thefeedbackneuralnetworkmodelusedforassociativememoryandoptimizationcalculations;representedbytheartmodelandtheKoholonmodel,usingSelf-organizingmappingmethodforclustering.Thedisadvantageoftheneuralnetworkmethodisthe"blackbox"nature,anditisdifficultforpeopletounderstandthelearninganddecision-makingprocessofthenetwork.
Genetický algoritmus
Genetický algoritmusisarandomsearchalgorithmbasedonbiologicalnaturalselectionandgeneticmechanism.Theimplicitparallelism,easyintegrationwithothermodelsandotherpropertiesofgeneticalgorithmmakeitbeusedindatamining.
Sunilhassuccessfullydevelopedadataminingtoolbasedongeneticalgorithm,usingthistooltoconductdataminingexperimentsontherealdatabasesoftwoplanecrashes,theresultsshowthatgeneticalgorithmisaneffectivemethodfordataminingOneof[4].Theapplicationofgeneticalgorithmisalsoreflectedinthecombinationwithneuralnetwork,roughsetandothertechnologies.Forexample,thegeneticalgorithmisusedtooptimizethestructureoftheneuralnetwork,andtheredundantconnectionsandhiddenunitsaredeletedwithoutincreasingtheerrorrate;thegeneticalgorithmandthebpalgorithmareusedtotraintheneuralnetwork,andthentherulesareextractedfromthenetwork.However,thealgorithmofgeneticalgorithmismorecomplicated,andtheproblemofearlyconvergenceinthelocalminimumhasnotbeensolvedyet.
Metoda rozhodovacího stromu
Decisiontreeisanalgorithmcommonlyusedinpredictivemodels.Itcanfindsomevaluableandpotentialinformationfromalargeamountofdatabypurposefullyclassifyingit.Itsmainadvantagesaresimpledescriptionandfastclassificationspeed,whichisespeciallysuitableforlarge-scaledataprocessing.Themostinfluentialandearliestdecisiontreemethodisthefamousid3algorithmbasedoninformationentropyproposedbyquinlan.Itsmainproblemsare:id3isanon-incrementallearningalgorithm;id3decisiontreeisaunivariatedecisiontree,itisdifficulttoexpresscomplexconcepts;therelationshipbetweenthesamesexisnotemphasizedenough;noiseresistanceispoor.Inresponsetotheaboveproblems,manybetterimprovedalgorithmshaveemerged.Forexample,Schlimmerandfisherdesignedtheid4incrementallearningalgorithm;ZhongMingandChenWenweiproposedtheiblealgorithm.
Hrubá metoda
Roughsettheoryisamathematicaltoolforstudyinginaccurateanduncertainknowledge.Theroughsetmethodhasseveraladvantages:noadditionalinformationisrequired;theexpressionspaceoftheinputinformationissimplified;thealgorithmissimpleandeasytooperate.Theobjectofroughsetprocessingisaninformationtablesimilartoatwo-dimensionalrelationaltable.However,themathematicalbasisofroughsetsissettheory,anditisdifficulttodirectlydealwithcontinuousattributes.Thecontinuousattributesintheactualinformationtableareuniversal.Therefore,thediscretizationofcontinuousattributesisthedifficultythatrestrictsthepracticalapplicationofroughsettheory.
Methodofcoveringpositiveexamplesandrejectingcounterexamples
Itusestheideaofcoveringallpositiveexamplesandrejectingallcounterexamplestofindrules.First,chooseaseedfromthesetofpositiveexamplesandcomparethemonebyoneinthesetofnegativeexamples.Ifitiscompatiblewiththeselectorformedbythefieldvalue,itwillbediscarded,otherwise,itwillberetained.Accordingtothisthought,allpositiveexampleseedsarelooped,andtheruleofpositiveexample(theconjunctiveofselector)willbeobtained.ThemoretypicalalgorithmsareMichalski'saq11method,HongJiarong'simprovedaq15methodandhisae5method.
Metoda statistické analýzy
Therearetworelationshipsbetweendatabasefielditems:functionalrelationship(deterministicrelationshipthatcanbeexpressedbyfunctionformula)andcorrelationrelationship(notexpressedbyfunctionformula),Butitisstillarelevantdeterministicrelationship).Statisticalmethodscanbeusedfortheiranalysis,thatis,theuseofstatisticalprinciplestoanalyzetheinformationinthedatabase.Commonstatistics(seekingthemaximum,minimum,sum,average,etc.inalargeamountofdata),regressionanalysis(usingregressionequationstoexpressthequantitativerelationshipbetweenvariables),correlationanalysis(usingcorrelationcoefficientstomeasurethecorrelationbetweenvariables)Degree),differenceanalysis(fromthevalueofthesamplestatisticstodeterminewhetherthereisadifferencebetweentheoverallparameters),etc.
Fuzzyset metoda
Thefuzzysettheoryisusedtoperformfuzzyevaluation,fuzzydecision-making,fuzzypatternrecognitionandfuzzyclusteranalysisonpracticalproblems.Thehigherthecomplexityofthesystem,thestrongerthefuzziness.Generally,fuzzysettheoryusesthedegreeofmembershiptodescribethefuzzythings.Onthebasisoftraditionalfuzzytheoryandprobabilityandstatistics,LiDeyiandothersproposedaqualitativeandquantitativeuncertaintyconversionmodel-thecloudmodel,andformedthecloudtheory.
Těžební předměty
Accordingtotheinformationstorageformat,theobjectsusedforminingincluderelationaldatabases,object-orienteddatabases,datawarehouses,textdatasources,multimediadatabases,spatialdatabases,andtemporaldatabases,Heterogeneousdatabasesandinternet,etc.
Software pro dolování dat
SASEM
Modelář společnosti IBMPSPS
K-Minerof Shenzhou General Company
TempoofMerrillLynchDataTechnologyCo.,Ltd.