Technicalprocess
Consideringthedataitself,dataminingusuallyrequires8stepsincludingdatacleaning,datatransformation,dataminingimplementationprocess,patternevaluationandknowledgerepresentation.
(1)Informationcollection:Abstractthecharacteristicinformationneededindataanalysisaccordingtothedetermineddataanalysisobject,thenselecttheappropriateinformationcollectionmethod,andstorethecollectedinformationinthedatabase.Formassivedata,choosingasuitabledatawarehousefordatastorageandmanagementiscrucial.
(2)Dataintegration:Logicallyorphysicallycentralizedatafromdifferentsources,formats,andcharacteristics,soastoprovideenterpriseswithcomprehensivedatasharing.
(3)Dataprotocol:Ittakesalongtimetoexecutemostdataminingalgorithmsevenonasmallamountofdata,andtheamountofdataisoftenverylargewhendoingbusinessoperationdatamining.Dataspecificationtechnologycanbeusedtoobtainthespecificationrepresentationofthedataset,whichismuchsmaller,butstillclosetomaintainingtheintegrityoftheoriginaldata,andtheresultsofdataminingafterthespecificationarethesameoralmostthesameastheresultsbeforethespecification.
(4)Datacleaning:someofthedatainthedatabaseisincomplete(someattributesofinterestaremissingattributevalues),noisy(includingwrongattributevalues),andareinconsistent(Thesameinformationisexpressedindifferentways),sodatacleaningisrequiredtostorecomplete,correct,andconsistentdatainformationinthedatawarehouse.
(5)Datatransformation:Transformdataintoaformsuitablefordataminingthroughsmoothaggregation,datageneralization,andstandardization.Forsomereal-numbereddata,itisalsoanimportantsteptotransformthedatathroughconceptualstratificationanddiscretizationofthedata.
(6)Dataminingprocess:Accordingtothedatainformationinthedatawarehouse,selecttheappropriateanalysistools,applystatisticalmethods,case-basedreasoning,decisiontrees,rule-basedreasoning,fuzzysets,evenneuralnetworks,geneticsThealgorithmicmethodprocessesinformationandobtainsusefulanalysisinformation.
(7)Modelevaluation:Fromabusinessperspective,industryexpertsverifythecorrectnessofthedataminingresults.
(8)Knowledgerepresentation:theanalysisinformationobtainedbydataminingispresentedtousersinavisualmanner,orstoredasnewknowledgeintheknowledgebaseforusebyotherapplications.
Thedataminingprocessisaniterativeprocess.Ifeachstepdoesnotachievetheexpectedgoal,youneedtogobacktothepreviousstep,re-adjustandexecuteit.Noteverydataminingjobrequireseverysteplistedhere.Forexample,whentherearenomultipledatasourcesinacertainjob,thestep(2)dataintegrationstepcanbeomitted.
Step(3)Dataspecification(4)Datacleaning(5)Datatransformationisalsocalleddatapreprocessing.Indatamining,atleast60%ofthecostmaybespentinstep(1)informationcollectionstage,andatleast60%oftheenergyandtimeisspentondatapreprocessing
Operationmethod
Neuralnetwork
Becauseofitsgoodrobustness,self-organizationandadaptability,parallelprocessing,distributedstorage,andhighfaulttolerance,neuralnetworksareverysuitableforsolvingdataminingproblems.Theyareusedforclassification,Thefeedforwardneuralnetworkmodelforpredictionandpatternrecognition;representedbyHopfield'sdiscretemodelandcontinuousmodel,thefeedbackneuralnetworkmodelusedforassociativememoryandoptimizationcalculations;representedbytheartmodelandtheKoholonmodel,usingSelf-organizingmappingmethodforclustering.Thedisadvantageoftheneuralnetworkmethodisthe"blackbox"nature,anditisdifficultforpeopletounderstandthelearninganddecision-makingprocessofthenetwork.
GeneticAlgorithm
GeneticAlgorithmisarandomsearchalgorithmbasedonbiologicalnaturalselectionandgeneticmechanism.Theimplicitparallelism,easyintegrationwithothermodelsandotherpropertiesofgeneticalgorithmmakeitbeusedindatamining.
Sunilhassuccessfullydevelopedadataminingtoolbasedongeneticalgorithm,usingthistooltoconductdataminingexperimentsontherealdatabasesoftwoplanecrashes,theresultsshowthatgeneticalgorithmisaneffectivemethodfordataminingOneof[4].Theapplicationofgeneticalgorithmisalsoreflectedinthecombinationwithneuralnetwork,roughsetandothertechnologies.Forexample,thegeneticalgorithmisusedtooptimizethestructureoftheneuralnetwork,andtheredundantconnectionsandhiddenunitsaredeletedwithoutincreasingtheerrorrate;thegeneticalgorithmandthebpalgorithmareusedtotraintheneuralnetwork,andthentherulesareextractedfromthenetwork.However,thealgorithmofgeneticalgorithmismorecomplicated,andtheproblemofearlyconvergenceinthelocalminimumhasnotbeensolvedyet.
Decisiontreemethod
Decisiontreeisanalgorithmcommonlyusedinpredictivemodels.Itcanfindsomevaluableandpotentialinformationfromalargeamountofdatabypurposefullyclassifyingit.Itsmainadvantagesaresimpledescriptionandfastclassificationspeed,whichisespeciallysuitableforlarge-scaledataprocessing.Themostinfluentialandearliestdecisiontreemethodisthefamousid3algorithmbasedoninformationentropyproposedbyquinlan.Itsmainproblemsare:id3isanon-incrementallearningalgorithm;id3decisiontreeisaunivariatedecisiontree,itisdifficulttoexpresscomplexconcepts;therelationshipbetweenthesamesexisnotemphasizedenough;noiseresistanceispoor.Inresponsetotheaboveproblems,manybetterimprovedalgorithmshaveemerged.Forexample,Schlimmerandfisherdesignedtheid4incrementallearningalgorithm;ZhongMingandChenWenweiproposedtheiblealgorithm.
Roughsetmethod
Roughsettheoryisamathematicaltoolforstudyinginaccurateanduncertainknowledge.Theroughsetmethodhasseveraladvantages:noadditionalinformationisrequired;theexpressionspaceoftheinputinformationissimplified;thealgorithmissimpleandeasytooperate.Theobjectofroughsetprocessingisaninformationtablesimilartoatwo-dimensionalrelationaltable.However,themathematicalbasisofroughsetsissettheory,anditisdifficulttodirectlydealwithcontinuousattributes.Thecontinuousattributesintheactualinformationtableareuniversal.Therefore,thediscretizationofcontinuousattributesisthedifficultythatrestrictsthepracticalapplicationofroughsettheory.
Methodofcoveringpositiveexamplesandrejectingcounterexamples
Itusestheideaof​​coveringallpositiveexamplesandrejectingallcounterexamplestofindrules.First,chooseaseedfromthesetofpositiveexamplesandcomparethemonebyoneinthesetofnegativeexamples.Ifitiscompatiblewiththeselectorformedbythefieldvalue,itwillbediscarded,otherwise,itwillberetained.Accordingtothisthought,allpositiveexampleseedsarelooped,andtheruleofpositiveexample(theconjunctiveofselector)willbeobtained.ThemoretypicalalgorithmsareMichalski'saq11method,HongJiarong'simprovedaq15methodandhisae5method.
Statisticalanalysismethod
Therearetworelationshipsbetweendatabasefielditems:functionalrelationship(deterministicrelationshipthatcanbeexpressedbyfunctionformula)andcorrelationrelationship(notexpressedbyfunctionformula),Butitisstillarelevantdeterministicrelationship).Statisticalmethodscanbeusedfortheiranalysis,thatis,theuseofstatisticalprinciplestoanalyzetheinformationinthedatabase.Commonstatistics(seekingthemaximum,minimum,sum,average,etc.inalargeamountofdata),regressionanalysis(usingregressionequationstoexpressthequantitativerelationshipbetweenvariables),correlationanalysis(usingcorrelationcoefficientstomeasurethecorrelationbetweenvariables)Degree),differenceanalysis(fromthevalueofthesamplestatisticstodeterminewhetherthereisadifferencebetweentheoverallparameters),etc.
Fuzzysetmethod
Thefuzzysettheoryisusedtoperformfuzzyevaluation,fuzzydecision-making,fuzzypatternrecognitionandfuzzyclusteranalysisonpracticalproblems.Thehigherthecomplexityofthesystem,thestrongerthefuzziness.Generally,fuzzysettheoryusesthedegreeofmembershiptodescribethefuzzythings.Onthebasisoftraditionalfuzzytheoryandprobabilityandstatistics,LiDeyiandothersproposedaqualitativeandquantitativeuncertaintyconversionmodel-thecloudmodel,andformedthecloudtheory.
Miningobjects
Accordingtotheinformationstorageformat,theobjectsusedforminingincluderelationaldatabases,object-orienteddatabases,datawarehouses,textdatasources,multimediadatabases,spatialdatabases,andtemporaldatabases,Heterogeneousdatabasesandinternet,etc.
Dataminingsoftware
SASEM
ModelerofIBMSPSSCompany
K-MinerofShenzhouGeneralCompany
TempoofMerrillLynchDataTechnologyCo.,Ltd.