Технически процес
Consideringthedataitself,dataminingusuallyrequires8stepsincludingdatacleaning,datatransformation,dataminingimplementationprocess,patternevaluationandknowledgerepresentation.
(1)Informationcollection:Abstractthecharacteristicinformationneededindataanalysisaccordingtothedetermineddataanalysisobject,thenselecttheappropriateinformationcollectionmethod,andstorethecollectedinformationinthedatabase.Formassivedata,choosingasuitabledatawarehousefordatastorageandmanagementiscrucial.
(2)Dataintegration:Logicallyorphysicallycentralizedatafromdifferentsources,formats,andcharacteristics,soastoprovideenterpriseswithcomprehensivedatasharing.
(3)Dataprotocol:Ittakesalongtimetoexecutemostdataminingalgorithmsevenonasmallamountofdata,andtheamountofdataisoftenverylargewhendoingbusinessoperationdatamining.Dataspecificationtechnologycanbeusedtoobtainthespecificationrepresentationofthedataset,whichismuchsmaller,butstillclosetomaintainingtheintegrityoftheoriginaldata,andtheresultsofdataminingafterthespecificationarethesameoralmostthesameastheresultsbeforethespecification.
(4)Datacleaning:someofthedatainthedatabaseisincomplete(someattributesofinterestaremissingattributevalues),noisy(includingwrongattributevalues),andareinconsistent(Thesameinformationisexpressedindifferentways),sodatacleaningisrequiredtostorecomplete,correct,andconsistentdatainformationinthedatawarehouse.
(5)Datatransformation:Transformdataintoaformsuitablefordataminingthroughsmoothaggregation,datageneralization,andstandardization.Forsomereal-numbereddata,itisalsoanimportantsteptotransformthedatathroughconceptualstratificationanddiscretizationofthedata.
(6)Dataminingprocess:Accordingtothedatainformationinthedatawarehouse,selecttheappropriateanalysistools,applystatisticalmethods,case-basedreasoning,decisiontrees,rule-basedreasoning,fuzzysets,evenneuralnetworks,geneticsThealgorithmicmethodprocessesinformationandobtainsusefulanalysisinformation.
(7)Modelevaluation:Fromabusinessperspective,industryexpertsverifythecorrectnessofthedataminingresults.
(8)Knowledgerepresentation:theanalysisinformationobtainedbydataminingispresentedtousersinavisualmanner,orstoredasnewknowledgeintheknowledgebaseforusebyotherapplications.
Thedataminingprocessisaniterativeprocess.Ifeachstepdoesnotachievetheexpectedgoal,youneedtogobacktothepreviousstep,re-adjustandexecuteit.Noteverydataminingjobrequireseverysteplistedhere.Forexample,whentherearenomultipledatasourcesinacertainjob,thestep(2)dataintegrationstepcanbeomitted.
Стъпка (3) Спецификация на данните (4) Почистване на данни (5) Трансформацията на данни се нарича още предварителна обработка на данни. Информационното повлияване, най-малко 60% от разходите може да бъде изразходвано в стъпка (1) етап на събиране на информация и най-малко 60% от енергията и времето са изразходвани за предварителна обработка на данни
Операционен метод
Невронна мрежа
Becauseofitsgoodrobustness,self-organizationandadaptability,parallelprocessing,distributedstorage,andhighfaulttolerance,neuralnetworksareverysuitableforsolvingdataminingproblems.Theyareusedforclassification,Thefeedforwardneuralnetworkmodelforpredictionandpatternrecognition;representedbyHopfield'sdiscretemodelandcontinuousmodel,thefeedbackneuralnetworkmodelusedforassociativememoryandoptimizationcalculations;representedbytheartmodelandtheKoholonmodel,usingSelf-organizingmappingmethodforclustering.Thedisadvantageoftheneuralnetworkmethodisthe"blackbox"nature,anditisdifficultforpeopletounderstandthelearninganddecision-makingprocessofthenetwork.
Генетичен алгоритъм
Генетичен алгоритъмisarandomsearchalgorithmbasedonbiologicalnaturalselectionandgeneticmechanism.Theimplicitparallelism,easyintegrationwithothermodelsandotherpropertiesofgeneticalgorithmmakeitbeusedindatamining.
Sunilhassuccessfullydevelopedadataminingtoolbasedongeneticalgorithm,usingthistooltoconductdataminingexperimentsontherealdatabasesoftwoplanecrashes,theresultsshowthatgeneticalgorithmisaneffectivemethodfordataminingOneof[4].Theapplicationofgeneticalgorithmisalsoreflectedinthecombinationwithneuralnetwork,roughsetandothertechnologies.Forexample,thegeneticalgorithmisusedtooptimizethestructureoftheneuralnetwork,andtheredundantconnectionsandhiddenunitsaredeletedwithoutincreasingtheerrorrate;thegeneticalgorithmandthebpalgorithmareusedtotraintheneuralnetwork,andthentherulesareextractedfromthenetwork.However,thealgorithmofgeneticalgorithmismorecomplicated,andtheproblemofearlyconvergenceinthelocalminimumhasnotbeensolvedyet.
Метод на дървото на решенията
Decisiontreeisanalgorithmcommonlyusedinpredictivemodels.Itcanfindsomevaluableandpotentialinformationfromalargeamountofdatabypurposefullyclassifyingit.Itsmainadvantagesaresimpledescriptionandfastclassificationspeed,whichisespeciallysuitableforlarge-scaledataprocessing.Themostinfluentialandearliestdecisiontreemethodisthefamousid3algorithmbasedoninformationentropyproposedbyquinlan.Itsmainproblemsare:id3isanon-incrementallearningalgorithm;id3decisiontreeisaunivariatedecisiontree,itisdifficulttoexpresscomplexconcepts;therelationshipbetweenthesamesexisnotemphasizedenough;noiseresistanceispoor.Inresponsetotheaboveproblems,manybetterimprovedalgorithmshaveemerged.Forexample,Schlimmerandfisherdesignedtheid4incrementallearningalgorithm;ZhongMingandChenWenweiproposedtheiblealgorithm.
Груб метод
Roughsettheoryisamathematicaltoolforstudyinginaccurateanduncertainknowledge.Theroughsetmethodhasseveraladvantages:noadditionalinformationisrequired;theexpressionspaceoftheinputinformationissimplified;thealgorithmissimpleandeasytooperate.Theobjectofroughsetprocessingisaninformationtablesimilartoatwo-dimensionalrelationaltable.However,themathematicalbasisofroughsetsissettheory,anditisdifficulttodirectlydealwithcontinuousattributes.Thecontinuousattributesintheactualinformationtableareuniversal.Therefore,thediscretizationofcontinuousattributesisthedifficultythatrestrictsthepracticalapplicationofroughsettheory.
Methodofcoveringpositiveexamplesandrejectingcounterexamples
Itusestheideaofcoveringallpositiveexamplesandrejectingallcounterexamplestofindrules.First,chooseaseedfromthesetofpositiveexamplesandcomparethemonebyoneinthesetofnegativeexamples.Ifitiscompatiblewiththeselectorformedbythefieldvalue,itwillbediscarded,otherwise,itwillberetained.Accordingtothisthought,allpositiveexampleseedsarelooped,andtheruleofpositiveexample(theconjunctiveofselector)willbeobtained.ThemoretypicalalgorithmsareMichalski'saq11method,HongJiarong'simprovedaq15methodandhisae5method.
Метод за статистически анализ
Therearetworelationshipsbetweendatabasefielditems:functionalrelationship(deterministicrelationshipthatcanbeexpressedbyfunctionformula)andcorrelationrelationship(notexpressedbyfunctionformula),Butitisstillarelevantdeterministicrelationship).Statisticalmethodscanbeusedfortheiranalysis,thatis,theuseofstatisticalprinciplestoanalyzetheinformationinthedatabase.Commonstatistics(seekingthemaximum,minimum,sum,average,etc.inalargeamountofdata),regressionanalysis(usingregressionequationstoexpressthequantitativerelationshipbetweenvariables),correlationanalysis(usingcorrelationcoefficientstomeasurethecorrelationbetweenvariables)Degree),differenceanalysis(fromthevalueofthesamplestatisticstodeterminewhetherthereisadifferencebetweentheoverallparameters),etc.
Fuzzyset метод
Thefuzzysettheoryisusedtoperformfuzzyevaluation,fuzzydecision-making,fuzzypatternrecognitionandfuzzyclusteranalysisonpracticalproblems.Thehigherthecomplexityofthesystem,thestrongerthefuzziness.Generally,fuzzysettheoryusesthedegreeofmembershiptodescribethefuzzythings.Onthebasisoftraditionalfuzzytheoryandprobabilityandstatistics,LiDeyiandothersproposedaqualitativeandquantitativeuncertaintyconversionmodel-thecloudmodel,andformedthecloudtheory.
Минни обекти
Accordingtotheinformationstorageformat,theobjectsusedforminingincluderelationaldatabases,object-orienteddatabases,datawarehouses,textdatasources,multimediadatabases,spatialdatabases,andtemporaldatabases,Heterogeneousdatabasesandinternet,etc.
Софтуер за данни
САСЕМ
Моделист на IBMMSPSSCompany
K-MinerofShenzhouGeneralCompany
TempoofMerrillLynchDataTechnologyCo., Ltd.