Технология за извличане на данни

Технически процес

Consideringthedataitself,dataminingusuallyrequires8stepsincludingdatacleaning,datatransformation,dataminingimplementationprocess,patternevaluationandknowledgerepresentation.

(1)Informationcollection:Abstractthecharacteristicinformationneededindataanalysisaccordingtothedetermineddataanalysisobject,thenselecttheappropriateinformationcollectionmethod,andstorethecollectedinformationinthedatabase.Formassivedata,choosingasuitabledatawarehousefordatastorageandmanagementiscrucial.

(2)Dataintegration:Logicallyorphysicallycentralizedatafromdifferentsources,formats,andcharacteristics,soastoprovideenterpriseswithcomprehensivedatasharing.

(3)Dataprotocol:Ittakesalongtimetoexecutemostdataminingalgorithmsevenonasmallamountofdata,andtheamountofdataisoftenverylargewhendoingbusinessoperationdatamining.Dataspecificationtechnologycanbeusedtoobtainthespecificationrepresentationofthedataset,whichismuchsmaller,butstillclosetomaintainingtheintegrityoftheoriginaldata,andtheresultsofdataminingafterthespecificationarethesameoralmostthesameastheresultsbeforethespecification.

(4)Datacleaning:someofthedatainthedatabaseisincomplete(someattributesofinterestaremissingattributevalues),noisy(includingwrongattributevalues),andareinconsistent(Thesameinformationisexpressedindifferentways),sodatacleaningisrequiredtostorecomplete,correct,andconsistentdatainformationinthedatawarehouse.

(5)Datatransformation:Transformdataintoaformsuitablefordataminingthroughsmoothaggregation,datageneralization,andstandardization.Forsomereal-numbereddata,itisalsoanimportantsteptotransformthedatathroughconceptualstratificationanddiscretizationofthedata.

(6)Dataminingprocess:Accordingtothedatainformationinthedatawarehouse,selecttheappropriateanalysistools,applystatisticalmethods,case-basedreasoning,decisiontrees,rule-basedreasoning,fuzzysets,evenneuralnetworks,geneticsThealgorithmicmethodprocessesinformationandobtainsusefulanalysisinformation.

(7)Modelevaluation:Fromabusinessperspective,industryexpertsverifythecorrectnessofthedataminingresults.

(8)Knowledgerepresentation:theanalysisinformationobtainedbydataminingispresentedtousersinavisualmanner,orstoredasnewknowledgeintheknowledgebaseforusebyotherapplications.

Thedataminingprocessisaniterativeprocess.Ifeachstepdoesnotachievetheexpectedgoal,youneedtogobacktothepreviousstep,re-adjustandexecuteit.Noteverydataminingjobrequireseverysteplistedhere.Forexample,whentherearenomultipledatasourcesinacertainjob,thestep(2)dataintegrationstepcanbeomitted.

Стъпка (3) Спецификация на данните (4) Почистване на данни (5) Трансформацията на данни се нарича още предварителна обработка на данни. Информационното повлияване, най-малко 60% от разходите може да бъде изразходвано в стъпка (1) етап на събиране на информация и най-малко 60% от енергията и времето са изразходвани за предварителна обработка на данни

Операционен метод

Невронна мрежа

Becauseofitsgoodrobustness,self-organizationandadaptability,parallelprocessing,distributedstorage,andhighfaulttolerance,neuralnetworksareverysuitableforsolvingdataminingproblems.Theyareusedforclassification,Thefeedforwardneuralnetworkmodelforpredictionandpatternrecognition;representedbyHopfield'sdiscretemodelandcontinuousmodel,thefeedbackneuralnetworkmodelusedforassociativememoryandoptimizationcalculations;representedbytheartmodelandtheKoholonmodel,usingSelf-organizingmappingmethodforclustering.Thedisadvantageoftheneuralnetworkmethodisthe"blackbox"nature,anditisdifficultforpeopletounderstandthelearninganddecision-makingprocessofthenetwork.

Генетичен алгоритъм

Генетичен алгоритъмisarandomsearchalgorithmbasedonbiologicalnaturalselectionandgeneticmechanism.Theimplicitparallelism,easyintegrationwithothermodelsandotherpropertiesofgeneticalgorithmmakeitbeusedindatamining.

Sunilhassuccessfullydevelopedadataminingtoolbasedongeneticalgorithm,usingthistooltoconductdataminingexperimentsontherealdatabasesoftwoplanecrashes,theresultsshowthatgeneticalgorithmisaneffectivemethodfordataminingOneof[4].Theapplicationofgeneticalgorithmisalsoreflectedinthecombinationwithneuralnetwork,roughsetandothertechnologies.Forexample,thegeneticalgorithmisusedtooptimizethestructureoftheneuralnetwork,andtheredundantconnectionsandhiddenunitsaredeletedwithoutincreasingtheerrorrate;thegeneticalgorithmandthebpalgorithmareusedtotraintheneuralnetwork,andthentherulesareextractedfromthenetwork.However,thealgorithmofgeneticalgorithmismorecomplicated,andtheproblemofearlyconvergenceinthelocalminimumhasnotbeensolvedyet.

Метод на дървото на решенията

Decisiontreeisanalgorithmcommonlyusedinpredictivemodels.Itcanfindsomevaluableandpotentialinformationfromalargeamountofdatabypurposefullyclassifyingit.Itsmainadvantagesaresimpledescriptionandfastclassificationspeed,whichisespeciallysuitableforlarge-scaledataprocessing.Themostinfluentialandearliestdecisiontreemethodisthefamousid3algorithmbasedoninformationentropyproposedbyquinlan.Itsmainproblemsare:id3isanon-incrementallearningalgorithm;id3decisiontreeisaunivariatedecisiontree,itisdifficulttoexpresscomplexconcepts;therelationshipbetweenthesamesexisnotemphasizedenough;noiseresistanceispoor.Inresponsetotheaboveproblems,manybetterimprovedalgorithmshaveemerged.Forexample,Schlimmerandfisherdesignedtheid4incrementallearningalgorithm;ZhongMingandChenWenweiproposedtheiblealgorithm.

Груб метод

Roughsettheoryisamathematicaltoolforstudyinginaccurateanduncertainknowledge.Theroughsetmethodhasseveraladvantages:noadditionalinformationisrequired;theexpressionspaceoftheinputinformationissimplified;thealgorithmissimpleandeasytooperate.Theobjectofroughsetprocessingisaninformationtablesimilartoatwo-dimensionalrelationaltable.However,themathematicalbasisofroughsetsissettheory,anditisdifficulttodirectlydealwithcontinuousattributes.Thecontinuousattributesintheactualinformationtableareuniversal.Therefore,thediscretizationofcontinuousattributesisthedifficultythatrestrictsthepracticalapplicationofroughsettheory.

Methodofcoveringpositiveexamplesandrejectingcounterexamples

Itusestheideaof​​coveringallpositiveexamplesandrejectingallcounterexamplestofindrules.First,chooseaseedfromthesetofpositiveexamplesandcomparethemonebyoneinthesetofnegativeexamples.Ifitiscompatiblewiththeselectorformedbythefieldvalue,itwillbediscarded,otherwise,itwillberetained.Accordingtothisthought,allpositiveexampleseedsarelooped,andtheruleofpositiveexample(theconjunctiveofselector)willbeobtained.ThemoretypicalalgorithmsareMichalski'saq11method,HongJiarong'simprovedaq15methodandhisae5method.

Метод за статистически анализ

Therearetworelationshipsbetweendatabasefielditems:functionalrelationship(deterministicrelationshipthatcanbeexpressedbyfunctionformula)andcorrelationrelationship(notexpressedbyfunctionformula),Butitisstillarelevantdeterministicrelationship).Statisticalmethodscanbeusedfortheiranalysis,thatis,theuseofstatisticalprinciplestoanalyzetheinformationinthedatabase.Commonstatistics(seekingthemaximum,minimum,sum,average,etc.inalargeamountofdata),regressionanalysis(usingregressionequationstoexpressthequantitativerelationshipbetweenvariables),correlationanalysis(usingcorrelationcoefficientstomeasurethecorrelationbetweenvariables)Degree),differenceanalysis(fromthevalueofthesamplestatisticstodeterminewhetherthereisadifferencebetweentheoverallparameters),etc.

Fuzzyset метод

Thefuzzysettheoryisusedtoperformfuzzyevaluation,fuzzydecision-making,fuzzypatternrecognitionandfuzzyclusteranalysisonpracticalproblems.Thehigherthecomplexityofthesystem,thestrongerthefuzziness.Generally,fuzzysettheoryusesthedegreeofmembershiptodescribethefuzzythings.Onthebasisoftraditionalfuzzytheoryandprobabilityandstatistics,LiDeyiandothersproposedaqualitativeandquantitativeuncertaintyconversionmodel-thecloudmodel,andformedthecloudtheory.

Минни обекти

Accordingtotheinformationstorageformat,theobjectsusedforminingincluderelationaldatabases,object-orienteddatabases,datawarehouses,textdatasources,multimediadatabases,spatialdatabases,andtemporaldatabases,Heterogeneousdatabasesandinternet,etc.

Софтуер за данни

САСЕМ

Моделист на IBMMSPSSCompany

K-MinerofShenzhouGeneralCompany

TempoofMerrillLynchDataTechnologyCo., Ltd.

Related Articles
TOP