Statistical linguistics
Editor’srecommendation
Researchfields
Theresearchfieldsofstatisticallinguisticscurrentlymainlyincludethefollowingaspects:
①TheworkofstatisticallanguageunitsFrequencyofappearance,suchasstatisticalresearchonthefrequencyofwords,phonemes,andmorphemes.
②Statisticsofthewriter’swordfrequency,wordlengthdistribution,andsentencelengthdistributiontounderstandthewriter’slanguagestyle;thismethodcanalsobeusedtodeterminetheauthorofananonymousarticle.
③Calculatingtheabsoluteageoftheexistenceoflanguageandtheagewhenrelativelanguagesdifferentiatedfromthecommonprimitivelanguage.Theresearchinthisareaiscalledlinguisticchronology,alsoknownasetymologicalstatisticalanalysis.Inaddition,statisticsandcomparisonscanbemadeonthegrammarandphoneticsystemsofrelatives’languages.
④Theinformationtheorymethodisusedtostudytheentropyandredundancyoflanguage.Theentropyoflanguageisthedegreeofuncertaintyintheappearanceoflanguagesignsintheprocessofcommunication.Thedegreeofuncertaintyisconsistentwiththeentropyofthelanguage.Whenthelanguagereceiverreceivesthelanguagesymbol,theuncertaintyiseliminated,andtheentropyisequaltozero.Therefore,inthecommunicationprocess,theamountofinformationobtainedbythelanguagereceiverisexactlyequaltotheeliminatedentropy.Languagetolerancereferstotheproportionoftheamountofinformationinthelanguagethatexceedstheminimumrequiredamountofinformation.Undernormalcircumstances,peoplealwaysprovidemuchmoreinformationthantheyactuallyneedinordertoensurethattheotherpartycanunderstand.Therefore,nomatterinwrittenlanguageOrinspokenlanguage,languagehasadegreeofenvy.
⑤Explorethegeneralstatisticallawsoflanguage.Forexample,inafrequencydictionaryarrangedindescendingorderoffrequency,thelargerthesequencenumberofaword,thelowerthefrequencyoftheword.Therelationshipbetweensequencenumberandfrequencycanbedescribedasacertainstatisticallawbymathematicalformulas.ThisstatisticallawiscalledZipf'slaw.,Namedafteroneofitsresearchers,AmericanphilologistGKZiff.⑥Usingrandomprocesstheorytostudylanguage,regardlanguageasasequenceoflettersthatarerelatedtoeachother.Thefirstletterdeterminestheappearanceofthenextletter,soaletterchainisformed,calledMarkovchain,becauseofitsearliestresearcher,RussianmathematicianAAMarkovisnamedafter.
⑦Studythespacingbetweentwowords,betweentwogrammaticalcategories,betweentwosemanticcategories,orbetweentwosyntactictypesinanarticletorevealthesyntacticorsemanticfeaturesofthearticle.
⑧Studytherelationshipbetweenlanguagevocabularyandarticlelengthtorevealtherichnessanddifferenceofvocabularyinthearticle.
Developmenthistory
Statisticallinguisticshasarelativelylonghistoryinmathematicallinguistics.Whenstudyingthe"Vedas",ancientIndiangrammariansmadestatisticsonthenumberofwordsandsyllables.In1851,theEnglishmathematicianA.DeMorgan(1806~1871)usedwordlengthasafeatureofthearticlestyleforstatisticalresearch.TheScottishscholarL.Campbellin1867andtheGermanscholarW.Dudingbergin1881bothusedstatisticalmethodstodeterminethewritingperiodofPlato'sworks.In1887,theAmericanscholarT.C.MendenhallconductedastatisticalanalysisofEnglishliteraryworks,especiallythoseofShakespeare.In1913,MarkovstudiedthegenerationoflettersequencesinRussianandproposedMarkov'srandomprocesstheory.In1935,ZiffpublishedZiff'slaw.In1944,BritishmathematicianG.U.Ullerextensivelyusedprobabilityandstatisticalmethodstostudylanguageinhisbook"StatisticalAnalysisofLiteraryWords".In1950,theAmericanscholarM.Swadesconductedresearchonlinguisticchronology.In1951,AmericanmathematicianC.ShennongusedthemethodofinformationtheorytostudytheentropyandredundancyinwrittenEnglish;AmericanscholarV.Ingwayanalyzedthesyntacticphenomenon.In1954,FrenchscholarP.Quiroproposedtheconceptofvocabularyrichnessbasedonthefrequencydistributionofwordsinthearticle.In1956,BritishscholarG.Herdanpublishedthebook"LanguageisChoiceandOpportunity",whichsystematicallysummarizedtheresearchresultsofstatisticallinguistics.Inthepast30years,theincreasinguseofcomputersinlanguagestatisticshasgraduallychangedthetraditionalmethodsofmanualfrequencycheckingandstatistics,andimprovedtheefficiencyandaccuracyofstatistics.
Mainresearchvalue
Mainresearchonstatisticallinguistics
Thefrequencyoflanguageunits
ThefrequencyofwordsusedbywritersandthedistributionofwordlengthAndsentencelengthdistributiontodeterminethewriter’swritingstyle
Calculatetheabsoluteageoflanguageexistenceandtheagewhenrelativelanguagesdifferentiatedfromcommonprimitivelanguage
BookIntroduction
StatisticalLinguisticsisaninterdisciplinarysubjectinvolvinglinguistics,computerscienceandmathematics,withawidecoverage.Thisbookelaboratesontheknowledgeoflanguagestatistics,theimplementationoflanguagestatisticsinRlanguage,thevisualdisplayofstatisticalresultsandthelanguageanalysisofstatisticalresults.Itmainlyintroducesthebasicstatisticsoflinguistics,parametrichypothesistesting,non-parametrichypothesistesting,analysisofvariance,textclustering,textclassification,andthecomprehensiveapplicationofthesestatisticalknowledge.
Thestructureofthisbookiscomplete,clearlyorganizedandorganized.Itisconvenientforteachingandself-study.ItcanbeusedasatextbookforseniorundergraduatesandpostgraduatesinChinese,foreignlanguages,computerscienceandothermajors.Itcanalsobeusedasareferenceforresearchersengagedinlanguagestatisticsandquantitativeanalysis.
Foreword
Statisticallinguistics,thestudyofhowtouseprobabilitytheory,mathematicalstatistics,informationtheoryandotherstatistical,non-discretemathematicalmethodsandcomputerstoanalyzethenaturalLanguageforstatisticsandanalysis.Naturallanguageistheobjectofitsstatisticsandanalysis,statisticalknowledgesuchasprobabilitytheoryandmathematicalstatisticsisthetheoreticalbasisofitsstatistics,andcomputersarethetoolsthatcanrealizestatistics.Therefore,notonlyknowledgeoflinguistics,butalsoknowledgeofmathematicsandcomputersciencearerequiredtoperformstatisticsonlanguage.
Thisbookisdividedinto9chapters,explainingindetailhowtocombinelinguisticknowledge,mathematicsknowledgeandcomputerknowledgetocarryoutstatisticsandanalysisoflanguage.
Chapter1mainlyintroducesthebasicconceptsofstatisticallinguistics,clarifiesthedifferencebetweenstatisticallinguistics,corpuslinguistics,quantitativelinguisticsandcomputationallinguistics,researchcontentandapplicationfields,andgivesstatisticallanguageTheresearchstepsofthebookandthemoredetailedresearchcontentofthisbook.Thisisasummaryofthecontentofthefollowingchapters.
Chapter2mainlyintroducestherelatedcontentofthecorpus.Explainsthedefinitionandcharacteristicsofcorpus,classificationaccordingtodifferentstandards,andgivesadetailedintroductiontocorpora,theirprocessingandapplication,whichareofgreatsignificanceathomeandabroad.
Chapter3mainlyintroducesthebasicstatisticsinlanguageresearch:includingsomebasicknowledgeofprobabilitytheoryandstatistics,variance,standarddeviation,mean,frequency,probability,mutualinformation,Dicecoefficient,Log-likelihoodratio,N-arymodel,Chinesecharacterentropy,Zipfrule,Zscore,Yulediagram,Fuchsformula,usageanduniversalityofwords,etc.
Chapter4mainlyintroduceshypothesistestingwidelyusedinlanguageresearch.Accordingtowhetherthepopulationinlanguageresearchisnormallydistributed,itisdividedintoparametrichypothesistestingandnon-parametrichypothesistesting.TheU-test,t-test,F-testandχ2testinparametrichypothesistesting;χ2testandrank-sumtestinnon-parametrichypothesistestingarediscussed.Theconditions,formulasandapplicationfieldsusedindifferenttestsarecomparedindetail.
Chapter5mainlyintroducestheanalysisofvariance,whichismainlyusedtocomparethedifferencesofthreeormorepopulations.Discussesone-wayanalysisofvariance,two-wayanalysisofvariancewithoutrepetition,repeatabletwo-wayanalysisofvariance,andmultiplecomparisonsofsinglefactor.
Chapter6mainlyintroducesamachinelearningmethodcommonlyusedinlanguageresearch-textclustering.Theprocessandmainalgorithmsoftextclusteringareintroducedindetail,withemphasisonhierarchicalclusteringandk-meansclustering.
Chapter7mainlyintroducesanothermachinelearningmethodcommonlyusedinlanguageresearch-textclassification,andintroducestheprocessoftextclassificationandthemainclassificationmodelsindetail,includingthenaiveBayesmodel,KNNandsupportvectormachine,etc.
Chapter8introducesaprogramminglanguagethatisoftenusedinlanguageresearch—Rlanguage,whichhaspowerfulstatisticalanalysisfunctionsanddrawingfunctions.ItfocusesonthebasicoperationsofR,themaindrawingfunctions,andtheRlanguageimplementationofthestatisticalmethodsusedinlanguageresearchinthisbook.
Chapter9discussescomputationalstylistics.Comprehensivelyexpoundthelanguagefeaturesusedincomputationalstylisticsresearchintermsofcharacters,vocabulary,sentences,partsofspeech,phrasesandparagraphs.TakethesixnovelsofMoYanandYuHuaasexamples.Fromtheaspectsofcharacters,vocabulary,sentences,partsofspeech,phrasesandparagraphs,basicstatistics,hypothesistesting,textclusteringandtextclassificationareusedtosystematicallyanalyzethewritingstylesofthetwoauthors.Research.ThestatisticsofthesecharacteristicsaremainlyrealizedbytheRlanguage.Therefore,itcanbesaidthatChapter9isanexampleofcombiningthecontentsofthevariouschaptersofthebook.
ThisbookcanbeusedasanundergraduatetextbookforseniorsinChinese,foreignlanguages,computerscienceandothermajors.Theteachingtimecanbe32to64hours.Ifstudentshavemasteredtheknowledgeoflinguisticsandbasicstatisticaltheories,andcanusetheRlanguagetoimplementthestatisticalmodelsintroducedinthisbook,itwilllayasolidfoundationforstudentstomasterthenaturallanguageofcomputerstatisticsandthelanguageofanalysis.
Whenwritingthisbook,trytomakeiteasytounderstand.Allstatisticsuserealnovelcorpusforexamplesandanalysis.IfthereadersofthisbookhavecertainknowledgeofprobabilityandstatisticsandRlanguageprogramming,theycanfreelyusethestatisticalknowledgeofthisbookforlanguageprocessing.Ifyoualsomastercomputerprogrammingknowledge(database,JavaprogrammingorClanguageprogramming)onthisbasis,youcaneasilyexpandtheexistingcontentofthisbookandconductmoreextensivelanguagestatisticsandanalysis.
Thewritingofthisbookreferstothepapersandworksofmanyscholars.Thepublicationofthisbookiscloselyrelatedtotheirwork.Iwouldliketoexpressmysincerethankstothem.
Duetomylevelandtimeconstraints,thisbookinevitablyhasomissionsanddeficiencies.Readersarewelcometocriticizeandcorrect.
LiuYing
July15,2014
BookCatalog
Chapter1Introduction
1.1StatisticalLinguistics
1.2StatisticalLinguisticsandOtherSubjects
1.2.1QuantitativeLinguistics
1.2.2ComputationalLinguistics
1.2.3CorpusLinguistics
1.2.4Relationsanddifferenceswiththethreedisciplines
1.3Linguisticfeaturesstudiedusingstatisticalmethods
1.4StatisticalLinguisticsBasicresearchmethods
1.5Stepsinstatisticallinguisticsresearch
1.6Linguisticapplicationsofstatistics
Chapter2Corpus
2.1CorpusDefinitionof
2.2Typesofcorpora
2.2.1Spokenlanguagecorpusandwrittencorpus
2.2.2Monolingualcorpus,bilingualcorpusandmultilingualcorpus
p>2.2.3Generalcorpusandspecialcorpus
2.2.4Synchronouscorpusanddiachroniccorpus
2.2.5Dynamiccorpusandstaticcorpus
2.2.6Homogeneouscorpusandheterogeneouscorpus
2.2.7Rawcorpusandlabeledcorpus
2.3Maindomesticandforeigncorpora
2.3.1ForeigncorpusCorpus
2.3.2Domesticcorpora
2.4Summaryofthischapter
Chapter3Basicapplicationofstatisticsinlanguageresearch
3.1Basicconceptsofstatistics
3.1.1population,individual,sample
3.1.2parametersandstatistics
3.1.3constants,variables
3.1.4ActualandObservedValues
3.2Means
3.2.1SimpleArithmeticMeans
3.2.2WeightedArithmeticMeans
3.3Varianceandstandarddeviation
3.3.1Varianceandstandarddeviationofungroupeddata
3.3.2Varianceandstandarddeviationofgroupeddata
3.4Frequency,frequency,probability,conditionalprobability,Bayestheorem
3.4.1Commonconceptsinprobabilitytheory
3.4.2Probability
3.4.3Independence
3.4.4Bayes'Theorem
3.4.5FrequencyandFrequency
3.5MutualInformation
3.6Zscore
3.7Dicecoefficient
3.8Phisquarecoefficient(Φ2)
3.9loglikelihoodratio
3.10Nmeta-model
3.10.1Nmeta-grammar
3.10.2Nmeta-grammodel
3.11Threestatisticallawsoflinguistics
3.11.1Zipf'slaw
3.11.2MenzerathAltmann'slaw
3.11.3PiotrowskiAltmann'slaw
3.12Entropy
3.12.1Staticaverageinformationentropy
3.12.2Limitentropy
3.13Yulegraph
3.14Fuchsformula
3.15Usageandversatility
3.15.1Usage
3.15.2versatility
3.16ChapterSummary
Chapter4HypothesisTesting
4.1RelatedConceptsofHypothesisTesting
4.1.1BasicPrinciplesofHypothesisTesting
4.1.2Classificationofhypotheses
4.1.3Teststatisticsandcriticalvalues
4.1.4Two-tailedtestandone-tailedtest
4.1.5HypothesistestThegeneralstepsof
4.1.6Twotypesoferrorsinhypothesistesting
4.2Parametrichypothesistesting
4.2.1Normaldistribution
4.2.2Utest
4.2.3ttest
4.2.4χ2test
4.2.5Ftest
4.2.6parameterhypothesisTestcomparison
4.3Nonparametrichypothesistest
4.3.1χ2test
4.3.2Ranksumtest
4.3.3NonparametricHypothesistestcomparison
4.4Summaryofthischapter
Chapter5AnalysisofVariance
5.1DefinitionandBasicIdeasofAnalysisofVariance
5.1.1Definitionofanalysisofvariance
5.1.2Thebasicideaofanalysisofvariance
5.2Thebasicconceptsandconditionsofuseofanalysisofvariance
5.2.1ThebasicsofanalysisofvarianceConcept
5.2.2Conditionsforusinganalysisofvariance
5.3Typesofanalysisofvarianceandgeneralsteps
5.3.1Typesofanalysisofvariance
5.3.2Generalstepsofanalysisofvariance
5.4One-wayanalysisofvariance
5.4.1Thesamplesizebetweeneachfactorlevelisthesame
5.4.2Thesamplesizebetweeneachfactorlevelisnotexactlythesame
5.4.3Multiplecomparisonsinanalysisofvariance
5.5Two-wayanalysisofvariance
5.5.1NoduplicationTwo-wayanalysisofvariance
5.5.2Repeatabletwo-wayanalysisofvariance
5.6Summaryofthischapter
Chapter6TextClustering
6.1Overviewoftextclustering
6.1.1Definitionoftextclustering
6.1.2Processoftextclustering
6.2Dataintextclustering
6.2.1Datastructureusedinclusteranalysis
6.2.2Datanormalizationprocessing
6.3Similaritycalculation
6.3.1Calculationoftextsimilarity
6.3.2Calculationoffeaturesimilarity
6.4Clusteringalgorithm
6.4.1Hierarchicalclustering
6.4.2Partitionclustering
6.4.3Relationsanddifferencesbetweenpartitionclusteringandhierarchicalclustering
6.5Textclusteringperformanceevaluation
6.5.1Purity
6.5.2Normalizedmutualinformation
6.5.3Precision
6.5.4Fvalue
6.6Summaryofthischapter
Chapter7TextClassification
7.1DefinitionofTextClassification
7.2ClassificationMethod
7.2.1BasedonKnowledgeEngineeringMethod
7.2.2Methodbasedonmachinelearning
7.3Classificationstepsandprocess
7.4Textrepresentationandfeatureselection
7.4.1Featureitemselection
7.4.2Bagofwordsmodel
7.4.3Vectorspacemodel
7.4.4Featureselectionandweight
7.5vectorsimilaritymeasurement
7.6classificationmodel
7.6.1NaveBayes
7.6.2knearestneighbor(kNearestNeighbor)
7.6.3SupportVectorMachines
7.7Evaluationoftextclassification
7.7.1Accuracy,recallRate
7.7.2correctrate,errorrate
7.7.3Fvalue
7.7.4micro-averageandmacro-average
7.8thischapterSummary
Chapter8IntroductiontoRlanguage
8.1Rlanguagehelpfile
8.1.1Rbasicknowledgeonlinehelp
8.1Onlinehelpforkeycharactersandfunctionsin2Rprogram
8.2Rprogrampackage
8.2.1programpackageinstallation
8.2.2programpackageLoading
8.3Rlanguagedatastructureandbasicfunctions
8.3.1Rlanguageobjecttype
8.3.2Rlanguageobjectcreation
>8.3.3Commonlyusedstatisticalfunctionsfornumericvectors
8.4DatareadingandStorage
8.4.1Datareading
8.4.2Datastorage
Basicdrawingof8.5R
8.5.1PiePlot
8.5.2Barplot
8.5.3Hist
8.5.4LinePlot(Matplot)
8.5.5Boxplot
8.5.6ScatterDiagram
8.5.7ScatterDiagramMatrix(ScatterplotMatrices)
8.6HypothesisTest
8.6.1ParametricHypothesisTest
8.6.2NonparametricHypothesisTest
8.7AnalysisofVariance
8.7.1Testforhomogeneityofvariance
8.7.2One-wayanalysisofvariance
8.7.3Two-wayanalysisofvariance
8.8Summaryofthischapter
Chapter9ResearchonComputationalStylistics
9.1LanguageFeaturesUsedinResearchonComputationalStylistics
9.1.1AspectsofCharacters
9.1.2Vocabularyaspects
9.1.3Sentenceaspects
9.1.4Partsofspeech
9.1.5Phrasesandgrammaticalstructure
9.1.6Paragraphaspects
9.2Methodsoftenusedinthestudyofcomputationalstylistics
9.3ResearchonthecomputationalstylisticsofMoYanandYuHua’snovels
9.3.1BasedonFrequencystyleanalysis
9.3.2Textstyleanalysisbasedonhypothesistesting
9.3.3Styleanalysisbasedontextclustering
9.3.4BasedontextclassificationAnalysisofstyles
9.3.5Summary
9.4Summaryofthischapter
AppendixTableofCommonlyUsedStatistics
AttachedTable1StandardNormalDistributionfunctionnumericaltable
AttachedTable2Thevalueofthecoefficientai(n)ofthenormalityteststatisticW
AttachedTable3ThealphaquantileofthenormalityteststatisticWNumberWαTable
AttachedTable4NormalityTestStatisticsYαQuantileYαTable
AttachedTable5TTestCriticalValueTable
AttachedTable6χ2testcriticalvaluetable
AttachedTable7Ftestcriticalvaluetable
AttachedTable8Wilcoxonranksumtestcriticalvaluetable
AttachedTable9ScoreofstatisticHNumberofdigitsH1-α(r,f)table
AnnexTable10Multiplecomparisonq1-α(r,f)table
Latest: Three lines of defense
Next: Military commander