Statistical linguistics

honggarae 24/03/2022 1182

Editor’srecommendation

Researchfields

Theresearchfieldsofstatisticallinguisticscurrentlymainlyincludethefollowingaspects:

①TheworkofstatisticallanguageunitsFrequencyofappearance,suchasstatisticalresearchonthefrequencyofwords,phonemes,andmorphemes.

②Statisticsofthewriter’swordfrequency,wordlengthdistribution,andsentencelengthdistributiontounderstandthewriter’slanguagestyle;thismethodcanalsobeusedtodeterminetheauthorofananonymousarticle.

③Calculatingtheabsoluteageoftheexistenceoflanguageandtheagewhenrelativelanguagesdifferentiatedfromthecommonprimitivelanguage.Theresearchinthisareaiscalledlinguisticchronology,alsoknownasetymologicalstatisticalanalysis.Inaddition,statisticsandcomparisonscanbemadeonthegrammarandphoneticsystemsofrelatives’languages.

④Theinformationtheorymethodisusedtostudytheentropyandredundancyoflanguage.Theentropyoflanguageisthedegreeofuncertaintyintheappearanceoflanguagesignsintheprocessofcommunication.Thedegreeofuncertaintyisconsistentwiththeentropyofthelanguage.Whenthelanguagereceiverreceivesthelanguagesymbol,theuncertaintyiseliminated,andtheentropyisequaltozero.Therefore,inthecommunicationprocess,theamountofinformationobtainedbythelanguagereceiverisexactlyequaltotheeliminatedentropy.Languagetolerancereferstotheproportionoftheamountofinformationinthelanguagethatexceedstheminimumrequiredamountofinformation.Undernormalcircumstances,peoplealwaysprovidemuchmoreinformationthantheyactuallyneedinordertoensurethattheotherpartycanunderstand.Therefore,nomatterinwrittenlanguageOrinspokenlanguage,languagehasadegreeofenvy.

⑤Explorethegeneralstatisticallawsoflanguage.Forexample,inafrequencydictionaryarrangedindescendingorderoffrequency,thelargerthesequencenumberofaword,thelowerthefrequencyoftheword.Therelationshipbetweensequencenumberandfrequencycanbedescribedasacertainstatisticallawbymathematicalformulas.ThisstatisticallawiscalledZipf'slaw.,Namedafteroneofitsresearchers,AmericanphilologistGKZiff.⑥Usingrandomprocesstheorytostudylanguage,regardlanguageasasequenceoflettersthatarerelatedtoeachother.Thefirstletterdeterminestheappearanceofthenextletter,soaletterchainisformed,calledMarkovchain,becauseofitsearliestresearcher,RussianmathematicianAAMarkovisnamedafter.

⑦Studythespacingbetweentwowords,betweentwogrammaticalcategories,betweentwosemanticcategories,orbetweentwosyntactictypesinanarticletorevealthesyntacticorsemanticfeaturesofthearticle.

⑧Studytherelationshipbetweenlanguagevocabularyandarticlelengthtorevealtherichnessanddifferenceofvocabularyinthearticle.

Developmenthistory

Statisticallinguisticshasarelativelylonghistoryinmathematicallinguistics.Whenstudyingthe"Vedas",ancientIndiangrammariansmadestatisticsonthenumberofwordsandsyllables.In1851,theEnglishmathematicianA.DeMorgan(1806～1871)usedwordlengthasafeatureofthearticlestyleforstatisticalresearch.TheScottishscholarL.Campbellin1867andtheGermanscholarW.Dudingbergin1881bothusedstatisticalmethodstodeterminethewritingperiodofPlato'sworks.In1887,theAmericanscholarT.C.MendenhallconductedastatisticalanalysisofEnglishliteraryworks,especiallythoseofShakespeare.In1913,MarkovstudiedthegenerationoflettersequencesinRussianandproposedMarkov'srandomprocesstheory.In1935,ZiffpublishedZiff'slaw.In1944,BritishmathematicianG.U.Ullerextensivelyusedprobabilityandstatisticalmethodstostudylanguageinhisbook"StatisticalAnalysisofLiteraryWords".In1950,theAmericanscholarM.Swadesconductedresearchonlinguisticchronology.In1951,AmericanmathematicianC.ShennongusedthemethodofinformationtheorytostudytheentropyandredundancyinwrittenEnglish;AmericanscholarV.Ingwayanalyzedthesyntacticphenomenon.In1954,FrenchscholarP.Quiroproposedtheconceptofvocabularyrichnessbasedonthefrequencydistributionofwordsinthearticle.In1956,BritishscholarG.Herdanpublishedthebook"LanguageisChoiceandOpportunity",whichsystematicallysummarizedtheresearchresultsofstatisticallinguistics.Inthepast30years,theincreasinguseofcomputersinlanguagestatisticshasgraduallychangedthetraditionalmethodsofmanualfrequencycheckingandstatistics,andimprovedtheefficiencyandaccuracyofstatistics.

Mainresearchvalue

Mainresearchonstatisticallinguistics

Thefrequencyoflanguageunits

ThefrequencyofwordsusedbywritersandthedistributionofwordlengthAndsentencelengthdistributiontodeterminethewriter’swritingstyle

Calculatetheabsoluteageoflanguageexistenceandtheagewhenrelativelanguagesdifferentiatedfromcommonprimitivelanguage

BookIntroduction

StatisticalLinguisticsisaninterdisciplinarysubjectinvolvinglinguistics,computerscienceandmathematics,withawidecoverage.Thisbookelaboratesontheknowledgeoflanguagestatistics,theimplementationoflanguagestatisticsinRlanguage,thevisualdisplayofstatisticalresultsandthelanguageanalysisofstatisticalresults.Itmainlyintroducesthebasicstatisticsoflinguistics,parametrichypothesistesting,non-parametrichypothesistesting,analysisofvariance,textclustering,textclassification,andthecomprehensiveapplicationofthesestatisticalknowledge.

Thestructureofthisbookiscomplete,clearlyorganizedandorganized.Itisconvenientforteachingandself-study.ItcanbeusedasatextbookforseniorundergraduatesandpostgraduatesinChinese,foreignlanguages,computerscienceandothermajors.Itcanalsobeusedasareferenceforresearchersengagedinlanguagestatisticsandquantitativeanalysis.

Foreword

Statisticallinguistics,thestudyofhowtouseprobabilitytheory,mathematicalstatistics,informationtheoryandotherstatistical,non-discretemathematicalmethodsandcomputerstoanalyzethenaturalLanguageforstatisticsandanalysis.Naturallanguageistheobjectofitsstatisticsandanalysis,statisticalknowledgesuchasprobabilitytheoryandmathematicalstatisticsisthetheoreticalbasisofitsstatistics,andcomputersarethetoolsthatcanrealizestatistics.Therefore,notonlyknowledgeoflinguistics,butalsoknowledgeofmathematicsandcomputersciencearerequiredtoperformstatisticsonlanguage.

Thisbookisdividedinto9chapters,explainingindetailhowtocombinelinguisticknowledge,mathematicsknowledgeandcomputerknowledgetocarryoutstatisticsandanalysisoflanguage.

Chapter1mainlyintroducesthebasicconceptsofstatisticallinguistics,clarifiesthedifferencebetweenstatisticallinguistics,corpuslinguistics,quantitativelinguisticsandcomputationallinguistics,researchcontentandapplicationfields,andgivesstatisticallanguageTheresearchstepsofthebookandthemoredetailedresearchcontentofthisbook.Thisisasummaryofthecontentofthefollowingchapters.

Chapter2mainlyintroducestherelatedcontentofthecorpus.Explainsthedefinitionandcharacteristicsofcorpus,classificationaccordingtodifferentstandards,andgivesadetailedintroductiontocorpora,theirprocessingandapplication,whichareofgreatsignificanceathomeandabroad.

Chapter3mainlyintroducesthebasicstatisticsinlanguageresearch:includingsomebasicknowledgeofprobabilitytheoryandstatistics,variance,standarddeviation,mean,frequency,probability,mutualinformation,Dicecoefficient,Log-likelihoodratio,N-arymodel,Chinesecharacterentropy,Zipfrule,Zscore,Yulediagram,Fuchsformula,usageanduniversalityofwords,etc.

Chapter4mainlyintroduceshypothesistestingwidelyusedinlanguageresearch.Accordingtowhetherthepopulationinlanguageresearchisnormallydistributed,itisdividedintoparametrichypothesistestingandnon-parametrichypothesistesting.TheU-test,t-test,F-testandχ2testinparametrichypothesistesting;χ2testandrank-sumtestinnon-parametrichypothesistestingarediscussed.Theconditions,formulasandapplicationfieldsusedindifferenttestsarecomparedindetail.

Chapter5mainlyintroducestheanalysisofvariance,whichismainlyusedtocomparethedifferencesofthreeormorepopulations.Discussesone-wayanalysisofvariance,two-wayanalysisofvariancewithoutrepetition,repeatabletwo-wayanalysisofvariance,andmultiplecomparisonsofsinglefactor.

Chapter6mainlyintroducesamachinelearningmethodcommonlyusedinlanguageresearch-textclustering.Theprocessandmainalgorithmsoftextclusteringareintroducedindetail,withemphasisonhierarchicalclusteringandk-meansclustering.

Chapter7mainlyintroducesanothermachinelearningmethodcommonlyusedinlanguageresearch-textclassification,andintroducestheprocessoftextclassificationandthemainclassificationmodelsindetail,includingthenaiveBayesmodel,KNNandsupportvectormachine,etc.

Chapter8introducesaprogramminglanguagethatisoftenusedinlanguageresearch—Rlanguage,whichhaspowerfulstatisticalanalysisfunctionsanddrawingfunctions.ItfocusesonthebasicoperationsofR,themaindrawingfunctions,andtheRlanguageimplementationofthestatisticalmethodsusedinlanguageresearchinthisbook.

Chapter9discussescomputationalstylistics.Comprehensivelyexpoundthelanguagefeaturesusedincomputationalstylisticsresearchintermsofcharacters,vocabulary,sentences,partsofspeech,phrasesandparagraphs.TakethesixnovelsofMoYanandYuHuaasexamples.Fromtheaspectsofcharacters,vocabulary,sentences,partsofspeech,phrasesandparagraphs,basicstatistics,hypothesistesting,textclusteringandtextclassificationareusedtosystematicallyanalyzethewritingstylesofthetwoauthors.Research.ThestatisticsofthesecharacteristicsaremainlyrealizedbytheRlanguage.Therefore,itcanbesaidthatChapter9isanexampleofcombiningthecontentsofthevariouschaptersofthebook.

ThisbookcanbeusedasanundergraduatetextbookforseniorsinChinese,foreignlanguages,computerscienceandothermajors.Theteachingtimecanbe32to64hours.Ifstudentshavemasteredtheknowledgeoflinguisticsandbasicstatisticaltheories,andcanusetheRlanguagetoimplementthestatisticalmodelsintroducedinthisbook,itwilllayasolidfoundationforstudentstomasterthenaturallanguageofcomputerstatisticsandthelanguageofanalysis.

Whenwritingthisbook,trytomakeiteasytounderstand.Allstatisticsuserealnovelcorpusforexamplesandanalysis.IfthereadersofthisbookhavecertainknowledgeofprobabilityandstatisticsandRlanguageprogramming,theycanfreelyusethestatisticalknowledgeofthisbookforlanguageprocessing.Ifyoualsomastercomputerprogrammingknowledge(database,JavaprogrammingorClanguageprogramming)onthisbasis,youcaneasilyexpandtheexistingcontentofthisbookandconductmoreextensivelanguagestatisticsandanalysis.

Thewritingofthisbookreferstothepapersandworksofmanyscholars.Thepublicationofthisbookiscloselyrelatedtotheirwork.Iwouldliketoexpressmysincerethankstothem.

Duetomylevelandtimeconstraints,thisbookinevitablyhasomissionsanddeficiencies.Readersarewelcometocriticizeandcorrect.

LiuYing

July15,2014

BookCatalog

Chapter1Introduction

1.1StatisticalLinguistics

1.2StatisticalLinguisticsandOtherSubjects

1.2.1QuantitativeLinguistics

1.2.2ComputationalLinguistics

1.2.3CorpusLinguistics

1.2.4Relationsanddifferenceswiththethreedisciplines

1.3Linguisticfeaturesstudiedusingstatisticalmethods

1.4StatisticalLinguisticsBasicresearchmethods

1.5Stepsinstatisticallinguisticsresearch

1.6Linguisticapplicationsofstatistics

Chapter2Corpus

2.1CorpusDefinitionof

2.2Typesofcorpora

2.2.1Spokenlanguagecorpusandwrittencorpus

2.2.2Monolingualcorpus,bilingualcorpusandmultilingualcorpus

2.2.3Generalcorpusandspecialcorpus

2.2.4Synchronouscorpusanddiachroniccorpus

2.2.5Dynamiccorpusandstaticcorpus

2.2.6Homogeneouscorpusandheterogeneouscorpus

2.2.7Rawcorpusandlabeledcorpus

2.3Maindomesticandforeigncorpora

2.3.1ForeigncorpusCorpus

2.3.2Domesticcorpora

2.4Summaryofthischapter

Chapter3Basicapplicationofstatisticsinlanguageresearch

3.1Basicconceptsofstatistics

3.1.1population,individual,sample

3.1.2parametersandstatistics

3.1.3constants,variables

3.1.4ActualandObservedValues

3.2Means

3.2.1SimpleArithmeticMeans

3.2.2WeightedArithmeticMeans

3.3Varianceandstandarddeviation

3.3.1Varianceandstandarddeviationofungroupeddata

3.3.2Varianceandstandarddeviationofgroupeddata

3.4Frequency,frequency,probability,conditionalprobability,Bayestheorem

3.4.1Commonconceptsinprobabilitytheory

3.4.2Probability

3.4.3Independence

3.4.4Bayes'Theorem

3.4.5FrequencyandFrequency

3.5MutualInformation

3.6Zscore

3.7Dicecoefficient

3.8Phisquarecoefficient(Φ2)

3.9loglikelihoodratio

3.10Nmeta-model

3.10.1Nmeta-grammar

3.10.2Nmeta-grammodel

3.11Threestatisticallawsoflinguistics

3.11.1Zipf'slaw

3.11.2MenzerathAltmann'slaw

3.11.3PiotrowskiAltmann'slaw

3.12Entropy

3.12.1Staticaverageinformationentropy

3.12.2Limitentropy

3.13Yulegraph

3.14Fuchsformula

3.15Usageandversatility

3.15.1Usage

3.15.2versatility

3.16ChapterSummary

Chapter4HypothesisTesting

4.1RelatedConceptsofHypothesisTesting

4.1.1BasicPrinciplesofHypothesisTesting

4.1.2Classificationofhypotheses

4.1.3Teststatisticsandcriticalvalues

4.1.4Two-tailedtestandone-tailedtest

4.1.5HypothesistestThegeneralstepsof

4.1.6Twotypesoferrorsinhypothesistesting

4.2Parametrichypothesistesting

4.2.1Normaldistribution

4.2.2Utest

4.2.3ttest

4.2.4χ2test

4.2.5Ftest

Statistical linguistics

4.2.6parameterhypothesisTestcomparison

4.3Nonparametrichypothesistest

4.3.1χ2test

4.3.2Ranksumtest

4.3.3NonparametricHypothesistestcomparison

4.4Summaryofthischapter

Chapter5AnalysisofVariance

5.1DefinitionandBasicIdeasofAnalysisofVariance

5.1.1Definitionofanalysisofvariance

5.1.2Thebasicideaofanalysisofvariance

5.2Thebasicconceptsandconditionsofuseofanalysisofvariance

5.2.1ThebasicsofanalysisofvarianceConcept

5.2.2Conditionsforusinganalysisofvariance

5.3Typesofanalysisofvarianceandgeneralsteps

5.3.1Typesofanalysisofvariance

5.3.2Generalstepsofanalysisofvariance

5.4One-wayanalysisofvariance

5.4.1Thesamplesizebetweeneachfactorlevelisthesame

5.4.2Thesamplesizebetweeneachfactorlevelisnotexactlythesame

5.4.3Multiplecomparisonsinanalysisofvariance

5.5Two-wayanalysisofvariance

5.5.1NoduplicationTwo-wayanalysisofvariance

5.5.2Repeatabletwo-wayanalysisofvariance

5.6Summaryofthischapter

Chapter6TextClustering

6.1Overviewoftextclustering

6.1.1Definitionoftextclustering

6.1.2Processoftextclustering

6.2Dataintextclustering

6.2.1Datastructureusedinclusteranalysis

6.2.2Datanormalizationprocessing

6.3Similaritycalculation

6.3.1Calculationoftextsimilarity

6.3.2Calculationoffeaturesimilarity

6.4Clusteringalgorithm

6.4.1Hierarchicalclustering

6.4.2Partitionclustering

6.4.3Relationsanddifferencesbetweenpartitionclusteringandhierarchicalclustering

6.5Textclusteringperformanceevaluation

6.5.1Purity

6.5.2Normalizedmutualinformation

6.5.3Precision

6.5.4Fvalue

6.6Summaryofthischapter

Chapter7TextClassification

7.1DefinitionofTextClassification

7.2ClassificationMethod

7.2.1BasedonKnowledgeEngineeringMethod

7.2.2Methodbasedonmachinelearning

7.3Classificationstepsandprocess

7.4Textrepresentationandfeatureselection

7.4.1Featureitemselection

7.4.2Bagofwordsmodel

7.4.3Vectorspacemodel

7.4.4Featureselectionandweight

7.5vectorsimilaritymeasurement

7.6classificationmodel

7.6.1NaveBayes

7.6.2knearestneighbor(kNearestNeighbor)

7.6.3SupportVectorMachines

7.7Evaluationoftextclassification

7.7.1Accuracy,recallRate

7.7.2correctrate,errorrate

7.7.3Fvalue

7.7.4micro-averageandmacro-average

7.8thischapterSummary

Chapter8IntroductiontoRlanguage

8.1Rlanguagehelpfile

8.1.1Rbasicknowledgeonlinehelp

8.1Onlinehelpforkeycharactersandfunctionsin2Rprogram

8.2Rprogrampackage

8.2.1programpackageinstallation

8.2.2programpackageLoading

8.3Rlanguagedatastructureandbasicfunctions

8.3.1Rlanguageobjecttype

8.3.2Rlanguageobjectcreation

8.3.3Commonlyusedstatisticalfunctionsfornumericvectors

8.4DatareadingandStorage

8.4.1Datareading

8.4.2Datastorage

Basicdrawingof8.5R

8.5.1PiePlot

8.5.2Barplot

8.5.3Hist

8.5.4LinePlot(Matplot)

8.5.5Boxplot

8.5.6ScatterDiagram

8.5.7ScatterDiagramMatrix(ScatterplotMatrices)

8.6HypothesisTest

8.6.1ParametricHypothesisTest

8.6.2NonparametricHypothesisTest

8.7AnalysisofVariance

8.7.1Testforhomogeneityofvariance

8.7.2One-wayanalysisofvariance

8.7.3Two-wayanalysisofvariance

8.8Summaryofthischapter

Chapter9ResearchonComputationalStylistics

9.1LanguageFeaturesUsedinResearchonComputationalStylistics

9.1.1AspectsofCharacters

9.1.2Vocabularyaspects

9.1.3Sentenceaspects

9.1.4Partsofspeech

9.1.5Phrasesandgrammaticalstructure

9.1.6Paragraphaspects

9.2Methodsoftenusedinthestudyofcomputationalstylistics

9.3ResearchonthecomputationalstylisticsofMoYanandYuHua’snovels

9.3.1BasedonFrequencystyleanalysis

9.3.2Textstyleanalysisbasedonhypothesistesting

9.3.3Styleanalysisbasedontextclustering

9.3.4BasedontextclassificationAnalysisofstyles

9.3.5Summary

9.4Summaryofthischapter

AppendixTableofCommonlyUsedStatistics

AttachedTable1StandardNormalDistributionfunctionnumericaltable

AttachedTable2Thevalueofthecoefficientai(n)ofthenormalityteststatisticW

AttachedTable3ThealphaquantileofthenormalityteststatisticWNumberWαTable

AttachedTable4NormalityTestStatisticsYαQuantileYαTable

AttachedTable5TTestCriticalValueTable

AttachedTable6χ2testcriticalvaluetable

AttachedTable7Ftestcriticalvaluetable

AttachedTable8Wilcoxonranksumtestcriticalvaluetable

AttachedTable9ScoreofstatisticHNumberofdigitsH1-α(r,f)table

AnnexTable10Multiplecomparisonq1-α(r,f)table

Latest: Three lines of defense

Next: Military commander

What Makes a Mobile App Successful in 2026?

02/12/2025 527

What Smartwatch Metrics Matter Most to Marathoners?

25/11/2025 399

How to Clear Netflix Cache and Data on Your Phone

21/11/2025 639

Statistical linguistics

Editor’srecommendation

Researchfields

Developmenthistory

Mainresearchvalue

BookIntroduction

BookCatalog

Related Articles

Hot Articles