[Prev][Next][Index][Thread]

Thought this might be of interest...



Forwarded message:
>From owner-linguist@TAMVM1.TAMU.EDU Sat Oct  1 17:53:29 1994
Message-Id: <9410012253.AA04342@astrid.ling.nwu.edu>
Date:         Sat, 1 Oct 1994 17:44:07 -0500
Reply-To: The Linguist List <linguist@tamsun.tamu.edu>
Sender: The LINGUIST Discussion List <LINGUIST@tamvm1.tamu.edu>
From: The Linguist List <linguist@tamsun.tamu.edu>
Subject:      5.1067 Sum: Comparing texts for authorship
To: Multiple recipients of list LINGUIST <LINGUIST@tamvm1.tamu.edu>

----------------------------------------------------------------------
LINGUIST List:  Vol-5-1067. Sat 01 Oct 1994. ISSN: 1068-4875. Lines: 661

Subject: 5.1067 Sum: Comparing texts for authorship

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar@tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry@emunix.emich.edu>

Asst. Editors: Ron Reck <rreck@emunix.emich.edu>
               Ann Dizdar <dizdar@tamsun.tamu.edu>
               Ljuba Veselinova <lveselin@emunix.emich.edu>
               Liz Bodenmiller <lbodenmi@emunix.emich.edu>

-------------------------Directory-------------------------------------

1)
Date: Thu, 29 Sep 1994 09:50:14 -0400
From: "William J. Rapaport" <rapaport@cs.Buffalo.EDU>
Subject: comparing texts for authorship -- summary

-------------------------Messages--------------------------------------
1)
Date: Thu, 29 Sep 1994 09:50:14 -0400
From: "William J. Rapaport" <rapaport@cs.Buffalo.EDU>
Subject: comparing texts for authorship -- summary

Last June, I posted the following query about "comparing 2 texts":

A colleague in our Classics dept. wants to be able to compare 2 texts
to see if they were written by the same author, or by different authors.
Presumably, this would be done by some combination of a stylistic and a
statistical analysis.

(As I recall, this sort of technique has been used by folks who try to
figure out if Shakespeare really wrote Shakespeare's plays.)

What she needs are pointers to the literature, especially information
on how reliable such arguments are.

Appended is a summary of the replies.  Thanks to all of you!

                        William J. Rapaport
                        Associate Professor of Computer Science
                        and
                        Center for Cognitive Science

Dept. of Computer Science | (716) 645-3180 x 112
SUNY Buffalo              | fax:  (716) 645-3464
Buffalo, NY 14260         | rapaport@cs.buffalo.edu

                    *************

   Date: Fri, 3 Jun 1994 14:18:46 --100
   From: Ken.Beesley@xerox.fr (Ken Beesley)

   Some important work on authorship was done by Michaelson & Morton in
   Edinburgh, Scotland.

   The Rev. A.Q. Morton
   The Abbey Manse
   Culross
   Dunfermline, Fife KY128JD

   Newmills 880-231

   Prof. S. Michaelson
   Computer Science
   JCMB
   Kings Buildings
   University of Edinburgh
   Edinburgh, Scotland

   There was also some work by a couple of statisticians at Brigham Young
   University in Provo, Utah, USA.  Their names escape me now.
                            *************
   Date: Fri, 3 Jun 1994 13:24:12 -0500
   From: hrubin@stat.purdue.edu (Herman Rubin)

   Probably the best sound work from a statistical basis is

    Title:          Inference and disputed authorship: The Federalist <by>
                      Frederick Mosteller <and> David L. Wallace.
    Author(s):      Mosteller, Frederick, 1916-
                    Wallace, David L. (David Lee), 1928-
    Publisher:      Reading, Mass., Addison-Wesley <1964>

   This book considered the Federalist papers, written by three known authors,
   but different ones by different authors.  One of their conclusions was that
   analysis by context vocabulary and other similar things, much used by those
   who assigned authorship in the past, did not work; the only thing which did
   for this problem was the use of connectives.

   There is an article in the _Journal of Applied Probability_
   on the type-token relationship in Shakespeare's plays.
   A cursory glance at the data indicates that one cannot treat these as a
   sample from a single population, even if the comedies, tragedies,
   and historical plays are separated; there is a definite effect of the
   individual work. Similar things can be noticed in the attempts of other
   statisticians to do this, such as the writings of Yule.

   I did look at some of the data; I have not published on this.  It is quite
   dangerous to say on the basis of a statistical test that two works are by
   different authors.
                          -----------------

   Date: Mon, 6 Jun 94 11:16:55 -0700
   From: jtang@cogsci.Berkeley.EDU (Joyce Tang Boyland)

   [Re: Mosteller and Wallace:]
   ...
   The 1984 edition [see reply below] has a much more informative table of
   contents than the original 1964 version published by Addison-Wesley.

                         ------------------

   Date: Tue, 7 Jun 1994 10:33:39 --100
   From: Gregory.Grefenstette@xerox.fr (Gregory Grefenstette)

   Frederick Mosteller and David L. Wallace

   "Applied Bayesian and Classical Inference: The case of the Federalist
   Papers, 2nd Edition of 'Inference and Disputed Authorship: The Federalist"
   (Springer-Verlag)

   this book gives statistical methods for deciding which of the
   Federalist papers were written by Hamilton and which were written by
   Jefferson.

                         ------------------
   From: Robert.Sigley@vuw.ac.nz
   Date: Sat, 04 Jun 1994 13:02:58 +1200

   I've just finished reading (and returned, unfortunately) a collection of
   papers which I can thoroughly recommend to anyone trying to identify an
   author by style.

   It's called "Statistics and Style" and appeared in 1968. ... on checking
   the library online catalogue, I find that no extra information is given for
   it, so I won't be able to confirm this identification until it's reshelved
   (within 2 days, I hope). As far as I remember, the editor had a Slavic-type
   name beginning with `G'; a grep of the library's entire author index makes
   J. Gvozdanovic the most likely option (listed for another book in the
   general area of linguistics, but unconnected to the topic under
   discussion), but this id is tentative for now.

   Analyses covered include comparisons of
   (1) word-length spectra (and a number of statistics calculated from them);

   (2) sentence-length spectra (which are found to follow a log-normal
   distribution for any particular text by a single author. Author behaviour
   is reasonably consistent, but there is considerable overlap between
   different authors in the same genre);

   (3) use of certain vocabulary items previously identified as `typical' of
   the candidate author(s) on the basis of uncontested works;

   (4) use of certain grammatical constructions;

   (5) counts of certain grammatical classes (eg noun/verb ratios, or
   adjective/verb ratios).

   The final paper in the collection is perhaps the most important, as it deals
   with the general question of reliability.

   Overall, it has to be said that the crude general statistics above are not
   useful for deciding questions of authorship unless:

   (i) the number of possible candidate authors is small;
   (ii) we have a large body of work from each of the candidate authors;
   (iii) this corpus covers the entire span of their career (or at least shows
   little change over time); and
   (iv) the corpus is in similar genres to the contested item.

   In short, the rewards of such analysis are mostly not worth the
   considerable time it used to take to compute the statistics. The results
   are, I'm afraid, especially indecisive in answering questions in classics
   (where this volume of work, and the historical information about potential
   alternative authors, is often lacking).

   But if there is only one candidate author, with a large known corpus, and
   the exercise is simply to determine how similar to that author's style
   the unknown text is, then it can still be attempted.

   (1) and (2) above are now relatively quick and easy to calculate
   with most concordance programs - providing the text is in machine-readable
   form to start with! But they are the least author-specific methods.

   (4) and (5) could be useful, but are still very time-consuming to
   calculate, and require a whole lotta manual tagging of the texts. Best
   avoided.

   So (3) is probably going to be of most use in identifying a specific
   author. The best approach I can think of would be to construct a
   concordance (using OCP or similar) for a large corpus (20000 words minimum)
   of the candidate author, and then do the same for a similar-sized
   matched-genre corpus from the author's contemporaries. (If the text's
   general *date* is in doubt, you may as well give up now.)

   Then you compare the frequency ratios of common vocabulary items (ie
   frequency in candidate corpus/ frequency in mixed-contemporary corpus).

   This will identify a number of vocabulary items which are used
   proportionately much more or much less by the candidate author, and so can
   be used as `characteristic' of that author. Discard items which are linked
   in any literal sense to the text topic. Ignore very rare items (eg with
   frequency less than 5 over 20000 words). To save yourself time, and to
   maximise the sensitivity of your tests, look at only the 10 or so items with
   the largest differential frequency.

   Now calculate the frequencies of the remaining items in the contested text.
   Compare these with both the candidate-author and contemporary corpora
   frequencies.

   Finally, conduct a series of statistical tests to determine whether any
   differences you find can reasonably be attributed to chance. The best
   method will depend on the frequencies you get at the end of all this; ask a
   friendly statistician.

   Hope this helps. I'll mail back when I find the book again to confirm
   its identity. I should add though that there's been considerable progress
   in text manipulation on computers since its publication, so it's out of
   date in some areas; however, this is more or less made up for by falling
   interest and a lack of progress in statistical style analysis since 1970.

                            ---------------------

   From: Robert.Sigley@vuw.ac.nz
   Date: Wed, 08 Jun 1994 16:34:09 +1200

   The reference I mentioned is actually:

   Lubomir Dolezel & Richard W. Bailey (eds) 1969. _Statistics_and_Style_. New
   York: American Elsevier Publishing Company.

   I shall try to give a brief description of the more important collected
   papers, with original references where possible.
   Page references for quotes are from the collection, though.

   Vocabulary Measures:

   Paul Bennett. The Statistical Measurement of a Stylistic Trait in
   _Julius_Caesar_ and _As_You_Like_It_. (from _Shakespeare_Quarterly_ VIII
   (1957): 33-50)
       Bennett applies Yule's characteristic (a measure of vocabulary
       repetitiveness) to two very different plays by Shakespeare. Using a
       card-sorting technique, this was very time-consuming; it would be much
       quicker today! He finds that the characteristic is a useful measure of
       style - it varies from act to act in a way predictable from the plays'
       structures - but "should not care to suggest that the characteristic is
       going to provide an infallible test of authorship" (p40).

   Charles Muller. Lexical Distribution Reconsidered: The Waring-Herdan
   Formula.
   (from _Cahiers_de_Lexicologie_ VI (1965): 35-53.)
       Muller tests a rather complicated formula designed to predict the
       word-frequency spectrum of a text. It works reasonably well on material
       from a variety of texts in several languages.
   [The shape of the frequency distribution is therefore of little use in
   author attribution. This formula has recently surfaced again in Baayen's
   (1990, 1991) work on morphological productivity.-RJS]

   Friederike Antosch. The Diagnosis of Literary Style with the Verb-Adjective
   Ratio. (translated from German original.)
       Antosch analyses a number of plays by Grillparzer, Goethe and
       Anzengruber, in terms of the verb/adjective ratio. She finds that this
       is extremely sensitive to elements of genre (eg dialogue/ monologue;
       and novels vs. academic writings) and characterisation (eg lower-class/
       upper-class). The V/A ratio may show local maxima within a play at
       points of rising action and climactic scenes, and so is a potentially
       useful stylistic indicator.
   [Corollary: it's of very limited use for comparing authors unless these
   factors can be controlled.]

   See also:
   G. Udny Yule. 1944. _The_Statistical_Study_of_Literary_Vocabulary_.
   Cambridge.

   Sentence-level Measures:

   C.B. Williams. A Note on the Statistical Analysis of Sentence-Length as a
   Criterion of Literary Style.
       Williams compares works by Chesterton, Wells and Shaw with respect to
       their sentence-length frequency spectra. He finds that these spectra
       are reasonably well modelled by a log-normal distribution (that is, the
       log of the sentence length has a normal distribution), and that the
       three books studied have significantly different mean sentence lengths
       - though the significance is marginal between Shaw and Chesterton.
       Williams uses samples of 600 sentences (approx 15000 words) from each
       book; this is a minimum sample size for work of this nature!
   [NB we can't conclude from this that we have identified any characteristic
   of the *authors*. -RJS]

   Kai Rander Buch. A Note on Sentence-Length as Random Variable.
       Buch comments on Williams' paper, presenting (with fearsome maths) a
       statistical analysis of two works by the same author, and concluding
       that the author's style has changed over time to such an extent that
       the texts are significantly different under Williams' test.

   See also:
   C.B. Williams. 1956. Studies in the History of Probability and Statistics
   IV. A
   Note on an Early Statistical Study of Literary Style. _Biometrika_ XLIII
   (1956): 248-256.
   G. Udny Yule. 1938. On Sentence-Length as a Statistical Characteristic of
   Style
   in Prose, with Application to Two Cases of Disputed Authorship. _Biometrika_
   XXX (1938-39): 363-390.

   [Hence gross sentence-length measures are of little use for author
   attributions: they can return non-significant differences between different
   authors, and significant differences between texts by the same author. They
   simply aren't specific enough. -RJS]

   Curtis W. Hayes. A Study in Prose Styles: Edward Gibbon and Ernest
   Hemingway.
   (from _Texas_Studies_in_Literature_and_Language_ VII (1966): 371-386.)
       Hayes avoids the above problem by taking a more detailed
       transformational analysis of passages of Gibbon & Hemingway. He finds a
       variety of grammatical patterns which show highly significant
       differences between the two authors - in particular, passives,
       doublets, infinitival nominals, and relative clauses are far commoner
       in Gibbon.
   [This is a valuable stylistic measure, though not a method I would have the
   patience to use myself! But it doesn't serve to identify the authors,
   so much as the very different genres they write in. -RJS]

   Studies of Individual Author Styles:

   John B. Carroll. Vectors of Prose Style.
   (from Thomas A. Sebeok (ed) 1960. _Style_In_Language_. MIT Press: 283-292.)
       This is an interesting use of factor analysis to determine the
       linguistic correlates of literary judgements.

   George M. Landon. The Quantification of Metaphoric Language in the Verse of
   Wilfred Owen.
       Least said the better.

   Frederick L.Burwick. Stylistic Continuity and Change in the Prose of Thomas
   Carlyle.
       The mutant offspring of an entropic study of 5-word wordclass
       sequences, and a more traditional literary analysis. The latter wins
       out, but is not easily applicable to other authors.

   Karl Kroeber. Perils of Quantification: The Exemplary Case of Jane Austen's
   _Emma_.
       Kroeber undertakes a detailed analysis of the vocabulary of Austen,
       Eliot, Dickens and [E.] Bronte. While many of the restrictions he
       places on his samples are arbitrary, this is potentially a useful
       direction for author comparison and attribution (see below).

   See also the case studies:
   Alvar Ellegard. 1962. _A_Statistical_Method_for_Determining_Authorship:_The_
   Junius_Letters,_1769-1772_. Gothenburg Studies in English 13.

   Ivor S. Francis. 1966. An Exposition of a Statistical Approach to the
   Federalist Dispute, in Jacob Leed (ed) _The_Computer_and_Literary_Style_.
   Ohio.

   Survey of the field:

   Richard W. Bailey. Statistics and Style: A Historical Survey. (pp217-236)
       This deals with the general question of reliability:
       "What is wanted... is a litmus test by which the critic can decide
       whether or not two given texts were written by the same author. Though
       some attempts have been made to formulate such a test, they have been
       almost wholly unsuccessful." (p222)

   Some other surveys cited by Bailey:
   William J. Paisley. 1964. Identifying the Unknown Communicator [...]
   _The_Journal_of_Communication_, XIV (1964): 219-237.

   Rebecca Posner. 1963. The Use and Abuse of Stylistic Statistics.
   _Archivum_Linguisticum_ XV (1963): 111-119.

   Overall, it has to be said that the crude general statistics above are not
   useful for deciding questions of authorship unless:

   (i) the number of possible candidate authors is small;
   (ii) we have a large body of work from each of the candidate authors;
   (iii) this corpus covers the entire span of their career (or at least shows
   little change over time); and
   (iv) the corpus is in similar genres to the contested item.

   In short, the rewards of such analysis are mostly not worth the
   considerable time it used to take to compute the statistics. The results
   are, I'm afraid, especially indecisive in answering questions in classics
   (where this volume of work, and the historical information about potential
   alternative authors, is often lacking).

   But if there is only one candidate author, with a large known corpus, and
   the exercise is simply to determine how similar to that author's style
   the unknown text is, then it can still be attempted.

   The general vocabulary and sentence-length measures above are now
   relatively quick and easy to calculate with most concordance programs -
   providing the text is in machine-readable form to start with! But they are
   the least author-specific methods. Possibly they could be of use as a
   preliminary check before plunging into more time-consuming methods.

   More specific grammatical analysis could be useful, but very time-consuming
   to calculate, requiring a whole lotta manual tagging of the texts. Best
   avoided.

   So an intermediate 'specific vocabulary' index is probably going to be of
   most use in identifying a specific author. The best approach I can think of
   would be to construct a concordance (using OCP or similar) for a large
   corpus (20000 words minimum) of the candidate author, and then do the same
   for a similar-sized matched-genre corpus from the author's contemporaries.
   (If the text's general *date* is in doubt, you may as well give up now.)

   Then you compare the frequency ratios of common vocabulary items (ie
   frequency in candidate corpus/ frequency in mixed-contemporary corpus).

   This will identify a number of vocabulary items which are used
   proportionately much more or much less by the candidate author, and so can
   be used as `characteristic' of that author. Discard items which are linked
   in any literal sense to the text topic. Ignore very rare items (eg with
   frequency less than 5 over 20000 words). To save yourself time, and to
   maximise the sensitivity of your tests, look at only the 10 or so items with
   the largest differential frequency.

   Now calculate the frequencies of the remaining items in the contested text.
   Compare these with both the candidate-author and contemporary corpora
   frequencies.

   Finally, conduct a series of statistical tests to determine whether any
   differences you find can reasonably be attributed to chance. The best
   method will depend on the frequencies you get at the end of all this; ask a
   friendly statistician.

   Hope this helps. I should add though that there's been considerable
   progress in text manipulation on computers since 1970, so it's out of
   date in some areas; however, this is more or less made up for by falling
   interest and a lack of progress in statistical style analysis since then.

                                   ----------------

   Date: Sat, 4 Jun 1994 11:51:03 -0600
   From: nostler@crl.nmsu.edu (Nick Ostler)

   Your colleague should look at a work by AJP Kenny on assessing the
   authorship of Aristotle's Eudemian Ethics: "The Aristotelian ethics:
   a study of the relationship..." Oxford: Clarendon Press, 1978.

   [also recommended by
   Virginia Knight <ZZAASVK@cms.manchester-computing-centre.ac.uk>]

                               ------------------

   Date: Mon, 6 Jun 94 13:58:20 +0200
   From: monique@gia.univ-mrs.fr (Monique Rolbert)

   Nous sommes une equipe BDD-LN qui developpons un langage d'interrogation
   de bases de donnees textuelles (a partir d'un format de type SGML) et
   une question est de savoir quel type d'operateur il est interessant de
   mettre a la disposition d'un utilisateur voulant faire du TALN sur des
   textes.
   Je serais tres interessee de connaitre vos types de besoins dans le genre de
   comparaison que vous voulez faire (statistique-stylistique)
   Merci d'avance.

   Monique Rolbert
   monique.rolbert@gia.univ-mrs.fr

                                -----------------
   Date: Mon, 6 Jun 1994 23:34:23 +1000
   From: sussex@lingua.cltr.uq.oz.au (Prof. Roly Sussex)

   John Burrows (LCJFB@cc.newcastle.edu.au) at the University of Newcastle,
   Australia, has done important work on text analysis and authorship.
   You could email him direct.

   Roly Sussex
   Director
   Centre for Language Teaching and Research
       and
   Language and Technology Centre of the National Languages and Literacy
           Institute of Australia
   University of Queensland
   Queensland 4072
   Australia

   email:       sussex@lingua.cltr.uq.oz.au
   phone:       +61 7 365-6896 (work)
           +61 7 300-2942 (home)
   fax: +61 7 365-7077

                            -------------------

   Date: Mon, 6 Jun 94 15:37:19 -0700
   From: edwards@cogsci.Berkeley.EDU (Jane A. Edwards)

   Your query reminded me of a recent exchange regarding stylistic
   analysis, though in different context.  Hope this is of use.  -Jane Edwards
   | ------------------------
   | Date: Mon, 31 Jan 1994 11:59:00 -0500
   | From: neff@watson.ibm.com (Mary Neff)
   | To: FL-LIST@BHAM.AC.UK
   | Cc: neff@watson.ibm.com
   | Subject: The Case of the Plagiarized Patent
   |
   | A few months back I was buttonholed at a party by the owner of a company
     in the middle of a patent infringement case.  He wanted to know if, as a
   | linguist, I might have anything useful to offer.  Not a lot, it's not my
   | field, but I just found this list, and one of YOU might. It seems that his
   | company had signed a contract with another one that included giving them
   | access to his design documentation and his patent applications.  Some
   | time later, he discovered that the other company is siphoning off his
   | business and is making a product too similar to his to be accidental, and
   | has filed patents also (I think in other countries).  His question to me
     was whether it were possible to study and compare the two patents by
     structure, language, etc. to determine whether there might have been
     any plagiarism involved. I later looked at the patents and decided that
     it was perhaps not a wild idea, but that any investigation would also
     have to take into account the general "formula" of a patent, which might
     account for a lot of similarity.  Who are the experts on this sort of
     thing?  What are some of the other issues involved?  It's not so often
     that I get approached at a party for some free advice as a linguist;
     usually it's the doctors and the lawyers that encounter that sort of
     thing!
   |
   | Interestingly, I read something in this month's DISCOVER magazine that
   | mentions a couple of guys who designed a computer program to snoop for
   | plagiarism in books.

                              ------------------

   Date:         Tue, 01 Feb 94 10:03:54 EST
   From: Larry Horn <LHORN@YaleVM.CIS.Yale.edu>

   Thanks for the postings.  The lawyer has settled on one of my earlier
   respondents, Gerry McMenamin of Fresno, who wrote a book on authorship
   determination.  Apparently computers are indeed much used in these matters,
   but I don't whether his samples (from his client and another man) are
   generous enough to allow for statistical significance.
   I guess McMenamin will help him decide.
                            ------------------

   From: "Richard Hamilton-Williams" <RJHW@registry.cit.ac.nz>
   Date:          Wed, 8 Jun 1994 13:29:26 GMT+1200

   Long ago, but not so far away, I studied Middle High German and wrote
   a bit of a thesis on the transmission of MHG texts.

   It wasn't very popular with a lot of people because it made little
   reference to "taste" and was based, rather, on a statistical analysis
   of variance between texts.  My professor at the time got me interested
   in this and he in turn had got it from a book called, I think, "The
   Calculus of Variants".  I've an idea it was written by E H
   Greig(Gregg?) in the 1920s or 1930s.  In any case, I think I have a
   copy at home and will send you the details tomorrow.

   I made use of a fairly crude algorithm which establised a model as if
   the transmission of texts were known, and then measured actual
   variation against this.  My professor died, nothing to do with me I
   hope, and although I completed my degree I went on to other things,
   so I can't claim that I know what goes on in the field nowadays.  I
   imagine, however, that analysis is much more sophisticated now - I
   used punchcards to enter data on a mainframe - although the concepts
   should be very similar.
   Richard Hamilton-Williams
   Central Institute of Technology, Wellington, New Zealand
   04 527-6397 x6982
   Private Bag 39807
   Wellington Mail Centre
   New Zealand

   From: "Richard Hamilton-Williams" <RJHW@registry.cit.ac.nz>
   Date:          Thu, 9 Jun 1994 08:03:41 GMT+1200

   The reference is:

   Greg, W. W.  The Calculus of Variants, An Essay on Textual Criticism
   (Oxford, 1927)

   Greg wrote a number of other things and edited works on the basis of
   his theories on textual transmission.
   Richard Hamilton-Williams
   Central Institute of Technology, Wellington, New Zealand
   04 527-6397 x6982
   Private Bag 39807
   Wellington Mail Centre
   New Zealand

                           ------------------

   From: h9290030@hkuxa.hku.hk (R.Y.L. TANG)
   Subject: Authorship identification

   In David Crystal's _The Cambridge Encyclopedia of Language_ (Cambridge
   UP, 1987), there is a very succinct account of the use of statistics in
   stylistic analysis and authorship identification (Chapter 12).
                                   -----------

   From: Brett.Baker@linguistics.su.edu.au (Brett Baker)
   Date:        Wed, 15 Jun 1994 15:41:12 +1000

   ... I don't know if this will be much use to your colleague, but
   she could do worse than have a look at a new monograph by John Myhill called
   'Typological Discourse Analysis' published by Blackwell 1992. Apart from
   loads of interesting stuff about analysing texts quantitatively, it also has
   references for analyses that have been done on written texts which sound
   like the kind of thing you want. Much of the purpose of this kind of
   analysis is to show up regularities of expression type and
   stylistic/grammatical function. Good luck.
                                   -----------

   Date: Thu, 16 Jun 1994 00:24:45 -0500 (CDT)
   From: Kristin E Hiller <hill0087@gold.tc.umn.edu>

   This is in response to the query you posted on Linguist (on behalf of
   your colleague).  I'm sorry it's taken me so long to respond.
   Stylostatistical studies abound concerning cases of disputed authorship.
   You mention the Shakespeare/Marlowe controversy.  I'll name a few others:

   1)  One of the most often cited cases of disputed authorship is that of
   the _Federalist Papers_.  Of the 88 papers, the authorship of twelve was
   in question (having been written by either Madison or Hamilton).

   2)  Several anonymous articles appeared in the journals _Vremja_ (_Time_)
   and _`Epoxa_ (_Epoch_), which were both edited by Dostoevsky.  Some of the
   artic;es have variously been attributed to Dostoevsky.

   3)  The authorship of _The Junius Letters_, not known for certain, has
   often been attributed to Sir Francis Bacon (although some 40 others were
   considered at one time or another).

   4)  Gustave Alderfeld's _The Military History of Charles XII_ was
   anonymously translated from the French.  Henry Fielding is considered by
   some to have been the translator.

   5)  Some scholars maintain that Sholoxov did not actually write all of
   _Tixij Don_, but plagiarized Krjuchkov's manuscripts.

   I have only recently begun reading about this field and have already
   come acrosss many refences to the work done on (1) by Frederick Mosteller
   and David Wallace _Inference and Disputed Authorship: "The Federalist"_
   (Reading, MA: Addison-Wesley, 1964) and a less statistic-laden work,
   Francis, Ivor S. "An Exposition of the Statistical Approach to the
   _Federalist_ Dispute," in _The Computer and Literary Style_, ed. Jacob
   Leed (Kent: Kent State U. Press, 1966).

   Geir Kjetsaa tackles (2) in his book (written in Russian) _Prinadlezhnost'
   Dostoevskomu: K voprosu ob atribucii F.M. Dostoevskomu anonimnyx statej v
   zhurnalax "Vremja" i "Epoxa"_ (Oslo: Solum Forlag, 1986).

   Michael and Jill Farringdon address (4) in "A computer-aided study of the
   prose style of Henry Fielding and its support for his translation of the
   Military History of Charles XII", in _Advances in Computer-aided Literary
   and Linguistic Research: Proceedings of the Fifth International Symposium
   on Computers in Literary and Linguistic Research_, D.E. Ager, F.E. Knowles,
   Joan Smith, eds. (Birmingham: AMLC, 1979).

   Rudall, B.H and T.N. Corns, _Computers and Literature: A practical guide_
   (Cambridge, MA: Abacus Press, 1987) contains a chapter on "Author
   identification and canonical investigation."

   With all the literature out there I could continue listing references
   until my fingers ached from typing.  Instead I'll just list two more:

   Kenny, Anthony, _The Computation of Style: An introduction to statistics
   for students of literature and humanities_ (Oxford: Pergamon Press, 1982).
   A great book -- the title says it all.

   Feldman, Paula R. and Buford Norman. _The Wordworthy Computer: Classroom
   and research applications in language and literature_ (NY: Random House,
   c.1987).  The best part of this book is its HUGE bibliography.  A very
   good starting point.

--------------------------------------------------------------------------
LINGUIST List: Vol-5-1067.