[Prev][Next][Index][Thread]
Thought this might be of interest...
Forwarded message:
>From owner-linguist@TAMVM1.TAMU.EDU Sat Oct 1 17:53:29 1994
Message-Id: <9410012253.AA04342@astrid.ling.nwu.edu>
Date: Sat, 1 Oct 1994 17:44:07 -0500
Reply-To: The Linguist List <linguist@tamsun.tamu.edu>
Sender: The LINGUIST Discussion List <LINGUIST@tamvm1.tamu.edu>
From: The Linguist List <linguist@tamsun.tamu.edu>
Subject: 5.1067 Sum: Comparing texts for authorship
To: Multiple recipients of list LINGUIST <LINGUIST@tamvm1.tamu.edu>
----------------------------------------------------------------------
LINGUIST List: Vol-5-1067. Sat 01 Oct 1994. ISSN: 1068-4875. Lines: 661
Subject: 5.1067 Sum: Comparing texts for authorship
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar@tam2000.tamu.edu>
Helen Dry: Eastern Michigan U. <hdry@emunix.emich.edu>
Asst. Editors: Ron Reck <rreck@emunix.emich.edu>
Ann Dizdar <dizdar@tamsun.tamu.edu>
Ljuba Veselinova <lveselin@emunix.emich.edu>
Liz Bodenmiller <lbodenmi@emunix.emich.edu>
-------------------------Directory-------------------------------------
1)
Date: Thu, 29 Sep 1994 09:50:14 -0400
From: "William J. Rapaport" <rapaport@cs.Buffalo.EDU>
Subject: comparing texts for authorship -- summary
-------------------------Messages--------------------------------------
1)
Date: Thu, 29 Sep 1994 09:50:14 -0400
From: "William J. Rapaport" <rapaport@cs.Buffalo.EDU>
Subject: comparing texts for authorship -- summary
Last June, I posted the following query about "comparing 2 texts":
A colleague in our Classics dept. wants to be able to compare 2 texts
to see if they were written by the same author, or by different authors.
Presumably, this would be done by some combination of a stylistic and a
statistical analysis.
(As I recall, this sort of technique has been used by folks who try to
figure out if Shakespeare really wrote Shakespeare's plays.)
What she needs are pointers to the literature, especially information
on how reliable such arguments are.
Appended is a summary of the replies. Thanks to all of you!
William J. Rapaport
Associate Professor of Computer Science
and
Center for Cognitive Science
Dept. of Computer Science | (716) 645-3180 x 112
SUNY Buffalo | fax: (716) 645-3464
Buffalo, NY 14260 | rapaport@cs.buffalo.edu
*************
Date: Fri, 3 Jun 1994 14:18:46 --100
From: Ken.Beesley@xerox.fr (Ken Beesley)
Some important work on authorship was done by Michaelson & Morton in
Edinburgh, Scotland.
The Rev. A.Q. Morton
The Abbey Manse
Culross
Dunfermline, Fife KY128JD
Newmills 880-231
Prof. S. Michaelson
Computer Science
JCMB
Kings Buildings
University of Edinburgh
Edinburgh, Scotland
There was also some work by a couple of statisticians at Brigham Young
University in Provo, Utah, USA. Their names escape me now.
*************
Date: Fri, 3 Jun 1994 13:24:12 -0500
From: hrubin@stat.purdue.edu (Herman Rubin)
Probably the best sound work from a statistical basis is
Title: Inference and disputed authorship: The Federalist <by>
Frederick Mosteller <and> David L. Wallace.
Author(s): Mosteller, Frederick, 1916-
Wallace, David L. (David Lee), 1928-
Publisher: Reading, Mass., Addison-Wesley <1964>
This book considered the Federalist papers, written by three known authors,
but different ones by different authors. One of their conclusions was that
analysis by context vocabulary and other similar things, much used by those
who assigned authorship in the past, did not work; the only thing which did
for this problem was the use of connectives.
There is an article in the _Journal of Applied Probability_
on the type-token relationship in Shakespeare's plays.
A cursory glance at the data indicates that one cannot treat these as a
sample from a single population, even if the comedies, tragedies,
and historical plays are separated; there is a definite effect of the
individual work. Similar things can be noticed in the attempts of other
statisticians to do this, such as the writings of Yule.
I did look at some of the data; I have not published on this. It is quite
dangerous to say on the basis of a statistical test that two works are by
different authors.
-----------------
Date: Mon, 6 Jun 94 11:16:55 -0700
From: jtang@cogsci.Berkeley.EDU (Joyce Tang Boyland)
[Re: Mosteller and Wallace:]
...
The 1984 edition [see reply below] has a much more informative table of
contents than the original 1964 version published by Addison-Wesley.
------------------
Date: Tue, 7 Jun 1994 10:33:39 --100
From: Gregory.Grefenstette@xerox.fr (Gregory Grefenstette)
Frederick Mosteller and David L. Wallace
"Applied Bayesian and Classical Inference: The case of the Federalist
Papers, 2nd Edition of 'Inference and Disputed Authorship: The Federalist"
(Springer-Verlag)
this book gives statistical methods for deciding which of the
Federalist papers were written by Hamilton and which were written by
Jefferson.
------------------
From: Robert.Sigley@vuw.ac.nz
Date: Sat, 04 Jun 1994 13:02:58 +1200
I've just finished reading (and returned, unfortunately) a collection of
papers which I can thoroughly recommend to anyone trying to identify an
author by style.
It's called "Statistics and Style" and appeared in 1968. ... on checking
the library online catalogue, I find that no extra information is given for
it, so I won't be able to confirm this identification until it's reshelved
(within 2 days, I hope). As far as I remember, the editor had a Slavic-type
name beginning with `G'; a grep of the library's entire author index makes
J. Gvozdanovic the most likely option (listed for another book in the
general area of linguistics, but unconnected to the topic under
discussion), but this id is tentative for now.
Analyses covered include comparisons of
(1) word-length spectra (and a number of statistics calculated from them);
(2) sentence-length spectra (which are found to follow a log-normal
distribution for any particular text by a single author. Author behaviour
is reasonably consistent, but there is considerable overlap between
different authors in the same genre);
(3) use of certain vocabulary items previously identified as `typical' of
the candidate author(s) on the basis of uncontested works;
(4) use of certain grammatical constructions;
(5) counts of certain grammatical classes (eg noun/verb ratios, or
adjective/verb ratios).
The final paper in the collection is perhaps the most important, as it deals
with the general question of reliability.
Overall, it has to be said that the crude general statistics above are not
useful for deciding questions of authorship unless:
(i) the number of possible candidate authors is small;
(ii) we have a large body of work from each of the candidate authors;
(iii) this corpus covers the entire span of their career (or at least shows
little change over time); and
(iv) the corpus is in similar genres to the contested item.
In short, the rewards of such analysis are mostly not worth the
considerable time it used to take to compute the statistics. The results
are, I'm afraid, especially indecisive in answering questions in classics
(where this volume of work, and the historical information about potential
alternative authors, is often lacking).
But if there is only one candidate author, with a large known corpus, and
the exercise is simply to determine how similar to that author's style
the unknown text is, then it can still be attempted.
(1) and (2) above are now relatively quick and easy to calculate
with most concordance programs - providing the text is in machine-readable
form to start with! But they are the least author-specific methods.
(4) and (5) could be useful, but are still very time-consuming to
calculate, and require a whole lotta manual tagging of the texts. Best
avoided.
So (3) is probably going to be of most use in identifying a specific
author. The best approach I can think of would be to construct a
concordance (using OCP or similar) for a large corpus (20000 words minimum)
of the candidate author, and then do the same for a similar-sized
matched-genre corpus from the author's contemporaries. (If the text's
general *date* is in doubt, you may as well give up now.)
Then you compare the frequency ratios of common vocabulary items (ie
frequency in candidate corpus/ frequency in mixed-contemporary corpus).
This will identify a number of vocabulary items which are used
proportionately much more or much less by the candidate author, and so can
be used as `characteristic' of that author. Discard items which are linked
in any literal sense to the text topic. Ignore very rare items (eg with
frequency less than 5 over 20000 words). To save yourself time, and to
maximise the sensitivity of your tests, look at only the 10 or so items with
the largest differential frequency.
Now calculate the frequencies of the remaining items in the contested text.
Compare these with both the candidate-author and contemporary corpora
frequencies.
Finally, conduct a series of statistical tests to determine whether any
differences you find can reasonably be attributed to chance. The best
method will depend on the frequencies you get at the end of all this; ask a
friendly statistician.
Hope this helps. I'll mail back when I find the book again to confirm
its identity. I should add though that there's been considerable progress
in text manipulation on computers since its publication, so it's out of
date in some areas; however, this is more or less made up for by falling
interest and a lack of progress in statistical style analysis since 1970.
---------------------
From: Robert.Sigley@vuw.ac.nz
Date: Wed, 08 Jun 1994 16:34:09 +1200
The reference I mentioned is actually:
Lubomir Dolezel & Richard W. Bailey (eds) 1969. _Statistics_and_Style_. New
York: American Elsevier Publishing Company.
I shall try to give a brief description of the more important collected
papers, with original references where possible.
Page references for quotes are from the collection, though.
Vocabulary Measures:
Paul Bennett. The Statistical Measurement of a Stylistic Trait in
_Julius_Caesar_ and _As_You_Like_It_. (from _Shakespeare_Quarterly_ VIII
(1957): 33-50)
Bennett applies Yule's characteristic (a measure of vocabulary
repetitiveness) to two very different plays by Shakespeare. Using a
card-sorting technique, this was very time-consuming; it would be much
quicker today! He finds that the characteristic is a useful measure of
style - it varies from act to act in a way predictable from the plays'
structures - but "should not care to suggest that the characteristic is
going to provide an infallible test of authorship" (p40).
Charles Muller. Lexical Distribution Reconsidered: The Waring-Herdan
Formula.
(from _Cahiers_de_Lexicologie_ VI (1965): 35-53.)
Muller tests a rather complicated formula designed to predict the
word-frequency spectrum of a text. It works reasonably well on material
from a variety of texts in several languages.
[The shape of the frequency distribution is therefore of little use in
author attribution. This formula has recently surfaced again in Baayen's
(1990, 1991) work on morphological productivity.-RJS]
Friederike Antosch. The Diagnosis of Literary Style with the Verb-Adjective
Ratio. (translated from German original.)
Antosch analyses a number of plays by Grillparzer, Goethe and
Anzengruber, in terms of the verb/adjective ratio. She finds that this
is extremely sensitive to elements of genre (eg dialogue/ monologue;
and novels vs. academic writings) and characterisation (eg lower-class/
upper-class). The V/A ratio may show local maxima within a play at
points of rising action and climactic scenes, and so is a potentially
useful stylistic indicator.
[Corollary: it's of very limited use for comparing authors unless these
factors can be controlled.]
See also:
G. Udny Yule. 1944. _The_Statistical_Study_of_Literary_Vocabulary_.
Cambridge.
Sentence-level Measures:
C.B. Williams. A Note on the Statistical Analysis of Sentence-Length as a
Criterion of Literary Style.
Williams compares works by Chesterton, Wells and Shaw with respect to
their sentence-length frequency spectra. He finds that these spectra
are reasonably well modelled by a log-normal distribution (that is, the
log of the sentence length has a normal distribution), and that the
three books studied have significantly different mean sentence lengths
- though the significance is marginal between Shaw and Chesterton.
Williams uses samples of 600 sentences (approx 15000 words) from each
book; this is a minimum sample size for work of this nature!
[NB we can't conclude from this that we have identified any characteristic
of the *authors*. -RJS]
Kai Rander Buch. A Note on Sentence-Length as Random Variable.
Buch comments on Williams' paper, presenting (with fearsome maths) a
statistical analysis of two works by the same author, and concluding
that the author's style has changed over time to such an extent that
the texts are significantly different under Williams' test.
See also:
C.B. Williams. 1956. Studies in the History of Probability and Statistics
IV. A
Note on an Early Statistical Study of Literary Style. _Biometrika_ XLIII
(1956): 248-256.
G. Udny Yule. 1938. On Sentence-Length as a Statistical Characteristic of
Style
in Prose, with Application to Two Cases of Disputed Authorship. _Biometrika_
XXX (1938-39): 363-390.
[Hence gross sentence-length measures are of little use for author
attributions: they can return non-significant differences between different
authors, and significant differences between texts by the same author. They
simply aren't specific enough. -RJS]
Curtis W. Hayes. A Study in Prose Styles: Edward Gibbon and Ernest
Hemingway.
(from _Texas_Studies_in_Literature_and_Language_ VII (1966): 371-386.)
Hayes avoids the above problem by taking a more detailed
transformational analysis of passages of Gibbon & Hemingway. He finds a
variety of grammatical patterns which show highly significant
differences between the two authors - in particular, passives,
doublets, infinitival nominals, and relative clauses are far commoner
in Gibbon.
[This is a valuable stylistic measure, though not a method I would have the
patience to use myself! But it doesn't serve to identify the authors,
so much as the very different genres they write in. -RJS]
Studies of Individual Author Styles:
John B. Carroll. Vectors of Prose Style.
(from Thomas A. Sebeok (ed) 1960. _Style_In_Language_. MIT Press: 283-292.)
This is an interesting use of factor analysis to determine the
linguistic correlates of literary judgements.
George M. Landon. The Quantification of Metaphoric Language in the Verse of
Wilfred Owen.
Least said the better.
Frederick L.Burwick. Stylistic Continuity and Change in the Prose of Thomas
Carlyle.
The mutant offspring of an entropic study of 5-word wordclass
sequences, and a more traditional literary analysis. The latter wins
out, but is not easily applicable to other authors.
Karl Kroeber. Perils of Quantification: The Exemplary Case of Jane Austen's
_Emma_.
Kroeber undertakes a detailed analysis of the vocabulary of Austen,
Eliot, Dickens and [E.] Bronte. While many of the restrictions he
places on his samples are arbitrary, this is potentially a useful
direction for author comparison and attribution (see below).
See also the case studies:
Alvar Ellegard. 1962. _A_Statistical_Method_for_Determining_Authorship:_The_
Junius_Letters,_1769-1772_. Gothenburg Studies in English 13.
Ivor S. Francis. 1966. An Exposition of a Statistical Approach to the
Federalist Dispute, in Jacob Leed (ed) _The_Computer_and_Literary_Style_.
Ohio.
Survey of the field:
Richard W. Bailey. Statistics and Style: A Historical Survey. (pp217-236)
This deals with the general question of reliability:
"What is wanted... is a litmus test by which the critic can decide
whether or not two given texts were written by the same author. Though
some attempts have been made to formulate such a test, they have been
almost wholly unsuccessful." (p222)
Some other surveys cited by Bailey:
William J. Paisley. 1964. Identifying the Unknown Communicator [...]
_The_Journal_of_Communication_, XIV (1964): 219-237.
Rebecca Posner. 1963. The Use and Abuse of Stylistic Statistics.
_Archivum_Linguisticum_ XV (1963): 111-119.
Overall, it has to be said that the crude general statistics above are not
useful for deciding questions of authorship unless:
(i) the number of possible candidate authors is small;
(ii) we have a large body of work from each of the candidate authors;
(iii) this corpus covers the entire span of their career (or at least shows
little change over time); and
(iv) the corpus is in similar genres to the contested item.
In short, the rewards of such analysis are mostly not worth the
considerable time it used to take to compute the statistics. The results
are, I'm afraid, especially indecisive in answering questions in classics
(where this volume of work, and the historical information about potential
alternative authors, is often lacking).
But if there is only one candidate author, with a large known corpus, and
the exercise is simply to determine how similar to that author's style
the unknown text is, then it can still be attempted.
The general vocabulary and sentence-length measures above are now
relatively quick and easy to calculate with most concordance programs -
providing the text is in machine-readable form to start with! But they are
the least author-specific methods. Possibly they could be of use as a
preliminary check before plunging into more time-consuming methods.
More specific grammatical analysis could be useful, but very time-consuming
to calculate, requiring a whole lotta manual tagging of the texts. Best
avoided.
So an intermediate 'specific vocabulary' index is probably going to be of
most use in identifying a specific author. The best approach I can think of
would be to construct a concordance (using OCP or similar) for a large
corpus (20000 words minimum) of the candidate author, and then do the same
for a similar-sized matched-genre corpus from the author's contemporaries.
(If the text's general *date* is in doubt, you may as well give up now.)
Then you compare the frequency ratios of common vocabulary items (ie
frequency in candidate corpus/ frequency in mixed-contemporary corpus).
This will identify a number of vocabulary items which are used
proportionately much more or much less by the candidate author, and so can
be used as `characteristic' of that author. Discard items which are linked
in any literal sense to the text topic. Ignore very rare items (eg with
frequency less than 5 over 20000 words). To save yourself time, and to
maximise the sensitivity of your tests, look at only the 10 or so items with
the largest differential frequency.
Now calculate the frequencies of the remaining items in the contested text.
Compare these with both the candidate-author and contemporary corpora
frequencies.
Finally, conduct a series of statistical tests to determine whether any
differences you find can reasonably be attributed to chance. The best
method will depend on the frequencies you get at the end of all this; ask a
friendly statistician.
Hope this helps. I should add though that there's been considerable
progress in text manipulation on computers since 1970, so it's out of
date in some areas; however, this is more or less made up for by falling
interest and a lack of progress in statistical style analysis since then.
----------------
Date: Sat, 4 Jun 1994 11:51:03 -0600
From: nostler@crl.nmsu.edu (Nick Ostler)
Your colleague should look at a work by AJP Kenny on assessing the
authorship of Aristotle's Eudemian Ethics: "The Aristotelian ethics:
a study of the relationship..." Oxford: Clarendon Press, 1978.
[also recommended by
Virginia Knight <ZZAASVK@cms.manchester-computing-centre.ac.uk>]
------------------
Date: Mon, 6 Jun 94 13:58:20 +0200
From: monique@gia.univ-mrs.fr (Monique Rolbert)
Nous sommes une equipe BDD-LN qui developpons un langage d'interrogation
de bases de donnees textuelles (a partir d'un format de type SGML) et
une question est de savoir quel type d'operateur il est interessant de
mettre a la disposition d'un utilisateur voulant faire du TALN sur des
textes.
Je serais tres interessee de connaitre vos types de besoins dans le genre de
comparaison que vous voulez faire (statistique-stylistique)
Merci d'avance.
Monique Rolbert
monique.rolbert@gia.univ-mrs.fr
-----------------
Date: Mon, 6 Jun 1994 23:34:23 +1000
From: sussex@lingua.cltr.uq.oz.au (Prof. Roly Sussex)
John Burrows (LCJFB@cc.newcastle.edu.au) at the University of Newcastle,
Australia, has done important work on text analysis and authorship.
You could email him direct.
Roly Sussex
Director
Centre for Language Teaching and Research
and
Language and Technology Centre of the National Languages and Literacy
Institute of Australia
University of Queensland
Queensland 4072
Australia
email: sussex@lingua.cltr.uq.oz.au
phone: +61 7 365-6896 (work)
+61 7 300-2942 (home)
fax: +61 7 365-7077
-------------------
Date: Mon, 6 Jun 94 15:37:19 -0700
From: edwards@cogsci.Berkeley.EDU (Jane A. Edwards)
Your query reminded me of a recent exchange regarding stylistic
analysis, though in different context. Hope this is of use. -Jane Edwards
| ------------------------
| Date: Mon, 31 Jan 1994 11:59:00 -0500
| From: neff@watson.ibm.com (Mary Neff)
| To: FL-LIST@BHAM.AC.UK
| Cc: neff@watson.ibm.com
| Subject: The Case of the Plagiarized Patent
|
| A few months back I was buttonholed at a party by the owner of a company
in the middle of a patent infringement case. He wanted to know if, as a
| linguist, I might have anything useful to offer. Not a lot, it's not my
| field, but I just found this list, and one of YOU might. It seems that his
| company had signed a contract with another one that included giving them
| access to his design documentation and his patent applications. Some
| time later, he discovered that the other company is siphoning off his
| business and is making a product too similar to his to be accidental, and
| has filed patents also (I think in other countries). His question to me
was whether it were possible to study and compare the two patents by
structure, language, etc. to determine whether there might have been
any plagiarism involved. I later looked at the patents and decided that
it was perhaps not a wild idea, but that any investigation would also
have to take into account the general "formula" of a patent, which might
account for a lot of similarity. Who are the experts on this sort of
thing? What are some of the other issues involved? It's not so often
that I get approached at a party for some free advice as a linguist;
usually it's the doctors and the lawyers that encounter that sort of
thing!
|
| Interestingly, I read something in this month's DISCOVER magazine that
| mentions a couple of guys who designed a computer program to snoop for
| plagiarism in books.
------------------
Date: Tue, 01 Feb 94 10:03:54 EST
From: Larry Horn <LHORN@YaleVM.CIS.Yale.edu>
Thanks for the postings. The lawyer has settled on one of my earlier
respondents, Gerry McMenamin of Fresno, who wrote a book on authorship
determination. Apparently computers are indeed much used in these matters,
but I don't whether his samples (from his client and another man) are
generous enough to allow for statistical significance.
I guess McMenamin will help him decide.
------------------
From: "Richard Hamilton-Williams" <RJHW@registry.cit.ac.nz>
Date: Wed, 8 Jun 1994 13:29:26 GMT+1200
Long ago, but not so far away, I studied Middle High German and wrote
a bit of a thesis on the transmission of MHG texts.
It wasn't very popular with a lot of people because it made little
reference to "taste" and was based, rather, on a statistical analysis
of variance between texts. My professor at the time got me interested
in this and he in turn had got it from a book called, I think, "The
Calculus of Variants". I've an idea it was written by E H
Greig(Gregg?) in the 1920s or 1930s. In any case, I think I have a
copy at home and will send you the details tomorrow.
I made use of a fairly crude algorithm which establised a model as if
the transmission of texts were known, and then measured actual
variation against this. My professor died, nothing to do with me I
hope, and although I completed my degree I went on to other things,
so I can't claim that I know what goes on in the field nowadays. I
imagine, however, that analysis is much more sophisticated now - I
used punchcards to enter data on a mainframe - although the concepts
should be very similar.
Richard Hamilton-Williams
Central Institute of Technology, Wellington, New Zealand
04 527-6397 x6982
Private Bag 39807
Wellington Mail Centre
New Zealand
From: "Richard Hamilton-Williams" <RJHW@registry.cit.ac.nz>
Date: Thu, 9 Jun 1994 08:03:41 GMT+1200
The reference is:
Greg, W. W. The Calculus of Variants, An Essay on Textual Criticism
(Oxford, 1927)
Greg wrote a number of other things and edited works on the basis of
his theories on textual transmission.
Richard Hamilton-Williams
Central Institute of Technology, Wellington, New Zealand
04 527-6397 x6982
Private Bag 39807
Wellington Mail Centre
New Zealand
------------------
From: h9290030@hkuxa.hku.hk (R.Y.L. TANG)
Subject: Authorship identification
In David Crystal's _The Cambridge Encyclopedia of Language_ (Cambridge
UP, 1987), there is a very succinct account of the use of statistics in
stylistic analysis and authorship identification (Chapter 12).
-----------
From: Brett.Baker@linguistics.su.edu.au (Brett Baker)
Date: Wed, 15 Jun 1994 15:41:12 +1000
... I don't know if this will be much use to your colleague, but
she could do worse than have a look at a new monograph by John Myhill called
'Typological Discourse Analysis' published by Blackwell 1992. Apart from
loads of interesting stuff about analysing texts quantitatively, it also has
references for analyses that have been done on written texts which sound
like the kind of thing you want. Much of the purpose of this kind of
analysis is to show up regularities of expression type and
stylistic/grammatical function. Good luck.
-----------
Date: Thu, 16 Jun 1994 00:24:45 -0500 (CDT)
From: Kristin E Hiller <hill0087@gold.tc.umn.edu>
This is in response to the query you posted on Linguist (on behalf of
your colleague). I'm sorry it's taken me so long to respond.
Stylostatistical studies abound concerning cases of disputed authorship.
You mention the Shakespeare/Marlowe controversy. I'll name a few others:
1) One of the most often cited cases of disputed authorship is that of
the _Federalist Papers_. Of the 88 papers, the authorship of twelve was
in question (having been written by either Madison or Hamilton).
2) Several anonymous articles appeared in the journals _Vremja_ (_Time_)
and _`Epoxa_ (_Epoch_), which were both edited by Dostoevsky. Some of the
artic;es have variously been attributed to Dostoevsky.
3) The authorship of _The Junius Letters_, not known for certain, has
often been attributed to Sir Francis Bacon (although some 40 others were
considered at one time or another).
4) Gustave Alderfeld's _The Military History of Charles XII_ was
anonymously translated from the French. Henry Fielding is considered by
some to have been the translator.
5) Some scholars maintain that Sholoxov did not actually write all of
_Tixij Don_, but plagiarized Krjuchkov's manuscripts.
I have only recently begun reading about this field and have already
come acrosss many refences to the work done on (1) by Frederick Mosteller
and David Wallace _Inference and Disputed Authorship: "The Federalist"_
(Reading, MA: Addison-Wesley, 1964) and a less statistic-laden work,
Francis, Ivor S. "An Exposition of the Statistical Approach to the
_Federalist_ Dispute," in _The Computer and Literary Style_, ed. Jacob
Leed (Kent: Kent State U. Press, 1966).
Geir Kjetsaa tackles (2) in his book (written in Russian) _Prinadlezhnost'
Dostoevskomu: K voprosu ob atribucii F.M. Dostoevskomu anonimnyx statej v
zhurnalax "Vremja" i "Epoxa"_ (Oslo: Solum Forlag, 1986).
Michael and Jill Farringdon address (4) in "A computer-aided study of the
prose style of Henry Fielding and its support for his translation of the
Military History of Charles XII", in _Advances in Computer-aided Literary
and Linguistic Research: Proceedings of the Fifth International Symposium
on Computers in Literary and Linguistic Research_, D.E. Ager, F.E. Knowles,
Joan Smith, eds. (Birmingham: AMLC, 1979).
Rudall, B.H and T.N. Corns, _Computers and Literature: A practical guide_
(Cambridge, MA: Abacus Press, 1987) contains a chapter on "Author
identification and canonical investigation."
With all the literature out there I could continue listing references
until my fingers ached from typing. Instead I'll just list two more:
Kenny, Anthony, _The Computation of Style: An introduction to statistics
for students of literature and humanities_ (Oxford: Pergamon Press, 1982).
A great book -- the title says it all.
Feldman, Paula R. and Buford Norman. _The Wordworthy Computer: Classroom
and research applications in language and literature_ (NY: Random House,
c.1987). The best part of this book is its HUGE bibliography. A very
good starting point.
--------------------------------------------------------------------------
LINGUIST List: Vol-5-1067.