Re: minimum sample

From: Jonathan Robie (jonathan@texcel.no)
Date: Fri Mar 05 1999 - 07:51:04 EST


At 08:16 AM 3/5/99 +0100, Daniel Ria–o wrote:
 
> (HERE ARE THE LINES): I am not sure I understand what you call
>descriptive statistics, either because of my scarce knowledge of statistics
>in general or because I can't fint the Spanish translation. Could you give
>me an example? Feel free to answer to the list if you think it can clarify
>something.

Descriptive statistics describe the distribution of a known universe. The
goal of descriptive statistics is description, not prediction. For
instance, if I have 10 marbles, and 6 are black, I can say that 60% of
these marbles are black. The US Census Bureau's statistics are a good
example of descriptive statistics.

Exploratory data analysis techniques are useful for describing data in a
known universe. I'm 15 years behind the times on this subject, but
clustering and scaling techniques were what we used back then - at the
time, we had to write our own programs to do this stuff, but there's
commercial software that does this now, with nice graphics. I also don't
have time to check all this right now, so I'm probably getting some of this
wrong, but here goes...

> I'll try to make my self clearer (though I am translating the
>Spanish terms for chart drawing): we agree (do we?) that there is a set of
>rules that govern any natural language's syntax.

Yes, we do agree on this.

>You are limited to the testimony of written
>text: you don't have a 2.000 years old Greek to test his linguistic
>competence.

Bummer.

>Anyway, you could draw a graphic, where you represent over the
>horizontal axis the volume of your corpus (the number of words), and, with
>a higher level of conventionality, you could represent over the vertical
>axis the completedness of your description of the system of Greek syntax in
>the first century AD, the norm in Palestine at the time, or the use of the
>language of a given author. Now, the curve that represent the (foreseeable)
>growth of your descriptions will raise as you parse text, but at some
>moment, the line that represent your description of the system will start
>to rise much slower and at some moment it will be flat (congratulations:
>you are now in the books of history), and some time latter the inclination
>of the line that represents the accurateness of your description of some
>author's linguistic use will tend to zero. And my question is: is there a
>statistical method to calculate where is the inflexion point of that line?.
>Is there a "minimum sample" for linguistic studies over text corpora?

First off, I should mention that I'm somewhat uncomfortable answering this
in the abstract, since my answer would depend to a great degree on the
extent to which a particular phenomenon can be clearly stated and
objectively tested. For instance, consider a study of the use of accents
and punctuation. For this kind of study it is fairly easy to state claims
in black and white, and it would be easy to draw such a curve objectively.
The distribution of particular verb forms can also be treated in this way,
e.g. the use of the perfect. In either case, the claims that can be made
are fairly broad and simple: the use of accents increased over time, and a
curve can be drawn to show that; the use of the perfect decreased over
time, and a similar curve can be drawn to show that. In either case, the
real number can be known, and exploratory data analysis brought to bear.

Using a clustering or scaling technique, I can bring in more dimensions in
a way that can increase my insight. For instance, I might want to add
geographical location, genre of literature, etc. I can throw in any number
of factors, and I will get a plot or a graph that groups texts according to
common characteristics - it's like letting the system decide what the X and
Y coordinates are, and telling the human being which factors were most
helpful in grouping the data.

With clustering and scaling, there are no confidence levels, and therefore
no minimum sample size. There is a technique called multidimensional factor
analysis that *does* allow you to make statistical assertions about
goodness of fit. I don't remember much about multidimensional factor
analysis, it's something I once knew.

For clustering and scaling, what is more relevant is the number of
observations that don't fit into groups that you can describe cleanly. What
matters is not the average, but the ability to identify the factors that
describe the data best. And it is really important to bring in enough
factors to consider the broader universe of what is being studied. I prefer
to study complex phenomena initially by doing exploratory data analysis,
identify specific hypotheses, then verify them using sampling statistics.

For text corpora, the problem with this approach is that sampling
statistics can't really be applied with validity because we know the entire
universe.

I don't know how clear this is... I hope it gives an idea of what I was
trying to say.

Jonathan
___________________________________________________________________________

Jonathan Robie jwrobie@mindspring.com

Little Greek Home Page: http://metalab.unc.edu/koine
Little Greek 101: http://metalab.unc.edu/koine/greek/lessons
B-Greek Home Page: http://metalab.unc.edu/bgreek
B-Hebrew Home Page: http://metalab.unc.edu/bhebrew

---
B-Greek home page: http://sunsite.unc.edu/bgreek
You are currently subscribed to b-greek as: [cwconrad@artsci.wustl.edu]
To unsubscribe, forward this message to leave-b-greek-329W@franklin.oit.unc.edu
To subscribe, send a message to subscribe-b-greek@franklin.oit.unc.edu


This archive was generated by hypermail 2.1.4 : Sat Apr 20 2002 - 15:40:18 EDT