|
|

THE PLACE OF LATIN LANGUAGES AND
CULTURES ON THE INTERNET
4.3 Statistical
methodology
The intervals of realibility of 90% and 99% have been
established using the T Student distribution4, taking as an
hypothesis a Normal type distribution.
4.3.1
Results of the measurements in the WWW with the Hotbot search engine
Table in Appendix 7 indicates the number of the quotation
of terms on the WWW for each language. The quotations were counted with the HotBot engine.
In this way, all the quotations that appear on all the web
pages referenced by the engine are considered.
4.3.2
Statistical calculations on the WWW
Here are the average percentages representing
the comparison between presence of Latin languages and English. Appendix 8 (Statistical calculations in the WWW
area) gives the word-by-word detailed table.
| |
Spanish |
French |
Italian |
Portuguese |
Rumanian |
| AVERAGE |
3,37% |
3,75% |
2,00% |
1,09% |
0,20% |
| Standard
Deviation |
3,07%
|
1,78%
|
1,76%
|
0,99%
|
0,33%
|
| Margin of
variance |
0,96
|
0,69
|
0,94
|
0,95
|
1,27
|
The margin of variance is the square root of
the of the squared typical difference divided by the squared mean. A value superior to 1
indicates a great dispersion, and, therefore, unreliable mean. A value inferior to 1
indicates a weak dispersion, and, therefore, the lower is the value, the higher is the
reliability of the result.
4.3.3
Results of the measurements in the Usenet with the DejaNews search engine
Table in Appendix 9, indicates, the amount of the
quotations of terms on the Usenet for each language, counted with the DejaNews engine.
4.3.4
Statistical calculations on the Usenet
Here are the average percentages representing
the presence of Latin languages in comparison with English. Appendix 10 (Statistical calculations in
theUsenet area), gives the word-by-word detailed table.
|
Spanish
|
French
|
Italian
|
Portuguese
|
Rumanian
|
| AVERAGE |
2,41%
|
1,44%
|
2,54%
|
1,12%
|
0,14%
|
| Standard
Deviation |
1,37%
|
1,65%
|
2,74%
|
5,47%
|
0,48%
|
| Margin of
variance |
0,75
|
1,07
|
1,04
|
2,21
|
1,83
|
4.4
Comparison with other studies
4.4.1
Comparison with the previous studies
Between the first study and this one the
English/French and French/Spanish, ratios have evolved in the following way:
| |
English / French |
French / Spanish |
English / Spanish |
| March 1996 (L1) |
21,91 |
2,40 |
52,58 |
| March 1997 (L2) |
19,99 |
1,92 |
38,38 |
| March 1998 (L3) |
17,60 |
1,33 |
23,32 |
| Sept. 1998 (L4) |
35,59 |
1,11 |
39,53 |
Does it mean that the Latin languages are on
the decline in comparison with the previous years ? Not at all! This evolution is due to
two main reasons :
- Changes introduced in the statistical method. For this study,
the authors have focused on the ratio between French and English, and not on the opposite.
This was done in order to have a normalized distribution (that is to say numerals
contained between 0 and 1).
- A different sample of reference.
There is no doubt that the nature of the sample of reference
influences heavilly the results of the average and of the margin of variance. Nearly none
of the original L1 samples would have satisfied all the linguistic filter criteria
rigorously established for the L4 study! If the statistical methodology of the present
study had been applied to the original sample, the margin of variance would have been
significatelly superior to 1 and the reliable intervals would have been very large.
The linguistic work has allowed to reveal the very strong
probabilities of homography between the latin languages. The L1 sample, defined without
any particular linguistic pretention has thus favoured the Latin languages because of
already mentioned homographies phenomena Moreover, the choice of terms was not
sufficiently "culturally neutral".
Therefore, it is very difficult to link this study, really
rigorous from the linguistic point of view, to the previous studies, and to extrapolate
trends. However, a scientific analysis of the evolutions can begin from now and from this
newly created sample.
One of the conclusions of the present study is that it is
impossible to compare English with a single Latin language, because of the great
probabilites of homography between the Latin languages: effectively, the occurrences of a
same form could be attributed to a single language whereas they should be distributed
among various languages (thus, the form "familia" means "family" at
the same time in Spanish, in Portuguese and in Rumanian).
All this indicates again necessity for an association between
the Agence de la Francophonie and the Union Latine for the supervision of this study .
4.4.2
Comparison with Alis and AltaVista
During the L3 study, a we undertook a
comparison with the Alis Technologies study, questioning the Alis results
and their overestimation of the presence of English. The results presented here, now
considered to be reliable, are showing an overestimation of the French language in an
important proportion (around 100%). Does this mean that Alis numbers were closer to the
reality than they seemed to be? Not exactly. As a matter of fact, if this comparison were
measured again today, in the light of more rigorous results that we have obtained, we
would get numbers closer to those suggested by use of AltaVista languages recognition
algorithm and still far from the Alis' results which is, according to our study,
constantly biased towards the English language.
The Alis numbers are the ones which are
published on the Web and that not have been updated at the time of our study. The
AltaVista figures are obtained by the "empty set complement" method, described
in the L3 study. The comparisons
are done on the hypothesis of an identical percentage in English.
Table 7 : Table
of comparison with AltaVista and Alis results
 |
 |
ALTAVISTA |
ALIS |
ACCT / UL / FUNREDES |
| ANY |
107,958,869
|
%
WITHOUT |
%
WITH (*) |
%
WITHOUT |
%
WITH |
comparison
with |
 |
 |
CORRECTION |
CORRECTION |
Altavista |
Alis |
| ENGLISH |
70,065,677
|
64,90%
|
76,35% |
84,00
|
82,30 |
76,35% |
82,30 |
| JAPONESE |
4,369,675
|
4,05%
|
4,76%
|
3,10
|
1,6 |
 |
 |
| GERMAN |
4,009,554
|
3,71%
|
4,37%
|
4,50
|
4,00 |
 |
 |
| FRENCH |
1,951,446
|
1,81%
|
2,13% |
1,8
|
1,5 |
2,86 |
3,08 |
| SPANISH |
1,495,195
|
1,38%
|
1,63% |
1,20
|
1,10 |
2,57 |
2,77 |
| ITALIAN |
1,490,109
|
1,38%
|
1,62% |
1,00
|
0,80 |
1,53 |
1,65 |
| PORTUGUESE |
905,676
|
0,84%
|
0,99% |
0,70
|
0,70 |
0,83 |
0,90 |
| RUMANIAN |
28,052
|
0,03%
|
0,03% |
 |
 |
0,15 |
 |
 |
| THE REST |
23,643,485
|
 |
25,77%
|
 |
Multilingual
websites |
| THE REST CORRECTED |
7,449,655
|
 |
8,12%
|
 |
15% |
Comparison with the numbers obtained from
AltaVista
In comparison with our study results :
- AltaVista figures for English seem reliable.
- AltaVista figures for French are underestimated of 35%.
- AltaVista figures for Spanish are underestimated of 58%.
- AltaVista figures for Italian are overestimated of 6%.
- The AltaVista Portuguese result is overestimated of 16%
- The Rumanian result is underestimated of 403%
Comparison with the numbers published by Alis Technologies
In comparison with our study results :
- Alis figures for English seem too high to us.
- Alis figures for French are underestimated of 106%.
- Alis figures for Spanish are underestimated of 152%.
- Alis figures for Italian are overestimated of 106%.
- The Alis result for Portuguese is overestimated of 28%.
- Alis does not consider Rumanian.
Table 8 :
Comparative synthesis of the four methods
 |
EN/FR |
FR/SP
|
EN/SP
|
| ALTAVISTA METHOD
«EMPTY SET COMPLEMENT» |
35,90 |
1,31 |
46,86 |
| ALIS METHOD |
46,67 |
1,36 |
63,64
|
| APPROX. FUNREDES
METHOD |
17,60 |
1,33 |
23,32
|
| FUNREDES/UL/ACCT
METHOD |
35,59
|
1,11
|
39,53
|
5. Prospects for an observation
follow-up
[BACK TO TOP]
Now, it is possible to reproduce the measurements with the
same linguistic sample at regular intervals. This enables us to estimate the respective
evolution of the different latin languages, both in comparison with English and between
themselves. In order to do this, it would be desirable to automate the measurement and the
process of results production.
6. Internet references, of
the related works
[BACK TO TOP]
Concerning the weight of languages on the Web in
general, the only available reference is the already mentioned Alis Technologies': «Web
Languages Hit Parade» : http://babel.alis.com:8080/palmares.en.html
It is also important to mention a web page (in
English) that publishes statistics on the repartition of Internet users according to their
mother tongue "Global statistics by language" : http://www.euromktg.com/globstats/
For some linguistic areas, some groups or individuals are
gathering the existant information and/or are commenting it :
- For the french-speaking area, the CIDIF have created, with the
Agence de la Francophonie support, and is running "L'état du développement
et de l'utilisation de l'inforoute dans l'espace francophone" :
http://www1.cidif.org/franco
- For the hispanity, two researchers are working on the theme of
Internet and Spanish:
There is also a regularily updated inventory of
statistical data on the Internet in Latin America and in the Caribbean : http://www.cr/latstat/ These figures are
based on the usual reference source for these sort of statistics : Network Wizard (http://www.nw.com).
Remaining are the general references about the statistics
relative to the Internet that for now do not include any special section on languages or
cultures :
4 John E. Freund
"Mathematical Statistics". 2nd edition, 1972, Prentice Hall International.
Chapter 9 "Estimation".
5 Who has translated in
English L1, C1 and L2 studies and published them in the Matrix News revue. |