QUICK MENU

PROLOGUE AND ANTECEDENTS

AUTHORS

RESULTS OF THE LINGUISTIC STUDY

DETAILS OF THE RESULTS

PROSPECTS FOR A FOLLOWING STUDY

REFERENCES OF RELATED WORKS

LIST OF APPENDICES

LIST OF TABLES

THE PLACE OF LATIN LANGUAGES AND CULTURES ON THE INTERNET

4.3 Statistical methodology

The intervals of realibility of 90% and 99% have been established using the T Student distribution⁴, taking as an hypothesis a Normal type distribution.

4.3.1 Results of the measurements in the WWW with the Hotbot search engine

Table in Appendix 7 indicates the number of the quotation of terms on the WWW for each language. The quotations were counted with the HotBot engine.

In this way, all the quotations that appear on all the web pages referenced by the engine are considered.

4.3.2 Statistical calculations on the WWW

Here are the average percentages representing the comparison between presence of Latin languages and English. Appendix 8 (Statistical calculations in the WWW area) gives the word-by-word detailed table.

AVERAGE	3,37%	3,75%	2,00%	1,09%	0,20%
	Spanish	French	Italian	Portuguese	Rumanian
*Standard Deviation*	*3,07%*	*1,78%*	*1,76%*	*0,99%*	*0,33%*
*Margin of variance*	*0,96*	*0,69*	*0,94*	*0,95*	*1,27*

The margin of variance is the square root of the of the squared typical difference divided by the squared mean. A value superior to 1 indicates a great dispersion, and, therefore, unreliable mean. A value inferior to 1 indicates a weak dispersion, and, therefore, the lower is the value, the higher is the reliability of the result.

4.3.3 Results of the measurements in the Usenet with the DejaNews search engine

Table in Appendix 9, indicates, the amount of the quotations of terms on the Usenet for each language, counted with the DejaNews engine.

4.3.4 Statistical calculations on the Usenet

Here are the average percentages representing the presence of Latin languages in comparison with English. Appendix 10 (Statistical calculations in theUsenet area), gives the word-by-word detailed table.

	Spanish	French	Italian	Portuguese	Rumanian
AVERAGE	2,41%	1,44%	2,54%	1,12%	0,14%
*Standard Deviation*	*1,37%*	*1,65%*	*2,74%*	*5,47%*	*0,48%*
*Margin of variance*	*0,75*	*1,07*	*1,04*	*2,21*	*1,83*

4.4 Comparison with other studies

4.4.1 Comparison with the previous studies

Between the first study and this one the English/French and French/Spanish, ratios have evolved in the following way:

	English / French	French / Spanish	English / Spanish
March 1996 (L1)	21,91	2,40	52,58
March 1997 (L2)	19,99	1,92	38,38
March 1998 (L3)	17,60	1,33	23,32
Sept. 1998 (L4)	35,59	1,11	39,53

Does it mean that the Latin languages are on the decline in comparison with the previous years ? Not at all! This evolution is due to two main reasons :

Changes introduced in the statistical method. For this study, the authors have focused on the ratio between French and English, and not on the opposite. This was done in order to have a normalized distribution (that is to say numerals contained between 0 and 1).
A different sample of reference.

There is no doubt that the nature of the sample of reference influences heavilly the results of the average and of the margin of variance. Nearly none of the original L1 samples would have satisfied all the linguistic filter criteria rigorously established for the L4 study! If the statistical methodology of the present study had been applied to the original sample, the margin of variance would have been significatelly superior to 1 and the reliable intervals would have been very large.

The linguistic work has allowed to reveal the very strong probabilities of homography between the latin languages. The L1 sample, defined without any particular linguistic pretention has thus favoured the Latin languages because of already mentioned homographies phenomena Moreover, the choice of terms was not sufficiently "culturally neutral".

Therefore, it is very difficult to link this study, really rigorous from the linguistic point of view, to the previous studies, and to extrapolate trends. However, a scientific analysis of the evolutions can begin from now and from this newly created sample.

One of the conclusions of the present study is that it is impossible to compare English with a single Latin language, because of the great probabilites of homography between the Latin languages: effectively, the occurrences of a same form could be attributed to a single language whereas they should be distributed among various languages (thus, the form "familia" means "family" at the same time in Spanish, in Portuguese and in Rumanian).

All this indicates again necessity for an association between the Agence de la Francophonie and the Union Latine for the supervision of this study .

4.4.2 Comparison with Alis and AltaVista

During the L3 study, a we undertook a comparison with the Alis Technologies study, questioning the Alis results and their overestimation of the presence of English. The results presented here, now considered to be reliable, are showing an overestimation of the French language in an important proportion (around 100%). Does this mean that Alis numbers were closer to the reality than they seemed to be? Not exactly. As a matter of fact, if this comparison were measured again today, in the light of more rigorous results that we have obtained, we would get numbers closer to those suggested by use of AltaVista languages recognition algorithm and still far from the Alis' results which is, according to our study, constantly biased towards the English language.

The Alis numbers are the ones which are published on the Web and that not have been updated at the time of our study. The AltaVista figures are obtained by the "empty set complement" method, described in the L3 study. The comparisons are done on the hypothesis of an identical percentage in English.

**Table 7 : Table of comparison with AltaVista and Alis results**
		*ALTAVISTA*		*ALIS*		ACCT / UL / FUNREDES
ANY	107,958,869	% WITHOUT	% WITH ()*	% WITHOUT	% WITH	comparison with
		CORRECTION		CORRECTION		Altavista	Alis
ENGLISH	70,065,677	64,90%	76,35%	84,00	82,30	*76,35%*	82,30
JAPONESE	4,369,675	4,05%	4,76%	3,10	1,6
GERMAN	4,009,554	3,71%	4,37%	4,50	4,00
FRENCH	1,951,446	1,81%	2,13%	1,8	1,5	2,86	3,08
SPANISH	1,495,195	1,38%	1,63%	1,20	1,10	2,57	2,77
ITALIAN	1,490,109	1,38%	1,62%	1,00	0,80	1,53	1,65
PORTUGUESE	905,676	0,84%	0,99%	0,70	0,70	0,83	0,90
RUMANIAN	28,052	0,03%	0,03%			0,15

THE REST	23,643,485		25,77%		Multilingual websites
THE REST CORRECTED	7,449,655		8,12%		15%

Comparison with the numbers obtained from AltaVista

In comparison with our study results :

AltaVista figures for English seem reliable.
AltaVista figures for French are underestimated of 35%.
AltaVista figures for Spanish are underestimated of 58%.
AltaVista figures for Italian are overestimated of 6%.
The AltaVista Portuguese result is overestimated of 16%
The Rumanian result is underestimated of 403%

Comparison with the numbers published by Alis Technologies

In comparison with our study results :

Alis figures for English seem too high to us.
Alis figures for French are underestimated of 106%.
Alis figures for Spanish are underestimated of 152%.
Alis figures for Italian are overestimated of 106%.
The Alis result for Portuguese is overestimated of 28%.
Alis does not consider Rumanian.

**Table 8 : Comparative synthesis of the four methods**
	EN/FR	FR/SP	EN/SP
ALTAVISTA METHOD «EMPTY SET COMPLEMENT»	35,90	1,31	46,86
ALIS METHOD	46,67	1,36	63,64
APPROX. FUNREDES METHOD	17,60	1,33	23,32
FUNREDES/UL/ACCT METHOD	35,59	1,11	39,53

5. Prospects for an observation follow-up
[BACK TO TOP]

Now, it is possible to reproduce the measurements with the same linguistic sample at regular intervals. This enables us to estimate the respective evolution of the different latin languages, both in comparison with English and between themselves. In order to do this, it would be desirable to automate the measurement and the process of results production.

6. Internet references, of the related works
[BACK TO TOP]

Concerning the weight of languages on the Web in general, the only available reference is the already mentioned Alis Technologies': «Web Languages Hit Parade» : http://babel.alis.com:8080/palmares.en.html

It is also important to mention a web page (in English) that publishes statistics on the repartition of Internet users according to their mother tongue "Global statistics by language" : http://www.euromktg.com/globstats/

For some linguistic areas, some groups or individuals are gathering the existant information and/or are commenting it :

For the french-speaking area, the CIDIF have created, with the Agence de la Francophonie support, and is running "L'état du développement et de l'utilisation de l'inforoute dans l'espace francophone" :

http://www1.cidif.org/franco

For the hispanity, two researchers are working on the theme of Internet and Spanish:

Mister José Millan, who has published various articles accessible on :

http://ourworld.compuserve.com/homepages/JAMillan/josemill.htm

The Cervantes Institute is managing a "Spanish observatory of the languages industries" :

The institute is at : http://www.cervantes.es/

There is also a regularily updated inventory of statistical data on the Internet in Latin America and in the Caribbean : http://www.cr/latstat/ These figures are based on the usual reference source for these sort of statistics : Network Wizard (http://www.nw.com).

Remaining are the general references about the statistics relative to the Internet that for now do not include any special section on languages or cultures :

Matrix News, which is proceeding with demographic studies on the Internet :

http://mids.org ⁵

Another «classic» is the Georgia Tech University, which is proceeding with very rigorous surveys about the Web users :

http://www.gvu.gatech.edu/user_surveys/

4 John E. Freund "Mathematical Statistics". 2nd edition, 1972, Prentice Hall International. Chapter 9 "Estimation".

5 Who has translated in English L1, C1 and L2 studies and published them in the Matrix News revue.