THE PLACE OF LATIN LANGUAGES AND CULTURES ON THE INTERNET

4. Details of the results

4.1 Internet methodology

The search engines available for free access on the Internet (AltaVista, Hotbot, Excite, DejaNews, Veronica, FtpSearch…) are very powerfull tools as they index an important part of the information available in the different Internet domains (web pages, newsgroups pages, Gopher menu or documents area, files accessible in FTP). They have been conceived for search of words or expressions in areas they consider. In addition to this, some of them offer quantitative data regarding the number of occurrence of the searched terms. It is this «border effect» that the authors have used in order to measure the respective place of the latin languages and cultures inside the different cultural categories. They have decided to concentrate on the Web and Usenet, as they are the best representatives of the Internet evolution² and as they have highly efficient tools.

You can find the further information regarding the Internet methodology consulting the previous research. To be more precise, it has been briefly exposed in the first edition, L1, while more details (comparison with the different search engines, etc.) can be found in the last march L3 version.

4.2 Linguistic methodology

The results of the following methodology are explained in Appendix 5 (List of the reference terms of the sample).

4.2.1 Graphic equivalences and distinctions

The authors have decided to treat the terms linguistically before treating them informatico-statistically with search engines. In that way a linguistic team has worked independently, elaborating series of six terms or "functions" (one in each chosen language). These terms are at the same time equivalent - as far as signification, semantic impact, syntax, and their usage frequency are concerned - and distinctive among themselves, that is to say that they not homographs³ neither between them, nor in relation with the other terms of the targeted languages, nor in relation with other forms of common languages on the Internet .

As a matter of fact, the authors have attempted to avoid systematically homographies with just one of the languages that was not part of this research and whose presence on the networks is probably considerable: German. The decision to reject forms of less than 4 letters has been taken in order to avoid other possible homographies (especially with acronyms, but not only), while the rest was treated from the statistical point of view when significant differences would appear.

Homographies between at least two studied languages have been very frequent, especially (but not only) between Spanish and Portuguese. Other problems were Latin origin of some English words, the loaned words, etc.

4.2.2 Words and variations

Each compared word or "function" includes or can include different types of variations :

Variations without diacritics and other "incorrect" elements. In the five latin languages that include diacritic marks (accents, cedillas, or other marks), two variations have always been included, one with all these elements, and another one without. The second cersion is incorrect" indeed, but also very frequent on the Internet. Moreover, we have taken in consideration forms not accepted by or absent from some dictionnaries, again "incorrect", but significantly present on the Web. The informatico-statistical analysis was in charge of confirming the verisimilitude of this presence.
In the case of the pluricentric languages, that is to say languages that have more than one normative centre (for example Spanish from Spain and from different Latin American countries, Portuguese from Portugal and from Brazil), synonymic, lexical and orthographical variations have been considered when necessary.
In at least one case, two words of comparable root do not match up with the same meaning in different languages, although the combination of both equivalent. These two forms have been included, as quasi-synonymic appropriate variations of a same word: parity / equality(en), paridad / igualdad (sp), parité / égalité(fr), parità / uguglianza / eguaglianza (it)...
In order to increase the number of searched forms, the authors have sometimes included morphological variations of number (singular or plural). In addittion, various rumanian names have forced us to include some morphological variations of number, gender and case (and also the difference determinatum / not determinatum) in each language comporting these variations.

4.2.3 Treatment of the relevant typological differences

The six studied languages present typological variations. Focus of this research revealed the morphosyntaxis variations. Apart from the differences of variation in gender, in number or in case that we mentioned above, it’s important to remind that in English, (language as synthetic as a language can be) one single form can have two syntactic values (noun and verb). Therefore it has different morphological variations in other languages, and, in this way, it can be translated by an excessive number of forms in these tongues, what makes the comparison impossible or unnecessarily costly.

Handles such as prepositions or pronouns have very different functions in compared languages, but were generally excluded because of the rule to avoid terms with less than four letters.

See Appendix 6 for all the aspects concerning the criteria applied to the selection of studied words.

4.2.4 Treatment of the remaining homographies

In spite of our efforts, some homographies still subsist. In order to avoid distorsions in the results, we have to treat them as exceptions.

The most frequent ones are the "-IDADES" plurals, common to Spanish and to Portuguese, and corresponding to the french "-ités" ("uniformités", "uniformidades"). The authors had to search them in the plural, since the italian form "-ità" corresponds both to the singular and to the plural. When the counting of the "-idades" form gives a low result (inferior to 50), the division of counting between Spanish and Portuguese was automatic and based on the results. In the other cases, we have divided the counting between Spanish and Portughese on the bases of the proportions indicated by results of AltaVista search/counting per language algorithm.

There is one case of homography that comes from Rumanian (CAL and CAI for "horse" : homographs of other words in Spanish, Italian, Portuguese…). That is why CAL and CAI forms have not been counted, and this penalizes Rumanian. Moreover, the CAII form has also been eliminated because it is an homograph of frequent acronyms on the Internet.

LUNG means «long» in Rumanian. The effect, really marginal in English, has not been corrected.

FACA and FACAS mean, respectively, "knife" and "knives" in Portuguese, but they are also two conjugation forms of the verb to do (faça and faças, in their variation without diacritic). In order not to penalise the indicated result, it was counted a posteriori, taking as a reference the Portuguese general mean.

The form MALADIE ("ILLNESS" in French) exists in Rumanian with the same meaning but it is rarely used. The effect is marginal. The form BOLI (rumanian case variation of the french word MALADIE) is a very frequent abbreviation of bolígrafo in Spanish ("pen") and has been eliminated of the counting.

JOI ("Thursday" in Rumanian) is a three letters word so, it is susceptible to homography with acronyms. JOIA is an homograph of the portuguese jóia without diacritics, meaning "jewel". The score has been counted by extrapolating the mean.

MARTI is an homograph of a famous person (José Martí), without diacritics, and its score has not been counted for the rumanian Tuesday.

The french MARDI ("Tuesday") score has been diminished for the result for MARDI GRAS in order not to count this english form.

4.2.5 Treatment of the other non-equivalent significations

The work of filtering, along with the insertion of semantically equivalent forms, has almost eliminated the risks of not detecting the non-equivalent significations (which we have labeled "semantical collisions" in the first study).

It remains that knife is sometimes used as a verb, and therefore this word favorizes the English.

The portuguese form for the days of the week is of the "quarta-feira", where the first term indicates the day number. The days are sometimes indicated directly in the second term (quarta for quarta-feira). This simple form has not been considered in order to avoid a confusion with the "fourth". This decision penalizes the Portuguese for the five selected days (particularly for the Usenet, where the abbreviations are frequent).

Methodological note : the amount of indexed websites by Hotbot seems to vary significately from one month to another. As all the terms have not been measured at the same time, the comparisons between the them might be slightly uncertain. On the other hand, for our subject – the relative weight of the languages – the proportions are quite the same whichever the dimension of the Hotbot indexed sample is.

4.2.6 Other linguistic elements taken into consideration during the study

One of the most frustrating elements of the study was the failure to expand the sample using expressions rather than simple terms. The linguistic team produced table of composed words and idiomatic expression on the basis of terminological dictionnaries. In this way a new sample of more than sixty terms was created (from the initial 400). However, first measurements have shown a very strong dispersion of the results, especially for the first sample. The results appear to be less coherent, probably because of chaotical phenomenon. A detailed research is necessary and will be conduced as a part of future updates.

2 The area of the files names (FTP) does not present the required characteristics: files' names may be correlated to language, but it is too occasional to be significant. The Gopher area, historically highly linked to the university world, has stopped growing since few years.

3 Unless otherwise stated, we are talking about trans-linguistical homographies: the homographs inside the same language will be considered, theoretically, as a same word (graphic).