4.2.1 Graphic equivalences and distinctionsThe authors have decided to treat the terms linguistically before treating them informatico-statistically with search engines. In that way a linguistic team has worked independently, elaborating series of six terms or "functions" (one in each chosen language). These terms are at the same time equivalent - as far as signification, semantic impact, syntax, and their usage frequency are concerned - and distinctive among themselves, that is to say that they not homographs3 neither between them, nor in relation with the other terms of the targeted languages, nor in relation with other forms of common languages on the Internet . As a matter of fact, the authors have attempted to avoid systematically homographies with just one of the languages that was not part of this research and whose presence on the networks is probably considerable: German. The decision to reject forms of less than 4 letters has been taken in order to avoid other possible homographies (especially with acronyms, but not only), while the rest was treated from the statistical point of view when significant differences would appear. Homographies between at least two studied languages have been very frequent, especially (but not only) between Spanish and Portuguese. Other problems were Latin origin of some English words, the loaned words, etc. 4.2.2 Words and variationsEach compared word or "function" includes or can include different types of variations :
4.2.3 Treatment of the relevant typological differencesThe six studied languages present typological variations. Focus of this research revealed the morphosyntaxis variations. Apart from the differences of variation in gender, in number or in case that we mentioned above, its important to remind that in English, (language as synthetic as a language can be) one single form can have two syntactic values (noun and verb). Therefore it has different morphological variations in other languages, and, in this way, it can be translated by an excessive number of forms in these tongues, what makes the comparison impossible or unnecessarily costly. Handles such as prepositions or pronouns have very different functions in compared languages, but were generally excluded because of the rule to avoid terms with less than four letters. See Appendix 6 for all the aspects concerning the criteria applied to the selection of studied words. 4.2.4 Treatment of the remaining homographiesIn spite of our efforts, some homographies still subsist. In order to avoid distorsions in the results, we have to treat them as exceptions. The most frequent ones are the "-IDADES" plurals, common to Spanish and to Portuguese, and corresponding to the french "-ités" ("uniformités", "uniformidades"). The authors had to search them in the plural, since the italian form "-ità" corresponds both to the singular and to the plural. When the counting of the "-idades" form gives a low result (inferior to 50), the division of counting between Spanish and Portuguese was automatic and based on the results. In the other cases, we have divided the counting between Spanish and Portughese on the bases of the proportions indicated by results of AltaVista search/counting per language algorithm. There is one case of homography that comes from Rumanian (CAL and CAI for "horse" : homographs of other words in Spanish, Italian, Portuguese ). That is why CAL and CAI forms have not been counted, and this penalizes Rumanian. Moreover, the CAII form has also been eliminated because it is an homograph of frequent acronyms on the Internet. LUNG means «long» in Rumanian. The effect, really marginal in English, has not been corrected. FACA and FACAS mean, respectively, "knife" and "knives" in Portuguese, but they are also two conjugation forms of the verb to do (faça and faças, in their variation without diacritic). In order not to penalise the indicated result, it was counted a posteriori, taking as a reference the Portuguese general mean. The form MALADIE ("ILLNESS" in French) exists in Rumanian with the same meaning but it is rarely used. The effect is marginal. The form BOLI (rumanian case variation of the french word MALADIE) is a very frequent abbreviation of bolígrafo in Spanish ("pen") and has been eliminated of the counting. JOI ("Thursday" in Rumanian) is a three letters word so, it is susceptible to homography with acronyms. JOIA is an homograph of the portuguese jóia without diacritics, meaning "jewel". The score has been counted by extrapolating the mean. MARTI is an homograph of a famous person (José Martí), without diacritics, and its score has not been counted for the rumanian Tuesday. The french MARDI ("Tuesday") score has been diminished for the result for MARDI GRAS in order not to count this english form. 4.2.5 Treatment of the other non-equivalent significationsThe work of filtering, along with the insertion of semantically equivalent forms, has almost eliminated the risks of not detecting the non-equivalent significations (which we have labeled "semantical collisions" in the first study). It remains that knife is sometimes used as a verb, and therefore this word favorizes the English. The portuguese form for the days of the week is of the "quarta-feira", where the first term indicates the day number. The days are sometimes indicated directly in the second term (quarta for quarta-feira). This simple form has not been considered in order to avoid a confusion with the "fourth". This decision penalizes the Portuguese for the five selected days (particularly for the Usenet, where the abbreviations are frequent). Methodological note : the amount of indexed websites by Hotbot seems to vary significately from one month to another. As all the terms have not been measured at the same time, the comparisons between the them might be slightly uncertain. On the other hand, for our subject the relative weight of the languages the proportions are quite the same whichever the dimension of the Hotbot indexed sample is. 4.2.6 Other linguistic elements taken into consideration during the studyOne of the most frustrating elements of the study was the
failure to expand the sample using expressions rather than simple terms. The
linguistic team produced table of composed words and idiomatic expression on the basis of terminological
dictionnaries. In this way a new sample of more than sixty terms was created (from the
initial 400). However, first measurements have shown a very strong dispersion of the
results, especially for the first sample. The results appear to be less coherent, probably
because of chaotical phenomenon. A detailed research is necessary and will be conduced as
a part of future updates. 2 The area of the files names (FTP) does not present the required characteristics: files' names may be correlated to language, but it is too occasional to be significant. The Gopher area, historically highly linked to the university world, has stopped growing since few years. 3 Unless otherwise
stated, we are talking about trans-linguistical homographies: the homographs inside the
same language will be considered, theoretically, as a same word (graphic). |
| [email protected] Copyright © 1996-1999 AGENCE DE LA FRANCOPHONIE, UNION LATINE, FUNREDES Created: 5 X 1998 Last Modified: 02 VII 1999 |
Back |