The Calimera Project is funded under the  European Commission,
IST Programme

 

 
Calimera Report cover with logoCalimera Guidelines

 

 

Cultural Applications:

Local Institutions Mediating Electronic Resources

 

 

 

Multilingualism

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

Calimera Guidelines

Multilingualism

 

                                                       SCOPE                               

 

Issues dealt with in this guideline include:

European languages

Social inclusion

Sign languages

Information retrieval

Multilingual thesauri

Multilingual websites

Scripts

Fonts and keyboards

Transliteration, transcription and authority files

Machine translation

Voice to voice translation

 

                                                POLICY ISSUES                         Back to Scope

 

“Language is the foundation of communication between people and is also part of their cultural heritage. For many, language has far-reaching emotive and cultural associations and values rooted in their literary, historical,

philosophical and educational heritage. For this reason the users’ language should not be an obstacle to accessing the multicultural heritage available in cyberspace. The harmonious development of the information society is therefore only possible if the availability of multilingual and multicultural information is encouraged.” [1]

 

Article 12 of the European Charter for Regional or Minority Languages [2] deals specifically with cultural activities and facilities – “especially libraries, video libraries, cultural centres, museums, archives, academies, theatres and cinemas, as well as literary work and film production, vernacular forms of cultural expression, festivals and the culture industries, including inter alia the use of new technologies” . The signatories to this (i.e. the member states of the Council of Europe) “undertake to make appropriate provision… for regional or minority languages and the cultures they reflect”. 


Cultural institutions should aim to reach as wide an audience as possible. Websites can reach a global audience, and there are estimated to be over 6,000 languages in the world. The EU  is committed to integration among its member states but also promotes the linguistic and cultural diversity of its peoples by promoting the teaching and learning of languages, including minority and regional languages. The Action Plan on Language Learning and Linguistic Diversity for 2004 – 2006 [3], states that  “language learning is for all citizens, throughout their lives. Being aware of other languages, hearing other languages, teaching and learning other languages: these things need to happen in every home and every street, every library and cultural centre, as well as in every education or training institution and every business”. 2001 was designated the European Year of Languages [4] and its activities continue annually through the celebration of the European Day of Languages on 26 September [5].

 

Museums, libraries and archives will need to consider providing services in

·        the official EU languages;

·        minority indigenous languages;

·        the languages of immigrants;

·        non-European languages – to some extent this will depend on the nature of their collections and whether there is likely to be interest outside Europe;

·        sign languages.

 

European languages                                                               Back to Scope  

There are 20 official languages in the EU – Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Slovak, Slovenian, Spanish and Swedish. The term “official language” is defined as a language that can be used in dealings with public authorities and in official documents, including commercial documents. A citizen may write to an EU institution in any of these languages and must receive a reply in the same language.

 

Information about the regional and/or minority languages of the European Union can be found on the website of Mercator [6], a research network and information service set up with the support of the European Commission. It is estimated that there are over 150 minority indigenous or autochthonous languages within the EU, not including dialects of any of the official languages, or any of the languages spoken by immigrant communities. The European Bureau for Lesser Used Languages (EBLUL) [7]estimates that over 40 million people in the EU speak a language which is not the official language of their country of origin. 

 

Some minority languages are afforded some sort of recognition in Europe, or within the country where they are spoken, but not all. The official status of the minority languages of Europe can be found on the Mercator website [6]. There are three accepted categories of regional and minority languages:

·        languages specific to a region which may be wholly or partially in one or more member states. This would cover languages like Basque, Breton, Catalan, Frisian, Sardinian, Welsh and so on;

·        languages spoken by a minority in one state but which are official languages in another EU country. This definition covers, for example, German in southern Denmark, French in the Val d’Aoste in northern Italy, Hungarian in Slovakia, etc.;

·        non-territorial languages such as those of Roma or Jewish communities (Romany and Yiddish), or Armenian.

 

Some minority languages are fully developed languages of culture taught in schools, with established orthographies, extensive literatures and a considerable amount of publishing. Others may lack some or even all of such attributes and it may be difficult to make provision for them. Indigenous linguistic minorities however tend not to present the same challenges as do immigrants. For example:

·        they are often fully bilingual and do not require instruction in the majority language or culture;

·        there is no doubt about their numbers or permanence or socio-economic circumstances.

 

There are also many non-indigenous languages spoken in Europe mainly by immigrants. These include:

·        Turkish (mainly in Belgium, Germany and the Netherlands);

·        Maghreb Arabic (mainly in France and Belgium);

·        Urdu, Bengali and Hindi (mainly in the United Kingdom;

·        Balkan languages (spoken in many parts of the EU by migrants and refugees who have left the region as a result of recent wars and unrest).

 

Social inclusion                                                                                 Back to Scope

No official EU protection is afforded these languages, but heritage institutions will need to be socially inclusive (see the guidelines on Social inclusion and in Cultural identity and cohesion) and so will need to consider the language issue. Established ethnic minorities may well be bi-lingual, or even, in the case of second and third generations, monolingual in the majority language. Recent immigrants pose more of a problem. Museums, libraries and archives must be aware of the languages used in their communities. In some large cities with a rapidly changing population this might involve regular monitoring of the linguistic profile. As well as providing a service in any relevant minority languages, they must also recognise their responsibility to document and preserve the cultural identity of all members of their communities, which could involve collecting materials and creating content in several languages. Services to immigrants could involve:

·        recruiting staff who speak the language(s), preferably as native speakers;

·        ensuring all leaflets, signs and publicity are available in all relevant languages;

·        providing reading materials and audio-visual materials in all relevant languages;

·        providing word processing facilities in all relevant languages;

·        providing a translation service;

·        designing websites in more than one language.

 

Sign languages                                                                                Back to Scope

These must not be overlooked.  There are many different versions of sign language and although not recognised as official EU languages, the Council of Europe's Recommendation 1598 (2003) Protection of sign languages in the member states of the Council of Europe encourages member states "to give the sign languages used in their territory formal recognition” [8]. Some countries, including Denmark, Finland, Portugal, Sweden and the UK, do afford official recognition to a sign language.

 

                                    GOOD PRACTICE GUIDELINES             Back to Scope

 

These guidelines focus on multilingualism in the digital arena. In practice organisations have faced a number of difficulties in creating and maintaining multi-lingual digital content and pan-European products and services for the global networks. Some of these difficulties are technical and some relate to the costs and difficulties of translation. In recognition of this the EC has created an action line to address multilingual issues under the strategically important e-Content programme [9].

 

 

 

Information retrieval                                                                       Back to Scope

As more and more cultural resources are digitised access is extended to a global audience. The challenge for museums, libraries and archives is to ensure access to these resources while at the same time respecting cultural and linguistic diversity. Various projects have been set up to work in this area, including MACS (Multilingual Access to Subjects) [10].

 

Multilingual thesauri                                                                        Back to Scope

A thesaurus is a set of controlled terms for the detailed subject indexing of (originally) printed documents. A thesaurus will show relationships such as hierarchy and equivalence between the terms it uses. A major problem in the construction of thesauri in more than one language is that terms in one language may not cover the same semantic fields as terms in another, for example the English term “teenager” covers a narrower semantic field than the French “adolescent”.

 

There are standards for the compilation of thesauri and equivalent terms across languages (see ISO 5964:1985 Guidelines for the establishment and development of multilingual thesauri” [11].) This standard is an adjunct to ISO 2788:1986 [12] which covers monolingual thesauri and so is not complete in itself, many of the problems in the construction of thesauri being common to the construction of monolingual and multilingual thesauri. A revision of both standards is currently in progress. A new standard, BS 8723: Structured vocabularies for information retrieval - guide, is planned, to cover both monolingual and multilingual thesauri. It will be in five parts, as follows:

Part 1: Definitions, symbols and abbreviations (draft published Nov. 2004);

Part 2: Thesauri (draft published Nov. 2004);

Part 3: Vocabularies other than thesauri;

Part 4: Interoperation between multiple vocabularies;

Part 5: Interoperation between vocabularies and other components of information storage and retrieval systems. [13].

The Getty Information Institute has produced Guidelines for Forming Language Equivalents: A Model based on the Art and Architecture Thesaurus [14] (http://www.chin.gc.ca/Resources/Publications/Guidelines/English/). The chapter on multilingual thesaurus construction in Jean Aitchison, Alan Gilchrist [and] David Bawden: Thesaurus construction and use: a practical manual. 4th ed. ASLIB: London, 2000. ISBN 0-85142-446-5 [15] is also very useful.

 

Multilingual websites                                                                       Back to Scope

”A quality website must be aware of the importance of multilinguality by providing a minimum level of access in more than one language.”[16] The structure of a multilingual or bilingual website should be carefully considered from the outset so that multilingualism is an essential part of it and not just an afterthought. The MINERVA Project has suggested some criteria to define a multilingual website [17], the degree of multilinguality being dependent on the number of these which are met. They are:

·        some content should be available in more than one language;

·        some content should be available in sign language;

·        some content should be available in non-EU immigrant languages;

·        site identity and profile should be available in more than one language;

·        core functionality of the site (searching, navigation) should be available in more than one language;

·        static content (images, descriptions etc.) should be available in more than one language;

·        switching between languages should be easy;

·        site structure and user interface language should be logically separate so that layout does not vary with the language;

·        multilinguality should be driven by a formal multilinguality policy;

·        the website should be reviewed against this policy.

 

In some cases a bilingual, as opposed to a multilingual, website will be appropriate. Bilingual websites may be used:

·        in countries or regions where there is one main minority language, e.g. Wales;

·        to address a readership which can be expected to consist of bilingual individuals;

·        to address individuals who may speak one or other of two languages;

·        to make a social or political point by reminding members of the majority community of the existence of a minority.

Multilingual websites will be needed:

·        in countries where there are a number of minority indigenous languages;

·        to address ethnic minorities, including immigrants and asylum seekers,  in their own languages;

·        if the content of the website is likely to be of interest to a pan-European or global audience.

 

There are various policy decisions to be made which have far-reaching effects on the appearance of the website:

·         frames may be difficult in a multilingual context;

·         multilingual pages are likely to have a lot of text on them and may have a formidable appearance;

·         some fonts are more appropriate for one language than another, and it is preferable to use the same font throughout rather than to appear to make one language more legible than others;

·         the language of logos must be sensitively chosen. The use of a majority language in a logo can alienate minority language users;

·         there may be a role for touchscreen technology in the design of multilingual websites.

Remember also that a multilingual website is not a cheap option and that, like other websites, it will require updating and this will not be such a simple matter as updating a monolingual site.

 

There are a number of ways in which a multilingual website may be arranged:

·        users may be offered a once and for all choice of which language they wish to use on the first page, and if they want to change, may be forced to return to that page. This may be appropriate in certain settings e.g. in a country in which two languages are used but by no means everyone is bilingual, for example Belgium and Switzerland;

·        they may be offered a choice of language on each page of the site. This may be by means of a button or filing tab, conventions familiar to most Internet users. Language links should be at the top of the page rather than at the bottom, as that is the part of the page displayed by default, and the link should take the user to exactly the same page but in the other language - not to another part of the site. The language should be given its native name e.g. French should be called Français;

·        all pages may offer the same text in all languages. Be aware that the same text in different languages may take up different amounts of space; typically an original text will be shorter than a translation;

·        sites may be asymmetrical, for example some information may be relevant to speakers of only one language e.g. a social club for Welsh people may have its membership form in Welsh only but in other respects may be bilingual.

 

The choice of arrangement may be affected by:

·        which type of audience being addressed - individuals speaking more than one language or individuals speaking only one language. Bilingual individuals may want to be able to see two languages much of the time as a means of double-checking that they understand the text correctly;

·        on a bilingual site, how different from each other the two languages are. Some languages are mutually comprehensible to a degree e.g. Spanish and Catalan, whereas some are not e.g. English and Welsh. Also some concepts do not appear at all in some languages;

 

The website of the Welsh Language Board [18] contains advice from the School of Education, University of Wales, Bangor and Escola Superior Politecnica, Universitat Pompeu Fabra, Barcelona on the design of bilingual websites, including recommendations as to the best ways of incorporating bilingualism into the design of a website without giving undue prominence to one language over another, and avoiding giving offence by the use of emotive or politically charged symbols such as flags to represent languages as many languages are spoken in more than one country and many countries are bilingual.

 

Scripts                                                                                               Back to Scope
Computers store letters and other characters by assigning a number for each one. The enormous diversity of languages and scripts led to hundreds of different encoding systems for assigning these numbers. Then in the mid-1980s Unicode [19] began to be developed. It assigns a unique binary code number to every character in every language, no matter what the platform, program or language. The Unicode Consortium is a non-profit making organisation founded to develop, extend and promote the use of the standard. Unicode is continually being expanded, nowadays even to include such things as archaic alphabets like Ogham (an ancient Celtic script) and cuneiform, and can cope with numbers, symbols, punctuation and Braille patterns etc.  Although
Unicode Standard Version 4.0. [20] and ISO/IEC 10646:2003 [21] are not the same thing, the sets of characters, names, and coded representations they contain are identical. Unicode Version 4.0 covers over 96,000 characters from the world's scripts. Although by no means the only standard in the field, it is favoured by the IT industry, as the adoption of one method has obvious advantages for worldwide communication, software availability, data interchange and publishing. ISO/IEC 10646:2003 has been widely adopted in new Internet and W3C protocols and mark up languages such as XML and HTML, and implemented in modern operating systems and computer programming languages.

 

Fonts and keyboards                                                                        Back to Scope

Small caps can be bought which cover the keys of a normal keyboard to aid in the typing of languages using extended versions of the Roman script e.g. ð å þ ñ ç æ ć ł etc. This simple method can even enable the Kanji script of Japanese to be word-processed.


Soft keyboards, or keyboards displayed on a touch screen, may be a flexible way of dealing with some of the problems of non-Roman or exotic scripts.

Languages with thousands of characters, like Chinese, require special software before they can benefit from electronic word-processing. For Chinese, a normal keyboard is used to enter a phonetic spelling of a Chinese word according to the Pinyin system of transliteration and the software displays those characters which are pronounced in that way – there may be as many as ten or so. The correct character or characters are chosen and entered in the document. The wrong choice would be the Chinese equivalent of a spelling mistake. This system is very adaptable, enabling the traditional and the simplified Chinese characters to be word-processed. The use of the Pinyin system does however mean that the operator needs to know the Mandarin or Pekingese pronunciation of Chinese, which it is not necessary to know in order to write Chinese by hand. It is however possible to buy software based on the Cantonese pronunciation [22]. The software takes up more space on the PC’s memory than the word-processing of a language written in the Roman alphabet but in cities where there are considerable numbers of Chinese people it could be justifiable to buy this software and make it available on a dedicated machine. Arabic scripts present less of a problem as specially adapted keyboards are available.

 

It is worth considering commercial fonts, software, and keyboards for multilingual computing such as those sold by Fingertip Software [23] which are based on Unicode.

 

Transliteration, transcription and authority files                                     Back to Scope

In many cases e.g. for the production of catalogues, indexes, toponymic lists and other works of a bibliographic nature which are meant to be used by people who can only be expected to be familiar with the Latin alphabet, or for typographical reasons, it will not be possible or practical to use the characters of a non-Latin script. In that case, transcription or transliteration will be necessary.

Transliteration is the process by which the letters of an alphabetic writing system are converted into the symbols of another alphabetic system e.g. Cyrillic or Greek into the Latin alphabet. There are problems caused by alternative systems of transliteration e.g. Чехов can be transliterated Tchehov or Chekhov.


Transcription is the process by which the sounds of a language are converted into the symbols of another language. Transcription may in principle be used for the sounds of any language, but it is the only system which can be used to convert the sounds of non-alphabetic languages such as Chinese into the symbols of the Latin or some other alphabetic system.


Clearly there are problems of standardisation as a result of transliteration and transcription. Different systems or variation in practice would cause difficulties in searching databases. At the moment there is no standardised name record format relevant to the needs of European cultural institutions but a prototype has been developed by the LEAF project (Linking and Exploring Authority Files) [24] funded by the EC from March 2001 to 2004.
The project results will be implemented by extending MALVINE, an online search service for post-medieval manuscripts, into a global multilingual information service about persons and corporate bodies [25].

 

International standards are being developed for the transliteration of a variety of languages. For example there is a standard for the transliteration of Indic scripts ISO 15919:2001, Transliteration of Devanagari and related Indic scripts into Latin characters [26].

 

Machine Translation (MT)                                                                Back to Scope

At one time great hopes were entertained of MT but in view of the effort expended on it over the last fifty years the results may be seen as disappointing. The kind of problems which are encountered and which have so far proved impossible to solve are, for example:

·        ambiguities in the meanings of words;

·        differences in word order;

·        as yet no way has been found to give computers any knowledge of the real world or context or readership.

 

The effectiveness of MT systems is dependent on a number of factors e.g. documents must be free of any typographic or grammatical errors, words not in the dictionary of the system, or complex sentence structures.

 

MT is the application of computers to the task of translating texts from one natural language to another and nowadays includes software ranging from simple dictionary lookup programs used as word-processor add-ons to sophisticated batch-translation systems. Viable applications for MT include:

·        content scanning, that is using a translation system simply to obtain a rough draft so as to be able to get the general gist of a text;

·        screening large numbers of documents to identify those warranting human translation;

·        assisting human translators - computer-aided translation (CAT) software uses a variety of linguistic tools to improve the productivity of translators, particularly when translating highly repetitive texts such as technical documentation.

 

There are a number of websites offering both free and charged translation services on the World Wide Web. If a URL (Uniform Resource Locator) is entered MT software can translate a webpage and documents can be translated automatically. These sites also often offer translation by human beings, for example AltaVista Babelfish, Google Language Tools, World Lingo, Free Translation, and Systran. The Yahoo Language Translation and Interpretation Resources page is a useful source of information about MT sites [27]. A gateway to a number of web-based translation services, including Internet search engines, is Babblefish [28].

 

For more information about MT see the website of the European Association for Machine Translation [29].

 

                                              FUTURE AGENDA                        Back to Scope

 

Work is already underway to establish a multilingual portal to the cultural heritage of Europe. MICHAEL, the Multilingual Inventory of Cultural Heritage in Europe [30], is a spin-off from the MINERVA project [31]. It will develop a trans-European inventory of the digital cultural heritage of Italy, France and the UK which will be made available to the public, utilising an open source platform which will allow extension to other countries.

 

The European Library [32], developed by the TEL project, will be launched in 2005 as a portal offering access to the combined resources of the 43 national libraries of Europe. It will use MACS to allow cross-language retrieval. This type of service provides a platform for research into multilingual access issues.

 

It would be useful to have more central resources of materials in minority languages along the lines of Denmark’s Indvandrerbiblioteket, a national resource centre for books and other media in foreign languages for ethnic minorities in Denmark. A central resource is especially useful where the minority language speakers are well dispersed throughout society and not concentrated in particular places, making local provision uneconomic.

 

The EC Joint Interpreting and Conference Service (SCIC - from the French acronym) [33] has as one of its objectives to exploit the possibilities offered by new technologies. It has set up a unit consisting of members of staff who test new communication tools and search for ways of bringing multilingualism to channels of communication such as multilingual chats on Internet, multilingual communication in the media, and multilingual virtual conferences. 

 

The Cross-Language Evaluation Forum (CLEF) and CLEF 2004 [34] have done a lot of research into multilingual information retrieval. It is to be hoped that such work will form the basis for future developments. 

 

Although imperfect in many ways, it would be useful to have some form of machine translation for minority languages, especially those spoken by minorities, not just the major languages of Europe.

 

Voice-to-voice translation                                                               Back to Scope

Voice-to-voice translation, that is a machine which translates spoken language from one language to another, is still science fiction but might be developed in the comparatively distant future. Such a device would involve the perfection of a number of complex technologies, each of which at present has many shortcomings, including:

· &n