2010년 4월 12일 월요일

A translated corpus of 30,000 French SMS

By Cedrick Fairon, Sebastien Paumier

This article presents a corpus of 30,000 French SMS with its uniqueness in quality, size and the fact that the SMS was translated into “standard” French. As well it shows the collection process and the detail of the translation process.

Sociologists and linguists started to describe how the new language of new forms in the written form such as chat, forums and SMS is adapted and how users play with each to “make sense” faster with fewer words and characters.

The shortage of reference corpora, especially with SMS due to the difficulty in collecting it made the researchers be hard in studying the new forms of written language. However recently, the collection was carried out by students and messages were manually copied from phone screens.

Two important limitations; i. corpora by restricted SMS users
ii. typing mistakes or voluntary corrections

“Give your SMS to Science”

A SMS collection in the French-speaking part of Belgium was organized;
- to facilitate the data collection, a toll free short code was made
- a call for participation was broadcast
- participants were invited to send copies of their SMS & to fill in online sociolinguistic form
- from Oct. 2004 to Dec. 2004, more than 75,000 SMS by more than 3,200 people
- 2,500 people answered the form, aged from 12 to 65, divided into 1200 men and 1500 women

Goal; to build a reference corpus as a solid base for linguistic studies

Preprocessing the corpus
73,127 raw SMS was received.

Two operations
1. the first - to reassemble messages of more than 160 characters that were split into several SMS and to remove SMS (non-French SMS, graphical SMS, duplicated, etc)
2. the second - to remove personal information

Translating the corpus
Why translate? (Motivations)
- “translate” or “transliterate” the corpus into “standardized” French (called a bilingual corpora)
- Both SMS and its translation in standardized French

1. readability; the difficulty in reading due to without spaces, mix upper case & lower case letter, non-standard abbreviations & text transformations, codes, usages and habits of SMS writers, and sequence errors
2. usability; facilitation for exploration of messages

Translation protocol

Translation rules
1. IdSMS – Index of the SMS in the database
2. User – number standing for a GSM number
3. Sex – to check gender agreements, in particular for past participles
4. Flag – message annotations
5. Message – Original SMS(already anonymised)
6. Trans. - translation in “standard” French

Two general rules;
1. original SMS – not modified
2. protocol – strictly observed in both “standard” French and the original messages

Subset rules; about foreign word, punctuation marks, mathematical symbols, abbreviations, smileys, spaces & new lines, acronyms & sigla, letter repetitions, phonetic transformations, onomatopoeia & interjections, proper names, numbers, neologisms, obvious errors, unexpected or incomprehensible symbols, character case, typing errors, and missing words, accordingly

The corpus
The translation of 30,000 SMS was finally made.

1. randomly selected messages with a sociolinguistic profile – from 1,736 authors
2. 11% of SMS with no associated profile to avoid any bias – from 799 authors

Published; in CD-Rom

The corpus; distributed as a database linked to a graphical interface for searching and sorting original and translated messages as well as author profiles

Conclusion
This SMS corpus is unique in its size and accuracy, the number of contributors and the amount of meta-data. It has also translated manually for a bilingual corpus allowing both standard French and the SMS variants.
It opens new perspective for studies of SMS languages as well as providing a high value to the corpus.

http://www.sms4science.org/userfiles/A%20translated%20corpus.pdf

댓글 1개:

  1. this is interesting, but what did you take away from it in relation to Twitter research? Was it the processing of the data? How would you use this with our data?

    답글삭제