Invited speakers / Plenières

The DiDi Project: Collecting, Annotating, and Analysing South Tyrolean Data of Computer-mediated Communication.

Egon W. Stemle, Institute for Specialised Communication and Multilingualism at the European Academy
of Bozen/Bolzano (EURAC).

Slides from talk available at this link

Following a sociolinguistic user-based perspective on language data, the project DiDi investigated the linguistic strategies employed by South Tyrolean users on Facebook. South Tyrol is a multilingual region (Italian, German, and Ladin are official languages) where the South Tyrolean dialect of German is frequently used in different communicative contexts. Thus, regional and social codes are often also used in written communication and in computer mediated communication. With a research focus on users with L1 German living in South Tyrol, the main research question was whether people of different age use language in a similar way or in an age-specific manner. The project lasted 2 years (June 2013 - May 2015).

We created a corpus of Facebook communication that can be linked to other user-based data such as age, web experience and communication habits. We gathered socio-demographic information through an online questionnaire and collected the language data of the entire range of social interactions, i.e. publicly accessible data as well as non-public conversations (status updates and comments, private messages, and chat conversations) written and published just for friends or a limited audience. The data acquisition comprised about 150 users interacting with the app, offering access to their language data and answering the questionnaire.

In this talk, I will present the project, its data acquisition app and text annotation processes (automatic, semi-automatic, and manual), discuss their strengths and limitations, and present results from our data analyses.

Egon Stemle is a Cognitive Scientist: he studies skills like perception, thinking, learning, motor function, and language by combining the humanistic and analytical methods of the arts and the formal sciences.

His research focus lies in the area where Computational Linguistics and Artificial Intelligence converge. He works on computer aided fabrication of ontologies from large document repositories, the technological feasibility thereof and the utilization of cross-linked structured data in applications, as well as on tools for editing, processing, and annotating linguistic data. He is driven by the question why humans handle incomplete and - more often than not - inconsistent structured-concepts just fine, whereas computational processes are often of little avail or fail completely.

Wikipedia as a corpus resource for linguistic research

Angelika Storer, University of Mannheim, Germany

Wikipedia is already known to be a valuable resource for many research fields. Until now, most linguistic studies have focused on article pages and on the content encoded in the written language. In my talk, I will demonstrate with examples how linguistics can profit from three additional perspectives on Wikipedia as a corpus resource, namely:

(1) Wikipedia as a social media corpus: in this perspective not only are the article pages relevant, but also the interaction between Wikipedia authors on talk pages and other communication channels.

(2) Wikipedia as a multimodal corpus: in this perspective not only written text is the object of analysis, but also to the integration of media objects in the article pages.

(3) Wikipedia as a multilingual corpus: Wikipedia articles of different language versions are interconnected through interlanguage links opening up innovative options for contrastive and cross-lingual research.

I will report on studies that used Wikipedia article and talk pages in order to test hypotheses about language style and register variation. On this basis I want to discuss (a) which linguistic and interactional features are most relevant for investigating wikipedia as a social media corpus, and (b) how these features may be annotated in accordance with the TEI Special Interest Group on Computer-Mediated Communication.

Angelika Storrer is head of the department of German Linguistics at the University of Mannheim. Her current research interests cover corpus-based research on language use in social media, multimodal hypertext analysis, and e-lexicography. She is active in various interdisciplinary research projects and research networks on digital humanities and corpus linguistics, where she cooperates with partners from Information Sciences and Computational Linguistics. She is a member of the board of directors of the German Society of Computational Linguistics GSCL, and a member of the Berlin-Brandenburg academy of Sciences and Humanities (BBAW).

Annotation des corpus plurilingues — l’expérience CLAPOTY

Pascal Vaillant, Université Paris 13, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé

NB. Key note presented in French with slides in English.

Les méthodes de traitement automatique de corpus se sont jusqu’à présent plus intéressées aux corpus multilingues (textes de différentes langues portant sur un même thème) qu’aux corpus plurilingues (corpus présentant une pluralité linguistique interne). Ceci est dû au fait qu’elles ont surtout émergé dans le domaine du traitement automatique des langues, dans des applications pratiques portant sur des textes écrits, et non dans le domaine de la linguistique, qui s’intéresse aux manifestations spontanées et non-normées, où des phénomènes d’utilisation combinée de plusieurs langues sont fréquents.

L’observation, et la compréhension, de phénomènes de contact de langues, suscite pourtant un intérêt accru non seulement de la part des spécialistes de linguistique, mais également de la part de tous ceux qui s’intéressent aux corpus d’oral ou de genres textuels non-normés.

Dans le cadre du projet ANR CLAPOTY, une équipe de linguistes et d’informaticiens s’est intéressée à la représentation et à l’encodage de transcriptions d’oral présentant différentes situations de contact de langues, mettant au total en contact 40 langues de différentes aires et de profils typologiques variés.

Le choix effectué, pour rendre l’exploitation de ces corpus possible sans perdre la complexité des phénomènes réels, a été d’annoter avec précision toutes les données linguistiques des unités observées, sans les classer a priori dans des catégories descriptives dont la définition fait souvent encore débat (comme emprunt, calque, ou alternance de code).

À cette fin, l’équipe de CLAPOTY a développé un schéma d’annotation conforme aux normes les plus actuelles en matière de transcription (Unicode), et d’encodage des annotations (XML). Ce schéma s’inscrit dans le cadre de l’initiative TEI (Text Encoding Initiative), dont il constitue une extension. Dans ce modèle, les unités linguistiques, à tous les niveaux, peuvent être décrites comme relevant d’une langue ou d’une autre, voire de plusieurs à la fois. Ce modèle permet de rendre compte de la richesse et de la flexibilité des manifestations linguistiques spontanées, où il arrive que les pratiques langagières des locuteurs « flottent » entre deux langues.

Abstract (eng)

Methods in corpus processing have until recently been more focused on multilingual corpora (texts in different languages about the same domain) than on plurilingual corpora (corpora with an internal linguistic heterogeneity). This may be due to the fact that they have emerged in natural language processing contexts, mostly in practical applications to written texts, and not in the field of applied linguistics, where the focus is rather on spontaneous, genuine utterances of non-standard speech, and where phenomena of combined use of different languages are not rare.

However, observing -and understanding- language contact phenomena has a growing appeal not only to linguistic specialists, but also to all those who have an interest in mining corpora of spoken language, or non-standard written language.

Within the frame of the ANR CLAPOTY project, a team of linguists and computer scientists has worked on the representation and encoding of oral transcripts, displaying different situations of language contact (with a total of 40 languages from different linguistic areas and various typological profiles).

The choice that was made, in order to allow automatic mining of the corpora without losing the complexity of real-world linguistic phenomena, was to precisely annotate all the linguistic data on the observed units, without classifying them a priori in descriptive categories, the exact definition of which is still often debatable (e.g. borrowing, calque, code switching).

To this purpose, the CLAPOTY team has developed an annotation schema in compliance with the latest standards with respect to transcription (Unicode) and markup (XML). This schema follows the inspiration of the TEI (Text Encoding Initiative), extending it where needed (namely, for the annotation of language plurality). In this model, linguistic units (at all levels) may be described as pertaining to one language or another, and even to many languages at the same time. The model is able to represent the richness and versatility of spontaneous linguistic utterances, where speakers actually often “float” between two languages.

Pascal Vaillant is a senior lecturer in computer science in the LIMICS research unit, and his field is computational linguistics. He graduated as a telecommunication engineer from France’s National Telecommunication Institute (INT) in 1992, and received a PhD in Cognitive Science from the University of Paris-Sud (Orsay, France) in 1997, on the subject of compared semiotics of image and language (with a computer application to communication aids for the language impaired). He worked in Thomson (now Thales) in France, in the Humboldt University in Berlin, in the French Telecommunication Engineering School, in the University of the French West Indies (Martinique and Guiana), and eventually in the University Paris 13 in 2009. His main research topic is information mining in semi-structured or non-structured texts -especially in heterogeneous, plurilingual, or non-standard communication settings.

Online user: 1