FACHBEREICH 7: Sprach- und Literaturwissenschaft

Institut für Romanistik und Latinistik

Navigation und Suche der Universität Osnabrück

International visitors



Digital Humanities

My scientific activities form part of the digital humanities, both in research and in teaching. My work is based on digital language corpora, to which I apply both qualitative and quantitative methods of analysis. I focus on the multimodal analysis of spoken language corpora covering various contexts (e.g. casual conversations, instructional settings, interviews, media broadcasts...). In addition, I work on written text corpora, including historical corpora and data from computer mediated communication.

A central part of my work concerns the building up of digital corpora and the further development of methods for their analysis. Below on this page you will find information about the corpus tools that I am developing and about the spoken language corpora that I have created. An adaptation of the international transcription conventions GAT for Spanish can be found here.

In the past I have been involved in several projects within the digital humanities. The international "ciel-f" project has built up an ecological corpus of world French, the "Corpus International Écologique de la Langue Française". The RomWeb project analyzed language use in computer mediated communication, taking into account conditions of language contact and globalization. [moca] is an online database system for the administration and analysis of large multimodal oral corpora. The Research Training Group GRK 1624 on "Frequency effects in language" applied usage-based models to the analysis of language change, language processing and language acquisition.

Corpus tools

The following corpus tools are freely available.

act - Aligned Corpus Toolkit for R

The Aligned Corpus Toolkit (act) is designed for linguists that work with time aligned transcription data. It offers functions to import and export various annotation file formats ('ELAN' .eaf, 'EXMARaLDA .exb and 'Praat' .TextGrid files), create print transcripts in the style of conversation analysis, search transcriptions (span searches across single annotations, search in normalized annotations, make concordances etc.), export and re-import search results (.csv and 'Excel' .xlsx format), create cuts for the search results (print transcripts, audio/video cuts using 'FFmpeg' and video sub titles in 'Subrib title' .srt format), modify the data in a corpus (search/replace, delete, filter etc.), interact with 'Praat' using 'Praat'-scripts, and exchange data with the 'rPraat' package. The package is itself written in R and may be expanded by other users.

  • Manual: Download (PDF file)
  • Example data set: direct Download (ZIP file) from this site, alternative download from GitHub
  • How to cite: Ehmer, Oliver (2021). act: Aligned Corpus Toolkit. R package version 1.0.2 https://CRAN.R-project.org/package=act
  • Installation: In R type install.packages("act").
  • CRAN: Official site of the package on CRAN, since October 26, 2020
  • GitHub: Development versions are available on GitHub.

The Transformer - A corpus tool on Windows

The Transformer is a corpus tool for scientists who work with time-aligned transcribed linguistic data. It addresses conversation analysts, phoneticians, anthropologists, and other social scientists who want to analyze digital audio or video data and language. The Transformer is a program to manage and convert transcribed linguistic and aligned data. The Transformer itself is not an annotation tool, but it allows you to change the format of your data and save it to a variety of output formats. In addition, The Transformer provides possibilities for searching and organizing corpora.
For more information, visit the separate web site.

TextGrid to Transcript - Converting TextGrids to print transcripts in Praat

"TextGrid to Transcript" is a tool to generate print transcripts in the style of conversation analysis based on Praat TextGrids. "TextGrid to Transcript" is a script that runs within Praat. It offers basic possibilities to modify the layout of a transcript, such as insertion of line numbers, selection of the tiers to be exported, formatting of the tier names/speakers and adjusting the width of the transcript.

  • Script (right click and select "Save as..."): Download
  • Instructions for using the script in English: Download
  • Instructions for using the script in German: Download

[moca] multimodal oral corpus administration

I have been involved in the development and redesign of [moca] – an online system for multimodal oral corpus administration. [moca] stores audio and/or video recordings and their accompanying transcription files. Transcription files are aligned, providing speaker information and the temporal blueprint of the transcription, in addition to the transcription itself. This allows for accessing the media file at individual points in a transcription file directly through an internet browser.
For more information, visit the separate web site.

Exchange format for multimodal annotations

An early proposal for an exchange format for multimodal annotations has been made in the following publication:
Schmidt, Thomas/ Duncan, Susan/ Ehmer, Oliver/ Hoyt, Jeffrey/ Kipp, Michael/ Loehr, Dan/ Magnusson, Magnus/ Rose, Travis/ Sloetjes, Han (2009): An exchange format for multimodal annotations. In: Kipp, Michael/ Martin, Jean-Claude, et al. (eds.): Multimodal Corpora. From Models of Natural Interaction to Systems and Applications. Berlin/Heidelberg: Springer, 207–222. Publisher (open access)


Forschungs- und Lehrkorpus gesprochenes Spanisch

Project leaders: Oliver Ehmer, Ignacio Satti. Funding: Supported by a grant from the Studierendenrat of the Albert-Ludwigs-University of Freiburg (Studierendenvorschlagsbudget).
More information coming soon. <

ICAS - Instructing Corporeal Arts and Skills

The ICAS is a Spanish-spoken corpus of authentic Instructions of Corporeal Arts and Skills. The corpus focuses on instructional classes in dance (Argentine tango, Latin dance), but also comprises sports classes (e.g. aikido, surfing), medical instructions (e.g. first aid, physical rehabilitation) and vocational training (e.g. construction, welding). All classes have been recorded with a dual camera set-up and body microphones on the teachers. The transcriptions are time-aligned and therefore compatible with tools like Praat, ELAN, EXMARaLDA, etc.

This corpus is currently being built up within the project "Body knowledge. Multimodal practices for instructing corporeal-performative knowledge in interaction" (for further details, visit http://www.body-knowledge.org).

Size: up to date ~130,000 words, total length ~76 hours, 60 recordings (Corpus size is constantly increasing)

Funding: Supported by a grant from the Ministry of Science, Research and the Arts of Baden-Württemberg and the Albert-Ludwigs-University of Freiburg. <

cespla - Corpus de Conversaciones ESPontáneas PLAtenses

The cespla is a linguistic corpus of everyday conversations from the region along the River Plate (Argentina and Uruguay). It mainly consists of dinner conversations amongst friends and family that have been recorded mostly in Buenos Aires and La Plata. Most of the recordings are audio only; some are video recordings. All transcriptions are time-aligned.

For more information, visit the separate web site.

Size: ~385,000 words transcribed, total length ~164 hours, 60 recordings

Funding: Wissenschaftliche Gesellschaft (Freiburg im Breisgau), Verein für Gesprächsforschung e.V. (Prize for the best PhD project)

TTI. Tango Teacher Interviews

This Spanish-spoken corpus presents a collection of interviews with teachers of Argentine Tango that have been video recorded in Buenos Aires and La Plata (Argentina). All transcriptions are time-aligned.

Size: ~55,000 words transcribed, total length ~10 hours, 32 recordings <

escucho. Radio call-in program "Te escucho" by Luisa Delfino

"Te escucho" is a famous Argentine radio call-in-format program by journalist Luisa Delfino. The radio format is oriented towards advice-giving and life-coaching. The corpus consists of audio recordings of the internet broadcasts and time-aligned transcriptions.

Size: ~40,000 words transcribed, total length ~13 hours, 18 recordings <

SCHALL - SprachCorpus Heutiger ALLtagsgespräche

Corpus of spontaneous everyday conversations in German. All recordings are audio, transcriptions are partly time-aligned, while others are flow text. The corpus has been created in collaboration with Stefan Pfänder.

For more information, see www.sprachcorpus.de. <