MERLIN for research

This section provides in-depth information about the design of the MERLIN corpus and the annotation layers. You will learn which measures have been taken to strictly link the MERLIN texts to the CEFR and to ensure reliability at all stages of the corpus compilation process. The last part provides a rationale for the use of the corpus in different research contexts.

Linking the MERLIN texts to the CEFR


The MERLIN texts are the writings sections of CEFR-related, standardized high-quality tests from telc (Frankfurt/Main, Italian and German tests) and ÚJOP (Prague, Czech tests). These institutions are ALTE-audited (ALTE). The tasks were in use until 2013 and are now freely available on the platform. However, to have explicit and direct information about the CEFR profiles of the written productions themselves (and not only of the tests as a whole), for MERLIN all texts were re-rated independently by two professional raters per language. The reliability of the re-ratings was examined with the help of Classical Test Theory and a Multi-Facet Rasch analysis. The latter is a probabilistic statistical procedure often used in language testing which allows for a correction of rating tendencies (e.g., leniency/harshness) and makes it possible to arrive at a fair average rating for each text. The intra-rater and inter-rater reliability was generally very high in MERLIN, with some exceptions for Italian. Therefore, the whole re-rating process was repeated for Italian resulting in a satisfying rating quality.

In MERLIN, the fair average is calculated based on a holistic scale (see 1.2 rating instruments).
If you are interested in more details regarding the quality of the ratings and the difficulty of the single rating criteria, please consult the technical report.

Rating instruments

Two rating instruments were used: An assessor-oriented version (Alderson 1991) of the holistic scale (page 2 of the MERLIN rating grid) for "General Linguistic Range" (Chapter 5, CEFR) was accompanied by an analytical rating grid (page 3 of the MERLIN rating grid) that is closely connected to Table 3 of the CEFR (CoE 2001). This table was of great importance in the process of scaling the CEFR descriptors (North 2005, 2000). The MERLIN version includes six rating criteria (vocabulary range | vocabulary control | grammatical accuracy | coherence & cohesion|orthography | sociolinguistic appropriateness). These criteria stem from scales in Chapter 5 of the CEFR that specifies aspects of communicative L2 competence. For the construction of the grid, descriptors of these scales were modified in an assessor-oriented way. Plus-levels (A2+, B1+) were excluded as the CEFR does not specify descriptors for these levels for all rating criteria. The rating instruments were piloted before their implementation in the MERLIN project.

Preparing the data


The hand-written original learner texts were transcribed in an xml-based editor (xml mind©) inside the testing institutions (telc and ÚJOP). The transcribers followed transcription guidelines and the reliability of the transcripts was checked, initially for a sample of 5% of the texts per CEFR level. As many transcription errors were detected, in the end almost all texts had to undergo a revision stage.
The transcription guidelines included tags (inline annotation) for basic textual features such as unreadable or ambiguous stretches of language, foreign language words, emoticons, images, paragraphs, copied words from the rubrics, or greeting formulae. The anonymization (names, places) was part of the transcription process and was carried through based on the guidelines.

Tools & formats

Once the transcriptions were available, all data was converted to PAULA, a standoff XML format designed as an exchange format for linguistic annotation.

Further manual annotations were carried through with two tools: MMAX2 and the Falko Excel Add-in. MMAX2 is a text annotation tool that allows multi-layered annotation. It was used for the annotation of learner language features (see 2.3.1). The Falko Add-in was used for annotating both target hypothesis 1 and 2.

Automatic annotation made use of the UIMA framework. UIMA allows a modular integration of a wide range of NLP tools such as part-of-speech taggers and parsers. For the advanced search functions, the open source web-browser based search and visualization architecture ANNIS is used.


Manual annotations available for the whole corpus


Minimal target hypotheses / target hypotheses 1 (TH1)

All annotation is necessarily based on human interpretation of what the person who produced the text might have had on his/her mind. It is important to make this interpretation explicit so that MERLIN users can understand the annotations better. Therefore, the MERLIN corpus contains rule-based target hypotheses that suggest a corrected version of the learner texts.
In the main phase of annotation, an orthographically and grammatically correct version of the learner text was created (target hypotheses 1, TH1) for the whole corpus.  As little interventions as possible were allowed by the annotator. In this table, you find a simple example:

The following example by the same learner shows that in TH1, errors from other linguistic areas were ignored. There are content and technical reasons for this.

While the orthographical (capitalization error, word boundary error, missing hyphen) and grammatical (missing article) errors are corrected in the TH1, the lexically erroneous form *Reisespass (instead of “Reisepass”) was not substituted by another lexeme. Phenomena like this are annotated in the MERLIN core corpus (for definitions of the errors see MERLIN annotation scheme).

The team followed the target hypotheses rules developed for the Falko corpus and adapted them to the project needs where necessary (cf. Reznicek/Lüdeling et al. 2012; see annotation manual). In some cases, annotators agreed upon annotation rules on a very fine-grained level. For example, it was decided that in German, the final double <ss> instead of standard German spelling <ß> was not changed in texts in which it might be possible that the learner consistently used the Swiss spelling, which does not use the <ß>. Annotation decisions have been documented consistently and are available upon request.

TH1 were compiled for the whole MERLIN corpus. The TH1 were written in Excel with the help of the Falko Add-in. The TH1 was piloted before the actual annotation took place.

Manual annotation of grammatical and orthographical learner language features – error annotation 1 (EA1)

Building on the target hypotheses 1, almost all MERLIN texts (for details see The MERLIN corpus in figures) were annotated with grammatical and orthographical language features from various sources (EA1 = error annotation 1). You can find a complete list of the features (“tags”) with examples in MERLIN annotations, while the annotation scheme gives you full access to the definitions of each learner language feature and additional examples.

The MERLIN annotation tags for EA1 and EA2 were derived from …

  1. CEFR scales: some tags were chosen to support research about the empirical validity of the CEFR scales underlying the MERLIN analytical rating grid (chapter 5 of the CEFR, CoE 2001). They can help to control whether the predictions of selected CEFR descriptors correspond to learner behaviour, e.g.: intelligibility, use of idioms, content jumps (see 3.2 MERLIN for scale validation).
  2. issues in current SLA research, e.g. grammatical aspects such as verb valency, word order, negation, or lexical aspects, e.g. the use of formulaic sequences (references)
  3. features reported to the MERLIN team by testers, teachers and teacher trainers in a questionnaire study and in expert interviews as being relevant for assessing language mastery at certain levels, e.g. the verbal aspect in Italian and Czech
  4. textbook and language test analyses revealed further recurrent topics some of which were included in the MERLIN annotation scheme, e.g. German modal verbs
  5. learner text analyses carried out in a random sample of MERLIN texts (5% per test level/language), e.g. use of articles and clitics

The annotation scheme specifies to which group(s) the single learner language features belong.

Furthermore, most error-related MERLIN tags (EA1 & EA2) incorporate the widely used ‘target language modification’ dimension (cf. Díaz-Negrillo/Fernández-Domínguez 2006). This dimension specifies the type of error: an element might have been omitted, changed, added, repositioned, merged with, or split from another element). You can find details about this in the annotation scheme.

Manual annotations in the MERLIN core corpus

The structure of the MERLIN core corpus

For a small pilot sample (the MERLIN core corpus), in addition to grammar and orthography more linguistic dimensions are taken into consideration. The MERLIN core corpus consists of texts that received fair averages of either A2 or B2. Thus, two groups of learners with a clearly distinct level of proficiency can be compared. It is important to notice that the ratings the learners received do not necessarily correspond to the CEFR level of the test they decided to take.

Many outperformed the targeted CEFR levels, while others’ performances were rated lower than the learners would have expected. An extreme case is the Italian corpus, where only two texts actually received a B2 level, while many more students took B2 tests. Here, the MERLIN core corpus incorporates the 100 texts that were placed highest on the Rasch logit scale (technical report).

Extended target hypotheses / target hypotheses 2 (TH2)  

Target hypotheses 2 aim at creating an acceptable version of the learner text. This process involves more subjectivity and difficulties of decision reliability, which is why it was separated from the level of target hypotheses 1 like in the Falko project with which there was a strong cooperation. The aim of TH2 is to capture the perspective of acceptability of the learner text (not, like for TH1, its correctness). TH2 therefore are an extension of TH1. To this aim, the learner text was still only minimally modified while at the same time its reconstruction comes close to what a native speaker utterance would look like. This reconstruction regards semantic and lexical aspects, pragmatics, and sociolinguistics. Other than in the TH1, phenomena that over-arch sentences and that are determined by the context are modified, too.

Annotations of sociolinguistic, pragmatic, lexical, and other learner language features  (error annotation 2, EA2)

For a part of the MERLIN core corpus, many tags from various linguistic perspectives were added to the grammatical and orthographical learner language features annotated in the main stage of the project. These tags stem from the same sources as the EA1 annotations.

You can find detailed information about the single tags which include, for example, the speech act REQUEST, the use of language with an inappropriate level of formality, the use of structures that pertain to spoken language variants, or reference problems in the annotation scheme.

Again, the MERLIN tags incorporate the widely used ‘target language modification’ dimension (cf. Díaz-Negrillo/Fernández-Domínguez 2006) which yields information about the type of the learner language feature (an element might have been omitted, changed, added, repositioned, merged with, or split from another element).

Quality control aspects of the annotation process

It was important to make sure that the annotations in the MERLIN corpus are as consistent as possible, even if a certain degree of subjectivity is unavoidable. To this aim, the MERLIN project carried through a number of measures:

  • All instruments (TH 1 & TH2 rules, annotation scheme for EA1 and EA2) were piloted before their implementation. This allowed to detect possibly problematic aspects which could be corrected before the annotations started.
  • All annotations are based on guidelines (annotation manual, Falko-Handbuch).  The guidelines were enriched by fine-grained decisions on single aspects of annotation.
  • The quality of the annotations was assured by comprehensive documentation of annotation decisions, also to guarantee the consistency of the annotations for the three project languages.

Last but not least, the reliability of annotations was controlled for 5% of the texts on each test level for target hypotheses (1 & 2) and error annotation (1 & 2). Different methods were applied:

  • In a qualitative approach, half of the files were annotated independently by the coders to then be commonly discussed with the aim to arrive at a consensus. This happened before the annotation (which was done level by level) of the level started. The texts served as a reference throughout the annotation process.
  • The second half of the files checked for reliability was annotated by all coders without their knowledge. This quantitative, double-blind procedure allows to check for intra-coder reliability (the consistency of one and the same annotator) and inter-coder reliability (the degree of agreement between different annotators).

Although EA2 annotations underwent these quality control measures as well, they are of an explorative pilot character. Therefore, users are asked to analyse EA2 annotations with caution.

Consistency and interference of annotation layers

From a technical perspective, it was complex to integrate and harmonize the different annotation formats in MERLIN without losing information or creating imprecisions.
At the same time, on a content level, contradictions between the different annotation levels (TH1-EA1-TH2-EA2) were to be avoided.
TH1 and EA1 are closely connected. If there is a change of the learner text on TH1, there ought to be a tag on EA1 that makes the learner language feature explicit in detail.
Also, all EA2 annotations are reflected in TH2. The opposite, however, is not necessarily true: There might be TH2 modifications that are needed to arrive at an acceptable version of the learner text and that are not part of the MERLIN annotation scheme. The MERLIN team might have not included a phenomenon if it was not considered relevant and/or feasible.

Automatic annotations in MERLIN

In MERLIN, a combination of automatic and manual annotation procedures was used in order to prepare learner texts for integration into the platform. We have applied existing automatic annotation tools developed for the target languages in order to expand the range of available linguistic annotation beyond what would have been possible with time-consuming and expensive manual annotation. However, it is important to keep in mind that automatic annotation is particularly challenging for learner language, since learner language often deviates considerably from the target language across all levels of linguistic analysis, from spelling to semantics.

The following tools were used for all three MERLIN languages:

Texts were tokenized using the tokenizer for Indo-European languages from LingPipe and the resulting tokenization was then corrected by hand.
Sentences were annotated with the OpenNLP sentence segmenter.
Repetitions were identified using the Saphre library on the basis of the automatic part-of-speech and lemma annotation described below.

Language-Specific Tools

MERLIN contains part-of-speech tags (tok_pos), lemmas (tok_lemma), and dependency parses (dependencies) for all three languages. Additional part-of-speech tags, lemmas, and morphological analyses from alternate tools are included where available. Details about the annotation tools and annotation schemes are provided for each language individually below.


Part-of-speech tags and lemmas (tok_pos and tok_lemma):

MorphoDiTa was used to annotate POS tags and lemmas according to the Prague Dependency Treebank guidelines. There are 12 basic POS tags (seen in the first character of each tag) and more than 4000 possible detailed morphosyntactic tags in the full tag set.

Dependency parses:

The joint tagger and parser from Bernd Bohnet et al. (2013) was trained on data from the Prague Dependency Treebank. The parser also provides basic POS tags (tok_pos_bohnet) and morphological analyses (tok_morph_bohnet).


Part-of-speech tags and lemmas (tok_pos and tok_lemma):

TreeTagger was used to annotate POS tags and lemmas using the Stuttgart-Tübingen tag set, which contains 54 tags.

Dependency parses:

The joint tagger and parser from Bernd Bohnet et al. (2013) was trained on a dependency conversion of the Tiger Treebank with additional data from the SMOR morphological analyzer.
Bernd Bohnet kindly provided a version of the German parsing model customized for the MERLIN data. The parser also provides basic POS tags (tok_pos_bohnet), lemmas (tok_lemma_bohnet), and morphological analyses (tok_morph_bohnet).

T-units (tunit and complextunit):

T-units and complex t-units were identified using the algorithms presented in Julia Hancke's 2013 master's thesis "Automatic Prediction of CEFR Proficiency Levels Based on Linguistic Features of Learner Language", which relies on automatic parses produced by the Stanford parser. The parses are not presented in the MERLIN corpus, but the POS tags from the Stanford parser, which uses the same German tag set as TreeTagger (STTS), are shown for reference in tok_pos_stanford.


Part-of-speech tags and lemmas (tok_pos and tok_lemma):

TreeTagger was used to annotate POS tags and lemmas. The POS tag set developed by Achim Stein contains 38 tags.

Dependency parses:

The joint tagger and parser from Bernd Bohnet et al. (2013) was trained with data from the Italian Stanford Dependency Treebank. Additional POS tags and morphological analysis provided by the parser are included as tok_pos_bohnet and tok_morph_bohnet.

Available complexity measures

For the German part of the MERLIN corpus a large set of accuracy and complexity measures has been computed. They reflect complexity features in different linguistic domains, e.g.:

Lexical complexity

  • Lexical types per lexical tokens
  • Nouns per lexical tokens
  • Verbs per noun

Morphological complexity

  • Average compound depth
  • Finite verbs per verb

Phrasal and clausal complexity

  • Words per sentence
  • T-units per sentence
  • Verbs per t-unit
  • Average noun phrase length
  • Average verb phrase length

Discourse complexity

  • Connectives per sentence

In addition, accuracy-related measures are provided, e.g.:

  • number of error-free sentences
  • overall error counts

You can extract these and many more measures for single learner texts or a collection of learner texts from the MERLIN meta-data file (*.csv).

An explanation and specification of the computed complexity measures is provided in: Weiss, Z., Meurers, D. (2021). Analyzing the linguistic complexity of German learner language in a reading comprehension task: Using proficiency classification to investigate short answer data, cross-data. International Journal of Learner Corpus Research 7 (1), 83-130.

Using MERLIN for research purposes

The main aim of MERLIN is not research-oriented: the platform was developed for practitioners who need empirical illustrations of rated CEFR levels for Czech, Italian, and German. An increasing number of initiatives have started to collect authentic learner language rated according to CEFR levels. Some of them pertain to the Reference Level Descriptions (RLD) initiative, i.e. a specification of the CEFR levels for single languages (the most prominent example is the English Profile Project, other projects are ASK for Norwegian, Carlsen 2013, or the Profilo della lingua italiana, Spinelli/Parizzi 2010). The Council of Europe encourages the development of RLDs (CoE 2005, see CoE website for Reference Level Descriptions).
From corpora like these, features that characterize CEFR levels (sometimes called criterial features, Hawkins/Filipovíc 2012) can be extracted. This process helps to deepen the understanding of what CEFR-related ratings mean and to build its use on firmer, empirical grounds. MERLIN contributes to the empirically-based exploration of the CEFR for German, Italian, and Czech. It differs from most existing initiatives in that all data, including full texts, test tasks and annotations, are fully and freely available online.
Apart from this major practical aim, MERLIN is relevant for research purposes from various perspectives:

Validating CEFR scales with MERLIN

The Council of Europe effort of scaling the CEFR descriptors (CoE 2001; North 2000; Schneider/North 2000) has led to immense improvements in standardization and transparency in language learning, teaching, and testing. Important decisions about language learners' lives are taken with reference to the CEFR levels. In many ways, it seems as if the scales have acquired a life of their own; often, they are over-estimated, misunderstood and applied in ways that they were not meant to be used for (North 2000). One crucial aspect that is yet insufficiently understood is the empirical validity of the CEFR scales (Fulcher 2004; Hulstijn 2007): If scales are used to describe or rate learner language, they must reflect what learners actually do (Alderson 1991). In spite of this, up to date there is almost no research that examines the power of the CEFR descriptors to capture the language learners actually produce (Wisniewski 2014). MERLIN allows to directly analyze the relationship between selected CEFR descriptors (such as "circumlocutions" or "content jumps" which were operationalized and annotated (see MERLIN annotation scheme) and learner language without having to rely on ratings.

MERLIN and second language acquisition studies

Many studies from the area of second language acquisition (SLA) refer to proficiency levels when describing the development and the variation of learner language. However, in many cases the proficiency classification is not yet based on procedures that comply with the strict standards that need to be met from the perspective of research-based, high-quality language testing (see for example AERA/APA/NCME; ALTE 2001; Bachman/Palmer 1996; EALTA code of practice). There is a particular lack of strict testing procedures and easily accessible empirical data for languages other than English when it comes to CEFR-based proficiency classifications. Although MERLIN is small in size, its reliable relationship to the CEFR makes it a precious resource for future SLA studies. Also, it can be used for triangulating and validating data for many existing studies.

MERLIN to advance NLP of learner language

The MERLIN corpus provides valuable data for the development and evaluation of natural language processing tools for learner language (Meurers 2012). The corpus and its meta-information on learners and ratings readily support research on automatic native language identification, enabling such research to go beyond the current English learner focus. In a similar vein, the corpus has already been used for research on automatic proficiency classification for German (Hancke 2013). The MERLIN corpus also provides richly annotated learner data for the development and adaptation of NLP tools and applications that assist language learners in improving their vocabulary usage, coherence, spelling and grammatical accuracy.


[ALTE 2001] = ALTE Working Group on the Code of Practice: Principles of Good Practice for ALTE Examinations. Revised Draft., October 2013.
[Consiglio d'Europa 2004a] = Trim, J./North, B./Coste, D.: Quadro comune europeo di riferimento per le lingue: apprendimento, insegnamento, valutazione. La Nuova Italia: Oxford.- A cura del Consiglio d'Europa.
[Council of Europe 1975] = Van Ek, J. A.: The Threshold Level in a European unit/credit system for modern language learning by adults.  Strasbourg: Council of Europe.
[Council of Europe 1994a] = North, B.: Scales of language proficiency: a survey of some existing systems. Strasbourg: Council of Europe, CC-Lang (94) 24.
[Council of Europe 1994b [1981]] = Galli de' Paratesi, N.: Livello Soglia per l'insegnamento dell'italiano come lingua straniera. Strasbourg: Edizioni del Consiglio d'Europa.
[Council of Europe 1999 [1980]] = Baldegger, M./Müller, M./Schneider, G. (1999): Kontaktschwelle Deutsch als Fremdsprache.  4. Auflage. Berlin u.a.: Langenscheidt.- Herausgegeben vom Europarat.
[Council of Europe 2001] = Trim, J./North, B./Coste, D.: Common European Framework of Reference for Languages: Learning, teaching, assessment. -Edited by the Council of Europe. Online-Dokument:, Oktober 2013.
[Europarat 2001] = Trim, J./North, B./Coste, D.: Gemeinsamer europäischer Referenzrahmen für Sprachen: lernen, lehren, beurteilen. Berlin u.a.: Langenscheidt.- Herausgegeben vom Europarat, Online-Dokument:, Oktober 2013.
[Europarat 2004b] = Takala, S./Kaftandjieva, F./Verhelst, N./Banerjee, J./Eckes, T./van der Schoot, F.: Reference Supplement to the Preliminary Pilot Version of the Manual for Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment.- Edited by the Council of Europe, Online-Dokument:, Oktober 2013.
[Europarat 2009 [2003]] = North, B./Figueras, N./Takala, S./Van Avermaet, P./Verhelst, N.: Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Manual. Preliminary Pilot Version.- Edited by the Council of Europe, Online-Dokument:, Oktober 2013.
Abel, A.; Wisniewski, K.; Nicolas, L.; Boyd, A.; Hana, J.; Meurers, D. (2014): A Trilingual Learner Corpus illustrating European Reference Levels. In: Ricognizioni – Rivista di Lingue, Letterature e Culture Moderne 2 (1), 111-126. 
Alderson, J.C. (2007): The CEFR and the need for more research. In: The Modern Languagre Journal 91, 658-662.
Alderson, J. C./Figueras, N./Kuijper, H./Nold, G./Takala, S./Tardieu, C. (2006): Analysing Tests of Reading and Listening in Relation to the Common European Framework of Reference: The Experience of the Dutch CEFR Construct Project. In: Language Assessment Quarterly 3(1), 3-30.
AERA/APA/NCME (1999): Standards for educational and psychological testing. Washington: AERA.
Alderson, J.C. (1991): Bands and scores. In: Alderson, J.C./North, B. (eds.): Language testing in the 1990s. London: British Council/Macmillan, 71-86.
Arnaud, P. J. L. (1984): The lexical richness of L2 written productionos and the validity of vocabulary tests: In: Culhane, T./Klein-Braley, C./Stevenson, D. K. (eds.): Practice and Problems in Language
Arras, U. (2010): Subjektive Theorien als Faktor bei der Beurteilung fremdsprachlicher Kompetenzen. In: Berndt, A./Kleppin, K. (eds.): Sprachlehrforschung: Theorie und Empirie - Festschrift für Rüdiger Grotjahn. Frankfurt: Lang, 169-179.
Bachman, L.F. (2004): Statistical analyses for language assessment. Cambridge: CUP 2004.
Bachmann, T. (2002): Kohäsion und Kohärenz: Indikatoren für Schreibentwicklung: Zum Aufbau kohärenzstiftender Strukturen in instruktiven Texten von Kindern und Jugendlichen. Innsbruck: Studienverlag.
Bausch, K.-R./Christ, H./Königs, F.G./Krumm, H.-J. (eds.) (2003): Der Gemeinsame Europäosche Referenzrahmen für Sprachen in der Diskussion. Arbeitspapiere der 15. Frühjarskonferenz zur Erforschung des Fremdsprachenunterrichts. Tübingen: Narr.
Bardovi-Harlig, K. (2009): Conventional Expressions as a Pragmalinguistic Resource: Recognition and Productions of Conventional Expressions in L2 Pragmatics. In: Language Learning 59 (4), 755-795.
Bestgen, Y./Granger, S. (2011): Categorising spelling errors to assess L2 writing. In: International Journal of Continuing Engineering Education and Life Long Learning, 21 (2), 235-252.
Bond, T. G./Fox, C. M. (2007): Applying the Rasch model: Fundamental measurement in human sciences. Mahwah, NJ: Lawrence Erlbaum.
Bulté, B./Housen, A. (2012): Defining and operationalising L2 complexity. In: Housen, A./Kuiken, F./Vedder, I. (eds.): Dimensions of L2 Performance and Proficiency: Complexity, Accuracy and Fluency in SLA. Amsterdam: Benjamins, 21-46.
Burger, H. (2007): Phraseologie. Eine Einführung am Beispiel des Deutschen. (3. Aufl.).Berlin: Erich Schmidt Verlag.
Carlsen, C. (ed.) 2013. Norsk Profil. Det felles europeiske rammeverket spesifisert for norsk. Et første steg. Oslo: Novus.
Carlsen, C. (2010): Discourse connectives across CEFR levels: A corpus-based study. In: Bartning, I./Martin, M./Vedder, I. (eds.): Communicative Proficiency and Linguistic Development: intersections between SLA and language testing research (Eurosla). 191-210.
Christ, O. (1994). A modular and flexible architecture for an integrated corpus query system. arXiv preprint cmp-lg/9408005.
Corder, S. P. (1993 [1973]): Introducing Applied Linguistics. Harmondsworth: Pelican.
Dallapiazza, R.M./von Jan, E., Schönherr, T. (1998) (eds.): Tangram: Deutsch als Fremdsprache. Kurs- und Arbeitsbuch 1 A. Munich: Hueber.
Daller, H./van Hou, R./Treffers-Daller, J. (2003): Lexical richness in spontaneous speech of bilinguals. In: Applied Linguistics 24, 197-222.
Dewaele, J.-M. (2004): Indiviual differences in the use of colloquial vocabulary. The effects of sociobiographical and psychological factors. In: Bogaards, P./Laufer, L. (eds.): Vocabulary in a secons language. Amsterdam: John Bejamins, 127-154.
Díaz-Negrillo, A./Fernández-Domínguez, J. (2006): Error-coding systems for learner corpora. In: RESLA 19, 83-102.
Eckes, T. (2008): Rater types in writing performance assessments: A classification approach to rater variability. In: Language Testing 25 (2) 155-185.
Eckes, T. (2009): Reference Supplement to the Manual for Relating Language Examinations to the Common European Framework of Reference for Languages: Learning, Teaching, Assessment.  Section H: Many-Facet Rasch Measurement. (, January 2014.)
Eisenberg, P. (2007): Sprachliches Wissen im Wörterbuch der Zweifelsfälle. über die Rekonstruktion einer Gebrauchsnorm. In: Aptum. Zeitschrift für Sprachkritik und Sprachkultur 3/2007: 209-228.
Ellis, R. (1994): The study of Second Language Acquisition. Oxford: Oxford University Press.
Fulcher, G. (2004): Deluded by Artifices? The Common European Framework and Harmonization. In: Language Assessment Quarterly 1 (4), 253-266.
Fulcher, G./Davidson, F. (2007): Language Testing and Assessment. London/New York: Routledge.
Gould, S.J. (1996): The mismeasure of man. London: Penguin.
Glaznieks A./Nicolas L./Stemle E./Lyding V./Abel A. (2012): Establishing a Standardised Procedure for Building Learner Corpora. In: Apples - Journal of Applied Language Studies. Special Issue: Proceedings of LLLC2012.
Granger, S. (2003): Error-tagged learner corpora and CALL: a promising synergy. In: CALICO Journal 20 (3). Special issues on error analysis and error correction in computer-assisted language learning, 465-480.
Granger, S. (2008): Learner corpora. In: Lüdeling, A. / Kytö, M. (eds.): Corpus linguistics: an international handbook (Handbooks of linguistics and communication science; 29.1_ 29.2). Berlin - New York: de Gruyter. 259-275.
Granger, S. (2002): A Bird's-eye view of learner corpus research. In:  Granger S,/Hung, J./ Petch-Tyson, St (eds.): Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching. Amsterdam: John Benjamins, 3-33.
Halliday, M. A. K. /Hasan, R. (1989): Language, context and text: a social semiotic perspective. Oxford: Oxford University Press.
Hancke, J. Automatic Prediction of CEFR Proficiency Levels Based on Linguistic Features of Learner Language. Master's thesis, Universität Tübingen, April 2013
Hancke J./Meurers D./Vajjala S. (2012): Readability Classification for German using lexical, syntactic, and morphological features. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING), 1063-1080.
Hancke, J. (2013):Automatic Prediction of CEFR Proficiency Levels Based on Linguistic Features of Learner Language. Master's thesis, University of Tübingen.
Hasil, J./Hájková, E./Hasilová, H. (2007): Brána jazyka českého otevřená. Prague: Karolinum.
Hawkey, R./Barker, F. (2004): Developing a Common Scale for the Assessment of Writing. In: Assessing Writing 9, 122-159.
Hawkins, J. A./Filipovíc, L. (2012): Criterial features in L2 English: Specifying the reference levels of the Common European Framework. Cambridge: CUP.
Housen, A./Kuiken, F. (2009): Complexity, Accuracy, and Fluency in Second Language Acquisition. In: Applied Linguistics 30 (4), 461-473.
Hulstijn, J. H. (2007): The shaky ground beneath the CEFR: Quantitative and qualitative dimensions of language proficiency. In: The Modern Language Journal 91, 663-667.
Hulstijn, J. H./Alderson, C./Schoonen, R. (2010): Developmental stages in second-language acquisition and levels of second-language proficiency: Are there links between them? In:  Bartning, I./Martin, M./Vedder, I. (eds.): Communicative Proficiency and Linguistic dvelopment: intersections between SLA and language testing research. Eurosla Monograph Series. (
Johns, T. (1988):  Whence and whither classroom concordancing? In: Bongaarts,  T./de Haan, P./Lobbe, S./Wekker, H. (eds.): Computer Applications in Language Learning.  Dordrecht: Foris, 9-33.
Johns, T. (1997): Contexts: The Background, Development and Trialling of a Concordance-based CALL Program. In: Wichmann, Anne/Fligelstone, Steven/McEnery, Tony/Knowles, Gerry (eds.) (1997), Teaching and Language Corpora. London: Longman, 100-115.
Laufer, B./Nation, P. (1995): Vocabulary size and use: lexical richness in L3 written production. In: Applied Linguistics 16, 307-322.
Little, D. (2007): The Common European Framework of Reference for Languages: Perspectives on the Making of Supranational Languages Education Policiy. In: The Modern Language Journal 91, 645-655.
Lu, X. (2011): A corpus-based evaluation of syntactic complexity measures as indices of College-level ESL writers' language development. In: TESOL Quarterly 45 (1) 36-62.
Lu, X. (2010): Automatic analysis of syntactic complexity in second language writing. In: International Journal of Corpus Linguistics 15 (4), 474-496.
Lüdeling, A. (2008): Mehrdeutigkeiten und Kategorisierung: Probleme bei der Annotation von Lernerkorpora. In: Walter, M./Grommes, P. (eds.): Fortgeschrittene Lernervarietäten: Korpuslinguistik und Zweitsprachenerwerbsforschung. Tübingen: Niemeyer, 119-140.
Lüdeling, A./Walter, M./Kroymann, E./Adolphs, P. (2005): Multi-level Error Annotation in Learner Corpora. In: Hunston, S./Danielsson, P. (eds.): Proceedings from the Corpus Linguistics Conference Series (Corpus Linguistics 2005, Birmingham, 1415 July 2005). (
Malvern, D./Richards, B./Chipere, N./Durán, P. (2008): Lexical Diversity and Language Development. Quantification and Assessment. New York: Palgrave Macmillan.
Mellor, A. (2011): Essay Length, Lexical Diversity and Automatic Essay Scoring. In: Memoirs of the Osaka Institute of Technology, Series B Vol. 55, No. 2 (2011), 1-14.
Meurers, D. (2012): Natural Language Processing and Language Learning. Encyclopedia of Applied Linguistics. Blackwell.
Mezzadri, M. (2000). Rete! Book 1. Perugia: Guerra Edizioni.
Müller, Ch./Strube  M. (2006): Multi-Level Annotation of Linguistic Data with MMAX2. In: S. Braun, K. Kohn, J. Mukherjee (Eds.): Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods. Frankfurt: Peter Lang, 197-214.
Nation, P. (2001): Learning vocabulary in another language. Cambridge: Cambridge University Press.
Nation, P. (2007): Fundamental issues in modelling and assessing vocabulary knowledge. In: Daller, H./ Milton, J./Treffers-Daller, J. (eds.): Modelling and Assessing Vocabulary Knowledge. Cambridge: Cambridge University Press.
Nesselhauf, N. (2005): Collocations in a Learner Corpus. Amsterdam: John Benjamins.
North, B. (2000): The Development of a Common Framework Scale of Language Proficiency. Oxford: Peter Lang.
O'Loughin, K. (1995): Lexical density in candidate output on direct and semi-direct versions of an oral proficiency test. In: Language Testing 12 (2), 217-237.
Ortega, L. (2003): Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level L2 writing. In: Applied Linguistics 24 (4), 492-518.
Paquot, M./Granger, S. (2012): Formulaic language in Learner Corpora. In: Annual Review of Applied Linguistics 32, 130-149.
Pollitt, A./Murray, N.L. (1996): What raters really pay attention to. In: Milanovic, M./Saville, N. (eds.): Performance testing, cognition and assessment; Selected papers from the 15th Language Testing Research Colloquium. Cambridge: Cambrudge University Press, 74-91.
Read, J./Nation, P. (2004): Measurement of formulaic sequences. In: Schmitt, N. (ed.): Formulaic sequences: Acquisition, processing and use. Amsterdam: John Benjamins, 23-35.
Read, J. (2000): Assessing vocabulary. Cambridge: Cambridge University Press.
Reznicek, M./Lüdeling, A./Krummes, C./Schwantuschke, F./Walter, M./Schmidt, K./Hirschmann, H./Andreas,T. (2012): Das Falko-Handbuch. Korpusaufbau und Annotationen. Version 2.01. HU Berlin (
Reznicek, M./Lüdeling, A./Hirschmann, H. (in print): Competing Target Hypotheses in the Falko Corpus. A Flexible Multi-Layer Corpus Architecture. In: Díaz-Negrillo, A./Ballier, N./Thompson, P. (eds.): Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: John Benjamins (Series Studies in Corpus Linguistics).
Rimrott, A./Heift, T. (2008): Evaluating automatic detection of misspellings in German. In: Language Learning & Technology 11 (3), 73-92.
Römer, U. (2010): Using general and specialized corpora in English language teaching: past, present and future. In: Campoy-Cubillo, M. et al. (eds.): Corpus-based approaches to English Language Teaching. London: Continuum, 18-38.
Römer, Ute. 2008. 7. Corpora and language teaching. In: Lüdeling, Anke & Merja Kytö (eds.). Corpus Linguistics. An International Handbook (volume 1). [HSK series] Berlin: Mouton de Gruyter. 112-130
Römer. U. (2006): Pedagogical applications of corpora: some reflections on the current scope and a wish list for future developments. In: Zeitschrift für ANglistik und Amerikanistik 54 (2) 121-134.
Schmitt, N./Carter, N. (2004): Formulaic sequences in action: An Introduction. In: Schmitt, N. (ed.): Formulaic sequences: Acquisition, processing and use. Amsterdam: John Benjamins, 1-21.
Schneider, J. G. (2013): Sprachliche ‚Fehler‘ aus sprachwissenschaftlicher Sicht. In: Sprachreport 1-2/2013, 30-37.
Spinelli, B./Parizzi, F. (ed.) (2010): Profilo della lingua italiana. Firenze: La Nuova Italia.
Stede, M. (2007): Korpusgestützte Textanalyse. Grundzüge der Ebenen-orientierten Textlinguistik. Tübingen: Narr.
Trosborg, A. (1995): Interlanguage Requests and Apologies. Berlin: de Gruyter.
Vajjala, S./Meurers, D. (2012): On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition. In: Tetreault, J./Burstein, J./ Leacock, C. (eds.): Proceedings of the 7th Workshop on Innovative Use of NLP for Building Educational Applications (BEA7) at NAACL-HLT. Montreal, Canada: Association for Computational Linguistics, 163-173.
Vaughan, C. (1991): Holistic assessment: What goes on in the rater's mind? In: Hamp-Lyons L. (ed.): Assessing Second Language Writing in Academic Contexts. Norwood: Ablex, 111.125.
Wisniewski, K. (2013): The empirical validity of the CEFR fluency scale: the A2 level description. In: Galaczi, E.D./Weir, C.J. (eds.): Exploring Language Frameworks: Proceedings of the ALTE Krakow Conference. Cambridge: Cambridge University Press, 253-272. Studies in Language Testing.
Wisniewski, K. (2014): Die Validität der Skalen des Gemeinsamen europäischen Referenzrahmens für Sprachen. Eine empirische Untersuchung der Flüssigkeits- und Wortschatzskalen des GeRS am Beispiel des Italienischen und des Deutschen. Frankfurt: Peter Lang.  Language Testing and Evaluation Series, 33.
Wisniewski, K./Schöne, K./Nicolas, L./Vettori, C./ Boyd, A./Meurers, D./ Abel, A./Hana, J. (2013): MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data. In: ICT for Language Learning, Conference Proceedings 2013. Edizioni. (
Wisniewski, K. / Abel, A. (2012): Die Sprachkompetenzerhebung: Theorie, Methoden, Qualitätssicherung. In: Abel, A. / Vettori, C. / Wisniewski, K. (eds.): Gli studenti altoatesini e la seconda lingua: indagine linguistica e psicosociale. / Die Südtiroler SchülerInnen und die Zweitsprache: eine linguistische und sozialpsychologische Untersuchung. Volume 1 - Band 1. Bolzano - Bozen: Eurac. 13-64 (
Wolfe-Quintero, K./Inagaki, S./ Kim, H.-Y.  (1998): Second Language Development in Writing: Measures of Fluency, Accuracy & Complexity. Honolulu: Second Language Teaching & Curriculum Center, University of Hawaii at Manoa.
Yang, W./Sun, Y. (2012): The use of cohesive devices in argumentative writing by Chinese EFL learners at different proficiency levels. In: Linguistics and Education, 23 (1), 31-48.
Wray, A. (2002): Formulaic Language and the Lexicon. Cambridge: Cambridge University Press.
Zeldes, A./Ritz J./Lüdeling A. et al. (2009): Annis: A search tool for multi-layer annotated corpora. In Proceedings of Corpus Linguistics, July 20-23. Liverpool. (
Zipser, F./Romary, L./al. (2010). A model oriented approach to the mapping of annotation formats using standards. In: Workshop on Language Resource and Language Technology Standards, LREC 2010.