Talks, presentations, case studies

Blurbs of the Workshop Talks

Keywords and beyond (Paul Baker, Lancaster University, UK)

Keywords are words which are statistically significantly more frequent in one corpus when compared against another. This talk discusses how they can act as signposts to funnel a corpus analysis to a manageable set of salient items, identifying concepts that analysts may not have guessed were salient. To an extent, they can help to introduce a level of objectivity into research. Using examples from my own work on diachronic change in British English, the representation of Muslims in the British press, and parliamentary debates on fox hunting I develop the notion of keyness to consider key lexical bundles and tags. I also discuss how keywords can be derived when working with more than 2 corpora.

The Theory behind Keyword Analysis (Václav Cvrček, Charles University in Prague, Czech Republic)

Corpus-based discourse analysis – an umbrella term for a bundle of methods – has been increasingly used recently because of its ability to reduce researcher's bias. Use of large language corpora in text analysis and interpretation facilitates maximally objective observations that might otherwise remain hidden. One of the most widely used corpus-based discourse analytical methods is keyword analysis. It provides researcher with a set of prominent words (keywords) by contrasting the text with a reference corpus. The list of keywords (as an output of the keyword analysis) can subsequently point to interpretation of a text or corpus/discourse. This talk will discuss the methods of obtaining prominent words from the text as well as the basic statistical background for measuring “keyness”. It will also introduce KWords, an on-line tool for keyword analysis of Czech and English texts which was created as a result of collaboration between the Institute of the Czech National Corpus and Brown University.

Topic Modeling (Brian Croxall, Brown University) (Notes to the presentations are forthcoming. In the meantime, if you have questions, please write to the author.)

Imagine that you want to know what sort of topics show up in a novel or a newspaper. You could just read the book or the paper, and then you would be able to tell someone what subjects it discussed. But how would you do this with several decades' worth of newspapers or the entire output of an author over the course of her life?
Topic modeling is a statistical method for doing precisely this work. In this presentation, we will cover the basics of topic modeling, including several software packages for implementing the method. I will draw from my own work modeling the complete fiction of Ernest Hemingway as a way to find patterns across his four decades of writing.

The Czech National Corpus (Václav Cvrček, Charles University in Prague, Czech Republic)

Case Studies

What is corpus data indexing? A comparison with elicited data (or, a wug experiment for Czech) (Neil Bermel, Sheffield University, UK)

The Texas Czech Legacy Project (Lida Cope, East Carolina University)

What and inflectional language can tell us about texts  (Masako Fidler, Brown University)