About the project


The Needle-in-a-Haystack project examines subtle changes in the perception of texts over time in two languages: Czech and English. It serves as a tool to enhance interpretation of text by providing objectively reproducible outputs. One of the most relevant applications of this method is a study of historical documents and how they might have been interpreted by audiences from different times. This process is often akin to finding a needle in a haystack.  As our goal is to test the applicability and capability of this method, we use both quantitative and qualitative analysis to examine how these two types of analysis mesh.

NHM and text analysis

The principal investigators study the extent to which KWords, with appropriate interpretive steps, help understanding of texts. Our hypothesis is that we can help answer the following questions:

  • Reconstruction of "the historical reader"  - How did the reader from the totalitarian regime read texts published in his/her time? How does this interpretation differ from today's interpretation of the same texts
  • Comparison between the original and its adaptation (e.g. comparison between the original fairy tale and a sanitized version for children; comparison of revisions made to a text by the author/censors)
  • Different views of the same event (e.g. textbook descriptions from different times)
  • Detection of social-cultural shifts reflected in a set of texts, which were published over a stretch of time (e.g. Do presidential New Year's addresses, even when they are ritualistic and contain no apparent changes, reflect subtle changes in the society?)
  • Inter-language reception of a text: effects of a text in the original language vs. the effects of the translation (Czech and English)             

Current phase of the project

The current phase of the NHM project focuses on developing and testing the software application for keyword analysis. We have been examining the extent to which the data from KWords can be useful to analysis of texts in different genres; this involves quantitative as well as qualitative analysis of texts. We plan to expand this project beyond Czech and English and strengthen the applicability of the method in English.

The software application (KWords, public domain) in use has thus far been equipped with the following functions:

  • Extraction of keywords and thematic concentration words. Keywords are word forms that make the target text unique in contrast to the general pattern of language (represented by large reference corpora). Keywords are closely connected to what the text is about (Scott and Tribble 2006), i.e. topics of the text.
  • Ranking of keywords according to effect size
  • Distribution plot of keywords maps out how the keywords of the text are distributed, e.g. clustering in one place or scattered all throughout the text. Distribution Plot suggests how topics may be spread out or concentrated in a text.
  • Visualization of keyword links. KWords identifies and counts links between each keywords and other keywords and visualizes these connections by means of D3.js library. Keyword links suggest how topics are connected.
  • Concordance of keywords shows what words co-occur in the proximity of each keyword.
  • Comparison of texts. KWords allows simultaneous analysis of up to twenty texts. The application shows similarities and differences in keywords inventory.
  • Flexibility and customization. KWords provides a wide range of possible parameters including types of reference corpora, test statistics, significance level, stop list (list of words to exclude from the analysis).
  • Language and Reference corpus KWords currently allows analysis of texts written in Czech and English. Several reference corpora are available for both languages, including COCA, BNC, InterCorp, and SYN2010.