11/7/2005

Search Engine

Filed under: — Massimo @ 9:53 am

I agree with Mike that the Balzac example is an interesting one, although it clearly applies to a corpus (oeuvre) by a single author. From this point of view, let’s remember that VHL is not a “single author” project - I find more affinities with the WWP or the EEBO. Of course, the search engine is a valuable tool for annotating. However, what our search engine should be able to do, eventually, is to maximize the possibilities embedded in our “differentiated” encoding. For example: crossreferencing names, places, dates, visualizing text strings and paragraphs etc., but also allowing to perform more sophisticated searches for authorial, thematic, semantic/rhetorical structures as we identify and encode them in the various texts (what fields would be appropriate for these other tasks?). Our goal is to enable a comparative and explorative approach to texts that belong to the same cultural context but also to different typologies of writing and rhetorical genres (we have chosen these texts precisely because of the wide spectrum they represent). How does the search engine help us reach that goal? Another question raised by Mike: keeping commentary and text separated is ok, but isn’t encoding a form of embedded commentary? Does Mike mean annotations? Will we be able to search annotations as well - in relation to text - once we have a significant amount of annotations? I suppose we can proceed by stages and add functionality and power to our engine as we progress in the encoding and annotating process. However, in designing it, one of the fundamental prerequisites we should keep in mind is its “expandibility” - to keep it open to the possibilities that lie ahead of us, including potential applications in the seminar room.

10/17/2005

Villani: next steps

Filed under: — vika @ 7:51 pm

I’ve been thinking about what to do next with the Villani text. Considering we’ve been dealing in detail with the Esposizioni, the work we need to do between now and putting Villani online should be simple and quite similar to what we’ve done before, shouldn’t it? Well, yes and no: our two Villani editors-and-encoders are both at institutions other than Brown, not particularly close to each other. I’ll be getting a bit more involved in Villani as well, documenting the process as thoroughly as possible. All of this points to the blog as a good, systematic communication system that hopefully opens us up to critique and other helpful input from colleagues we don’t even know we have.

So, this is a first attempt at a to-do list for Villani. It’s tentative, but I’m pretty sure we have to do the first two bullet points before moving on to other things.

  • Go through dtd and correct the typos: I’ve used the power of software to create a DTD (Document Type Definition, where the formal rules for encoding are stored). We won’t actually be using the DTD to prescribe encoding, but making it automatically has helpfully revealed some typos in the XML, which need to be corrected. More on this below.
  • Debug attributes: A bunch of unconventional (to me) stuff was done with the encoding; some attributes have more than one value and we decided to use blank spaces to represent that (for example, <person role=”duke captain”>). But we must go through all the attributes again and delete any stray spaces: for example, it would be confusing to write <person role=”duke captain ruler of the people”> because “of” is not on the same semantic level as “duke”. I will leave that to Matt and Rala, and will help as needed.
  • Run text by Paul: I’m pretty sure the blank space thing is going to be all right, but I’m not comfortable enough with how XML interacts with other languages we’re using in the back end (notably PHP). So I’ll show it to Paul and ask him whether our mark-up is unnecessarily clunky in a glaring way. Anything that needs fixing won’t need re-encoding from scratch; we’ll find a way to automate as much of the transformation as possible, if it comes to that.
  • Put the text online: …and debug until it works. :)
  • Blog encoding principles: Matt and Rala will post here a completist account of their encoding principles, perhaps with a chart or table of all elements and attributes accompanied by the reasoning behind them. This will be the beginning of a document that will be publically linked from the main VHL site, and that will serve as a starting point for introducing other scholars to our work.
  • Move on!: Then the real fun begins: I will perhaps do a first pass at indexes (*cough* with a little help from my friends…) while the Villani scholars annotate?

Comments? Additions? Subtractions?

3/3/2005

New semester, new update.

Filed under: — vika @ 5:47 pm

Okay, so it’s well into the new semester. But, although things have been slow here on the weblog, they’ve been moving at a pretty exciting speed here at VHL HQ.

In a nutshell:

  • Paul has created the annotation engine! It was tested with a dummy text, and next we’ll be testing it with real ones.
  • Part of the Esposizioni, just over 2000 of its 5000 paragraphs, is now ready to be put up, tested and cleaned up. Segments of Villani to follow imminently.
  • It looks like several of us are presenting VHL in many different venues; as such, at a meeting held last weekend we decided it would be great to have some business cards. I’m looking into that.
  • This semester will be dedicated to testing the annotation engine, beginning to actually annotate the texts we’re putting up, and hopefully gaining some momentum in terms of public interest by contacting people who might want to annotate specifically the texts we’re putting up. Italianists, medievalists – anyone reading this want to play? Please let us know in comments if you do.
  • We’re leaving the search engine and generating indices mostly for the second year of the grant. That’ll be a big project.

Aside from writing the papers I’ll be presenting in March, June and July, as well as checking in on the progress of various project components, I’ll be concentrating on getting the texts to play nicely with the code. Having just spent several hours cleaning up the Esposizioni chunk a bit (nothing major, just structural well-formedness), I’m excited to get my hands on actual code again. It’s so… logical. XML is a great toy, and tool.

Guyda and I have been asked for a few screenshots of the annotation engine, for the Digital Medievalist article; I’ll post them here when I make them.

My colleagues can update the world about their own work in more detail. Undoubtedly I’m missing something(s). Feel free to supplement this short report, o Team.

1/6/2005

a small query about quotation marks

Filed under: — mike @ 2:13 pm

While going through canto XIII, I noticed that there are some odd inconsistencies in the use of quotation marks (odd in the sense that they are replaced by italics sometimes in our text, but not always. For the sake of documentation, the paragraphs in question are:

XIII.let.8, 9, 12, 13, 14, 22, 23, 24, 30, 33, 57, 59, 60, 61, 64, 84, 86 (twice), 87, 90, 93, 102, 105, 107, 111 and 112.

86 is a good example. You can see the inconsistencies quite clearly there. What’s our position on this?

M

p.s. Paragraph 18 was missing from the text. I have it, however, if anyone wants it…

9/30/2004

Roberto’s update on the encoding

Filed under: — roberto @ 6:57 pm

I am encoding glosses, names and a few themes in the Esposizioni:

a) Glosses.
I am glossing two kinds of terms that Boccaccio himself explains and defines in the text. In our editions, these terms are either:
- in quotation marks
- in italics
For Boccaccio’s explanations of terms that do not fall in neither of the above categories, I am leaving comments for Cristiana and all other collaborators: later we will have to decide whether or not also these terms deserve a gloss.

b) Names.
I am encoding four types of names, with or without glosses explaining their meanings in Boccaccio’s own words. Names can refer to a:
- Person. Yes/no subcategories: collective (ex. “fiorentini”, “centauri”), mythological, biblical (Vika and I have decided to consider “biblical” names as part of the “mythical” category as well; “mythical” are also fictional characters, for example Dante-personaggio as opposed to Dante-autore, when the distinction is applicable). For the moment, I have decided not to encode all the different names for “God” (”Dio”,”Creatore” etc.)
- Place. Yes/no subcategories: mythological (non-”real” biblical place names included)
- Myth-entity. Mythical characters that are also names for places (ex. “Oceano”)
- work (of art/literature) Ex. Eneide, Timeo… With specification of the author. I am encoding the references to “Divina Commedia” as “comedia”, and I am not considering the three separate “cantiche” as works in themselves (thus I am not encoding the occurrences of “inferno”, “purgatorio”, “paradiso” as parts of the “Divina Commedia”, but I do encode them when they are names of places).
For the moment, I am not encoding names of places, persons etc. when they appear in citations in languages other that Italian (for ex. Latin).

c) Themes.
I began encoding some passages according to very general abstract theme-categories such as: time, sexuality, health etc. I have labelled these categories with the mark “temrob” (=”tema-roberto”), that is: those are themes identified by Roberto, all of you are free to add others or discuss changes, sub-divisions etc. A theme suggested by Cristiana, for example, could be marked “temcri”, and the whole matter be discussed in a meeting later on (but I think it is important to start identifying these themes now, just to have a general idea and classification…)

As you can see, this work of encoding is already implying some issues to discuss and choices to make. It goes without saying that your opinion is more than welcome. Other issues and problems I have encountered so far, and on which I would like to hear your comments, are the following:
- I wrote “check” in those cases when I am not 100% sure of a certain information, for example the author of a certain work, whether or not a character is real or mythological etc.
- When Boccaccio refers to biblical psalms, he uses expressions such as “il Salmista scrive”. This problem could be solved in many different ways. I have chosen to encode “il Salmista” not as a “work” but as a person (mythological, biblical).

9/10/2004

Esposizioni work has started: an update

Filed under: — vika @ 2:41 pm

Thanks to Ethan Fremen’s help in setting up our versioning system, we have started work on the Esposizioni. At the moment, our list of points to focus on includes the following:

  • basic structural elements, such as chapter divisions, the divisions between literal and allegorical exposition in each chapter, paragraphs and milestones (following the numeration established in the critical edition);
  • proper names, along with contextual information about the people, places and entities they represent;
  • citations from other authors quoted by Boccaccio and notes as to the erroneous nature of some of these citations (with information about their authors and the works in which they appear – in some cases, the quotation is anonymous, and we have found that these tend to be either proverbial sayings or Boccaccian constructions intended to give personal statements a more general import);
  • Greek and Latin terms and their meanings, to be indexed;
  • the many terms, Italian as well as foreign, that Boccaccio treats as lemmas, explaining their meaning in detail to his audience, and the definitions themselves; these lemmae will also be indexed, and the glosses cross-referenced to them – particularly useful in cases where Boccaccio defines terms more than once;
  • rhetorical devices explicitly and implicitly used by Boccaccio;
  • the complex rhetorical structure that Boccaccio lays out every so often, and whether he follows through;
  • digressions, in which Boccaccio stops addressing Dante’s text directly and spends, at times, entire pages addressing a broad topic of interest to him (for example, poetry)

Guyda has agreed to tackle Boccaccio’s citations of other authors. Roberto has started on the encoding of terms for the index. Cristiana is researching and annotating Boccaccio’s rhetorical devices, as well as the highly irregular rhetorical structure of the work. The main focus right now are the Accessus and the two expositions of Canto I.

What’re we missing? Are there other elements of the work that will be absolutely necessary to encode on a first pass, that are missing from the above bulleted list? What about a wish list, things we could keep in mind for encoding in the future?