DIR2006 - Short trip report by Michiel

Dutch-Belgium Information Retrieval Workshop

12-13 March 2006, Delft

DIR2006
Author: Michiel
CWI participants: Michiel, INS1: Arjen, Roberto, Thijs, Georgina
# participants: 30

Overall impression

About 30 attendees, mainly dutch phd students, a few foreign students and a handful senior researchers. Two interesting keynotes were scheduled. Both gave an overview of the research they were doing and did not go into many details. There were 9 paper presentations, some master thesis projects and mainly phd work. Bellow I have only written something about the keynotes and the most relevant talks.

Thoughts on Information Retrieval

Tendency to use some structure. Wikipedia seems popular because it is a semi-structured format. Research in information retrieval includes both raw text processing as structural querying. Large pieces of natural text have the advantage that they are easy to write and they contain many words that can be used for keyword matching. I noticed in the e-culture demo that relevant results can frequently come from a long literal such as comment or a descriptive note. There is just more text to find matches. One way to overcome the sparsity of literal text in case of the semantic web is to use synonyms, hypernyms, etc. These relations could come from wordnet for example. In IR this seems to be advocated as well. Another approach, probably in combination, is to use some structure of a document to find and explore related pieces of text. To determine what is related, both statistical techniques (on large training sets) as well as more logical tactics are used.

Monday

Keynote: Opportunities and Challenges in Applying IR Techniques to Bioinformatics

ChengXiang Zhai, University of Illinois

Started with a basic introduction to cell and molecular biology. Focus on DNA, transcription and protein creation. He made a nice analogy to object oriented programming. Both a cell and a program contains individually operating objects.

Then he switched to show existing online tools biologist frequently use, National Center for Biotechnology Information. NCBI gives access to literature and genome resource, and it provides various tools to search/mine/align these resources. I think many science fields can learn from this.

Back to biology and the role of IR in it. The human genome is completely discovered. The next two secrets researcher try to uncover are:

When and how do genomes become (in)active?
What are the active substances in a protein?

Micro arrays are used to find out which genes are expressed. IR can help to find patterns of dna switches that are shared between the expressed genes. He did not go into much detail about this. He mentioned Hidden Markov Models. Most of these things I already heard from a friend how is a PhD at the cancer institute.

Which motifs are responsible for a specific function of a protein. Many are discovered and published every day. They use a gene ontology to find functional description of genes. Many genes are discovered in different setting, therefore there are varying names. However, people use different names and terms. So the use of an ontology doesn't pay of completely. Sounds like an open issue for good search and navigation techniques.

Vague Element Selection and Query Rewriting for XML Retrieval

Vojkan Mihajovic, University Twente

Xpath plus vague keyword search. First step to add a search query to xpath element. This enables to select xml element that contains given keywords. In addition they use a operator to indicate a vague element selection. A query without a vague operator does a static search on a specific xml element. A query with a vague operator can do a dynamic search on a group of "related" xml elements. The relation between elements are hand-coded in an expansion list.

Nice to know that in the semweb library, with the new keyword indexing, from SWI we can do all this. Get triple with a matching literal value. Relevance tags correspond somehow to our weighted nodes in a graph.

Tuesday

Keynot: Facing restrictions in questions answering

Maarten de Rijke, UvA

In a restricted domain question answering can not make use of (IR's favorite) redundancy of information. More and deep levels of annotation are needed. They use XIRAF to combine different language annotation tools. Result is a text file with multiple xml annotation trees. How to combine multiple annotation types. Annotations can overlap or can be contained in one another and the inverse relations of these.

Based on these relations they defined 4 new axes to move between XMl trees. Now they can query a question, such as how killed Kennedy by using multiple annotation files as follows (in words): find a sentence with head "killed" that contains the string Kennedy, and in this sentence find a noun phrase and from this select a name of a person that is contained (along the axis that spans multiple annotation files) within the np.

The need for additional axes in xml have a small resemblance with the problem that we face in while finding relevant resource in an RDF graph. Which axes can you explore? However, I got no idea if and how this research could be useful.

Next Maarten showed two projects that involve wikipedia. In one project they scrape wiki for time data, create a structure out this and publish it on wikitimeline.net. Looking at the documentation on the website this looks like a student project. In the other project wikipedia is used to determine relevance of target sentences compared to reference sentences. They use a word overlap technique. It would be worth the time to look in some other techniques from IR to measure relevance, to see what could be applied to RDF graphs.

Optimal link categorization for minimal retrieval effort

Vera Hollink, UvA

Dynamically create a hierarchy structure from a keyword search result set. Use keywords from html meta tag. Interesting part is that they try to find the optimal categorization based on the number clicks needed to find the right result page.

I asked her what she thought of displaying the number of result pages (as is done in many faceted browsing). She replied that this is already included in the calculation of the optimal hierarchy structure. This definitely makes sense.

Is this useful for e-culture? In an ontology the hierarchy is already given. Should we change this? Probably a better solution for optimization on the semantic web would be to skip categories and change the order, layout and visualization (size, color,..)