DIR2006
Author: Michiel
CWI participants: Michiel, INS1: Arjen, Roberto, Thijs, Georgina
# participants: 30
Started with a basic introduction to cell and molecular biology. Focus on DNA, transcription and protein creation. He made a nice analogy to object oriented programming. Both a cell and a program contains individually operating objects.
Then he switched to show existing online tools biologist frequently use, National Center for Biotechnology Information. NCBI gives access to literature and genome resource, and it provides various tools to search/mine/align these resources. I think many science fields can learn from this.
Back to biology and the role of IR in it. The human genome is completely discovered. The next two secrets researcher try to uncover are:
Which motifs are responsible for a specific function of a protein. Many are discovered and published every day. They use a gene ontology to find functional description of genes. Many genes are discovered in different setting, therefore there are varying names. However, people use different names and terms. So the use of an ontology doesn't pay of completely. Sounds like an open issue for good search and navigation techniques.
Xpath plus vague keyword search. First step to add a search query to xpath element. This enables to select xml element that contains given keywords. In addition they use a operator to indicate a vague element selection. A query without a vague operator does a static search on a specific xml element. A query with a vague operator can do a dynamic search on a group of "related" xml elements. The relation between elements are hand-coded in an expansion list.
Nice to know that in the semweb library, with the new keyword indexing, from SWI we can do all this. Get triple with a matching literal value. Relevance tags correspond somehow to our weighted nodes in a graph.
In a restricted domain question answering can not make use of (IR's favorite) redundancy of information. More and deep levels of annotation are needed. They use XIRAF to combine different language annotation tools. Result is a text file with multiple xml annotation trees. How to combine multiple annotation types. Annotations can overlap or can be contained in one another and the inverse relations of these.
Based on these relations they defined 4 new axes to move between XMl trees. Now they can query a question, such as how killed Kennedy by using multiple annotation files as follows (in words): find a sentence with head "killed" that contains the string Kennedy, and in this sentence find a noun phrase and from this select a name of a person that is contained (along the axis that spans multiple annotation files) within the np.
The need for additional axes in xml have a small resemblance with the problem that we face in while finding relevant resource in an RDF graph. Which axes can you explore? However, I got no idea if and how this research could be useful.
Next Maarten showed two projects that involve wikipedia. In one project they scrape wiki for time data, create a structure out this and publish it on wikitimeline.net. Looking at the documentation on the website this looks like a student project. In the other project wikipedia is used to determine relevance of target sentences compared to reference sentences. They use a word overlap technique. It would be worth the time to look in some other techniques from IR to measure relevance, to see what could be applied to RDF graphs.
Dynamically create a hierarchy structure from a keyword search result set. Use keywords from html meta tag. Interesting part is that they try to find the optimal categorization based on the number clicks needed to find the right result page.
I asked her what she thought of displaying the number of result pages (as is done in many faceted browsing). She replied that this is already included in the calculation of the optimal hierarchy structure. This definitely makes sense.
Is this useful for e-culture? In an ontology the hierarchy is already given. Should we change this? Probably a better solution for optimization on the semantic web would be to skip categories and change the order, layout and visualization (size, color,..)