The role of semantics in information access, a survey

Keyword search is the method of choice for end users to access information on the current web. Several attempts have been made to provide keyword search access to resources described on the semantic web as well as existing web documents enriched with semantic annotations. We will refer to both these approaches as semantic search. The systems that provide some form of semantic search show great variation. Our aim is to give an overview of the different notions of semantic search. In particular we are interested in the role of explicit semantics in the search process.

Within this survey we define keyword search as the process in which a user can submit one or more search terms in free text, with optional control structures, and the system returns a set of result items, which can be organized in various ways. The systems are analyzed in four different stages of the search process, input, processing, feedback and the search results. For each stage we consider the functionality that the system provides and how this is made available through the graphical user interface.

We try to give a complete overview of the different notions of keyword search that have been applied to semantic web data. We believe that in pursuing this, it is not necessary to analyze all systems that support a form of semantic search. Instead it suffices to thoroughly analyze the best representative for a particular type of semantic search. In total we considered xx systems which are all listed in the system overview. A thorough analysis is performed on 15 applications. Additionally we included a couple of systems that do not support keyword search (yet), but for which there is some clear added value for adding this.

Prerequisites

Several applications support operations to create semantic web data to search on, such as, named entity extraction, annotation, crawling and indexing. This lays outside the scope of this survey.

Many applications allow the user to view a specific resource and its direct metadata values, a local view. Some also allow browsing the semantic graph by making the metadata values hyperlinks. This is an useful technique to explore the RDF graph (see also browsers such as tabulator and Disco). This functionality lays outside the scope of this survey. Unless additional features are presented in the local view we do not mention this functionality.

System Overview

Name Purpose Users Collection Scope Store/Index
Autofocus Search engine, Browser End users (medium-expert) RDFized text documents Sesame and application specific index.
DBin Information management Developers. End users (medium-expert) ? ?
e-Culture Portal, Search engine End users (novice-expert) Multiple data collections and thesauri in RDFS. Thesauri mappings in OWL. SWI Prolog Semantic Web Library Triple Store and literal index (porter stemming)
Flink Browser End users (medium-experts) RDFized documents. Sesame
Haystack Information management Developers and end users (medium-expert) Semantic web
Hybrid Search Search engine End users Single collection in RDF Application specific triple store. Lucene index
KIM Information management, Search engine, Faceted browser Developers and end users (medium-expert) Text documents and an upper level ontology in OWL Sesame Triple store, Lucene indexing.
Longwell Faceted browser End users (medium-expert) Single data collection in RDF Sesame
mspace Browser End users (medium-expert) Single data collection in RDF
Museum Finland Portal, Faceted Browser, Search Engine End users (novice-expert) Multiple data collections in RDF and 1 upper ontology in OWL. Ontogator
OpenAcademia Search engine, Browser End users (novice-expert) multiple collections in RDF. Instance mappings in OWL. Sesame. Application specific index
OWLIR Search engine End users Text Document + Extracted RDF triples + Additional scraped triples Store: DAMLJessKB. WONDIR IR engine.
QuizRDF Search Engine End Users (novice-expert) Text documents. RDFS Ontologies. Application specific index.
SemSearch Search Engine End Users (medium-expert) Single data collection in RDF Sesame Triple Store and Lucene Index
Slashfacet Faceted Browser Developers. End users (medium-expert) Multiple collections and thesauri in RDF SWI prolog semantic web library
Squiggle Search Engine End Users (novice-medium) RDFized image metadata and RDF thesaurus Sesame Triple Store, Lucene literal index on record data and on literal values from triple store.
Swoogle Ontology Search Engine Developers and Software agents Semantic web
Tap: Semantic Search Context based search Engine End users Text document and RDF ontologies Tap framework
Excluded systems
Ontokhoj Search engine Developers Semantic Web ?
OntoSearch Search engine Developers Semantic Web ?
SHOE search tool Search Engine Developers Text document with SHOE annotations Application specific
InWiss Search Search Engine End Users Data collections in RDF Sesame. Applications specific index
DOSE Search Engine End Users Web Documents and Domain Ontology ?

System analysis

Search Input

Functionality
interface

Processing

Functionality
Interface

User Feedback

Functionality
Interface

Search Results

Functionality
interface
Search Input Processing User Feedback Search Results
Autofocus Functionality
  • Keyword search with multiple terms
  • Value selection from predefined facets
  • Literal matching: subword match on extracted terms
  • Retrieval (keyword search): resources with matching literal value
  • Retrieval (value selection): resources with selected value as metadata
  • Refinement: add new search term or facet value. In contrast to faceted browsing the available values for refinement are not restricted to the current selection. This allows to make multiple different intersections.
  • Set of items grouped by relation to constraints
Interface
  • Keyword search: text entry box and keyword suggestion list
  • Search options: check boxex for specific metadata fields to search in
  • Value selection: selectable facets with grouped value list
  • Number of results per selected value or search term
  • Similar as initial search
  • Table with items metadata. Cluster map visualization
DBin Functionality
Interface
e-Culture basic Functionality
  • Keyword search with single search term
  • literal matching: minimal letter distance on stemmed index
  • Retrieval: backwards graph search with weighted relations. Weight are manually assigned by relation type
  • Disambiguation: result clusters grouped by result path
  • Set of items grouped by search path
  • Clusters: search path
  • Ranking: clusters are ranked by path length. Items within a cluster a ranked by score (=literal match * total path weight).
Interface
  • Text entry box
  • number of total results and number of results per cluster
  • Hyperlinks for cluster headers to zoom in on this cluster
  • Thumbnails with selected metadata
Flink Functionality
Interface
Haystack Functionality Browse concepts Manually defined virtual properties.
Interface
Hybrid Search Functionality
  • Keyword search with multiple search terms
  • Literal matching: ?
  • Retrieval: Spread Activation algorithm. Weights are determined by similarity and specificity measure plus manually assigned by relation type.
  • Refinement: Related keywords
  • Set of items clustered by type.
  • Ranking based an activation.
Interface
  • Text entry box
-
  • list of keywords
  • Item presented by title and visually grouped by type
KIM Functionality
  • Keyword search with multiple terms and Lucene operators
  • Pattern search consisting of a structured query and a search term
  • Value selection from facets
  • Facet value autocompletion search
  • Literal Match: string match on extracted entities and metadata
  • Retrieval (keyword search): resources with matching literal value
  • Retrieval (pattern search): resources with matching literal value and exact query match
  • Retrieval (value selection): resources with selected value as metadata
  • Refinement (keyword search): add search term for other metadata field
  • Refinement (value selection): Add new facet value or related concept
  • Set of items
Interface
  • Keyword search: Form with text entry Boxes for title,keyword,author and content
  • Pattern search: a complex form representing the structure of the query
  • Value selection: facets with value list and text entry box for autocompletion. The facets that are shown in the interface can be manually selected.
  • Number of matching documents.
  • Selected terms
  • Similar as initial search but with available facet value updated to current selection.
  • Item presented by title and date
Longwell Functionality
  • Value selection from facets
  • Facet value autocompletion search
  • Keyword search with single search term
  • Literal matching (autocompletion): prefix
  • Literal matching (keyword search): subword
  • Retrieval (keyword search): resources with matching literal value
  • Retrieval (value selection): resources with selected value as metadata for specific facet
  • Facet values are updated to current selection
  • Add new facet value or search term
  • Set of items
Interface
  • Keyword search: text entry box
  • Value selection: facets with value list and text entry box for autocompletion. All facets are shown but can be the value list can be hidden.
  • Loading message at every click
  • Number of total results
  • Number of results for each facet value
  • Selected facet values
  • Similar as initial search
  • Fresnel
mspace Functionality
  • Value selection from facets
  • Facet value autocompletion search
  • Literal match (autocompletion): prefix
  • Retrieval: results related to selected value by predefined graph paths.
  • Refinement: select new value from facet
  • Change order of facets to construct different view
  • Selected item
Interface
  • Value selection: facets with value list and text entry box for autocompletion. Visible facets can be manually selected.
  • Selected facet values are highlighted
  • Facets are draggable to change order
  • All related values of the result item are shown.
Museum Finland Functionality
  • Value selection from facets
  • Keyword search with single search term
  • Literal matching: subword
  • Retrieval (keyword search): resources with matching literal value
  • Retrieval (value selection): resources with selected value with metadata or with a narrower concept as metadata for specific facet
  • Disambiguation: keyword search matches by use (facet in which they occur as a value)
  • Refinement: add value from new facet or select more specific value from active facet
  • Exploration (if a single result is selected): related results have similar values for predefined properties or paths
  • Set of items
Interface
  • Keyword search: text entry box
  • Value selection: facets with value list
  • Number of results for each result cluster
  • Number of results for each facet value
  • Selected facet values
  • Similar as initial search
  • Thumbnail with selected metadata
Open Academia Functionality
  • Keyword search with single search term for metadata field
  • Value selection from metadata fields
  • Literal matching: subword
  • Retrieval (keyword search): Resources with matching literal for specified metadata field
  • Retrieval (value selection): resources with selected value as metadata for specified field
  • Values in metadata fields are updated to selection
  • Add new search term or metadata field
Set of items.
Interface
  • Keyword search: search form
  • Value selection: drop down lists for fixed set of fields
  • Number of total results
  • Processing time
  • same as initial search
Different visualization tools, tagcould, topic graph, social net, timeline, clustermap and relation graph
OWLIR Functionality
Interface
QuizRDF Functionality
  • Keyword search with multiple search terms
  • Search options: Class of the provided input. Options for literal match, case, exact match, only in title.
  • Literal matching: defined by match options
  • Retrieval: documents with literal match on index table. Index table of a document contains the literal values from all direct annotations.
  • Disambiguation: select class and values for metadata fields used for instances of this class
  • Set of items
  • Ranking by a variant of tf.idf
Interface
  • Keyword search: text entry field
  • Options: Drop down menu with classes
  • Options: checkboxes for search options
  • Number of results
  • Possible classes of the input are updated to result set
  • Drop down for classes
  • Search form for properties with a literal value range
  • Title of document + all metadata
.
SemSearch Functionality
  • Keyword search with multiple search terms
  • Structure: Boolean operators AND/OR. Search engine specific operator ":" to indicate the result target type
  • Literal matching: subword
  • Interpretation: Based on the sets of resources matching the input a formal query is constructed
  • Retrieval: Resources matching the constructed query. RDFs reasoning over class and property hierarchy.
  • Disambiguation/Refinement: Deselect class/property/instance of matching search terms
  • Set of items
  • Ranking on literal match
Interface
  • Keyword search: Text entry box
  • Number of total results
  • Processing time
  • Form with the matches for each keyword. Checkboxes to toggle them
  • Title + the entities that matched the query + the relation from the keyword matches to the result
Slashfacet Functionality
  • Value selection from facets
  • Facet value autocompletion search
  • Global facet autocompletion search
  • Literal matching (autocompletion): prefix
  • Retrieval (value selection): resources with selected value as metadata for specific facet or with a narrower concept
  • Facet values are updated to current selection
  • Complex query paths can be constructed through interactive interface
  • Disambiguation (global facet search): by use of value (facet in which the value occurs)
  • Disambiguation (in facet search): by location in the value hierarchy
  • Refinement: Add a new facet value
  • Set of items
  • Clustered by manually selected property
Interface
  • Value selection: facets with value list and text entry box for autocompletion. All facets are shown but can be the value list can be hidden.
  • Global facet search: text entry box with autocompletion dropdown list
  • Loading message at every click
  • Number of results per cluster
  • Number of results for each facet value
  • Selected facet values are highlighted
  • Disambiguation (global search): grouped by class
  • Disambiguation (in facet search): value in hierarchy shown as unfolded tree
  • Refinement: similar as initial search
  • Thumbnail with selected metadata
Squiggle Functionality
  • Keyword search with multiple search terms
  • Literal matching: Lucene search engine.
  • Retrieval: Resources with matching literal value. After disambiguation by selecting a concept resources are matched to all literal values known for the selected concept.
  • Multiple terms in a query are interpreted disjunctive. Conjunctive queries on concepts can be made by selecting multiple values from the suggestions.
  • Disambiguation: by matching URI and by rdf:type
  • Exploration: related concepts grouped by rdf:type
  • Set of items
Interface
  • Keyword search: text entry box
  • Total number of results
  • Processing time
  • Hits per matching literal
  • Disambiguation: List of concepts with checkbox
  • Exploration: List of concepts
  • Thumbnail or title + selected metadata
Swoogle Functionality
  • Keyword search: Search term or URI
  • Structure: boolean operators AND,OR. Specific constructs to indicate domain for literal match: in URI, namespace, local name, literal values
  • Literal match: subword
  • Retrieval (ontology): contains resource with matching literal value
  • Retrieval (term): resource with matching literal value
-
  • Set of items
  • Ranking: Ontorank [explain] and termran [explain]
Interface
  • Keyword search: Text entry box
  • Options: result type (document, ontology, term)
  • Number of total results
  • Processing time
- rdfs:Label for terms and URI for documents + selected metadata
Tap: Semantic Search Functionality
  • Keyword search with max two search terms
  • Literal matching: subword
  • Retrieval: Full graph search. Restricted to manually assigned properties for each class
  • Exploration: the semantic search result augment results from a traditional search engine
  • Set of items
  • Clustering: by type
Interface
  • Keyword search: Text entry box of host search engine
- (see results)
  • Results are presented alongside traditional search results
  • Template for each result class

Overview

Search Input Processing User Feedback Search Results
Functionality
  1. Free text input of one or more search terms. Optionally control structures, such as boolean operators or application specific operators.
  2. Controlled input of search terms. Controlled can mean that the the search term is restricted to a list of predefined values or/and that the search term is restricted to certain value range. For example, in A search form the value range is restricted for each field (title, author etc.) while the search term is unrestricted. Searching within a facet both the value range and the range are restricted. [List advantages of controlled input. No dead ends.]
  1. Literal Match. Limit the discussion to matches on literal values of RDF resources. Indexing of documents lays outside the scope. Direct hits: match op literal attributes, normal attributes from which the label field matches. There is no difference between a literal value and a resource with a literal value as a label.
  2. Input interpretation. This only applies if multiple keywords are given. [SemSearch and Hybrid search support this]
  3. Query extension. (see discussion below)
  4. related results. (see discussiom below)
  1. Disambiguation
  2. Refinement
Interface It is often mentioned that the interface for search terms should be simplistic. The google interface is regarded as very positive. Forms and facets [should we discuss interface issues]. Other options, type of literal match selection (prefix,substring). Loading message, # of results, process time, warning message (no results, number of result limit etc.) Semantics play an important role here. The properties, classes and concepts allow the system to explain the possible interpretations to disambiguate and possible dimensions for refinement. Allows give precise feedback. Trade-off between search and browsing. Problems arise with ontological resources that do not make sense to the user. There is a need for an interface ontology (natural categories). Semantics play an important role in clustering. The semantics provide the dimensions for the grouping of results as well as the explanation of the groups.

Discussion

Semantic Search

What do we consider as semantic search. [This can also explain why certain systems are chosen and others are not. Why do we not consider SeRQL or SPARQL query language or natural language based interfaces. I am not sure yet. The underspecified input of keyword search has something to do with it. In NL the focus is on the interpretation of the expression. Once the interpretation is made it is clear how the database should be queried. To answer an underspecified query it is uncertain what and how much should be retrieved from the database. The SPARQL describe construct is a similar notion of vagueness. It is exactly the part that is underspecified in the SPARQL spec.]

I think we should focus on systems in which the goal is to find instances. The target group typically are end users. This excludes systems, such as swoogle, swangler and ontoSearch, where the target type is a sementic web document, an ontology or resources used in an ontology, a class or predicate. The target group for these systems are developers and knowledge engineers.

General remarks about analysis of systems

Choose focus on certain aspects. There could be much more semantics done in the background, integration, smushing, extraction etc. We only look at the semantics used in the actual search process.

General remarks about the use of semantics

  1. use metadata to extend literal match.
  2. disambiguation of input, refinement of input and complex query patterns.
  3. complex paths to find related resources.

In the first there is some consensus. Properties to use are synonyms, sameAs, narrower. Can be done on the fly or the additional terms can be added to the index offline. This Increases the recall, simply because the number of available "meaningful" terms is increased. On how to do disambiguation there is also some agreement. rdf:type and the property between the value and the search result are typically used. Facets are very well suited for refinement and to construct complex (union/intersection) queries. The third is the most exciting one and here we see various solutions. Rules: Ontogator/MuseumFinland uses logical rules, Squiggle predefined paths, Tap Semantic search predefined properties for each class. User controlled: KIM complex search query form, /facet has cross relations. Graph search: e-Culture weighted graph algorithm.

Can we describe the need for complex paths? Which additional results can we find with this? We can make a link to relation search here. If the focus is not the item itself but the relation between two or more items the complex paths are itself search targets. \cite{Seth Work}.

I think we can distinguish two types of complex paths. The first serves for query extension in order to find more results. The second allows the system to suggest related concepts, which allows further exploration.
  1. Complex path is introduced by modeling decisions. In other words the value could have been modeled as a direct value. The use of blank nodes introduces this complexity. Another example would be the complex annotation in the multimedia ontology. Dealing with this type of complex paths is not a semantic issue. If the system is aware of which complex paths to use or if some mapping exists at the data level the issue of relevance does apply. .. the value is now a direct value of the item.
  2. Complex path in which the occurrence of one or more concepts is crucial to the relation between the items. In other words the relation exist by virtue of an additional concept. In this case the system has to decide which concepts and relations are relevant. Many factors play a role here.
Interpretation of the input in advance (SemSearch) or presentation techniques that allow refinement and browsing on and from the result set. In SemSearch it is also issue iii). that is problematic. This case the problem is approached from the other side. What are the possible combinations of search terms can according to the data.

Search Results

In the presentation of the results almost all applications use a hand configured template form. FRESNEL \cite{fresnel05} could easily be applied for this purpose. Most applications support some sort of local view on an individual search result. Often this also allows some form of browsing. This connects to browsers such as Tabulator and Disco. Ranking of results is an open issue, added value of semantics is not clear. Clustering is often applied, added value of semantics in the form of meaningful explanations of the clusters.

User feedback

Disambiguation, refinement and suggestion of related items

Related Work

A Categorization Scheme for Semantic Web Search Engine, 2006. Kyumars Sheykh Esmaili, Hassan Abolhassani. Ontology search engines (meta search, crawler based) and semantic search engines (context based, evolutionary, semantic association). The systems that they cover seems to be very complete. The categorization is straightforward. I guess this means we can leave such a categorization out of our paper and just refer to this one. More room to focus on the role of semantics.

Evaluation

Fields experimenting with semantic web tools in a virtual organization, 2003. Victor Iosif, Peter Mika et. al. How do we test SW tools? Design considerations for Semantic Web Field Experiments. Description of an experimental setup with SW applications QuizRDF (search), spectacle (browse) and traditional free text search of EnerSearch.