Faceted Browsing
Michiel Hildebrand, $Date: 2006/04/18 12:01:52 $
Considerations and decisions in the implementation of a faceted browser
ToDo:
http://media.cwi.nl/hildebra/blog/2006/03/02/facet-browser-todo/
0. Intro
Faceted browsers allow navigation of a repository along the conceptual dimensions of its metadata. A single facet consists of hierarchically structured concepts that are used as metadata for a set of instances. In a faceted browser users can select and combine concepts from multiple hierarchies. In fact, a user creates a complex query by simple clicking.
The interaction in faceted browsing follows a step by step process of specifying, and possibly generalizing, the constraints of a result set. At each stage the current children and, additionally, the number of instances it annotates are displayed. Concepts with no instances (count 0) are hidden to protect users from oblivious navigation of empty branches.
1. Transitive Closure
At every stage the system has to calculate for all direct children the number of instances that are annotated with that concept. We say that an instance is annotated with a concept if there exists a triple (subject,annotation_property, concept). The concept can be part of a hierarchy. In this hierarchy the concept's parents are indirectly used for annotation of an instance as well. In other words annotations occur as direct properties or through a transitive relation. A good strategy is required to efficiently handle the transitive relations.
How can we do efficient computation with entailed relations?There are at least two approaches to tackle this:
- Let a query processor infer new statements as needed per query.
- Compute and store the closure of the given graph as a basis for querying. (cited in reversed order from [2])
For example, the semantics of the rdfs:subClassOf relation state that the instances of the subject class are also instances of the value class. In the semweb package of SWI prolog the predicate rdfs_individual_of uses the subClassOf relation to find individuals (instances) of a class [1]. This is approach 1. Sesame uses approach 2. It computes the closure and adds all triples in advance.
My first naive implementation used the first approach. The algorithm searched for all instances top down. Get all direct instances of a concept and recursively get the instances of a child of that concept. This strategy turned out to be very expensive, because a parent can have multiple children, and most of them will not lead to an annotation triple. For each subfacet the algorithm has to search the whole subtree.
A bottom up approach is not much better because in this case all instances have to be checked for each subfacet, and most of them are not being used in that facet. Although the path from instance to parent is most of the time non branching (some concepts can have more than one parent) the set with all instances is usually very big.
1.1. Storage vs. computation
In a revised implementation (thanks to Jan) I followed the second approach. In an initialization stage I first compute the transitive closure and add additional triples for the entailed relations between an instance and a concept. In the new extended triple set if a concept is used to annotate an instance there must exist a triple (instance,annotation_property,concept).
1.2. Reduction of additional triples
If instances share a metadata concept additional derived triples are needed from both instances to the concept's parents. In fact the whole transitive closure for that concept is stored twice, once for each instance. The number of derived triples for the transitive closure can be reduced if the closure is calculated solely at the concept level. See image below, if derive the closure from the instance level (instance1, instance2, instance3) to the top node we need 6 additional triples, from the all instances to concept2 and from all instances to top. If we start from the concept level (concept1) we need only 1 triple, from concept1 to top.
Of course the hierarchies do not always have this shape. However, it is very common that multiple instances share 1 metadata concept. In the worst case each instance has one or more concepts as metadata and each concept is used only once. In this situation there is no reduction of triples, but there isn't an increase either.
2. Going multiple
In progress..2.1. Multiple facets of the same dimension
Each facet describes the hierarchy in one dimension. However, the concepts from one dimension can be used for different properties of an instance. For example, the place of birth of an artist and the place describing the subject of a painting both use a hierarchy of geographical locations. (Would be nice to find a good visualization for this). The same dimension can also be used for the same type of instance. For example concept from a material hierarchy can be used as metadata for the medium used in an artwork (oil paint) or for support (canvas).
There is no one to one correspondence between a hierarchy and a facet. To make this explicit the definition of a facet has to specify, besides the hierarchy, also the property between the instance and its metadata concept. I call this the annotation property. Note, that the same hierarchy for different facets does not have to be exactly the same because empty branches are hidden.
As a result facets can be (partially) overlapping. Most implementations of faceted browsers require different facets to be orthogonal sets. (eg. Flamenco, more??)
2.2. Instances from multiple classes
All implementations of faceted browsers I found are directed towards a specific type of resource. For example books at amazon.com and artworks at MuseumFinland and Flamenco Search. The E-Culture data set, and the Semantic Web in general, contains meta-data about all kinds of resources. I try to exploit and visualize the multiple type of instances in the faceted browser.
Faceted browsing with multiple types of classes allows the creation of more complex queries by combining constraints from facets of different types of resources. For example, in the E-Culture demo multi-type faceted browsing allows the user to find specific artworks through the properties of its creator (artworks created by female artists that were born in Amsterdam). This is also useful to get a view of the other type of resources once a constraint is made. Selecting a constraint on artworks automatically adapts the facets about persons.
2.3 Class Tree
The different types of resources are themselves part of a hierarchy, with rdfs:Resource as the root node. Besides the facets for a specific type of resource the interface of the faceted browser also displays the class tree. It's use is two fold; it allows to select (and display the facets of) a specific type of resource, person, artwork, and you can browse the tree, as with the other facets, to specify/generalize the type.
From a performance perspective facets from all types of resources can't be shown, and thus computed, all at once. Only the facets for one type of resources are displayed. The class tree allows the user to switch between the types of resources and its facets.
3. Cross relations
Handle multiple types of instances need to define how they are related. This is a tricky subject. To start I choose to hard-code them by hand.
Notes With only 1 cross relation (work-creator-person) many questions arise, especially, in combination with the class facet. What to do when a specific class is selected? We could argument that the user specifies which class he wants to browse so the system should only show instances of this class. But what if a constraint is already given e.g. artwork material, and next the class person is selected. The instances are are those person that made artwork constructed with the previously selected material. All good. But if all artwork facets are hidden (including the material facet) it is not possible to further specialize the material query.
This is not necessarily a bad thing. It's a choice! I does make some sense, if you want to be able to further specify the artwork's material don't select the person type. But at the other hand it is somewhat strange that you selected a constraint from the material facet, and after selecting the person type the material facet disappears but the constraint remains. Could it be a solution to keep only the material facet (Note the facets from which previous constraint have been selected) and hide the rest???
4. Trees everywhere
In the initial idea of a faceted browsing there is a collection of instances and multiple facets that allow browsing of the this collection along its metadata. The implementation of the cross relations showed that a facet can also be used for the the type tree. The type tree allows browsing and switching between different type of instances. During the implementation of the automatic derivation of facets a third type of facet hierarchy came up, the facets themselves.
A facet is defined for a specific resource property on a specific type of resource. Properties can themselves be hierarchically ordered, typically, through the subPropertyOf relation. For example, in ulan nationallityPreferred and nationalityNonPreferred are both subproperties of nationality. In vra Material.Medium and Material.Support are both subproperties of Material.
Now we can define three types of hierarchy trees. The class tree on subjects, the property tree, and the concept trees. This corresponds to the tree elements (subject, predicate, object) of RDF's basic unit, the triple. Implementation results in a universal facet browser for RDF, that gives the user control over the entire graph. The advantage of the faceted browsing paradigm is that we don't visualize this entire graph. Instead it is cut up in separate trees that are each individually easy to understand and from the system's perspective easy to visualize.
Graphs versus Trees
RDF is a directed acyclic graph, DAG. DAGs can be considered to be a generalization of trees in which certain subtrees can be shared by different parts of the tree. Transforming the graph into a tree as the following consequences: The same node can occur at different levels of the tree.
5. Combined with keyword search
Navigation through facet hierarchies is the core strategy of faceted browsing. Though this can be a useful method to find resources, it is not always straightforward how to find the right concepts in the hierarchy. In a facet, only the concepts at top of the hierarchy are directly accessible. Concepts inside the hierarchy are only reachable by navigating through the tree. Navigation can take quite some time, and may require domain specific knowledge. For example, to reach the concept bronze the user has to traverse the following path in the material facet:
materials by composition > inorganic material > metal and metal products > metal > metal by composition or origin > nonferrous metal > copper and copper alloy > copper alloy > bronze
If a user is looking for something specific, and knows the name of this thing, he has to be able to go there directly. To minimize superfluous navigation within facets I explore the integration with keyword search. There are two approaches to combine keyword search with faceted browsing that I want to explore.
- Keyword auto-completion for individual facets. Find concepts within a specific facet.
- Top level full keyword search. Facets provide a method for disambiguation of the keyword.
5.1. Facet auto-completion
In the interface a search field is added for each facet. AJAX technology is used to do the client-server communication. The communication initiates by a request from the client for each key press in the search field. A list of matching concepts is returned by the server. A matching concept has the given substring as a label, is part of the hierarchy, and is used as metadata on an instance of the current type. Finally a javascript function on the client side updates the facet tree with the new found matching concepts.
I decided to use a prefix search to find the matching labels. A full substring search would take the server to much time (could be added as on option). The combination of prefix search on the rdfs labels and the AJAX technology results in tree based auto-completion.
The literal values of the resources are indexed on each word separately. This allows searching for "Vincent van Gogh" with substrings of either 'vincent', 'van' and 'gogh'.
Many resources from the getty thesauri have both preferred and non preferred labels. Both are a subclass of rdfs:label, and therefore, both are used in the prefix search. This allows to search for 'Paris' with a substring of 'paris', as well as a substring of 'lutetia' (the roman name for Paris).
5.2. Top level keyword search
6. Facet Store
A facet links a collection of resources to a taxonomy. Need four properties to define this connection.
6.1. Facet definition
- Resource Type: A facet is defined for a specific type of resource. Resources from a different class can have the same facet dimension. The concepts in this facet can vary.
- Resource Property: A facet connects a resource to a concept from the taxonomy through a resource property. Same type of resource and same taxonomy, but different resource property.
- Hierarchy Top Concept: The lowest common top concept of the taxonomy that connects all resources defined by the facet. Hence, it does not have to be the top as designed by the modeler.
- Hierarchy Relation: The concepts that are part of the taxonomy have a relation to the hierarchy top concept through the defined hierarchy relation.
6.2. Subject, predicate and object facets
Subject | Property | Object | |
---|---|---|---|
Description | Class hierarchy of instances | Hierarchy of properties, that are used in the object facets | Hierarchy of concepts used as metadata |
Resource Type | rdfs:Class | rdf:Property | subclass of rdfs:Class |
Resource Property | rdf:type | - | subproperty of rdf:Property |
Hierarchy Top | rdfs:Resource | rdf:Property | subclass of rdfs:Class |
Hierarchy Relation | rdfs:subClassOf | rdfs:subPropertyOf | Transitive Relation or rdf:type |
The subject and property facet have a fixed definition. (hierarchy relations subPropertyOf and subClassOf are at the moment fixed. In theory this could be a different relation or a set or relations.)
Need extra prolog predicate to find direct subclassOf and subPropertyOf relations. SWI built in predicates are defined transitive.
6.3. Derive facets
Three ways to define object facets:- Manually defined in RDF
- Defined by modeler in OWL specification
- Extract from RDF graph
6.3.1 Facet declarations in RDF
Store a facet description in RDF. Currently I created all resources that I needed under the iface namespace.
6.3.2 OWL restrictions
6.3.3 Facet extraction
State
References
RDF stores- [1] SWI Prolog semweb package documentation
- [2] Jeen Broekstra , Arjohn Kampman , and Frank van Harmelen Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema
- [3] Ka-Ping Yee (et al). Faceted Metadata for Image Search and Browsing
- Flamenco
- Aqua Browser
- mSpace
- MuseoSuomi
- Aduna Spectacle
- Stitch (Aligned multiple collections)