Relation Search

Michiel Hildebrand, $Date: 2007/02/23 20:18:13 $

Roadmap

Relation search is concerned with the paths, the relation, between two resources, in our cases two resources in a RDF graph. A brute force path find algorithm fails both on performance and on precision. Why? We perform an experiment with brute force graph traversal to get more insight into what causes the combinatorial explosion. Can we point out specific problematic properties, patterns or patterns of specific properties? Next, we will analyze the successful/meaningful paths. What role does the domain specific knowledge play in these paths? Are there generic patterns? Statistics can help to get a quantitative measure on the meaningfulness of relations. Which statistics can we compute with a generic method? From this point on we should have a enough knowledge of the properties of RDF graphs to step up one level...

Brute Force Experiment

As a first naive try we generate all paths between two resources with a brute force graph traversal. We use the length of the path as the cutoff procedure. The experiment will incrementally increase the path length until it fails to finish within a reasonable amount of time. We test the brute force method on various triple stores and with different data sets.

For each path-length we generate a set of SPRQL queries. The number of queries is exponential to the path-length. The global schema is (the queries are still in SeRQL):

select * from {From} property {To}
select * from {To} property {From}

For path length 4 one of the 16 queries looks as follows:

select * from {subj1} p1 {From}, {subj1} p2 {obj1}, {obj1} p3 {obj2}, {To} p4 {obj2}

I have written a small prolog program that generates the SPRQL queries for two URIs and a any path-length. In SWI I can use the queries directly to get all results. For the other triple stores I output the queries and (hopefully) pipe time to their query interface. For each path length I make one big query through the UNION of all individual queries for that length.

Query Restriction

With the unrestricted queries mentioned above the search space and the number of results is going to be huge. There are a few properties and patterns responsible for this. A good example of the problem is the comparison of two artworks. With path length 2 the search space is not problematically big, but there are already irrelevant paths. Two artworks are always related by their type, namely vra:Work. In this artchive collection they are also related by their material.medium and material.support, since these are the same for all artworks. With path length 4 it becomes really problematic. The following query is responsible for this: From-P1-V1, S-P2-V1, S-P3-V2, To-P4-V2. In words this means that From and To both have the same value has some other resource S. This pattern occurs very often, namely with Pi is rdf:type, vra:material.medium etc. and all combinations of these properties. In fact, this means that two artworks are related through a random third artwork.

Cycle Detection I

A first try to fix this problem is cycle detection on the resources. We add a where part to the query with all equalities that are not allowed (I assume that there exist no triples where the Subject and Value are the same).

select * from {From} p1 {V1}, {V1} p2 {V2}, {V2} p3 {V3}, {V3} p4 {To} where From != V2 AND From != V31 AND V1 != V3 AND V1 != To AND V2 != To

The cycle detection removes all results where all Pi are of the same. It does not disallow, for example a combination rdf:type and vra:material.medium, because the values of these properties are not the same. What to do? Hand-coded restrictions on certain properties. Most obvious is rdf:type.

Cycle Detection II

The cycle detection above succeeds in preventing explicit cycles within a path. However, a path in an RDF graph may also contain implicit cycles. The cycles are implicit because we can't directly exclude them from the path. An example: Work1 P Work2 creator Person1. There is an implicit cycle in this path if the relation Work1 creator Person1 exists. We can exclude implicit cycles by extending the where part of the query with all predicate-value pairs from the visited URIs.

select * from {u1} p1 {i1}, {i1} p2 {u2}
where not exists ( SELECT * FROM {u1} p2 {u2} )

In words, where there is not a path directly using the same property.

Precision and Recall

We need a measure to express what meaningful results are.

Datasets

test set 1: Cultural heritage
RDF files:
  1. rdfs.rdfs,
  2. dc.rdfs,
  3. vracore3.rdfs,
  4. vp.rdfs,
  5. aat.rdfs,
  6. ulan.rdfs,
  7. tgn.rdfs,
  8. aat.rdf,
  9. ulan_meta_relations.rdf,
  10. ULAN-Events.rdf,
  11. ULAN-Nationalities.rdf,
  12. ULAN-Relations.rdf,
  13. ULAN-Roles.rdf,
  14. ULAN-Top.rdf,
  15. ULAN-painter.rdf,
  16. TGN-Top.rdf,
  17. TGN-PlaceTypes.rdf,
  18. TGN-Other.rdf,
  19. TGN-Europe.rdf,
  20. artchive.rdf,
  21. artchive_annotations.rdf,
  22. aul.rdfs,
  23. aul.rdf,
  24. aul9styles.rdf
Statistics: Concepts > 100 subjects

We need a notion of connectedness. How much potential properties are available for the relation search? The default branching factor? - average subject branch factor. The subject branch factor is the average number of subjects for each unique value of a specific predicate. The average subject branch factor is computed over all properties. - average object branch factor. The average number of values for a unique subject.

Results

Test machine is AMD dual-core 3800+ with 3gb of DRAM pc400.

Examples:
  1. almond_blossom.jpg (Painting of Vincent van Gogh) - Vincent van Gogh
  2. almond_blossom.jpg (Painting of Vincent van Gogh) - Anton Mauve
  3. Vincent van Gogh - Paul Gauguin
Without property constraints
Example Length Time (ms) Count Class-Count Pred-Count Class-Pred-Count
2 100000
2104133
31201761313
480002451657176180
5-----
With property constraint (rdf:type)
Example Length Time (ms) Count Class-Count Pred-Count Class-Pred-Count
2 100000
204133
31201661212
485201935552154158
5-----
With property constraints (rdf:type, ulan:role, ulan:gender, vp:parentPreferred)
Example Length Time (ms) Count Class-Count Pred-Count Class-Pred-Count
2 100000
204133
313010388
49000952329094
5-----
Why does the exclusion of these properties have such a drastic drop in results? Do we loose any valuable results or is at all junk? The biggest clusters at class-pred level are:

What happens in these patterns? In words we can say that the work is related to all persons with the french nationality, who in turn are all related to any other person because they are both painters (role(Non)Preferred) or because persons have a preferred parent (Facet Person).

work(W)&culture(W,N)&Person(P1)&nationality(P1,N)&role(P,R)&Person(P,R) Can we say why this is a useless pattern?
  1. Gauguin already has the french nationality himself why relate it to another person????check this
  2. Is culture - nationalityPreferred a useful pattern?
  3. role and parent are statistically "verzadigd". To many persons have this property.

In words. It could be the first part of the path, thus that the french nationality is not a good way to relate a painting to a person. Or it could be the second part of the path in which two persons are related by a property that almost all persons have in common. What about the combination?

More useless clusters.

Only a few of the examples, all have a common structure. The artwork is related to a Person A because it is related to another artwork which is created by a Person who is related to Person A. What happens is that van Gogh is directly related to Anton Mauve in 4 ways; Van Gogh created 58 artworks; All the artworks of van Gogh are related to each other in different ways, for example because they are made of the same material (oil paint). So any artwork of van Gogh is related to Anton Mauve by all the other artworks of van Gogh.

work(W)&material(W,M)&Work(W1,M)&creator(W1,P1)&Person(P1)&r(P1,P)&Person(P)

A possible solution would be to disallow the generic pattern: U1 -P2- I -P1- U2 if U1 -P1- U2 already exists. Should we allow extended paths between two resources if a shorter one between these resources exists? What would such an extended path add?

I think the problem can be formulated more general. We are not interested in random sequences of paths. The parts of the path should be related to each other too. A path where he relation of the first part has nothing to do with the relation in the second part is not meaningful. Can we express this dependency?

A meaningful example:
Work -creator- van Gogh -style- Impressionism -style- Gauguin

Why is this a meaningful path? There are two properties used, creator and hasStyle (or the inverse hasArtist). We could say that for an artist to have a certain style means the act of creation is influenced by this style. The two properties are not independent. How can we express this dependency.

A useless example:
Work -Material.Support- raw canvas -Material.Support- mountain.jpg -creator- Gogh, Vincent van -assistant of- Mauve Anton

The assistant of relation has in this case no influence on the relation that both paintings are made on raw canvas, nor does the creator relation influence this. Why is this the case? One reason would again be statistically, van Gogh painted all his painting on canvas. But, would it be a meaningful path of the paintings were both painted on wooden panels, and van Gogh only used this material for those two paintings. It only is meaningful in this path if it also relates to Anton Mauve. This could for example be the case if he introduced painting on wooden panels. How could we know something like this?

Predicate Statistics

Compute in advance data set specific statistics about the value distribution for the predicates. For each Class and each Predicate of this class find all value distributions that are above a certain threshold. The distribution is computed for each predicate-value pair. The distribution is the number of triples that a value occurs in divided by the total number of triples for that predicate. All value distributions above the threshold are returned. Below are the class-predicate-value pairs with a value distribution above 0.5.


Class = 'http://www.getty.edu/vocabularies/tgn#AdministrativePlace'
Predicate = 'http://www.getty.edu/vocabularies/tgn#placeTypePreferred'
Value = 'http://www.getty.edu/vocabularies/tgn#83002'
Count = 46947
Percentage = 0.898311 ;

Class = 'http://www.getty.edu/vocabularies/tgn#PlaceType'
Predicate = 'http://www.w3.org/2000/01/rdf-schema#subClassOf'
Value = 'http://www.getty.edu/vocabularies/tgn#PlaceType'
Count = 1718
Percentage = 1 ;

Class = 'http://www.getty.edu/vocabularies/ulan#Person'
Predicate = 'http://www.getty.edu/vocabularies/ulan#gender'
Value = 'http://www.getty.edu/vocabularies/ulan#Male'
Count = 38354
Percentage = 0.906007 ;

Class = 'http://www.getty.edu/vocabularies/ulan#Person'
Predicate = 'http://www.getty.edu/vocabularies/ulan#master_of'
Value = 'http://www.getty.edu/vocabularies/ulan#500117045'
Count = 21
Percentage = 0.619048 ;

Class = 'http://www.getty.edu/vocabularies/ulan#Person'
Predicate = 'http://www.getty.edu/vocabularies/ulan#roleNonPreferred'
Value = 'http://www.getty.edu/vocabularies/ulan#31261'
Count = 48909
Percentage = 0.536731 ;

Class = 'http://www.getty.edu/vocabularies/ulan#Person'
Predicate = 'http://www.getty.edu/vocabularies/ulan#rolePreferred'
Value = 'http://www.getty.edu/vocabularies/ulan#31100'
Count = 39579
Percentage = 0.628616 ;

Class = 'http://www.getty.edu/vocabularies/ulan#Person'
Predicate = 'http://www.getty.edu/vocabularies/vp#parentPreferred'
Value = 'http://www.getty.edu/vocabularies/ulan#500000002'
Count = 39579
Percentage = 1 ;

Class = 'http://www.multimedian.nl/projects/n9c/eculture#Work'
Predicate = 'http://www.vraweb.org/vracore/vracore3#culture'
Value = 'http://www.getty.edu/vocabularies/ulan#901600'
Count = 30
Percentage = 0.866667 ;

Class = 'http://www.multimedian.nl/projects/n9c/eculture#Work'
Predicate = 'http://www.vraweb.org/vracore/vracore3#material.medium'
Value = 'http://www.getty.edu/vocabularies/aat#300015050'
Count = 2229
Percentage = 0.895469 ;

Class = 'http://www.multimedian.nl/projects/n9c/eculture#Work'
Predicate = 'http://www.vraweb.org/vracore/vracore3#material.support'
Value = 'http://www.getty.edu/vocabularies/aat#300238097'
Count = 2028
Percentage = 0.998028 ;

The properties cover all intuitively chosen constraints from the previous strategy. Additionally, it indicates for which value and class the properties should not be used. For the two material properties this is important, because a constraint on only these properties would exclude, for example, the relation between to artworks that are made on wooden panels (which is very unique in this dataset).

How can we make use of the statistics to constrain the results of the queries?