Exactly what is it about The Fine Young
Cannibals that is similar to Roxy Music?
Many of today's online retailers use the opinions of lots of people
to suggest selections that are similar to music a customer has purchased.
The challenge of automating the process comes down to teaching a computer
how to tell that a piece of music written by Mozart sounds like music
written by Mozart.
Researchers from the Center for Mathematics and Computer Science
(CWI) and the University of Amsterdam in the Netherlands have tapped computational
biology and information theory to find a way to automatically compare
music files to determine how similar they are.
The work is the latest application of a universal similarity metric
that the researchers have used to construct evolutionary trees using mitochondrial
genes of different animal species, put together language trees of 52 Eurasian
tongues, and detect plagiarism in student programming assignments.
The researchers' latest application shows that it could be used
to categorize music, and to determine the true origin of music authorship.
"Our method works well on abstract symbolic data, that is, data without
a direct physical interpretation," said Rudi Cilibrasi, a researcher at
The method uses a universal similarity metric that measures the
information distance, or differences between two files, as a number between
zero and one. "Zero means they're the same file, and one means they have
no relation whatsoever. Most file pairs fall somewhere in between," said
The universal metric measures how easily one file can be compressed
using the information contained in the second file. Compression algorithms
are regularly used to remove redundant information from files to make
the files smaller. "Two files are closer if one can be compressed [more
fully] given the information in the other," said Cilibrasi.
This method takes into account all similarities between a pair
of sequences, according to Paul Vitanyi, a fellow at CWI and professor
of computer science at the University of Amsterdam. "It is universal in
that it discovers all effective similarities," he said.
A human expert will categorize pieces of music by searching for
specific similarities related to pitch, rhythm or harmony, according to
Vitanyi. Existing systems that automatically classify music do the same,
looking, for instance, for similar rhythms.
The CWI researchers' universal metric approach does not look for
particular features, according to Vitany. This means the clustering takes
into account features that are not apparent, said Vitanyi. "Existing systems
tend to use... very particular features to do the clustering," he said.
"Our theory tells us we -- ideally -- account... for all features [including]
the many arbitrary ones we don't know and even cannot know," he said.
The method could be used to classify music in commercial applications,
said Cilibrasi. It could "enable better cataloging and indexing based
on people's preferences... without the help of pre-existing purchase records
or other human input," he said.
The researchers' method has two steps. The researchers first compared
a group of MIDI files and calculated the information distance between
every pair of files.
Then they used a hill-climbing algorithm to organize the differences
among files into a branching tree, with closer files situated closer on
the tree, said Cilibrasi. "If two files have a short distance in the [distance]
matrix, then they should also be close together in the tree," he said.
A hill-climbing algorithm starts at a random place and aims to
improve the solution at each step. A possible solution to a problem can
be thought of as a hill, and many possible, unconnected solutions a group
of hills. In order to keep the hill-climbing algorithm from getting stuck
on a small hill, the researchers added a randomized escape sequence.
To speed the process, they modified the hill-climbing algorithm
so that it examined many files in parallel. This enabled them to categorize
larger groups of music than was previously possible, said Cilibrasi. The
method was able to handle 60 items, but works best with sets of 20 to
30 items Vitanyi said.
The researchers tested the method by having it categorize increasingly
similar files. The program correctly classified a group of 12 very different
types of files: gene sequences, excerpts from a novel, classical MIDI
files, jazz MIDI files, and linux and java computer code.
The algorithm was 85.8 percent accurate in distinguishing the
genres of 36 classical, jazz and rock pieces. The program placed some
of the rock music in a separate group near the main rock music group,
and classified a pair of Bach preludes and a Chopin piece in the other
The apparent misclassifications could be errors of the program,
or could reveal less apparent likenesses among specific pieces in different
genres, according to Cilibrasi.
The method was more accurate at sorting music by composer within
genres, especially with smaller groups of music. It was 95.8 percent accurate
in sorting 12 Bach, Chopin and Debussy pieces; 89.5 percent accurate in
sorting 32 pieces from Bach, Chopin, Debussy and Haydn; 84.4 percent accurate
when 28 pieces from Beethoven, Buxtehude and Mozart pieces were added
to the mix; and 86 percent accurate in sorting 34 symphonies from Haydn,
Mozart, Beethoven, Schubert and Saint-Saens.
Applying methods from computational biology and information theory
to music is not new, said Elaine Chew, an assistant professor of industrial
and systems engineering at the University of Southern California. The
challenge, however, is finding a practical way to implement the theory,
The particular metrics the researchers chose have had previous
success in building phylogeny trees to help in classifying mitochondrial
genes and Eurasian languages, Chew added. "What is new is the application
of these methods to music classification," she said.
Further work needs to be done in order to better understand what
it means for two pieces to be close according to the researchers' metrics,
said Chew. In addition, it appears that the method doesn't work well with
larger data sets, she said. This "would limit its applicability to the
classification of today's digital music libraries," she said.
This type of research, however, does have potential practical
applications, said Chew. "Similarity metrics are a critical part of query-by-humming
and... query-by-example systems," she said. These types of metrics also
form the core of music search and recommendations systems for personalized
music applications, she said.
The basic software works to detect patterns that appear meaningful,
said Cilibrasi. "We still lack... a clear idea of the best way to practically
use these techniques," he said. "We think [they] need to be applied to
a larger variety of subject areas."
The researchers are currently working on making the algorithm
more consistent, and developing better methods of visualizing the information
that the algorithm brings to light, said Cilibrasi. "A binary tree is
not the only way to visualize this information. We may consider using
arbitrary trees or other pictures," he said.
The basic method could also be applied to disciplines like art
history and forensics, said Cilibrasi.
Cilibrasi and Vitanyi's research colleague was Ronald de Wolf at
CWI. The research was funded by the Netherlands Organization for Scientific
TRN Categories: Databases and Information Retrieval
Story Type: News
Related Elements: Technical paper, "Algorithmic Clustering
of Music," posted on the Computer Research Repository (CoRR) at arxiv.org/abs/cs.SD/0303025;
"The Similarity Metric," presented at the 14th Association for Computing
Machinery-Society for Industrial and Applied Mathematics (ACM-SIAM) Symposium
on Discrete Algorithms, January 12-14, 2003 and posted at www.cwi.nl/~paulv/selection.html