Software sorts tunes TRN 042303


	Software sorts tunes		April 23/30, 2003
	By Kimberly Patch, Technology Research News		Page One Nanocomputer skips clock DNA motor keeps cranking Software sorts tunes Silver bits channel nano light News briefs: Tiny drug capsules shine Degree of difference sorts data Casting yields non-carbon nanotubes Material makes backwards lens Juiced liquid jolts metal into shapes Nanotube web could mimic brain
	By Kimberly Patch, Technology Research News
	Exactly what is it about The Fine Young Cannibals that is similar to Roxy Music? Many of today's online retailers use the opinions of lots of people to suggest selections that are similar to music a customer has purchased. The challenge of automating the process comes down to teaching a computer how to tell that a piece of music written by Mozart sounds like music written by Mozart. Researchers from the Center for Mathematics and Computer Science (CWI) and the University of Amsterdam in the Netherlands have tapped computational biology and information theory to find a way to automatically compare music files to determine how similar they are. The work is the latest application of a universal similarity metric that the researchers have used to construct evolutionary trees using mitochondrial genes of different animal species, put together language trees of 52 Eurasian tongues, and detect plagiarism in student programming assignments. The researchers' latest application shows that it could be used to categorize music, and to determine the true origin of music authorship. "Our method works well on abstract symbolic data, that is, data without a direct physical interpretation," said Rudi Cilibrasi, a researcher at CWI. The method uses a universal similarity metric that measures the information distance, or differences between two files, as a number between zero and one. "Zero means they're the same file, and one means they have no relation whatsoever. Most file pairs fall somewhere in between," said Cilibrasi. The universal metric measures how easily one file can be compressed using the information contained in the second file. Compression algorithms are regularly used to remove redundant information from files to make the files smaller. "Two files are closer if one can be compressed [more fully] given the information in the other," said Cilibrasi. This method takes into account all similarities between a pair of sequences, according to Paul Vitanyi, a fellow at CWI and professor of computer science at the University of Amsterdam. "It is universal in that it discovers all effective similarities," he said. A human expert will categorize pieces of music by searching for specific similarities related to pitch, rhythm or harmony, according to Vitanyi. Existing systems that automatically classify music do the same, looking, for instance, for similar rhythms. The CWI researchers' universal metric approach does not look for particular features, according to Vitany. This means the clustering takes into account features that are not apparent, said Vitanyi. "Existing systems tend to use... very particular features to do the clustering," he said. "Our theory tells us we -- ideally -- account... for all features [including] the many arbitrary ones we don't know and even cannot know," he said. The method could be used to classify music in commercial applications, said Cilibrasi. It could "enable better cataloging and indexing based on people's preferences... without the help of pre-existing purchase records or other human input," he said. The researchers' method has two steps. The researchers first compared a group of MIDI files and calculated the information distance between every pair of files. Then they used a hill-climbing algorithm to organize the differences among files into a branching tree, with closer files situated closer on the tree, said Cilibrasi. "If two files have a short distance in the [distance] matrix, then they should also be close together in the tree," he said. A hill-climbing algorithm starts at a random place and aims to improve the solution at each step. A possible solution to a problem can be thought of as a hill, and many possible, unconnected solutions a group of hills. In order to keep the hill-climbing algorithm from getting stuck on a small hill, the researchers added a randomized escape sequence. To speed the process, they modified the hill-climbing algorithm so that it examined many files in parallel. This enabled them to categorize larger groups of music than was previously possible, said Cilibrasi. The method was able to handle 60 items, but works best with sets of 20 to 30 items Vitanyi said. The researchers tested the method by having it categorize increasingly similar files. The program correctly classified a group of 12 very different types of files: gene sequences, excerpts from a novel, classical MIDI files, jazz MIDI files, and linux and java computer code. The algorithm was 85.8 percent accurate in distinguishing the genres of 36 classical, jazz and rock pieces. The program placed some of the rock music in a separate group near the main rock music group, and classified a pair of Bach preludes and a Chopin piece in the other genres. The apparent misclassifications could be errors of the program, or could reveal less apparent likenesses among specific pieces in different genres, according to Cilibrasi. The method was more accurate at sorting music by composer within genres, especially with smaller groups of music. It was 95.8 percent accurate in sorting 12 Bach, Chopin and Debussy pieces; 89.5 percent accurate in sorting 32 pieces from Bach, Chopin, Debussy and Haydn; 84.4 percent accurate when 28 pieces from Beethoven, Buxtehude and Mozart pieces were added to the mix; and 86 percent accurate in sorting 34 symphonies from Haydn, Mozart, Beethoven, Schubert and Saint-Saens. Applying methods from computational biology and information theory to music is not new, said Elaine Chew, an assistant professor of industrial and systems engineering at the University of Southern California. The challenge, however, is finding a practical way to implement the theory, she said. The particular metrics the researchers chose have had previous success in building phylogeny trees to help in classifying mitochondrial genes and Eurasian languages, Chew added. "What is new is the application of these methods to music classification," she said. Further work needs to be done in order to better understand what it means for two pieces to be close according to the researchers' metrics, said Chew. In addition, it appears that the method doesn't work well with larger data sets, she said. This "would limit its applicability to the classification of today's digital music libraries," she said. This type of research, however, does have potential practical applications, said Chew. "Similarity metrics are a critical part of query-by-humming and... query-by-example systems," she said. These types of metrics also form the core of music search and recommendations systems for personalized music applications, she said. The basic software works to detect patterns that appear meaningful, said Cilibrasi. "We still lack... a clear idea of the best way to practically use these techniques," he said. "We think [they] need to be applied to a larger variety of subject areas." The researchers are currently working on making the algorithm more consistent, and developing better methods of visualizing the information that the algorithm brings to light, said Cilibrasi. "A binary tree is not the only way to visualize this information. We may consider using arbitrary trees or other pictures," he said. The basic method could also be applied to disciplines like art history and forensics, said Cilibrasi. Cilibrasi and Vitanyi's research colleague was Ronald de Wolf at CWI. The research was funded by the Netherlands Organization for Scientific Research (NWO). Timeline: unknown Funding: Government TRN Categories: Databases and Information Retrieval Story Type: News Related Elements: Technical paper, "Algorithmic Clustering of Music," posted on the Computer Research Repository (CoRR) at arxiv.org/abs/cs.SD/0303025; "The Similarity Metric," presented at the 14th Association for Computing Machinery-Society for Industrial and Applied Mathematics (ACM-SIAM) Symposium on Discrete Algorithms, January 12-14, 2003 and posted at www.cwi.nl/~paulv/selection.html
	Page One Archive Glossary Resources Research Directory Newsletters and Reports TRN Store Feedback Letters Comments Find out about TRN Services for Web sites and print publications. About TRN

	For permission to reprint or republish this article, please email trn@trnmag.com. © Copyright Technology Research News, LLC 2000-2003. All rights reserved.