====================== = MAR Workshop = = 20/07/2007 = = Glasgow University = ====================== 1) Keynote - Alan Smeaton (DCU) ------------------------------- Title: Video Summarisation: A New Research Challenge? Content summarisation requires content analysis, indexing, comparison/clustering, highlight detection, duplicate and near-duplicate detection and redundancy elimination and ... then it requires creation/packaging of output! Video summarisation: keyframes play a key role! - movie trailer is a kind of summary - summary of sports events is another kind of summary, maintaining chronological sequences - ... BUT: what motivates inclusion of a fragment in the summary is very dependent on the video genre: sports, movie/TV, news, rushes, personal content Some commercial summarisation: Mitsubishi with a PVR TRECVid summarization: goal is to compare on the same dataset - Formal workshop at ACM MM this year. - 100 hours of rushes - Task: create a summary of max 4% the original, no interaction, just playback, eliminate redundancy and maximise viewers' efficiency at recognising objects and events as quickly as possible Evaluating summaries: how to evaluate a summary? - ground truth made for the 42 summaries, normalized from 24 in average to 12 important items to detect per summary - 3 accessors looking at each summary and comparing with the ground truth - measures: . fraction of (12 items of) groundtruth found; . ease of use and amount of near-redundancy, as judged by assessors; . assessment time taken; . summary duration; . summary creation compute time; - 22 groups from 13 countries completed submissions, system papers received 10 days ago Final results: it seems that the systems that use really complex techniques perform less than the stupid techniques (such as take a frame every 24 :-) Conclusion: - this is the first large-scale, multi-participant evaluation of summarisation of video - the summarization is tied to the nature of the data, i.e. TV series rushes ... - ... BUT the techniques are not generalisable to other kinds of rushes or non-rushes - conclusion: like for text, simple baselines work as well as sophisticated techniques also for videos! 2) Provocative Talk - Steve Renals (Uni. of Edinburgh) ------------------------------------------------------ Title: Communication Scene Analysis / Innterpreting Communication Scenes A tag cloud of core technologies is displayed: the techniques that must be improved for TRECVid are "speech transcription", "summarization", "focus of attention", "social networks", "video editing", "hotspot detection", "decision point detection", etc. Challenges: Processing archives, interactions in (close-to) real time, representing the content of a communicative interaction (traditional linguistic stuff, interactive and social signals, social dynamics, the environment). See: http://corpus.amiproject.org 3) Provocative Talk - Ebroul Izquierdo (QMUL) --------------------------------------------- Title: Face detection and recognition: achievements and open research challenges Visual information retrieval is too broad and too difficult! Constraining the application domain usually leads to better results However, how to measure the relevance of an application domain and how to find the killer application? Nice presentation of how humans versus machines perform in recognizing faces: - humans recognize cartoons, sketches ... machine don't - machines do not care if the images is in reverse way ... humans do - humans do not have memory ... machines have - humans need 5 mms to identify a face, machine identify 5 faces in 1 mms Consequently, the challenges are: - detection: is it the most difficult challenge? - recognition (matching faces): is it easier? Where are we now? - false positives are close to 0% in good conditions - false negatives are < 9% in good conditions - detective failure rates are stil higher than 12% Why the results of detection/recognition presented in PAMI 10, 15 and 20 years ago are very similar to those reported in 2007? By adding more and more complex (and incremental) models, are we getting closer to the solution? 4) Provocative Talk - Stefan Ruger (KMI, Open University) --------------------------------------------------------- Title: Can Multimedia Retrieval get away with NOT using metadata? Everything is in the NOT! - Not provocative for the multimedia analysis community: they don't care about metadata! - Provocative for librarians: they spend their life annotating! The IR community distinguishes metadata from annotations Experiments on Flickr photos that visual analysis + tags give better results 5) Provocative Talk - Alex Hauptmann (CMU, USA) ----------------------------------------------- Title: Challenges in Cross-modal Analysis Broad view of cross-modal analysis: not only audio, video, text ... but also background knowledge, any kind of sensors data, etc. Really interesting: - Event Search from R. Jain !!! - To look for: nice interface for displaying news items: . Time: Calendar + now (big middle of the screen), recent past (smaller fonts, down in the screen), future (smaller font, up in the screen) . Place: world map . Results cluster by general topic ... Challenges: - Exploiting the next level of scale - Issue: knowing when we don't know an answer Break-out Sessions - Semantic and Interaction: ---------------------------------------------- Chair: Raphael Participants: Joemon Jose, Georgina, Stefan Ruger, Simon Tucker, Steve Renals, Thierry Declerk We are looking at the problem from content and analysis and retrieval perspective - semantic interaction, is reasonably projected as a solution for addressing the semantic gap! or for the generic CBIR problem. So basically what we should do is to discuss their role, research challenges, issues etc. Questions: - How the user-interaction (browsing? relevance feedback?) can help to guess more high-level semantics for indexing multimedia? - In the TRECVid Interactive Search Task, how many systems learn with the user interaction? . the user sees things back and changes the query ... the user is learning but not the system in the interactive search task . challenge: mine the user interaction - Conversely, how high level semantics can enhance the interaction with multimedia? . it seems easier to adapt interfaces when the semantics of the multimedia content is known Challenge 1: - Guess the information seeking goal of the user: change the presentation of the results = display a summary, a longer summary, a ranked list of results, a map, etc. . Examples: Google has pre-processed queries, Google "I feel lucky" Challenge 2: - Use semantics for dividing the search space and clustering the search results . auto completion: find the type of an object (e.g: apple is a fruit or a computer) . suggestion: expand the query with other terms Challenge 3: - How to log and mine the user interaction . explicit relevance feedback vs implicit (active learning) . not successful for text but might be different for multimedia as the results are really bad without feedback Challenge 4: - Set up evaluation campaign, for example for really assessing the performance of face recognition - Problem with user satisfaction evaluation is that it is difficult to repeat and compare results between systems . too many variables, user interface evaluation is too expensive! . we need other ways for measuring, beyond TRECVid Personal photo management: - Multimodality: GPS, timestamps, any kind of contextual information - Face recognition: you tag perhaps 20 names, and these person will always show up on your photos, the system has just to find the similar faces