Market-Watch
Information and overviews on new products and advances in multimedia and multilingual content processing systems.
NTT creates World-Wide Media Browser––Multilingual Audio-visual Content Retrieval and Browsing System
This article from NTT's Website describes the World-Wide Media Browser (WWMB) being developed at NTT Communication Science Laboratories. The aim of there system is to provide users with easy access to a large amount of multilingual audio/visual content via translingual techniques. If it works as well as they hope, in stands to be a great leap foreward. (execrpts, and a screen shot of the system with descriptions superimposed)

In recent years, it has become much easier for people to listen to music and watch videos around the world through the Internet. Since many Internet video services provide movies and TV shows by using streaming technology, users can watch the videos they want to see anywhere and at any time. Furthermore, video-sharing sites such as YouTube™ not only provide a lot of videos, but also give end users the opportunity to publish their own videos. Such user-generated videos are being uploaded every day by a large but unspecified number of people, so the number of publicly available videos is increasing rapidly. However, it is not easy for ordinary viewers to find a specific scene among these millions of videos. For example, in most video-sharing sites, the clue to finding a scene is only a few representative terms attached to each file by the file’s contributor. When users try to find a desired scene, they type some keywords into the corresponding search engine, but they usually receive a long list of video files whose pre-attached terms match the keywords. Finally, they have to waste time playing each video to check if the target scene is included. At the same time, the wide range of languages is also a large problem. Although most people can access videos from all over the world, it is very difficult to find and watch videos in foreign languages even if the content is of interest because the keywords do not match any representative terms in foreign languages.
The prototype can currently handle Japanese and English content. By using the browser interface, for example, a Japanese user can find and watch videos in English with Japanese queries and subtitles. When the user enters a (Japanese) keyword or key phrase on the query input form, scene candidates that match the query are listed as the search results, showing where the query term or its (English) translation is spoken in each listed scene. The user can play a video that has a matching scene by selecting it from the list. The media player plays the video with multilingual (English and Japanese) subtitles, which have already been provided automatically by the speech recognition and machine translation subsystems. Thus, users can easily find, watch, and understand foreign-language videos in their own language.
The browser also provides a list of named entities (NEs) that are spoken in the video. NEs are keywords such as proper names. The user can check the meaning of the spoken NEs via their hyper links to a web search engine. The WWMB assumes that all videos have been processed in advance by our technologies. Its content processing part, which consists of video content collection, speech recognition, language processing, and machine translation. The language processing module includes sentence boundary detection and NE extraction from speech recognition results. The search server provides fast retrieval of video files and scenes.
Speech recognition and machine translation results are stored in the annotation database together with the timestamps for inserting subtitles during video playback. An index table is also constructed for the annotation data so that the content is searched for efficiently with complex queries. Important keywords (NEs, etc.) are also stored and used to characterize each video scene. These technologies are explained in more detail in the next section.
Core technologies in WWMB
The World-Wide Media Browser essentially requires highly accurate methods for speech recognition, language processing, and machine translation because a few speech recognition errors induce language processing errors, and these errors result in more errors in machine translation; that is, even a few errors have a big impact on the final system output. In addition, a large amount of data should be processed efficiently in a short computation time.
Automatic speech recognition (ASR) technology has many applications including the control of consumer electronic devices, an interactive voice response interface in telephone services, and automatic generation of meeting minutes. To make ASR effective for various applications, we are working to improve speech analysis, model training, search, and backend processing algorithms.
The language processing module in the WWMB detects sentence boundaries and extracts NEs. Sentence boundary detection (SBD) is indispensable for handling speech recognition results in the following language processing, such as machine translation, because most natural language processing methods are assumed to handle input text sentence by sentence. Since the punctuation that appears in written text does not exist in speech recognition results, the boundaries must be detected automatically. However, since sentence boundaries are usually ambiguous in spoken language, it is not easy to find the correct ones.
NE extraction identifies keywords from a document, such as proper names and expressions for dates, times, and quantities. NEs hold important information that is used for metadata, information extraction, and question answering. NE extraction can be regarded as the problem of classifying a word or a compound word into an NE class (person, location, date, time, etc.) or a not-NE class.
This article introduced the World-Wide Media Browser (WWMB) being developed at NTT Communication Science Laboratories. This system provides users with easy access to a large amount of multilingual audio/visual content by using translingual techniques. We evaluated the system using real lecture videos recorded at MIT and obtained good ASR and SMT results that could be used to help Japanese viewers understand the gist of English lectures. Future work involves improving ASR, SBD, and SMT, and closely coupling these techniques to further improve the WWMB functions.