JDIM

Volume 3 Issue 1 March 2005

Article

Automatically Building a Stopword List for an Information Retrieval System

Rachel Tsz-Wai Lo, Ben He, ladh Ounis

Abstract Words in a document that are frequently occurring but meaningless in terms of Information Retrieval (IR) are called stopwords. It is repeatedly claimed that stopwords do not contribute towards the context or information of the documents and they should be removed during indexing as well as before querying by an IR system. However, the use of a single fixed stopword list across different document collections could be detrimental to the retrieval effectiveness. This paper presents different methods in deriving a stopword list automatically for a given collection and evaluates the results using four different standard TREC collections. In particular, a new approach, called term-based random sampling, is introduced based on the Kullback-Leibler divergence measure. This approach determines how informative a term is and hence enables us to derive a stopword list automatically. This new approach is then compared to various classical approaches based on Zipf’s law, which we used as our baselines here. Results show that the stopword lists derived by the methods inspired by Zipf’s law are reliable but very expensive to carry out. On the other hand, the computational effort taken to derive the stopword lists using the new approach was minimal compared to the baseline approaches, while achieving a comparable performance. Finally, we show that a more effective stopword list can be obtained by merging the classical stopword list with the stopword lists generated by either the baselines or the new proposed approach. Full abstract text here. Read More Read Less

Article

Blueprint of a Cross-Lingual Web Retrieval Collection

Borkur Sigurbjornsson Jaap Kamps Maarten de Rijke

Abstract The world wide web is a natural setting for cross-lingual information retrieval; web content is essentially multilingual, and web searchers are often polyglots. Even though English has emerged as the lingua franca of the web, planning for a business trip or holiday usually involves digesting pages in a foreign language. The same holds for searching information about European culture, sports, economy, or politics. This paper discusses the blue - print of the WebCLEF track, a new evaluation activity addressing cross-lingual web retrieval within the Cross-Language Evaluation Forum in 2005. Full abstract text here. Read More Read Less

Article

Extracting Temporal Information from Open Domain Text: A Comparative Exploration

David Ahn Sisay Fissaha Adafre Maarten de Rijke

Abstract The utility of data-driven techniques in the end-to-end problem of temporal information extraction is unclear. Recognition of temporal expressions yields readily to machine learning, but normalization seems to call for a rule-based approach. We explore two aspects of the (potential) utility of data-driven methods in the temporal information extraction task. First, we look at whether improving recognition beyond the rule base used by a normalizer has an effect on normalization performance, comparing normalizer performance when fed by several recognition systems. We also perform an error analysis of our normalizer’s performance to uncover aspects of the normalization task that might be amenable to data-driven techniques.

Article

Hierarchical topic detection in large digital news archives: Exploring a sample based approach

Dolf Trieschnigg, Wessel Kraaij

Abstract Hierarchical topic detection is a new task in the TDT 2004 evaluation program, which aims to organize a collection of unstructured news data in a directed acyclic graph (DAG) structure, refecting the topics discussed in the collection, ranging from rather coarse category like nodes to fine singular events. The HTD task poses interesting challenges since its evaluation metric is composed of a travel cost component reflecting the time to find the node of interest starting from the top node and a quality cost component, determined by the quality of the selected node. We present a scalable architecture for HTD and compare several alternative choices for agglomerative clustering and DAG optimization in order to minimize the HTD cost metric. The alternatives are evaluated on the TDT3 and TDT5 test collections. Full abstract text here. Read More Read Less

Article

A Ground Truth For Half A Million Musical Incipits

Rainer Typke, Marc den Hoed, Justin de Nooijer, Frans Wiering, Remco C. Veltkamp

Abstract : Musical incipits are short extracts of scores, taken from the beginning. The RISM A/II collection contains about half a million of them. This large collection size makes a ground truth very interesting for the development of music retrieval methods, but at the same time makes it very difficult to establish one. Human experts cannot be expected to sift through half a million melodies to find the best matches for a given query. For 11 queries, we filtered the collection so that about 50 candidates per query were left, which we then presented to 35 human experts for a final ranking. We present our filtering methods, the experiment design, and the resulting ground truth.To obtain ground truths, we ordered the incipits by the median ranks assigned to them by the human experts. For every incipit, we used the Wilcoxon rank sum test to compare the list of ranks assigned to it with the lists of ranks assigned to its predecessors. As a result, we know which rank differences are statistically significant, which gives us groups of incipits whose correct ranking we know. This ground truth can be used for evaluating music information retrieval systems. A good retrieval system should order the incipits in a way that the order of the groups we identified is not violated, and it should include all high-ranking melodies that we found. It might, however, find additional good matches since our filtering process is not guaranteed to be perfect.

Article

Towards Automatic Formulation of a Physician's Information Needs

Loes Braun, Floris Wiesman, Jaap van den Herik

Abstract The goal of this paper is to contribute to the improvement of the quality of care by providing physicians with patient-related literature. For physicians, it is a problem that they are often not aware of gaps in their knowledge and the corresponding information needs. Our research aim is to resolve this problem by formulating information needs automatically. Based on these information needs, patient-related literature can be retrieved. In this paper, we investigate how to model a physician’s information needs. Thereafter, we design and analyze an approach to instantiate the model with patient data, resulting in information-need templates that are able to represent patient-related information needs. Since the patient data are in Dutch, we developed a translation mechanism for translating Dutch terms into the standard English terminology. From our experiments it is clear that a physician’s information needs can be modelled adequately and can be substantiated into patient-related information needs. As an aside, we have shown that the automatic translation mechanism is not sufficiently effective in comparison with a manual translation mechanism. The usability of our information-need formulation approach is demonstrated by performing a literature retrieval that is based on the formulated information needs. Since the number of formulated information needs is rather high, methods have to be developed that restrict the set of automatically formulated information needs to a more specialized set. Full abstract text here. Read More Read Less

Article

Using a Reference Corpus as a User Model for Focused Information Retrieval

Gilad Mishne Maarten de Rijke, Valentin Jijkoun

Abstract We propose a method for ranking short information nuggets extracted from a text corpus, using another, reliable reference corpus as a user model. We argue that the availability and usage of such additional corpora is common in a number of IR tasks, and apply the method to answering a form of definition questions. The proposed ranking method makes a substantial improvement in the performance of our system. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—linguistic processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—information filtering, search process; H.3.4 [Information Storage and Retrieval]: Systems and Software—question-answering (fact retrieval) systems; 1.2.1 [Artificial Intelligence]: Applications and Expert Systems; 1.2.7 [Artificial Intelligence]: Natural Language Processing

Notification