JOURNAL OF DIGITAL INFORMATION MANAGEMENT V2.I2 JUNE 2004 ABSTRACTS

JOURNAL OF DIGITAL INFORMATION MANAGEMENT

(ISSN 0972-7272) The peer reviewed journal

Volume 3 Issue 4 March 2005

Abstracts

Automatically Building a Stop Word List for an Information Retrieval System

Rachel Tsz-Wai Lo, Ben He, ladh Ounis
Department of Computing Science University of Glasgow 17 Lilybank Gardens Glasgow, UK
Email: lotr|ben|ounis@dcs.gla.ac.uk

Abstract

Words in a document that are frequently occurring but meaningless in terms of Information Retrieval (IR.) are called stopwords. It is repeatedly claimed that stopwords do not contribute towards the context or information of the documents and they should be removed during indexing as well as before querying by an IR, system. However, the use of a single fixed stopword list across different document collections could be detrimental to the retrieval effectiveness. This paper presents different methods in deriving a stop-word list automatically for a given collection and evaluates the results using four different standard TREC collections. In particular, a new approach, called term-based random sampling, is introduced based on the Kullback-Leibler divergence measure. This approach determines how informative a term is and hence enables us to derive a stopword list automatically. This new approach is then compared to various classical approaches based on Zipf's law, which we used as our baselines here. Results show that the stop-word lists derived by the methods inspired by Zipf's law are reliable but very expensive to carry out. On the other hand, the computational effort taken to derive the stopword lists using the new approach was minimal compared to the baseline approaches, while achieving a comparable performance. Finally, we show that a more effective stopword list can be obtained by merging the classical stopword list with the stopword lists generated by either the baselines or the new proposed approach.

Blueprint of a Cross-Lingual Web Retrieval Collection

Borkur Sigurbjornsson, Jaap Kamps, Maarten de Rijke
Informatics Institute, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands
Email: {borkur,kamps,mdr}@science.uva.nl

Abstract

The world wide web is a natural setting for cross-lingual information retrieval; web content is essentially multilingual, and web searchers are often polyglots. Even though English has emerged as the lingua franca of the web, planning for a business trip or holiday usually involves digesting pages in a foreign language. The same holds for searching information about European culture, sports, economy, or politics. This paper discusses the blue-print of the WebCLEF track, a new evaluation activity addressing cross-lingual web retrieval within the Cross-Language Evaluation Forum in 2005.

Extracting Temporal Information from Open Domain Text: A Comparative Exploration

David Ahn, Sisay Fissaha, Adafre Maarten de Rijke
Informatics Institute, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam, The Netherlands
Email: {ahn,sfissaha,mdr}@science.uva.nl

Abstract

The utility of data-driven techniques in the end-to-end problem of temporal information extraction is unclear. Recognition of temporal expressions yields readily to machine learning, but normalization seems to call for a rule-based approach. We explore two aspects of the (potential) utility of data-driven methods in the temporal information extraction task. First, we look at whether improving recognition beyond the rule base used by a normalizer has an effect on normalization performance, comparing normalizer performance when fed by several recognition systems. We also perform an error analysis of our normalizer's performance to uncover aspects of the normalization task that might be amenable to data-driven techniques.

Hierarchical topic detection in large digital news archives: Exploring a sample based approach

Dolf Trieschnigg
University of Twente Enschede, The Netherlands
Email: trieschn@cs.utwente.nl

Wessel Kraaij
TNO P.O. Box 155, 2600 AD Delft, The Netherlands
Email: kraaij@tpd.tno.nl

Abstract

Hierarchical topic detection is a new task in the TDT 2004 evaluation program, which aims to organize a collection of unstructured news data in a directed acyclic graph (DAG) structure, refecting the topics discussed in the collection, ranging from rather coarse category like nodes to file singular events. The HTD task poses interesting challenges since its evaluation metric is composed of a travel cost component refecting the time to fhd the node of interest starting from the top node and a quality cost component, determined by the quality of the selected node. We present a scalable architecture for HTD and compare several alternative choices for agglomerative clustering and DAG optimization in order to minimize the HTD cost metric. The alternatives are evaluated on the TDT3 and TDT5 test collections.

Multidocument Question Answering Text Summarization Using Topic Signatures

Maria Biryukov
Katholieke Universiteit Leuven, Belgium
Email: barmaliska@yahoo.com

Roxana Angheluta
Katholieke Universiteit Leuven, Belgium
Email: anghelutar@yahoo.com

Marie-Francine Moens
Katholieke Universiteit Leuven, Belgium
Email: marie_france.moens@law.kuleuven.ac.be

Abstract

In this paper we describe the process of the multidocument question answering summarization based on the topic signatures. We show that summaries produced using the topic signatures have coverage which is comparable to that of the best systems competing at the Document Understanding Conference .

A Ground Truth For Half A Million Musical Incipits

Rainer Typke, Marc den Hoed, Justin de Nooijer, Frans Wiering, Remco C. Veltkamp
Utrecht University, ICS
Padualaan 14 3584 CH Utrecht, The Netherlands
Email: {rainer.typke|mhoed|jnooijer|frans.wiering|remco.veltkamp}@cs.uu.nl

Abstract

Musical incipits are short extracts of scores, taken from the beginning. The RISM A/II collection contains about half a million of them. This large collection size makes a ground truth very interesting for the development of music retrieval methods, but at the same time makes it very difficult to establish one. Human experts cannot be expected to sift through half a million melodies to find the best matches for a given query. For 11 queries, we filtered the collection so that about 50 candidates per query were left, which we then presented to 35 human experts for a final ranking. We present our filtering methods, the experiment design, and the resulting ground truth.

To obtain ground truths, we ordered the incipits by the median ranks assigned to them by the human experts. For every incipit, we used the Wilcoxon rank sum test to compare the list of ranks assigned to it with the lists of ranks assigned to its predecessors. As a result, we know which rank differences are statistically significant, which gives us groups of incipits whose correct ranking we know. This ground truth can be used for evaluating music information retrieval systems. A good retrieval system should order the incipits in a way that the order of the groups we identified is not violated, and it should include all high-ranking melodies that we found. It might, however, find additional good matches since our filtering process is not guaranteed to be perfect.

Towards Automatic Formulation of a Physician's Information Needs

Loes Braun
Institute for Knowledge and Agent Technology
P.O.Box 616
6200 MD Maastricht,
The Netherlands
Email: L.Braun@cs.unimaas.nl

Floris Wiesman
Academic Medical Center
Amsterdam P.O. Box 22700
1100 DE Amsterdam,
The Netherlands
Email: F.J.Wiesman@amc.uva.nl

Jaap van den Herik
Institute for Knowledge and Agent Technology
P.O.Box 616
6200 MD Maastricht,
The Netherlands
Email: Herik@cs.unimaas.nl

Abstract

The goal of this paper is to contribute to the improvement of the quality of care by providing physicians with patient-related literature. For physicians, it is a problem that they are often not aware of gaps in their knowledge and the corresponding information needs. Our research aim is to resolve this problem by formulating information needs automatically. Based on these information needs, patient-related literature can be retrieved. In this paper, we investigate how to model a physician's information needs. Thereafter, we design and analyze an approach to instantiate the model with patient data, resulting in information-need templates that are able to represent patient-related information needs. Since the patient data are in Dutch, we developed a translation mechanism for translating Dutch terms into the standard English terminology. From our experiments it is clear that a physician's information needs can be modelled adequately and can be substantiated into patient-related information needs. As an aside, we have shown that the automatic translation mechanism is not sufficiently effective in comparison with a manual translation mechanism. The usability of our information-need formulation approach is demonstrated by performing a literature retrieval that based on the formulated information needs. Since the number of formulated information needs is rather high, methods have to be developed that restrict the set of automatically formulated information needs to a more specialized set.

Using a Reference Corpus as a User Model for Focused Information Retrieval

Gilad Mishne, Maarten de Rijke, Valentin Jijkoun
Informatics Institute, University of Amsterdam Kruislaan 403, 1098 SJ
Amsterdam, The Netherlands
Email: {gilad,mdr,jijkoun}@science.uva.nl

Abstract

We propose a method for ranking short information nuggets extracted from a text corpus, using another, reliable reference corpus as a user model. We argue that the availability and usage of such additional corpora is common in a number of IR tasks, and apply the method to answering a form of definition questions. The proposed ranking method makes a substantial improvement in the performance of our system .