IR Experiments with Lemur

更新时间:2023-08-31 16:13:01 阅读量: 教育文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

IR Experiments with LemurNancy McCracken October 21, 2004 Adapted from a presentation in IST 657 Lemur assistance: Shuyuan Mary Ho IR model slides: Liz Liddy and Anne Diekema Example experiment: Sijo Cherian

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Outline Overview of Lemur project goals and capabilities Standard IR experiments made possible by TREC Steps for IR using Lemur–––– Document preparation and indexing Query preparations Retrieval using several models Other applications

Evaluation for TREC experiments Example Experiment

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

* The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval The toolkit supports– indexing of large-scale text databases,– the construction of simple language models for documents, queries, or subcollections, and– the implementation of retrieval systems based on language models as well as a variety of other retrieval models.

* A Lemur is a nocturnal, monkey-like African animal that is largely confined to the island of Madagascar."Lemur" was chosen for the name of the UMass-CMU project in part because of its resemblance to LM/IR. (The fact that the language modeling community has until recently been an island to the IR community is also 3 suggestive.) (Lemur documentation).

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Lemur Facts Written in C++ for Unix platforms, but also runs on Windows. Includes both– API to program your own applications or modules– User applications to perform IR experiments without programming Note that Lucene has an API but lacks user applications

Maintained by UMASS and CMU Public forum for discussion, also invites code submissions

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Lemur Features From the Documentation Overview at the Lemur Home page– http://www-2.cs.cmu.edu/~lemur/

Indexing:––––––– English, Chinese and Arabic text word stemming (Porter and Krovetz stemmers) omitting stopwords recognizing acronyms token level properties, like part of speech and named entities passage indexing incremental indexing

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

More Lemur Features Retrieval:– ad hoc retrieval TFIDF (vector model) Okapi (probabilistic model) relevance feedback– structured query language InQuery (Boolean queries weighted from other models)– language modeling KL-divergence query model updating for relevance feedback two-stage smoothing smoothing with Dirichlet prior or Markov chain6

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

More Lemur Features Distributed IR (using multiple indexes):– query-based sampling– database ranking (CORI)– results merging (CORI, single regression and multi-regression merge)

Summarization and Clustering Simple text processing CGI script and stand-alone GUI (written in Java Swing) for retrieval– Provides a user interface to submit single queries with a prepared index– Under development7

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Text REtrieval Conference (TREC) NIST provides the infrastructure for IR experiments– Large data collections protected by intellectual property

rights, available for research purposes– Human evaluators for yearly experiments

Results of experiments presented at yearly workshops, since 1992– 93 groups from 22 countries participated in 2003– Standard IR experiments were in the Ad Hoc track Current tracks include Cross Language, Filtering, Question Answering, Robust Retrieval, Terabyte, Video

http://trec.nist.gov– co-sponsored by the National Institute of Standards and Technology (NIST) and U.S. Department of Defense8

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

TREC retrieval evaluation measures Evaluation based on relevance:– TREC’s definition: “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant.“– TREC relevance judgments are binary: relevant or not (1 or 0)– Human relevance judgments

Evaluation measures based on precision and recall– For each query: Recall: number of relevant retrieved docs/ total relevant docs Average Precision at seen relevant docs (includes ranking)– (Standard precision: number of relevant retrieved docs/ total retrieved docs)

Average Precision versus standard Recall levels– Total result: compute the average precision and recall over all queries Micro-average: total precision/recall divided by the number of queries

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

TREC evaluation process Problem is to identify the relevant documents for each query by humans:Relevant retrieved~ 50 Total docs~ 500,000 Relevant~ 100 Retrieved~ 1000

Solution is pooling: take all retrieved documents from all groups in the evaluation to form a“pool”. Human evaluators judge those documents.– Solution has proved to be effective10

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

TREC data Document collection For each Ad Hoc track evaluation:– Queries, describing information need. Includes Title Description Narrative– Relevance judgments Includes entire pool– 0 for non-relevant– 1 for relevant

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Goal Use Lemur for IR experiments using TREC data

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Recall“Steps in the IR Process” Documents1. 2. 3. 4. 5. 6. 7. 8. 9. Normalize document stream to predefined format. Break document stream into desired retrievable units. Isolate& meta-tag sub-document pieces. Identify potential indexable elements in documents. Delete stop words. Stem terms. Extract index entries. Compute weights. Create/ update inverted file.

Lemur accepts TREC formatted documents and provides capabilities for (most of) the other steps in indexing applications.13

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Document Preparation and Initial Steps Lemur takes TREC document files and breaks each one into individual“documents” based on the<DOC> formatting tags in the files.– These are newswire documents.

Lemur provides a standard tokenizer (called a parser) that has options for stemming and stopwords– Stemming: either Porter and Krovetz are standard strong stemmers– Stopwords: user provides a stopwords file We will

start with the stopwords file from the SMART retrieval project at Cornell.

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Indexing Application Programs The main differences between the indexers provided by Lemur are– Whether the index keeps word position information– Performance characteristics Index Name File File Exten- Limit sion .inv .ifp .key no no no yes Stores Loads Disk Positions Fast Space Usage no yes yes no no yes less more Even more Increment al no yes yes 15

InvIndex InvFPIndex KeyfileIncIndex BasicIndex (deprecated)

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Setting Up Indexing Command line invocation of indexing application, specifying parameter file BuildKeyfileIncIndex build_param

Parameters include:– Name of the index index= pindex;– File with a list of document files to process: dataFiles= filelist;– Stemming and stop word options: stemmer= porter; stopwords= stopwords.dat;

As a result of this command, approximately 10 files are produced, starting with the index name.– Inverted index is not humanly readable– Use query processor to view tokenization, stopping and stemming on a document.

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Recall“Steps in the IR Process” Queries1. Recognize query terms vs. special operators. 2. Tokenize query terms. 3. Delete stop words. 4. Stem words. 5. Create query representation. 6. Expand query terms. 7. Compute weights. ---------------------------> Matcher

Lemur has Parse applications that create the query representation Matching is provided with a number of Retrieval applications17

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Query Representation: Keywords Lemur Parse application provides a keyword representation of the query– Queries in a file in the same format as documents Must choose which parts of the TREC queries to include and add DOC tags– Tokenization, stemming and stop word options are provided by Lemur

Command line ParseToFile query_param queries.sgml– Parameter options outputFile= query.lemur.in; stemmer= porter; stopwords= stopwords.dat;18

选填,简要介绍文档的主要内容,方便文档被更多人浏览和下载。

Retrieval Applicationsfor queries represented as keywords General purpose retrieval for the three models: vector, probabilistic and language– Command: RetEval retrieval_param– Option to choose retrieval model retModel= 0;/* 0= TFIDF; 1= Okapi; 2= KL-divergence */– Other parameters index= pindex.key;/* database index */ textQuery= query.lemur.in;/* query text stream */ resultFile= retrieval_file;/* result file */ resultCount= 1000;/* how many docs to return as the result */ resultFormat= 1;/* 0= simple-format; 1= TREC-format */– Options for pseudo-relevance feedback feedbackDocCount: the number of docs to use for pseudo-feedback– (0 means no-feedback)

feedbackTermCount: the number of terms to add to a query when doing 19 feedback.

本文来源:https://www.bwwdw.com/article/nvxi.html

Top