Simple Text Processing

更新时间:2023-06-09 05:26:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response

1

Document Management, Department of Linguistics, Uppsala University, VT992

Information Retrieval

Document Life Cycle

Document Management, Department of Linguistics, Uppsala University, VT993

Document Management, Department of Linguistics, Uppsala University, VT994

Data Retrieval versus Information Retrieval

Precision and Recall

matching inference model query language query speci cation items wanted error response

data retrieval exact deduction deterministic arti cial complete matching sensitive

information retrieval partial, best match induction probabilistic natural incomplete relevant insensitive

precision= j Dj\ D j Djr

\ recall= j D D Dj j jr r

deduction: p(a); 8x: p(x) ! q(x) ) q(a) induction: p(a); q(a) ) 8x: p(x) ! q(x)

D: documents returned by the system D: relevant documentsr

data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response

Document Management, Department of Linguistics, Uppsala University, VT995

Document Management, Department of Linguistics, Uppsala University, VT996

Simple Text Processing

Searching StringsSearch Pattern

searching strings string replacement sorting tokenization text segmentation (sentences, phrases, MWU's) frequency counts

simple strings expressionsString Searching Problems

string matching string distance longest common subsequence approximate string matching longest repeated substring

Document Management, Department of Linguistics, Uppsala University, VT997

Document Management, Department of Linguistics, Uppsala University, VT998

String Matching

String Matching - Algorithms (1)

Given pattern string x, with j x j= m, and text string y, with j y j= n, where m; n> 0 and m n, if x occurs as a substring of y then determine the position within y of the rst occurrence of x, i.e. return the least value of i such that y(i; i+ m? 1)= x(1; m).

at successive positions. worst case: O(mn) average case: O(m+ n) Knuth-Morris-Pratt: calculate a next-table for the pattern x which will be used for shifting the pattern when mismatching worst case: O(m+ n) Boyer-Moore: scan the string string from left to right, but match symbols from right to left (disregard portions which cannot possibly match the pattern) worst case: O(m+ n) average case: O(n=m) (large alphabet, small pattern)

Brute Force: compare pattern x with substrings of y

data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response

Document Management, Department of Linguistics, Uppsala University, VT999

Document Management, Department of Linguistics, Uppsala University, VT9910

String Matching - Algorithms (2)

String Distance

Sunday,Hume: mismatch optimizationQuick Search Maximal Shift Optimal Mismatch Tuned Boyer-Moore Least Cost Harrison: test for the occurrence of the pattern but don't nd its location, needs preprocessed static texts, uses hashing functions Karp-Rabin: comparison of signatures in a 'Brute-Forcemanner', uses hashing functions

Given strings x and y, with j x j= m, j y j= n, where m; n> 0, and a string distance metric, d, nd d(x; y). (Determine a minimal sequence of editing operations necessary to transform x to y

.

Document Management, Department of Linguistics, Uppsala University, VT9911

Document Management, Department of Linguistics, Uppsala University, VT9912

Longest Common Subsequence

Approximate String Matching

A subsequence of a string is obtained by eliminating zero or more symbols, not necessarily contiguous, from the string. A common subsequence of two strings is a string which occurs as a subsequence of both strings. A longest common subsequence of two strings is a common subsequence of the strings having maximal length.

Given a pattern string x, with j x j= m, text string y, with j y j= n, where m; n> 0 and m n, an integer k 0 and a distance function d, nd all the substrings, s, of y such that d(x; s) k.

data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response

Document Management, Department of Linguistics, Uppsala University, VT9913

Document Management, Department of Linguistics, Uppsala University, VT9914

Dynamic Programming - String Distancef (a; b)= 0 if a matches b, otherwise f (a; b)= 1 d1 1= f (x1; y1) 8i> 1: d 1= d?1 1+ f (x; y1 ) 8j> 1: d1= d1?1+ f (x1; y ) 8i; j> 1: d= min(d?1+1; d?1+1; d?1?1+ f (x; y )); i; i; i;j;j j i;j i;j i;j i;j i j;

Dynamic Programming - LCSf (a; b)= 1 if a matches b, otherwise f (a; b)= 0 d1 1= f (x1; y1) 8i> 1: d 1= d?1 1+ f (x; y1 ) 8j> 1: d1= d1?1+ f (x1; y ) 8i; j> 1: d= max(d?1; d?1; d?1?1+f (x; y ))i; i; i;j;j j i;j i;j i;j i;j i j

b a l a n s a r m e n s

b 0 1 2 3 4 5 6 7 8 9 10 11

a 1 0 1 2 3 4 5 6 7 8 9 10

l 2 1 0 1 2 3 4 5 6 7 8 9

a 3 2 1 0 1 2 3 4 5 6 7 8

n 4 3 2 1 0 1 2 3 4 5 6 7

c 5 4 3 2 1 1 2 3 4 5 6 7

e 6 5 4 3 2 2 2 3 4 4 5 6

7 6 5 4 3 3 3 3 4 5 5 6

a 8 7 6 5 4 4 3 4 4 5 6 6

r 9 8 7 6 5 5 4 3 4 5 6 7

m 10 9 8 7 6 6 5 4 3 4 5 6

b a l a n s a r m e n s

b 1 1 1 1 1 1 1 1 1 1 1 1

a 1 2 2 2 2 2 2 2 2 2 2 2

l 1 2 3 3 3 3 3 3 3 3 3 3

a 1 2 3 4 4 4 4 4 4 4 4 4

n 1 2 3 4 5 5 5 5 5 5 5 5

c 1 2 3 4 5 5 5 5 5 5 5 5

e 1 2 3 4 5 5 5 5 5 6 6 6

1 2 3 4 5 5 5 5 5 6 6 6

a 1 2 3 4 5 5 6 6 6 6 6 6

r 1 2 3 4 5 5 6 7 7 7 7 7

m 1 2 3 4 5 5 6 7 8 8 8 8

Document Management, Department of Linguistics, Uppsala University, VT9915

Document Management, Department of Linguistics, Uppsala University, VT9916

Heaviest Common Subsequence

Approximate String Matching

Landau-Vishkin k-mismatch: based on Knuth-Morris-

Using weight functions: 1. f (xi; yj )= w(i; j; xi) if xi matches yj, otherwise f (xi; yj )= 0 2. f (xi; yj )= w(i; j; xi; yj )

Landau-Vishkin k-distance: based on dynamic programming

Pratt string matching

data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response

Document Management, Department of Linguistics, Uppsala University, VT9917

Document Management, Department of Linguistics, Uppsala University, VT9918

SortingThree of many sorting algorithms: selection sort: Look for the biggest element and remove it. Place removed elements in an ordered list until no element is in the original list. bubble sort: Swap neighbors if they are in wrong order until the list is sorted. quick sort: Divide the list in two parts. Ru

n through the two parts and put all elements which are bigger than the chosen value in one part and all elements which are smaller in the other part. Apply the same algorithm to each of the two parts until the list is sorted (best average complexity of n2log(n)).

ReferencesKD94] Bharat Kinariwala and Tep Dobry. Chapter 10, Sorting and Searching. In Programming in C, 1994. Available on http://spectra.eng.hawaii.edu/Courses/EE150/Book/chap10/chap10.html Rij79] C. J. van RIJSBERGEN. Information Retrieval, Second Edition, Department of Computer Science, University of Glasgow, 1979. Available on http://www.dcs.gla.ac.uk/Keith/Preface.html St92] Graham A. Stephen String Search, Technical Report, School of Electronic Engineering Science, University College of North Wales, Gwynedd 1992. Available on http://ei.cs.vt.edu/ cs5604/f95/cs5604cnSS/TR92.html

本文来源:https://www.bwwdw.com/article/9521.html

Top