FIT2015第14回情報科学技術フォーラム 開催日:2015年9月15日(火)~17日(木) 会場:愛媛大学城北キャンパス
抄録
D-002
Detecting Reused Contents in Text Documents
Pei Wang・Chuan Xiao・石川佳治(名大)
Text document collection typically contains reused information. Events or facts may be restated with modifications by various sources. Identifying text reuse may help to find original sources of facts and track information flow, and thus becomes an important task in text analysis. In this paper, we study the problem of content reuse detection in text documents. Existing methods are usually sensitive to modifications such as paraphrases. We propose a new method that tolerates a considerable amount of differences in reused contents. A prefix-filtering-based algorithm is devised for efficient reuse detection. Experiment evaluation on real datasets demonstrates that our method outperforms alternative solutions in terms of both effectiveness and efficiency.