情報処理学会第78回全国大会講演論文集

6J-07

Frequent Multi-byte Characters String Mining Using Wavelet Tree-based Compress Suffix Array

○パーヌチーップチョーットニティ，高須淳宏（総研大）

The frequent string mining is used widely in text processing to extract text features. Most researches about frequent string mining usually focus on single-byte characters but contain problems when apply to multi-byte characters in other languages such as Japanese, Chinese. The main problem is huge memory usage in the large multi-byte characters string compared with single-byte characters in the same length. To solve this problem, we applied wavelet tree-based compress suffix array instead of normal suffix array to reduce memory usage. Then, we proposed a novel technique which utilizes the rank operation to improve the run-time by 45% compared to using only compress suffix array. By using the proposed method we obtained 75% less memory usage but 40% slower than using suffix array.

情報処理学会 第78回全国大会講演要旨

情報処理学会第78回全国大会講演要旨