6J-07
Frequent Multi-byte Characters String Mining Using Wavelet Tree-based Compress Suffix Array
○パーヌチーップ チョーットニティ,高須淳宏(総研大)
The frequent string mining is used widely in text processing to extract text features. Most researches about frequent string mining usually focus on single-byte characters but contain problems when apply to multi-byte characters in other languages such as Japanese, Chinese. The main problem is huge memory usage in the large multi-byte characters string compared with single-byte characters in the same length. To solve this problem, we applied wavelet tree-based compress suffix array instead of normal suffix array to reduce memory usage. Then, we proposed a novel technique which utilizes the rank operation to improve the run-time by 45% compared to using only compress suffix array. By using the proposed method we obtained 75% less memory usage but 40% slower than using suffix array.

footer 著作権について 倫理綱領 プライバシーポリシー セキュリティ 情報処理学会