1C-7
Detection of Paragraph Boundaries in Complex Page Layouts for Electronic Documents
○Yimin Chu(東大),高須淳宏,安達 淳(国立情報学研)
The precision of paragraph segmentation is critical for the succeed information retrieval tasks in reverse engineering of paginated electronic documents such as PDF files. Current solutions to the layout analysis for simple layouts are not flexible enough to adapt to various complex layouts. Here we propose one method to determine the boundary of the paragraphs with machine learning techniques. We decide the paragraph boundaries based on the features of other parts of the paragraph which are not so ambiguous. A tree structure is also designed in order to enable the text content being grouped flexibly.