抄録
CD-003
Query by Dataset Based on Instance Similarities Generated by Sentence Embeddings
姜 逸越・坂巻慶行・野間 唯(富士通研)
Before data analysis, it often takes three to four times longer for discovering and preparing relevant datasets from a heterogeneous database. The large data volume, lack of dataset’s domain knowledge, and their access limitations make it difficult to reduce the related manual workloads. We propose an automated method that uses sentence embeddings to extract features from instances of dataset attributes and the values. The similarities from sentence embeddings are further utilized to derive dataset relevance scores based on which relevant dataset pairs or clusters can be automatically found. We demonstrate results using heterogeneous open data and provide a quantitative evaluation.