2P-7
Parallel Distance Based Outlier Detection in Very Large Datasets on Hadoop environment
○邱 倩如,川島英之,北川博之(筑波大)
As we see, the volume of data being made publicly available such as the weather datasets increases every year. Finding outliers is an important and useful data-mining activity. Here we use a cell-based outlier detection algorithm which is proved to be far superior to the other algorithms for less then four-dimensional to detect outliers in large datasets. We use MapReduce which is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers collectively referred to as a cluster to deal with these datasets. Apache Hadoop implements this computational paradigm and it develops open-source software for all these related projects. So we have a preliminary approach that implement the cell-based outlier detect algorithm on Hadoop environment. Combining these two, we suppose to confirm that outlier detection can have a good application on Hadoop environment.