
Both bibliographic data and PDF files can be accessed free until December 2008. Copyright reserved. 

Volume 4 (2008, Latest release 2008326) 

268280 
Regular Papers
Maintaining Multiple Populations with Different Diversities for Evolutionary Optimization Based on Probability Models
Takayuki Higo, Keiki Takadama
This paper proposes a novel method, Hierarchical Importance Sampling (HIS) that can be used instead of population convergence in evolutionary optimization based on probability models (EOPM)such as estimation of distribution algorithms and cross entropy methods. In HIS, multiple populations are maintained simultaneously such that they have different diversities, and the probability model of one population is built through importance sampling by mixing with the other populations. This mechanism can allow populations to escape from local optima. Experimental comparisons reveal that HIS outperforms general EOPM. 

257267 
Papers
Fast Projective Reconstruction: Toward Ultimate Efficiency
Hanno Ackermann, Kenichi Kanatani
We accelerate the timeconsuming iterations for projective reconstruction, a key component of selfcalibration for computing 3D shapes from feature point tracking over a video sequence. We first summarize the algorithms of the primal and dual methods for projective reconstruction. Then, we replace the eigenvalue computation in each step by the power method. We also accelerate the power method itself. Furthermore, we introduce the SOR method for accelerating the subspace fitting involved in the iterations. Using simulated and real video images, we demonstrate that the computation sometimes becomes several thousand times faster. 

250256 
Original Papers
A DistributedProcessing System for Accelerating Biological Research Using DataStaging
Yoshiyuki Kido, Shigeto Seno, Susumu Date, Yoichi Takenaka, Hideo Matsuda
The number of biological databases has been increasing rapidly as a result of progress in biotechnology. As the amount and heterogeneity of biological data increase, it becomes more difficult to manage the data in a few centralized databases. Moreover, the number of sites storing these databases is getting larger, and the geographic distribution of these databases has become wider. In addition, biological research tends to require a large amount of computational resources, i.e., a large number of computing nodes. As such, the computational demand has been increasing with the rapid progress of biological research. Thus, the development of methods that enable computing nodes to use such widelydistributed database sites effectively is desired. In this paper, we propose a method for providing data from the database sites to computing nodes. Since it is difficult to decide which program runs on a node and which data are requested as their inputs in advance, we have introduced the notion of “datastaging” in the proposed method. Datastaging dynamically searches for the input data from the database sites and transfers the input data to the node where the program runs. We have developed a prototype system with datastaging using grid middleware. The effectiveness of the prototype system is demonstrated by measurement of the execution time of similarity search of severalhundred gene sequences against 527 prokaryotic genome data. 

238249 
Original Papers
A Combination Method of the Tanimoto Coefficient and Proximity Measure of Random Forest for Compound Activity Prediction
Gen Kawamura, Shigeto Seno, Yoichi Takenaka, Hideo Matsuda
Chemical and biological activities of compounds provide valuable information for discovering new drugs. The compound fingerprint that is represented by structural information of the activities is used for candidates for investigating similarity. However, there are several problems with predicting accuracy from the requirement in the compound structural similarity. Although the amount of compound data is growing rapidly, the number of wellannotated compounds, e.g., those in the MDL Drug Data Report (MDDR)database, has not increased quickly. Since the compounds that are known to have some activities of a biological class of the target are rare in the drug discovery process, the accuracy of the prediction should be increased as the activity decreases or the false positive rate should be maintained in databases that have a large number of unannotated compounds and a small number of annotated compounds of the biological activity. In this paper, we propose a new similarity scoring method composed of a combination of the Tanimoto coefficient and the proximity measure of random forest. The score contains two properties that are derived from unsupervised and supervised methods of partial dependence for compounds. Thus, the proposed method is expected to indicate compounds that have accurate activities. By evaluating the performance of the prediction compared with the two scores of the Tanimoto coefficient and the proximity measure, we demonstrate that the prediction result of the proposed scoring method is better than those of the two methods by using the Linear Discriminant Analysis (LDA) method. We estimate the prediction accuracy of compound datasets extracted from MDDR using the proposed method. It is also shown that the proposed method can identify active compounds in datasets including several unannotated compounds. 

228237 
Original Papers
Reaction Structure Profile: A Comparative Analysis of Metabolic Pathways Based on Important Substructures
Yuta Ashida, Tomonobu Ozaki, Takenao Ohkawa
Comparative analysis of organisms with metabolic pathways gives important information about functions within organisms. In this paper, we propose a new method for comparing the metabolic pathways with reaction structures that include important enzymes. In this method, subgraphs from pathways that include `choke point' or `load point' are extracted as important “reaction structures, ” and a “reaction structure profile, ” which represents whether extracted reaction structures are observed in the metabolic pathway of other organisms, is created. Distance regarding function within organisms between species is defined using the “reaction structure profile.”By applying the proposed method to the metabolic networks of 64 representative organisms selected from Archaea, Eubacteria and Eukaryote in the KEGG database, we succeed in reconstructing a phylogenetic tree, and confirm the effectiveness of the method. 

217227 
Original Papers
Prediction of ProteinProtein Interaction Sites Using Only Sequence Information and Using Both Sequence and Structural Information
Masanori Kakuta, Shugo Nakamura, Kentaro Shimizu
Proteinprotein interactions play an important role in a number of biological activities. We developed two methods of predictingproteinprotein interaction site residues. One method uses only sequence information and the other method uses both sequence and structural information. We used support vector machine (SVM) with a position specific scoring matrix (PSSM) as sequence information and accessible surface area(ASA) of polar and nonpolar atoms as structural information. SVM is used in two stages. In the first stage, an interaction residue is predicted by taking PSSMs of sequentially neighboring residues or taking PSSMs and ASAs of spatially neighboring residues as features. The second stage acts as a filter to refine the prediction results. The recall and precision of the predictor using both sequence and structural information are 73.6% and 50.5%, respectively. We found that using PSSM instead of frequency of amino acid appearance was the main factor of improvement of our methods. 

207216 
Original Papers
A Clustering Method for Analysis of Sequence Similarity Networks of Proteins Using Maximal Components of Graphs
Morihiro Hayashida, Tatsuya Akutsu, Hiroshi Nagamochi
This paper proposes a novel clustering method based on graph theory for analysis of biological networks. In this method, each biological network is treated as an undirected graph and edges are weighted based on similarities of nodes. Then, maximal components, which are defined based on edge connectivity, are computed and the nodes are partitioned into clusters by selecting disjoint maximal components. The proposed method was applied to clustering of protein sequences and was compared with conventional clustering methods. The obtained clusters were evaluated using Pvalues for GO(GeneOntology) terms. The average Pvalues for the proposed method were better than those for other methods. 

193206 
Original Papers
A Generalized Hidden Markov Model Approach to Transmembrane Region Prediction with Poisson Distribution as State Duration Probabilities
Takashi Kaburagi, Takashi Matsumoto
We present a novel algorithm to predict transmembrane regions from a primary amino acid sequence. Previous studies have shown that the Hidden Markov Model (HMM) is one of the powerful tools known to predict transmembrane regions; however, one of the conceptual drawbacks of the standard HMM is the fact that the state duration, i.e., the duration for which the hidden dynamics remains in a particular state follows the geometric distribution. Real data, however, does not always indicate such a geometric distribution. The proposed algorithm utilizes a Generalized Hidden Markov Model (GHMM), an extension of the HMM, to cope with this problem. In the GHMM, the state duration probability can be any discrete distribution, including a geometric distribution. The proposed algorithm employs a state duration probability based on a Poisson distribution. We consider the twodimensional vector trajectory consisting of hydropathy index and charge associated with amino acids, instead of the 20 letter symbol sequences. Also a Monte Carlo method (Forward/Backward Sampling method) is adopted for the transmembrane region prediction step. Prediction accuracies using publicly available data sets show that the proposed algorithm yields reasonably good results when compared against some existing algorithms. 

182192 
Regular Papers
A Type System for Dynamic Delimited Continuations
Takuo Yonezawa, Yukiyoshi Kameyama
We study the control operators “control” and “prompt” which manage part of continuations, that is, delimited continuations. They are similar to the wellknown control operators“shift” and “reset”, but differ in that the former is dynamic, while the latter is static. In this paper, we introduce a static type system for “control”and “prompt” which does not use recursive types. We design our type system based on the dynamic CPS transformation recently proposed by Biernacki, Danvy and Millikin. We also introduce letpolymorphism into our type system, and show that our type system satisfies several important properties such as strong type soundness. 

167181 
Regular Papers
Parallel Skeletons for Sparse Matrices in SkeTo Skeleton Library
Yuki Karasawa, Hideya Iwasaki
Skeletal parallel programming makes both parallel programs development and parallelization easier. The idea is to abstract generic and recurring patterns within parallel programs as skeletons and provide them as a library whose parallel implementations are transparent to the programmer. SkeTo is a parallel skeleton library that enables programmers to write parallel programs in C++ in a sequential style. However, SkeTo's matrix skeletons assume that a matrix is dense, so they are incapable of efficiently dealing with a sparse matrix, which has many zeros, because of duplicated computations and commutations of identical values. This problem is solved by reformalizing the matrix data type to cope with sparse matrices and by implementing a new C++ class of SkeTo with efficient sparse matrix skeletons based on this new formalization. Experimental results show that the new skeletons for sparse matrices perform well compared to existing skeletons for dense matrices. 

151166 
Numerical Computation
Level3 BLAS and LU Factorization on a Matrix Processor
Ahmed S. Zekri, Stanislav G. Sedukhin
As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to generalpurpose processors in order to handle the different workloads in scientific, engineering, and signal processing applications. In this paper, we propose a manycore matrix processor model consisting of a scalar unit augmented with b×b simple cores tightly connected in a 2D torus matrix unit to accelerate matrixbased kernels. Data load/store is overlapped with computing using a decoupled data access unit that moves b×b blocks of data between memory and the two scalar and matrix processing units. The operation of the matrix unit is mainly processing finegrained b×b matrix multiplyadd (MMA) operations. We formulate the data alignment operations including matrix transposition and skewing as MMA operations in order to overlap them with data load/store. Two fundamental linear algebra algorithms are designed and analytically evaluated on the proposed matrix processor: the Level3 BLAS kernel, GEMM, and the LU factorization with partial pivoting, the main step in solving linear systems of equations. For the GEMM kernel, the maximum speed of computing measured in FLOPs/cycle is approached for different matrix sizes, n, and block sizes, b. The speed of the LU factorization for relatively large values of n ranges from around 5090% of the maximum speed depending on the model parameters. Overall, the analytical results show the merits of using the matrix unit for accelerating the matrixbased applications. 

