J. Keinduangjun, P. Piamsa-nga, and Y. Poovorawan (Thailand)
Signature pattern discovery, feature selection, statisticalscoring measures and DNA sequences
The knowledge acquisition in biological sequences is a challenge and real hard problem. In this paper, we propose statistical models to discover "Signature Patterns", which is very significant, small information for identifying types of sequences, from unknowledgeable DNA sequences. The models simply gather all possible n grams of DNA sequences and then use six statistics-based functions to measure how the n-grams relate to types of sequences. Finally, "signatures" are extracted by selecting the most significant patterns (signatures) by consensus of measured scores. Our experiments showed that the signature patterns generated from too short sequences yield poor performance; while the use of sequences which are longer than a half of complete sequences yields good performance. The six scoring measures are comparably useful, except the "Information Gain" since the use of absent patterns brings unbalanced class and pattern score distribution. Additionally, the experiments on different datasets also showed that precision to identify Influenza virus sequences are over 90%, when the length of patterns is between 6 and 10 and the longer length does not show improvement of precision.
Important Links:
Go Back