H.A. Ali, A.I. El-Desouky, and A.I. Saleh
Web page classification, naïve bayes, overfitting, continuous learning, probability distribution
Web page classification may be considered as a one of the most challenging research areas. Where the web has a huge volume of unstructured documents of distributed data related to a variety of domains; so, considering one base for the classification task will be extremely difficult. In addition, the web is full of noise that will certainly harm the classifier performance especially if it is found in the classifier training data. Generally, it will be more valued to build a domain-oriented classifiers (vertical classifiers) to classify pages related to a specific domain. This paper analyzes a new way of applying Bayes theorem to build a Domain-Oriented Naïve Bayes (DONB) classifier. In addition, a main contribution is to introduce a novel classification strategy by adding the continuous learning ability to bayes theorem to build a Continuous Learning Naïve Bayes (CLNB) classifier. Where the overfitting problem has a great impact on most web page classification techniques, continuous learning can be considered as a proposed solution, it allows the classifier to adapt itself continuously for achieving better performance. Both classifiers are tested; experimental results have shown that CLNB demonstrate significant performance improvement over DONB, where its accuracy reaches 94.1% after testing 1000 page. In addition, according to continuous learning, more accuracy enhancement is predicted during future tests.
Important Links:
Go Back