Classification is a process by which we can segregate different items to match their specific class or a category. This is a very commonly occurring problem across all activities that happen throughout the day, for all of us. Classifying whether an activity is dangerous, good, moral, ethical, criminal, etc., or not are all deep rooted and complex problems, which may or may not have a definite solution. But each of us, in a bounded rational world, try to classify actions, based on our prior knowledge and experience, into one or more of the classes that we may have defined over time. Let us take a look at some real-world examples of classification, as seen in business activities.
Case 1: Doctors look at various symptoms and measure various parameters of a patient to ascertain what is wrong with the patient’s health. The doctors use their past experience about patients to make the right guess.
Case 2: Emails need to be classified as spam or not spam, based on various parameters, such as the source IP address, domain name, sender name, content of the email, subject of the email etc. Users also feed information to the spam identifier by marking emails as spam.
Case 3: IT enabled organizations face a constant threat for data theft from hackers. The only way to identify these hackers is to search for patterns in the incoming traffic, and classify traffic to be genuine or a threat.
Case 4: Most of the organizations that do business in the B2C (business to consumer) segment keep getting feedbacks about their products or services from their customers in form of text, ratings, or answers to multiple choice questions. Surveys, too, provide such information regarding the services or products. Questions such as “What is the general public sentiment about the product or service?” or “Given a product, and its properties, will it be a good sell?” also needs classification.
As we can imagine, classification is a very widely used technique for applying labels to the information that is received, thus assigning it some known, predefined class. Information may fall into one or more such classes, depending on the overlap between them. In all the above seen cases, and most of the other cases where classification is used, the incoming data is usually large. Going through such large data sets manually, to classify them can become a significantly time-consuming activity. Therefore, many classification algorithms have been developed in artificial intelligence to aid this intuitive process. Decision trees, boosting, Naive Bayes, random forests are a few commonly used ones. In this blog, we discuss the Naive Bayes classification algorithm.
The classification using Naive Bayes is one of the simplest and widely used effective statistical classification technique, which works well on text as well as numeric data. It is a supervised machine learning algorithm, which means that it requires some already classified data, from which it learns and then applies what it has learnt to new, previously unseen information, and gives a classification for the new information.
- Naive Bayes classification assumes that all the features of the data are independent of each other. Therefore, the only computation required in the classification is counting. Hence, it is a very compute-efficient algorithm.
- It works equally well with numeric data as well as text data. Text data requires some pre-processing, like removal of stop words, before this algorithm can consume it.
- Learning time is very less as compared to a few other classification algorithms.
- It does not understand ranges; for example, if the data contains a column which gives age brackets, such as 18-25, 25-50, 50+, then the algorithm cannot use these ranges properly. It needs exact values for classification.
- It can classify only on the basis of the cases that it has seen. Therefore, if the data used in the learning phase is not a good representative sample of the complete data, then it may wrongly classify data.
Classification Using Naive Bayes With Python
Data In this blog, we used the customer review data for electronic goods from amazon.com. We downloaded this data set from the SNAP website. Then we extracted
Features Label (good, look, bad, phone) bad (worst, phone, world) bad (unreliable, phone, poor, customer, service) bad (basic, phone) bad (bad, cell, phone, batteries) bad (ok, phone, lots, problems) average (good, phone, great, pda, functions) average (phone, worth, buying, would, buy) average (beware, flaw, phone, design, might, want, reconsider) average (nice, phone, afford, features) average (chocolate, cheap, phone, functionally, suffers) average (great, phone, price) good (great, phone, cheap, wservice) good (great, entry, level, phone) good (sprint, phone, service) good (free, good, phone, dont, fooled) good
We used the stopwords list provided in nltk corpus for the identification and removal. Also, we applied labels to the extracted reviews, based on the ratings available in the data – 4 and 5 as good, 3 as average, and 1 and 2 as bad. A sample of this extracted data set is shown in table 1.
Implementation : classification algorithm works in two steps – first is the training phase and second is the classification phase.
Training Phase In the training phase, the algorithm takes two parameters as input. First is the set of features, and second is the classification labels for each feature. A feature is a part of the data, which contributes to the label or the class attached to the data. In the training phase, the classification algorithm builds the probabilities for each of the unique features given in a class. It also builds prior probabilities for each of the classes itself, that is, the probability that a given set of features will belong to that class. Algorithm 1 gives the algorithm for training. The implementation of this is shown using Python in figure 1.
Classification Phase In the classification phase, the algorithm takes the features, and outputs the attached label or class with the maximum confidence. Algorithm 2 gives the algorithm for classification. Its implementation can be seen in figure 2.
Algorithm 1: Naive Bayes Training Data: C, D where C is a set of classes, and D is a set of documents 1 TrainNaiveBayes(C, D) begin 2 V ← ExtractVocabulary(D) 3 N ← CountDocs(C ) 4 for each c ∈ C do 5 Nc ←CountDocsInClass(D, c) 6 prior[c] ← NC ÷ N 7 textc ←ConcatenateTextOfAllDocumentsInClass(D, c) 8 for each t ∈ V do 9 Tct ← CountTokensOfTerm(textc , t) 10 for each t ∈ V do 11 condprob[t][c] ← (Tct + 1) ÷ Σt0 (Tct0 + 1) 12 return V, prior, condprob
Algorithm 2: Naive Bayes Classification Data: C; V; prior; condprob; d where C is a set of classes, d is the new input document to be classied, and V; prior; condprob are the outputs of the training algorithm 1 ApplyNaiveBayes(C;D) begin 2 W ExtractTermsFromDoc(V; d) 3 Ndw CountTokensOfTermsInDoc(W; d) 4 for each c 2 C do 5 score[c] log(prior[c]) 6 if (t 2 W) then 7 score[c]+ = log(condprob[t][c] Ndt) 8 return argmaxc2C(score[c])
Figure 1: Training Phase