Site Loader

III. PROPOSED METHODOLOGY
In this paper we present a method to build a SMS spam filter integrated with categorization system that combines Information Retrieval techniques with Data Mining algorithm.
This paper uses N – grams 3 and PMI 4 as the co – occurrence algorithms instead of the Apriori algorithm. N–grams forms contiguous combinations of N words such as unigrams, bigrams, etc. PMI 4 gives co-occurrence values between the words. Unlike Apriori algorithm, PMI is independent of the factors like support 5 and confidence 5 that does not affect the output.
This system provides a systematic approach to categorize the classified ham messages into different categories.
The system is divided into two phases – Training Phase and Test Phase as shown in Fig1 below.

Fig 1: Block Diagram
Training phase consists of:
Pre – processing, forming of N – grams, creation of word occurrence table.
Whereas the test phase consists of:
Pre – processing, finding the co – occurrence (using N – grams and PMI) and Classifying the test SMS using Naïve Bayes Algorithm by referring to the word occurrence table created in the test phase. Further the SMSs which are classified as ham are categorized into one of the six categories (festival, shopping, sports, entertainment, greeting, others).
Each of the modules are described in detail as follows:
Pre-processing
Pre-processing module consists of the following steps:

Tokenization
The text is broken down into words, terms, symbols, or some other meaningful elements called tokens separated by delimiters ?.,;#£ÄÃü¼Ã£â€Ë’?/\”‘[email protected]`~-+=()%*_{}|^:&0123456789

Replacement
The shortcut words(tokens) are replaced with the original form.

Post Author: admin