Comparing Naive Bayes and Linear Support Vector Machine (Classification Algorithm) in Apache Spark Using Indonesian Text Reviews
1st Antonius Angga Kurniawan Computer Science and Information Technology Faculty, Gunadarma UniversityDepok, Indonesia1st [email protected] Metty MustikasariComputer Science and Information Technology Faculty, Gunadarma University
2nd [email protected]—Big Data has a large volume and has a variety of data, so it can’t be processed using usual traditional tools. Therefore, new ways and tools are needed to get the value of the data. Apache Spark is a distributed memory-based computing framework which is naturally suitable for machine learning and large-scale data processing. Based on several studies, Apache Spark is a lightning-fast unified analytics engine. In this study, an approach is made to find out how quickly Spark processes large data. This research was conducted by using Machine Learning library (MLlib) classification algorithm in Apache Spark. The comparative classification algorithms are Naive Bayes and Support Vector Machine (SVM). From the results of the comparison obtained, it can be seen which algorithm is better. The tool used in this study is to predict the analysis of sentiment based on a review of an application product. A sentiment analysis based on a product review is a challenging issue. That is because a review has the nature, diversity and volume are quite varied. In addition, if the data can be managed properly, then it can be one of the tools in decision making. User reviews are derived from one of the redeveloped chat-based products, Black Berry Messenger (BBM). The results show that apache spark has a very good speed in processing large data. The classification algorithm is evaluated by precision, recall, f-measure and ROC curve. Based on these evaluations, the SVM algorithm has better results than the Naive Bayes algorithm. Meanwhile, for data processing time, Naive Bayes algorithm has better speed in doing big data processing than SVM algorithm.
Keywords—Big Data, Apache Spark, Classification, Sentiment Analysis, Naïve Bayes, SVM.
Technology is one of the factors that support human improvement. It is undeniable that today’s technological advances are growing rapidly. In many parts of society, technology has become a tool for information. With technology, information becomes easier and faster to obtain. Some people use information as a source of data to be processed into something useful. In the present, data sources will always increase over time. Every day, the data is generated from various sources such as posting in social media, reviews of a product, digital or video images, purchase transaction records and more.
A large data source is called Big Data. The structure of the data is structured data and unstructured data. This becomes a problem in conducting Big Data analysis. The Big Data problem is divided into 3 characteristics, namely Volume, Velocity, and Variety (3Vs). These three characteristics are a challenge to the system in implementing the Machine Learning framework. Thus, a strong Machine Learning framework, strategy, and environment is needed to properly analyze large data 1.
Many technologies can be used to perform large data processing. Typically, data processing is done by distributing data storage and computing across multiple computers. One of the technologies to handle large data processing is Hadoop.
Hadoop is the platform for compiling Big Data. Hadoop is an open source software project that allows distributed processing of large data sets across commodity servers. Hadoop is designed to solve problems with analytic purposes. In addition, Hadoop is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance 2.
Within the Hadoop ecosystem, there is a well-known architecture, the MapReduce framework. The framework allows the specification of an operation to be applied to a huge data set, divide the problem and data, and run it in parallel. However, MapReduce has some important flaws. When running a job, MapReduce has a high overhead. The existence of dependencies between data storage and the results of the computation to disk. In this case it makes Hadoop relatively ill-suited for use cases of an iterative or low-latency nature 3.
Along with advances in technology, the development of the framework in the field of data computing is also growing. One of the most developed frameworks is the Apache Spark. Spark is present to solve some problems in the previous framework, such as Hadoop.
Apache Spark is a distributed memory-based computing framework. Apache Spark is designed to be optimized for low-latency tasks and to store intermediate data and results in memory. Therefore, Spark is suitable for machine learning and iterative application.
Spark is a general distributed computing framework. Spark is based on Hadoop MapReduce algorithms. It absorbs the advantages of Hadoop MapReduce, but unlike MapReduce. The intermediate and output results of the Spark jobs can be stored in memory, which is called Memory Computing. Memory Computing improves the efficiency of data computing. Therefore, Spark is better suited for iterative applications, such as Data Mining and Machine Learning 3.
Machine Learning Library (MLlib) is one of the Apache Spark components that consists of common machine learning algorithms and utilities. In this research focuses on two MLlib classification algorithms for prediction. The two algorithms are Naive Bayes (NB) and Linear Support Vector Machine (LSVM). Naive Bayes and SVM were best techniques to classify the data and could be regarded as the baseline learning methods 4.
Naïve Bayes is a linear classifier based on the Bayes theorem. It creates simple and well performed models and assumes the features in the dataset are mutually independent, thus the term naive came along 5. While, SVM is a learning algorithm that performs classification by finding the hyperplane that maximizes margin between two classes. The nearest points to the hyperplane are the support vectors that determine the maximum margin 6.
Based on previous research, the Naive Bayes and SVM algorithms are still prominent for future research. Both algorithms are suitable for classifying data such as Text Mining, Opinion Mining, or Sentiment Analysis. Then, in terms of the classification algorithm with the framework, Apache Spark has better results than some other frameworks. Previous studies are summarized in the second section of the Related Work, Classification Used Spark and Non Spark.
Sentiment analysis was used in this study. Sentiment analysis can be said to be a form of application of the concept of text analysis, computational linguistics and natural language processing. In Sentiment analysis involves several processes, namely extracting, preprocessing, understanding, classifying & presenting the sentiments expressed by users. Sentiment analysis generally involves classifying the polarity of a text as positive, negative or neutral. It also involves extraction of subjectivity, prediction of intensity and classification of emotions. Sentiment analysis is also carried out on terms, sentences, paragraphs, document level and also extended to other aspects 26.
The data used in this research as many as 122,138 rows of data. The dataset is taken from the Google App Store by taking an Indonesian language review of Black Berry Messenger (BBM). The data captured has a .csv format. With this review, we can know the sentiment analysis of BBM application.
The purpose of this research was to obtain a comparison result of the classification algorithm between Naive Bayes and SVM under the Apache Spark framework. Evaluation is done by taking values from Precision, Recall, F-Measure, and ROC Curve. Comparative results are also useful as a reference for further studies in determining the use of classification algorithms. In addition, in this research we want to know how great and fast framework Apache Spark in doing data processing with large-scale data.
The remains of the paper is structured as follows: Section 2 presents the related work and what matters that support the implementation of this research. Section 3 presents what methodology is used and the flow of this research. Section 4 presents the results and discussion of the research conducted. The last part, section 5 presents the conclusions of the research that has been done.
In a study of “Evaluation of classification algorithms for banking customer behavior under Apache Spark data processing system” in 2017, Etaiwi Wael et al conducted a comparative study on the Naive Bayes algorithm and the SVM algorithm using Apache Spark 1. The results showed that Naive Bayes predictive approach was more efficient than SVM. The data used were customer’s personal information and behavior data of Santander Bank that could be obtained from the website kaggle.com. In evaluating classification algorithm, Etaiwi Wael et al used evaluation metrics precision, recall and f-measure.
Pang et al. (2002) has compared many of the classification algorithms in movie reviews. Pang et al. (2002) gave a vision of insight and comprehension in sentiment analysis and also opinion Mining. Pang et al. (2002) evaluated the performance of Naive Bayes, Maximum Entropy, and Support Vector Machines in the specific domain of movie reviews. The result obtained an accuracy of slightly above 80% 7.
The same techniques were also used in Kharde and Sonawane (2016) to perform sentiment analysis on Twitter data. Again, the results showed that SVM algorithm proved to have the best performance 8.
Catal and Nangir (2017) proposed a new sentiment classification technique based on Vote ensemble classifier. Three individual classification used, such as Bagging, Naive Bayes, and Support Vector Machines (SVM), for Turkish sentiment classification problem. The results showed the Naive Bayes algorithm and the SVM algorithm have good results with an average accuracy above 81% 9.
Bo Yan et al in 2017 in a paper entitled “Microblog Sentiment Classification using Parallel SVM in Apache Spark” conducted a study to classify sentiments on a microblog using SVM parallel 10. In addition, they also tried to increase the execution speed of SVM which usually had constraints with considerable data. They increased the speed with RBF Kernel function using Spark. They also increased the value of accuracy by attaching comments to microblogs, feature space evolution and tuning parameters. With the methods they performed, the performance of the SVM parallel algorithm was increased using Spark SVM compared to LIBSVM.
Srivastava D.K and Bhambhu L., in the journal of theoretical and applied information technology, conducted data classification using support vector machine research 11. In their research they used 4 types of data that is Diabetes data, Heart data, Satellite data, and Shuttle data with different number. From the results of their research, the results obtained comparison between SVM techniques using RBF Kernel function and rule base classifier for RSES. The results obtained from the total execution time to predict, SVM took longer time compared with RSES. While in terms of accuracy, SVM was better compared to RSES and it could be concluded that the greater the amount of data classified, the greater the value of accuracy predictions.
Huang Y. and Li Lei in 2011 in their paper entitled “Naive Bayes Classification Algorithm Based on Small Sample Set” studied the Naive Bayes classification algorithm based on the Poisson distribution model 12. They studied it while proving that the classification accuracy obtained remained high despite using small sample data.
The next research is from Baltas Alexandros et al in 2017 entitled “An Apache Spark Implementation for Sentiment Analysis on Twitter Data”. In their research, they used machine learning methodologies with natural language processing techniques, apache spark Machine Learning library (MLlib) and classification algorithm (binary and ternary classification). After doing the classification of microblogging including positive or negative, they used the method in machine learning. The result showed that Naive Bayes algorithm was the best algorithm 13.
In 2016, there were several studies related to Big Data Apache Spark Machine Learning and Sentiment Analysis using Apache Spark. Fu Jian et al in their research “Spark-A Big Processing Platform for Machine Learning” analyzed Spark’s primary framework, core technologies, and ran a machine learning instance on it 14. Compared with Hadoop, Spark had a better ability of computing.
Salloum Salman et al in 2016 in their research entitled “Big data analytics on Apache Spark” explained how the Apache Spark was built 15. They did technical reviews on the Apache Spark with their focus on key components, abstractions and features of apache spark. In this case to find out what the Apache Spark had to design and implement Big Data and pipeline algorithms on Machine Learning.
In supporting this research using datasets based on reviews of a product on the Google Play Store, there are also several related studies. In his dissertation “Mobile App Analytics & Sentiment Analysis of Customer Reviews”, Calikus Ece (2015) proposed a system to develop a prototype that displayed a dashboard for IOS and Android app analytics at the same platform. The back-end used mobile data mining techniques and applied classification based sentiment analysis model. As a result of the background research, it was decided to apply supervised machine learning techniques for sentiment classification. Evaluation of the proposed solution was resulted with four major contribution. The first and major contribution of their research was proved that supervised machine learning approach of sentiment analysis was applicable for classification on mobile app reviews with 88.3% accuracy. Moreover, it showed that best algorithm for sentiment classification in terms of performance was Multinomial Naïve Bayes algorithm 16.
Classification Algorithm Naïve Bayes and SVM
The Naive Bayes classifier is based on the Bayes’ theorem, and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, the Naive Bayes classifier can often achieve comparable performance with some sophisticated classification methods, such as decision tree and selected neural network classifier. Naive Bayes classifiers have also exhibited high accuracy and speed when applied to large datasets. However, the assumption of independence between attributes makes accuracy less (since there is usually a linkage) 22.
SVM is a theoretically sound approach for controlling model complexity. It picks important instances to construct the separating surface between data instances. When the data is not linearly separable, it can either penalize violations with loss terms, or leverage kernel tricks to construct non-linear separating surfaces. SVMs can also perform multiclass classifications in various ways, either by an ensemble of binary classifiers or by extending margin concepts. The optimization techniques of SVMs are mature, and SVMs have been used widely in many application domains. However, the SVM is difficult to use in large-scale problems. Large scale in this case is meant by the number of samples being processed 22.
Classification Used Spark And Non Spark
Garcia-Gil, D., et al. (2017) conducted a study entitled “A comparison on scalability for large batch data processing on Apache Spark and Apache Flink” 17. In their research, they compared the performance of Apache Spark and Flink using three Machine Learning algorithms with the same dataset. They conducted a comparative study using SVM, Linear Regression and DITFS algorithms. They tested the learning time of the three algorithms. The results show that SVM, Linear Regression and DITFS algorithms using Apache Spark have faster learning time than Flink.
In the MSc Research Project Data Analytics in 2016, Gilheany conducted a study entitled “Processing time of TFIDF and Naive Bayes on Spark 2.0, Hadoop 2.6 and Hadoop 2.7: Which Tool Is More Efficient?” 18. In his research, Gilheany compared processing times in Apache Spark and Hadoop using the Term Frequency-Inverse Document Frequency (TF-IDF) process and Naive Bayes classification algorithm. He conducted a comparative study with three iterations. To perform the TF-IDF process, Spark has a faster and more stable processing time than Hadoop. But to perform the process using the Naive Bayes algorithm, Spark has a slower time than Hadoop. However, when viewed from the results of each iteration, Spark has an improved processing time from previous iterations.
Assefi Mehdi et al in 2017 in their paper entitled “Big Data Machine Learning using Apache Spark MLlib” performed research using six variations of the dataset 19. Then they used several methods of classification such as SVM, Decision Tree, Naive Bayes, Random Forest. In addition they also did a comparison between Apache Spark MLlib and Weka. The results shows that Apache Spark MLlib was a very strong tool for Big Data analytics.
Big Data is a term that describes large volumes of data, both structured and unstructured data. Big Data has been used in many businesses. Not only should the amount of data that becomes the main point but what the organization do with the data. Big Data can be analyzed for insights that lead to better decision making and business strategies.
Big Data has three characteristics commonly called 3Vs, namely Volume, Velocity and Variety as shown in Figure 1.
Fig. 1. The Three Vs of Big Data 20.
Apache Spark has emerged as the de facto framework for big data analytics with its advanced in memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R 15.
Apache Spark is a lightning fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing 21.
Sentiment analysis is a computational research of opinion sentiment and emotion which is expressed in textual mode 23. Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker. A common use case for this technology is to discover how people feel about a particular topic.
In knowing the capabilities of Apache Spark, this study was made through several stages. Figure 2 shows the methodological steps in this study. The first stage performs a dataset collection of reviews on an app product in the Google App Store. The second stage performs data preprocessing. The third stage performs about feature extraction. The fourth stage performs machine learning model. The fifth stage is to perform an evaluation based on the machine learning classification algorithm used.
Fig. 2. Methodology Steps
Figure 3 shows the steps in data retrieval used in this study. The dataset derived comes from user reviews written on an app product that is in the Google App Store via appfollow.io. The selected application product is the famous chat messenger in Indonesia, namely BBM.
Fig. 3. Steps for Data Collection
Reviews are successfully obtained from BBM as much as 122138 reviews. This review will be processed with Apache Spark.
The preprocessing of the data is a very important step as it decides the efficiency of the other steps down in line. It involves syntactical correction of the reviews as desired. The steps involved aim for making the data more machine readable in order to reduce ambiguity in feature extraction.
Fig. 4. Data Preprocessing Steps
Based on Figure 4, below are the steps taken for preprocessing data on a review text:
Not all text reviews are consistent in the use of capital letters. Therefore, Case Folding’s role is needed in converting the entire text in the reviews into a standard form (usually lowercase). For example, users who want to get “PESAN” information and type “PESAN”, “PeSaN” or “pesan” are still given the same retrieval result as “pesan”. Case folding is changing all the letters in the reviews into lowercase. Only the letters ‘a’ to ‘z’ are accepted.
At this stage we do the process of checking text based on existing reviews. In this process it removes punctuation, corrects words that can damage the actual wording.
Tokenizing stage is the cutting stage of the input string based on each word that compiles it. An example of this stage can be seen in the Figure 5.
Fig. 5. Example of Tokenizing
Tokenize outline breaks up a set of characters in a text into word units, how to distinguish certain characters that can be treated as word separators or not.
In the filtering stage we also used stop list algorithm / stop word (remove the word less important) or wordlist (save important words). Stop lists / stop words are non-descriptive words that can be removed in the bag-of-words approach. Examples of stop words are “yang”, “dan”, “di”, “dari” and so on.
Stop word data can be taken from the journal Fadillah Z Tala entitled “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia” 25.
Fig. 6. Example of Stop word Removal
Figure 6 shows the example of stop word removal using the tokenize result. Words like “dari”, “yang”, “di”, and “ke” are some examples of high frequency words and can be found almost in every document (referred to as stop word). Stop word removal can reduce index size and processing time. In addition, it can also reduce the noise level.
Indexing is done because a document can’t be recognized directly by an Information Retrieval System (IRS). Therefore, the document must first be mapped into a representation using the text inside it.
Stemming techniques are needed in addition to minimizing the number of different indexes of a /document, as well as for grouping other words that have similar words and basics but have different forms or forms for different affixes.
For example the words “bersama”, “kebersamaan”, “menyamai”, will be stemmed into the word “sama”. However, as with stopping, stemming performance also varies and often depends on the domain of the language used. An example of this stage can be seen in the Figure 7.
Fig. 7. Example of Stemming
The stemming process in Indonesian text is different from stemming in English text. In English subtitles, the required process is simply the process of removing the suffix. While in the Indonesian text all the words affixes both the suffix and prefix are also omitted.
Tagging Positive or Negative
At this stage, sentence checking is based on words containing positive sentences and negative sentences. A collection of positive and negative words in the Indonesian language can be obtained from the github Devid Haryalesmana on the site https://github.com/masdevid/ID-OpinionWords.
The algorithm used is to count the number of positive words or negative words in one sentence, then divided by the number of words in one sentence. If the positive value is greater than negative value, the sentence is positive. Then, if the negative value is greater than positive value, the sentence is negative. This study separated the value of positive which is 0 and the value of the negative that is 1 as shown in Table I.
TABLE I. LABEL FOR POSITIVE OR NEGATIVE
Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Feature extraction involves “ML” library of Apache Spark. The recommended API is the Data Frame based API. This feature is useful for a case where we need to find trending topics or to create word clouds.
At this stage it will change the set of sentences in the existing review into a collection of vectors. Then this stage will generate data with column “label” and “features”.
This paper presents a comparative study between two classification algorithm namely, Naive Bayes and the Support Vector Machine (SVM) of the Machine Learning Library (MLlib) under the Apache Spark Data Processing System. Figure 8 shows the steps for comparative machine learning classification algorithm used.
Fig. 8. Steps for Comparative Classification Algorithm
NB is a linear classifier based on the Bayes theorem, it creates simple and well performed models, and it assumes that the features in the dataset are mutually independent, thus the term naïve came along 5. While, SVM is a learning algorithm that performs classification by finding the hyper plain that maximizes margin between two classes, and the nearest points to the hyper plain are the support vectors that determine the maximum margin 6.
There are several measures that can be used to evaluate the performance of classification algorithms for a prediction, such as entropy, purity, confusion matrix, accuracy, precision, recall, F measure, and computation time.
According to Saito, et al 24, precision and recall are very informative evaluation metrics for binary classifiers. So, the performance metrics used for evaluation in this paper are precision and weighted precision, recall and weighted recall, f-measure, confusion matrix, and area under ROC. Precision, also known as positive predictive value, is the number of the correctly predicted items over the number of all predicted items. While recall, also referred to as the true positive rate or sensitivity, is the number of the correctly predicted items over all related items. However, the F-measure can be calculated using the values of both precision and recall.
RESULTS AND DISCUSSION
This section will discuss how the process of data processing is and then test using Naive Bayes algorithm and Support Vector Machine algorithm (SVM), so that the results obtained evaluation metrics on both algorithms using Apache Spark.
Time Process Load Data to DataframeBy using Spark we can process a data and store it into a dataframe. This dataframe is like a variable that can contain a row and column. Processing using this dataframe is also similar to SQL commands (Structured Query Language).
Fig. 9. Load Data to Dataframe Spark
Figure 9 shows how a data is stored into a Spark Data Frame. The time it takes to load data of 122138 rows and store into the Data Frame takes only 789 milliseconds.
Fig. 10. Results of Tokenizing Process
Figure 10 shows some of the results obtained using the Regex Tokenizer function in Apache Spark. The results obtained are accompanied by the number of tokens in one sentence.
Stopwords Removal Results
Fig. 11. Results of Stopwords Removal Process
Figure 11 shows the results obtained using the Stopwords Remover function in Apache Spark. Stopwords Remover is useful for word deletion process or stop word. This process is performed using an Indonesian corpus based on references from previous research.
The next process is to do stemming. In the stemming process, Stemmer Factory function of PySastrawi is used. PySastrawi is a special library for stemming sentences in Indonesian language.
Fig. 12. Examples of Results in the Stemming Process
Figure 12 shows one of the results obtained from the Indonesian stemming process using PySastrawi.
The next process is to conduct research to separate sentences that have positive and negative meaning. In this stage, a positive Indonesian corpus is used and also a negative corpus in Indonesian language. The examples of results can be seen in Figure 13.
Fig. 13. Tagging the Reviews Positive or Negative
Feature Extraction Results
In performing feature extraction, we used the function of Apache Spark which is Count Vectorizer and also Inverse Document Frequency (IDF). In addition we used Vector Assembler function and also Pipeline on Apache Spark.
Fig. 14. Results of Feature Extraction
Figure 14 shows the result of feature extraction process using some of the functions available in Apache Spark. The result shows a word that is converted into a vector, in this case Spark does the Machine Learning process by changing the words into a Vector first.
Apply the Algorithm
In this process we tested two classification algorithms on Machine Learning, namely Naive Bayes and SVM. To find out how fast Apache Spark doing data processing using the algorithm, the experiment is executed in 8 times. First test the data as much as 5000, the second test data as much as 10000, the third test data as much as 20000, the fourth test data as much as 40000, the fifth test the data as much as 60000, the six test data as much as 80000, the seven test data as 100000, and the last test data as much as 122138.
Results Based On Time Processing
Table II shows the results obtained based on processing time on the Naive Bayes algorithm and the Linear Support Vector Machine (SVM) algorithm.
TABLE II. TIME PROCESS IN APACHE SPARK
USING NB AND SVM
Data NB SVM
5000 3.4s 13.9s
10000 2.29s 22.1s
20000 2.9s 28.9s
40000 3.91s 52.4s
60000 5.52s 1min 24s
80000 6.51s 2min 5s
100000 7.32s 2min 33s
122138 9.18s 2min 41s
The results obtained in Table II show that Naive Bayes has a better speed compared to SVM in classifying data. It shows the time required of each algorithm. The more data used, the processing time will be longer, but it does not have a considerable time difference.
Results Based On Evaluation Metrics in Naive Bayes
Table III shows the results obtained based on the evaluation metrics on the Naive Bayes algorithm. The experiment is executed in 8 times with different amount of data, first data 5000, second data 10000, third data 20000, fourth data 40000, fifth data 60000, sixth data 80000, seventh data 100000, and eighth data as much as 122138. The value taken is the evaluation for accuracy, precision, recall, and f-measure.
TABLE III. EVALUATION METRICS
USING NAIVE BAYES (NB)
Precision Recall F-Measure Accuracy
5000 0.925 0.783 0.783 78.36%
10000 0.933 0.811 0.851 80.97%
20000 0.940 0.835 0.871 83.51%
40000 0.945 0.846 0.881 84.66%
60000 0.949 0.859 0.889 85.96%
80000 0.950 0.862 0.892 86.23%
100000 0.950 0.865 0.893 86.52%
122138 0.953 0.871 0.898 87.04%
AVG 94.31% 84.15% 86.97% 84.15%
From the results obtained, the Naive Bayes algorithm has a pretty good result of accuracy, precision, recall, and f-measure. In addition, in the test it can be concluded that the more data is processed, the higher value of each evaluation metrics.
Results Based On Evaluation Metrics in SVM
Table IV shows the results obtained based on evaluation metrics on the SVM algorithm. The experiment is executed in 8 times with different amount of data, first data 5000, second data 10000, third data 20000, fourth data 40000, fifth data 60000, sixth data 80000, seventh data 100000, and eighth data as much as 122138. The value taken is the evaluation for accuracy, precision, recall, and f-measure.
TABLE IV. EVALUATION METRICS
USING SUPPORT VECTOR MACHINE (SVM)
Precision Recall F-Measure Accuracy
5000 0.963 0.966 0.964 96.61%
10000 0.962 0.964 0.963 96.48%
20000 0.970 0.971 0.970 97.17%
40000 0.972 0.973 0.972 97.31%
60000 0.975 0.975 0.975 97.59%
80000 0.975 0.975 0.975 97.55%
100000 0.977 0.977 0.977 97.79%
122138 0.978 0.978 0.979 97.87%
AVG 97.15% 97.23% 97.18% 97.29%
From the results obtained, the SVM algorithm has a very good result of accuracy, precision, recall, and f-measure because the average is close to 100%. In addition, in the test can be concluded, the more data is processed, the higher value of each evaluation metrics.
Results of Confusion Matrix Using Naive Bayes
Fig. 15. Confusion Matrix Using NB with All Data
Figure 15 shows the results of the confusion matrix using Naive Bayes and the data used are the overall data of the eighth data as much as 122138. The results show that the True Positive value (29876.) is higher than False Positive (185.), False Negative (4557.), and True Negative (1964.).
Results of Confusion Matrix Using SVM
Fig. 16. Confusion Matrix Using SVM with All Data
Figure 16 shows the results of confusion matrix using SVM and the data used is the overall data of the eighth data as much as 122138. The results show that the True Positive value (34018.) is higher than False Positive (420.), False Negative (358.), and True Negative (1758.).
Results of Area Under ROC Curve Using NB
Fig. 17. Area under ROC Curve Using NB
Figure 17 shows the value of Area under ROC Curve from Naive Bayes using the entire data. The value obtained shows a pretty good result because the value of Area under ROC Curve is greater than 0.5 that is equal to 0.648.
Results of Area Under ROC Curve Using SVM
Fig. 18. Area under ROC Curve Using SVM
Figure 18 shows the value of Area under ROC Curve from SVM using the entire data. The value obtained shows very good results because the value of Area under ROC Curve almost close to 1, that is equal to 0.909.
Comparison Results of Naive Bayes and SVM
TABLE V. COMPARISON RESULTS OF
Precision Recall F-Measure Accuracy
NB 94.31% 84.15% 86.97% 84.15%
SVM 97.15% 97.23% 97.18% 97.29%
Table V shows the mean value of evaluation metrics of Naive Bayes and also SVM. From Table V it can be seen that SVM is better at classifying data in Spark with an average of 97.21% compared to Naive Bayes with an average of 87.39%.
TABLE VI. COMPARISON RESULTS OF
AREA UNDER ROC CURVE
AREA UNDER ROC CURVE
Table VI shows the comparison of Area under ROC Curve from Naive Bayes and also SVM. From Table VI it can be seen that SVM has better results compared to Naive Bayes, because the value of SVM is almost close to 1 that is 0.909, whereas for Naive Bayes it is only 0.648.
Based on the results obtained, Spark has a good speed in doing data processing with the number of reviews around 122138 rows of data. First, in the task of loading data about 122138 data, Spark takes 789 milliseconds. Second, in performing feature extraction, Spark takes an average of 7.68 seconds. Third, for the use of Naive Bayes algorithm on Spark, it takes an average of 5.12 seconds. Fourth, the SVM algorithm takes an average of 79.5 seconds.
From the test results of two classification algorithms in Spark, obtained a pretty good results from Naive Bayes and Linear Support Vector Machine (SVM). This comparison is done by looking at the results obtained by NB and SVM based on the average value of 84.15% accuracy: 97.29%, precision value 94.31%: 97.15%, recall 84.15%: 97.23%, and f-measure value 86.97%: 97.18%. In addition, the results obtained based on the value of Area under ROC Curve is 0.648:0.909.
Finally, the results show that there are more positive reviews than the negative reviews of the current BBM app.
Then, the SVM algorithm has excellent results compared to the Naive Bayes algorithm.
For further research it is expected to test with more data and use more classification algorithms. Thus, the results obtained will be more helpful in determining which algorithm has a better ability. In addition, we hope the data or reviews obtained have better sentencestructure and words as well. Thus, the accuracy of data values will also be better and valid.
Wael E., Mariam B., and Ghazi N., “Evaluation of classification algorithms for banking customer’s behaviour under Apache Spark data processing system,” The 4th International Symposium on Emerging Information, Communication and Networks. Procedia Computer Science, vol. 113, pp. 559–564, September 2017.
Bhosale H.S. and Prof. Gadekar D.P., “A review paper on big data and hadoop,”International Journal of Scientific and Research Publications, vol.4, issue 10, October 2014.
Jian F., Junwei S., and Kaiyuan W., “Spark- a big data processing platform for machine learning,” International Conference on Industrial Informatics – Computing Technology, Intelligent Technology, Industrial Information Integration, 2016.
Davidov, D., Tsur, O., Rappoport, A., “Enhanced sentiment learning using twitter hashtags and smileys,” In Proceedings of the 23rd international conference on computational linguistics: posters, 241–249, 2010.
Raschka and Sebastian. “Naive bayes and text classification introduction and theory.” arXiv preprint arXiv: 1410.5329 (2014).
Hearst, Marti A., Susan T. Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. “Support vector machines.” IEEE Intelligent Systems and their Applications 13, no. 4 (1998): 18-28.
Pang, B., Lee, L., Vaithyanathan, S., “Thumbs up? Sentiment classification using machine learning techniques,” In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, 79–86, 2002.
Kharde, V., Sonawane, P., “Sentiment analysis of twitter data: A survey of techniques,” arXiv preprint arXiv: 1601.06971, 2016.
Catal, C., Nangir, M., “A sentiment classification model based on multiple classifiers,” Applied Soft Computing, 50, 135–141, 2017.
Yan Bo, Yang Zijiang, Ren Yitian, Tan Xing, and Liu E., “Microblog sentiment classification using parallel svm in apache spark,” IEEE 6th International Congress on Big Data, 2017.
Srivastava D.K., and Bhambhu L., “Data classification using support vector machine,” Journal of Theoretical and Applied Information Technology, 2005-2009 JATIT.
Huang Y. and Li Lei, “Naive bayes classification algorithm based on small sample set,” Proceedings IEEE Cloud Computing Intelligence Systems, 2011.
B. Alexandros, K. Andreas, and T. Athanasios K., “An apache spark implementation for sentiment analysis on twitter data,” ALGOCLOUD 2016, LNCS 10230, pp. 15–25, 2017.
F. Jian, S. Junwei, and W. Kaiyuan, “Spark – a big data processing platform for machine learning,” International Conference on Industrial Informatics – Computing Technology, Intelligent Technology, Industrial Information Integration, 2016.
Salloum S., Dautov R., Chen X., Peng P.X., and Huang J.Z., “Big data analytics on Apache Spark,” International Journal Data Science Analytics. Springer, 2016.
C. Ece, “Mobile App Analytics ; Sentiment Analysis of Customer Reviews”, University of Greenwich, 2015.
Garcia-Gil, D., Ramirez-Gallego S., Garcia S., and Hererra F., “A comparison on scalability for batch big data processing on apache spark and apache flink,” Big Data Analytic, 2017.
Gilheany E., “Processing time of TFIDF and Naive Bayes on Spark 2.0, Hadoop 2.6 and Hadoop 2.7: Which Tool Is More Efficient?” MSc Research Project Data Analytics, 2016.
A. Mehdi, B. Ehsun, L. Guangchi, and T. P. Ahmad, “Big data machine learning using apache spark mllib,” IEEE Big Data, November 2017.
Su Xiaomeng. “Introduction to big data”. NTNU, 2017.
Jonnalagadda V.S., Srikanth P., Thumati K., and Nallamala Sri H., “A review study of Apache Spark on big data processing,” International Journal of Computer Science Trends and Technology (IJCST), vol. 4, issue. 3, May – Juni 2016.
Aggarwal Charu C. “Data classification algorithm and applications”. Chapman & Hall/CRC Press, 2015.
Zulfa I. and Winarko E., “Sentiment analysis Indonesian language tweet with Deep Belief Network,” Indonesian Journal of Computing and Cybernetics Systems (IJCCS), vol.11, No.2, pp. 187~198, July 2017.
Saito, Takaya, and Marc Rehmsmeier. “The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.” PloS one 10.3 (2015): e011843.
Tala Fadillah Z., “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia,” Universiteit van Amsterdam, December 2003.
Ghag Kranti V. and Shah K., “Conceptual sentiment analysis model,” International Journal of Electrical and Computer Engineering (IJECE), Vol. 8, No. 4, pp. 2358~2366, August 2018.