Rohan Kumar, Rajnish Kumar, and ayush singh
AbstractMachine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn from data. A machine learning system could be trained to distinguish between spam and non-spam (ham) emails. We aim to analyze current methods in machine learning to identify the best techniques to use in content-based spam filtering. To overcome from this problem we have used different algorithm in which some of them gives very high accuracy and low precision.
Machine learning is the fulcrum of our project which has enabled to increase the efficiency both in reducing the time as well as reducing the false positive. Convolution neural network has enabled us o differentiate between different type of images in the category of spams and hams. Random forest has also helped in segregating differentiating various type of mails.
Spam email is a type of advertising which is done on a commercial level to have a greater reach towards the targeted audience.
the mail are sent on large scale and even if a very small ratio of people get affected or influenced by the mail,it brings huge economic benefit to the company.
spammers collect email address from public sources as well as through various other prromotion tactics.spammers mostly hide or bluff their identity using various methods to circulate the legislation and blocking. At present , more than 95% of mails sent throughout the world are spams.
To counter the spam filtering, various techniques like Ran- dom Forest, SVM, convulational neural network has been used.
SVM can be used for for classification as well as re- gression although we have only used it for classifica- tion purpose here.These are functions which takes low dimensional input space and transform it to a higher dimensional space i.e. it converts not separable problem to separable problem, these functions are called kernel .
We have used python language in various algorithms in our project. Svm, Random forest and image processing are being executed with the help of python. SVM and Random Forest which deals with the classification be- tween Spam and Ham whereas Image processing deals with the layering and classification of the images.
naive Bayes alogorithm which calculates the accuracy and classifies the Spam and Ham is being written in java language. In addition to this, this algorithm also classifies the data set into various different categories as well. This may include classification into personal, business, advertisement.
The second half of the project deals with Image Processing. with the use of Deep Neural Network we have layered vast amount of images that we will be considering as input. Due to this process of layering we have tried to categories the main agenda of our project i.e Classification of Spam and Ham with their accuracy This is an approach towards better classification of emails. For the image spamming what we have done here is that with the use of Deep Neural Network we have layered vast amount of images that we will be considering as input. Due to this process of layering we have tried to categories the main agenda of our project i.e Classification of spam and Ham accuracy. The purpose of this paper is not only to show that a neural network is indeed feasible, but also to demonstrate the accuracy of such a method. We provide quantitative results, supporting our claims. Our project mostly concentrates on the concept of Classification between Spam and Ham with the help of different algorithms such as SVM, Random Forest, naive Bayes and Deep Neural Networks. This helps the user to classify his/her emails respectively. The classification of the mails were divided into various parts like private, business, advertisement and others. image data consisting of more than 4000 pictures has been used which will enable to segregate the picture and will prove the efficiency and working of the proposed algorithm.
The paper has been further divided into four sections. Sections II consist of literature review of the various research paper we went through before proposing this model . various other way of spam filtering which existed earlier but not as efficient has also been discussed. in section III , we have dealt with various methods we have proposed and its implementa- tion in the group. Section IV deal with the experimental result we obtained when we performed the algorithm in real situation of differentiating between spamming and hamming. in the end section V deals with conclusion of our overall experience in this algorithm, problems faced and its future prospect with which we can deal in coming times.
Email is an efficient way to exchange information. Consid- ering the growth of the Internet and wide use of email, the rate of Increase of spam is of great concern. Despite tools to prevent spam, it has been increasing daily. Lack of mechanized systems to prevent spam will result in a spam-saturated World Wide Web, destruction of Internet products and
Severe loss of bandwidth. A major problem with introduc- tion of spam filtering is that a valid email may be labeled as spam or a valid email may be missed. There are existing techniques to identify emails received in the form of spam, as follows:
1) Black/white list:-A white list is a list of addresses from which users tend to receive emails. An advantage of
white list is that it allows users or administrators to put email addresses of favorite people into the list in order to make sure that valid emails received from addresses in the white list are not labeled spam
A black list is a list of addresses from which users do not tend to receive email. An email will be labeled junk and transferred to a spam folder
Bayesian classification:-Bayesian classifications are the basis of many anti-spam methods; probability of a future event can be obtained by its occurrence in the past. Bayesian is an automatic classifier. Only text algorithms that have shown better efficiency are recently used for filtering.
Rule based software package:-Rule-based solutions have two substantial disadvantages. First, these systems re- quired users to generate a series of rules; the users required broad knowledge of spam to formulate suitable rules. Second, these rules required reformulation by experts because features of spam change over time.
We have presented a feature set that accurately identi- fies image spam across multiple data sets and classification models. The prediction from our system exceeds 90% ac- curacy, achieves 94% accuracy on our personal spam and ham messages, and can be used to enhance existing content filters for more robust spam classification in the presence of image spam. Additionally, evaluations on data reflecting a real world distribution over spam images yielded upwards of 97% ac- curacy. Additionally, we presented two methods to improve classification speed. First, we modified a popular feature selection algorithm to be speed sensitive. Our new algorithm reduces the feature set so that overall performance is maintained and computation time per image is greatly reduced. We then introduced JIT feature extraction, which selects features at test time for classification. This method does not affect system performance but greatly reduces the average processing time per image. Overall, our approach makes real time classification of image
Spam a reality, both in terms of effectiveness and practical- ity.
When data are not labeled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The clustering algorithm which provides an improvement to the support vector machines is called SVM and is often used in industrial applications either when data are not labeled or when only some data are labeled as a reprocessing
In our project we have had our main focus on increasing the accuracy of the Spams and Hams respectively. To overcome this problem we have used various algorithms such as naive base algorithm ,Support Vector Machine(SVM), Random For-
est. Using these algorithms in our project we were successful enough to calculate the accuracy and Spam and Ham differ- ently.
Our project has got a presentable display as we have focused upon the important outcomes through our project. When this project will be implemented in real world it will a major boost for the office people who got to work on emails on daily basis. Regarding each topic we have tried to display the result of the alter. For example for image layering the result will be a graph which shows the accuracy of the validation data and of the training/testing data. Further we have also displayed the Confusion Matrix which shows whether the result of the predicted set matches with the result of the testing set. If it does match then that means our classification is on point and that the accuracy of the classification is top-notch.
When data are not labeled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The clustering algorithm which provides an improvement to the support vector machines is called SVM and is often used in industrial applications either when data are not labeled or when only some data are labeled as a preprocessing for a classification pass.
Our second half of the project deals with Image Processing. What we have done here is that with the use of Deep Neural Network we have layered vast amount of images that we will be considering as input. Due to this process of layering we have tried to categories the main agenda of our project i.e Classification of spam and ham accuracy. Our project mostly concentrates on the concept of Classification between Spam and Ham with the help of different algorithms such as SVM, Random Forest, naive Bayes and Deep Neural Networks. This helps the user to classify his/her emails respectively. The classification of the mails were divided into various parts like private, business, advertisement and many more.
This table shows the accuracy of spam and ham using different algorithm. . In order to get the best result we have implemented different algorithm like naive bayes that show only 66.7% total accuracy. Further we applied support vector machine algorithm.implementing this algorithm we have got better result as the total accuracy 97%. The best approach what we have found in this project is random forest. Applying this algo with we have got very high accuracy 98% and low precision.
1.Multinomial 68.6 64.9 66.7
3.Random Text 97 98 98
In this project first we applied support vector machine algorithm.We collected the dataset from different company and
organisation.The dataset include the email that contains both spam and ham mail.We classify these email using suport vector machine.The accuracy is better than the previously applied algorithm.The accuracy in this is 97% which is much better.
After applying SVM we applied another algorithm that is better than SVM.Random forest tree algorithm is esemble method for classification.In our roject first we have the dataset that classify spam and ham using Randomn forest tree.The accuracy is vey high than the previous applied algorithm.The accuracy is 98% and precision is low
The above figure shows how the email filetering is done.
This graph shows the accuracy of training and testing data.We divide the data set ,we have taken 25% of data for testing part and 75% of data for training part.The result show the accuracy for spam and ham respectivily.
We found that the na?ve bayes algorithm outperformed the other classifiers. This simple algorithm achieved great performance and was easy to implement. When compared to black/ white list or rule based technique, SVM along with natural language processing turned out to be more efficient algorithm in filtering the spam from ham. with the use of convulation neural network , image has been segregated which will add new dimensions to the era of mail filtering. Till now, only text based filter approach were used but this tehnique will also allow the image content which will differentiate between the spam and ham.
However, we believe that the performance of this algorithm could still improve. We encourage those who wish to further research this project to investigate the effects of a weighted majority vote, enhanced feature selection and different distance measures. We discovered that the accuracy for this algorithm was very high and precision was low. This suggest that our algorithm are very broad based in labeling an email as spam. After analysis, we believe that a machine learning approach to spam filtering is a viable and effective method to supplement current spam detection techniques
We Use various sampling and regression methods in addition to increase accuracy.
Have greater use of natural language processing for easier distinction between categories.
As we have seen above we need to improve accuracy much more. For the context For that we must have good data set, and we can improve different techniques we implement to classify data.
S. Osindero G. E. Hinton and Y. Teh, A fast learning algorithm for deep belief nets, Neural computation,
K. Simonyan and A. Zisserman, Very deep convo- lutional networks for large-scale image recognition,
R. Socher L. Li K. Li J. Deng, W. Dong and F. Li, Im- agenet: A large-scale hierarchical image database, in
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. 248255.
Y. Bengio J. Yosinski, J. Clune and H. Lipson, How transferable are features in deep neural networks?, in
Advances in Neural Information Processing Systems, 2014, vol. 27, pp. 33203328.
O. Vinyals J. Hoffman N. Zhang E. Tzeng J. Donahue,
Y. Jia and T. Darrell, Decaf: A deep convolutional activation feature for generic visual recognition, vol. 50, no. 1, pp. I647, 2013.
T. Darrell R. Girshick, J. Donahue and J. Malik, Rich feature hierarchies for accurate object detection and se-
mantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition,
2014, pp. 580587.
S. Nykl K. Hopkinson N. Becherer, J. Pecarina, Im- proving optimization of convolutional neural networks
through parameter fine-tuning, Neural Computing and Applications, , no. 6, pp. 111, 2017.
A. Chigorin A. Babenko, A. Slesarev and V. Lempit- sky, Neural codes for image retrieval, in Proceedings
of the IEEE conference on computer vision and pattern recognition, 2014, pp. 584599.
L. Zhang S. Gurudu M. Gotway Z. Zhou, J. Shin and
J. Liang, Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally,
in Proceedings of the IEEE conference on computer vi- sion and pattern recognition, 2017, pp. 47614772.
S. Mukhopadhyay R. DiBiano M. Karki S. Basu,
S. Ganguly and R. Nemani, Deepsat: a learning frame- work for satellite imagery, 2015, p. 37.
A. Courville P.-A. Manzagol P. Vincent D. Erhan,
Y. Bengio and S. Bengio, Why does unsupervised pre- training help deep learning?, Machine Learning Re- search, vol. 11, pp. 625660, 2010.
Y. Bengio-S. Bengio D. Erhan, P.-A. Manzagol and
P. Vincent, The difficulty of training deep architectures and the effect of unsupervised pre-training, Immunol-
ogy of Fungal Infections, vol. 5, pp. 153160, 2009.
S. Ioffe and C. Szegedy, Batch normalization: Acceler- ating deep network training by reducing internal covari-
K. Q. Weinberger G. Huang, Z. Liu and L. Maaten, Densely connected convolutional networks, 2016.
S. Ren K. He, X. Zhang and J. Sun, Deep residual learning for image recognition, in Proceedings of the
IEEE conference on computer vision and pattern recog- nition, 2016, pp. 770778.
1] Navdeep Kaur , Maninder Singh Botnet and botnet detection techniques in cyber realm International Conference on Inventive Computation Technologies,2016:1-7
He Haibo A. Garcia Edward “Learning from Imbal- anced Data” IEEE Transactions on Knowledge and Data Engineering vol. 21 no. 9 pp. 1263-1284 2009
Ambriola, V. and Gervasi, V. Processing natural lan- guage requirements, Proc. 12th IEEE Intl. Conf. on Auto- mated Software Engineering, pp. 36-45,1997
Ambriola, V. and Gervasi, V. Processing natural lan- guage requirements, Proc. 12th IEEE Intl. Conf. on Auto- mated Software Engineering, pp. 36-45,1997
Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew McCallum. 2017.
Semeval 2017 task 10: Scienceie – extracting keyphrases and relations from scientific publica-