Abstract

—Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn from data. A machine learning system could be trained to distinguish between spam and non-spam (ham) emails. We aim to analyze current methods in machine learning to identify the best techniques to use in content-based spam filtering. To overcome from this problem we have used different algorithm in which some of them gives very high accuracy and low precision. Machine learning is the fulcrum of our project which has enabled to increase the efficiency both in reducing the time as well as reducing the false positive.

Convolution neural network has enabled us o differentiate between different type of images in the category of spams and hams. Random forest has also helped in segregating differentiating various type of mails.

KeywordsCNN, PERCLOS, Tenser flow

Introduction

Spam email is a type of advertising which is done on a commercial level to have a greater reach towards the targeted audience.

the mail are sent on large scale and even if a very small ratio of people get affected or influenced by the mail,it brings huge economic benefit to the company.

spammers collect email address from public sources as well as through various other prromotion tactics.spammers mostly hide or bluff their identity using various methods to circulate the legislation and blocking. At present , more than 95% of mails sent throughout the world are spams.

To counter the spam filtering, various techniques like Random Forest, SVM, convulational neural network has been used.

Get quality help now
RhizMan
Verified

Proficient in: Artificial Intelligence

4.9 (247)

“ Rhizman is absolutely amazing at what he does . I highly recommend him if you need an assignment done ”

+84 relevant experts are online
Hire writer

SVM can be used for for classification as well as regression although we have only used it for classification purpose here.These are functions which takes low dimensional input space and transform it to a higher dimensional space i.e. it converts not separable problem to separable problem, these functions are called kernel .

We have used python language in various algorithms in our project. Svm, Random forest and image processing are being executed with the help of python. SVM and Random Forest which deals with the classification between Spam and Ham whereas Image processing deals with the layering and classification of the images.

Bayes alogorithm which calculates the accuracy and classifies the Spam and Ham is being written in java language. In addition to this, this algorithm also classifies the data set into various different categories as well. This may include classification into personal, business, advertisement.

The second half of the project deals with Image Processing. with the use of Deep Neural Network we have layered vast amount of images that we will be considering as input. Due to this process of layering we have tried to categories the main agenda of our project i.e Classification of Spam and Ham with their accuracy This is an approach towards better classification of emails. For the image spamming what we have done here is that with the use of Deep Neural Network we have layered vast amount of images that we will be considering as input.

Due to this process of layering we have tried to categories the main agenda of our project i.e Classification of spam and Ham accuracy. The purpose of this paper is not only to show that a neural network is indeed feasible, but also to demonstrate the accuracy of such a method. We provide quantitative results, supporting our claims. Our project mostly concentrates on the concept of Classification between Spam and Ham with the help of different algorithms such as SVM, Random Forest, naive Bayes and Deep Neural Networks. This helps the user to classify his/her emails respectively. The classification of the mails were divided into various parts like private, business, advertisement and others. image data consisting of more than 4000 pictures has been used which will enable to segregate the picture and will prove the efficiency and working of the proposed algorithm.

The paper has been further divided into four sections. Sections II consist of literature review of the various research paper we went through before proposing this model various other way of spam filtering which existed earlier but not as efficient has also been discussed. in section III , we have dealt with various methods we have proposed and its implementation in the group. Section IV deal with the experimental result we obtained when we performed the algorithm in real situation of differentiating between spamming and hamming in the end section V deals with conclusion of our overall experience in this algorithm, problems faced and its future prospect with which we can deal in coming times.

Literature Review

Email is an efficient way to exchange information. Considering the growth of the Internet and wide use of email, the rate of Increase of spam is of great concern. Despite tools to prevent spam, it has been increasing daily. Lack of mechanized systems to prevent spam will result in a spam-saturated World Wide Web, destruction of Internet products and

Severe loss of bandwidth. A major problem with introduction of spam filtering is that a valid email may be labeled as spam or a valid email may be missed. There are existing techniques to identify emails received in the form of spam, as follows:

Black/white list

A white list is a list of addresses from which users tend to receive emails. An advantage of white list is that it allows users or administrators to put email addresses of favorite people into the list in order to make sure that valid emails received from addresses in the white list are not labeled spam

A black list is a list of addresses from which users do not tend to receive email. An email will be labeled junk and transferred to a spam folder

Bayesian classification

Bayesian classifications are the basis of many anti-spam methods; probability of a future event can be obtained by its occurrence in the past. Bayesian is an automatic classifier. Only text algorithms that have shown better efficiency are recently used for filtering.

Rule based software package

Rule-based solutions have two substantial disadvantages. First, these systems required users to generate a series of rules; the users required broad knowledge of spam to formulate suitable rules. Second, these rules required reformulation by experts because features of spam change over time.

We have presented a feature set that accurately identifies image spam across multiple data sets and classification models. The prediction from our system exceeds 90% accuracy, achieves 94% accuracy on our personal spam and ham messages, and can be used to enhance existing content filters for more robust spam classification in the presence of image spam. Additionally, evaluations on data reflecting a real world distribution over spam images yielded upwards of 97% accuracy. Additionally, we presented two methods to improve classification speed. First, we modified a popular feature selection algorithm to be speed sensitive. Our new algorithm reduces the feature set so that overall performance is maintained and computation time per image is greatly reduced. We then introduced JIT feature extraction, which selects features at test time for classification. This method does not affect system performance but greatly reduces the average processing time per image. Overall, our approach makes real time classification of image

Spam a reality, both in terms of effectiveness and practicality.

When data are not labeled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The clustering algorithm which provides an improvement to the support vector machines is called SVM and is often used in industrial applications either when data are not labeled or when only some data are labeled as a reprocessing.

In our project we have had our main focus on increasing the accuracy of the Spams and Hams respectively. To overcome this problem we have used various algorithms such as naive base algorithm ,Support Vector Machine(SVM), Random For- est. Using these algorithms in our project we were successful enough to calculate the accuracy and Spam and Ham differently.

Our project has got a presentable display as we have focused upon the important outcomes through our project. When this project will be implemented in real world it will a major boost for the office people who got to work on emails on daily basis. Regarding each topic we have tried to display the result of the alter. For example for image layering the result will be a graph which shows the accuracy of the validation data and of the training/testing data. Further we have also displayed the Confusion Matrix which shows whether the result of the predicted set matches with the result of the testing set. If it does match then that means our classification is on point and that the accuracy of the classification is top-notch.

When data are not labeled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The clustering algorithm which provides an improvement to the support vector machines is called SVM and is often used in industrial applications either when data are not labeled or when only some data are labeled as a preprocessing for a classification pass.

Conclusion

We found that the naive bayes algorithm outperformed the other classifiers. This simple algorithm achieved great performance and was easy to implement. When compared to black/ white list or rule based technique, SVM along with natural language processing turned out to be more efficient algorithm in filtering the spam from ham. with the use of convulation neural network , image has been segregated which will add new dimensions to the era of mail filtering. Till now, only text based filter approach were used but this tehnique will also allow the image content which will differentiate between the spam and ham.

However, we believe that the performance of this algorithm could still improve. We encourage those who wish to further research this project to investigate the effects of a weighted majority vote, enhanced feature selection and different distance measures. We discovered that the accuracy for this algorithm was very high and precision was low. This suggest that our algorithm are very broad based in labeling an email as spam. After analysis, we believe that a machine learning approach to spam filtering is a viable and effective method to supplement current spam detection techniques

We Use various sampling and regression methods in addition to increase accuracy.

Have greater use of natural language processing for easier distinction between categories.

As we have seen above we need to improve accuracy much more. For the context For that we must have good data set, and we can improve different techniques we implement to classify data.

References

  1. S. Osindero G. E. Hinton and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
  2. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Computer Science, 2014.
  3. R. Socher L. Li K. Li J. Deng, W. Dong and F. Li, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. 248–255.
  4. Y. Bengio J. Yosinski, J. Clune and H. Lipson, “How transferable are features in deep neural networks?,” in Advances in Neural Information Processing Systems, 2014, vol. 27, pp. 3320–3328.
  5. O. Vinyals J. Hoffman N. Zhang E. Tzeng J. Donahue, Y. Jia and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” vol. 50, no. 1, pp. I–647, 2013.
  6. T. Darrell R. Girshick, J. Donahue and J. Malik, “Rich feature hierarchies for accurate object detection and se- mantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
  7. S. Nykl K. Hopkinson N. Becherer, J. Pecarina, “Improving optimization of convolutional neural networks through parameter fine-tuning,” Neural Computing and Applications, , no. 6, pp. 1–11, 2017.
  8. A. Chigorin A. Babenko, A. Slesarev and V. Lempitsky, “Neural codes for image retrieval,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 584–599.
  9. L. Zhang S. Gurudu M. Gotway Z. Zhou, J. Shin and J. Liang, “Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4761–4772.
  10. S. Mukhopadhyay R. DiBiano M. Karki S. Basu, S. Ganguly and R. Nemani, “Deepsat: a learning framework for satellite imagery,” 2015, p. 37.
  11. A. Courville P.-A. Manzagol P. Vincent D. Erhan, Y. Bengio and S. Bengio, “Why does unsupervised pretraining help deep learning?,” Machine Learning Research, vol. 11, pp. 625–660, 2010.
  12. Y. Bengio-S. Bengio D. Erhan, P.-A. Manzagol and P. Vincent, “The difficulty of training deep architectures and the effect of unsupervised pre-training,” Immunol- ogy of Fungal Infections, vol. 5, pp. 153–160, 2009.
  13. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covari- ate shift,” 2015, pp. 448–456.
  14. K. Q. Weinberger G. Huang, Z. Liu and L. Maaten, “Densely connected convolutional networks,” 2016.
  15. S. Ren K. He, X. Zhang and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  16. Navdeep Kaur , Maninder Singh “Botnet and botnet detection techniques in cyber realm” International Conference on Inventive Computation Technologies,2016:1-7
  17. He Haibo A. Garcia Edward “Learning from Imbalanced Data” IEEE Transactions on Knowledge and Data Engineering vol. 21 no. 9 pp. 1263-1284 2009
  18. Ambriola, V. and Gervasi, V. “Processing natural language requirements”, Proc. 12th IEEE Intl. Conf. on Automated Software Engineering, pp. 36-45,1997
  19. Ambriola, V. and Gervasi, V. “Processing natural language requirements”, Proc. 12th IEEE Intl. Conf. on Automated Software Engineering, pp. 36-45,1997
  20. Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew McCallum. 2017.
  21. Semeval 2017 task 10: Scienceie extracting keyphrases and relations from scientific publications. CoRR, abs/1704.02853.

Cite this page

EMail Spam Detection. (2019, Dec 01). Retrieved from https://paperap.com/email-spam-detection/

EMail Spam Detection
Let’s chat?  We're online 24/7