Bachelor of Science in Information Technology
Department of Information Technology Sri Lanka Institute of Information Technology
(Proposal documentation submitted in partial fulfillment of the requirement for the Degree of Bachelor of Science Special (honors) In Information Technology)
Bachelor of Science in Information Technology
Department of Information Technology Sri Lanka Institute of Information Technology
We declare that this is researchers own work and this project proposal does not incorporate without acknowledgement any material previously submitted for a Degree or Diploma in any other University or institute of higher learning and to the best of knowledge and belief it does not contain any material previously published or written by another person except where the acknowledgement is made in the text.
The above candidates are carrying out research for the undergraduate Dissertation under my supervision.
Signature of the supervisor: Date: 3/5/2019
There are many types of business overheads project team process to get business related data in any business like invoices, goods return notices etc.
processing these documents manually can be a very complex and time consuming. Then there is the human error factor. Efficiency each employee can also be different. Taking all these factors into account automating the document processing mechanism can be very effective and efficient. Sarotis is a product aimed at solving this issue.
Sarotis is a tool capable of processing data in business documents and formatting them into common platform like json / spreadsheet files so these information is easily accessible and manipulated. Basic computer vision technologies, machine learning and image processing techniques are used to implement this solution.
Main objective of all businesses is to maximize profit, today during the 21st century information is the main driving force behind every business. Project team process that information to make future business decisions, to generate reports, make predictions, producing accounting statements and many more, business decision making being the primary focus. Sarotis also have the capabilities of analyzing the processed data to give predictions on business transactions so that the management can use them to make decisions. Furthermore, this tool can be used to produce reports on various business areas.
1. INTRODUCTION 8
1.1. Background 8
1.2. Literature Review 9
1.2.1. Sales Analysis and prediction module 9
1.2.2. Optical character recognition and processing. 12
1.3. Research Gap 15
1.4. Research Problem 16
2. OBJECTIVES 17
2.1. Main objectives 17
2.2. Specific Objectives 17
3. RESEARCH METHODOLOGY 18
3.1 High-level architecture ……………………………………………………………………..18
3.2. Major Components 19
3.2.1. Computer-generated document analyzer 18
3.2.2. Handwritten document analyzer 21
3.2.3. Template training and accounting statement generator 21
3.2.4. Business data analysis and predictions module 22
3.4. Testing 22
3.5. Marketability 23
3.6. Gantt chart 24
4. DESCRIPTION OF PERSONAL AND FACILITIES .25
5. BUDGET 27
6. REFERENCES 28
Figure 1 – Level Of Benefits And Scope…………………………………………………………………………….9
Figure 3 – The Different Areas Of Character Recognition 13
Figure 6 – The high-level architecture diagram 17
Figure 7 Computer Generated Document Analyzer…………………………………………………………20
Figure 8 Business data analysis and predictions module … . ..22
Figure 9 Gantt Chart ..24
Table 1 – Description of personal and facilities ……24 ?
In todays world main goal of almost all the businesses is to maximise the profit while reducing the costs and workload. Efficiency in all key aspects of the business is crucial in achieving aforementioned state of efficiency. Todays businesses are dependent extensively dependent on data. Efficient handling of data can boost the overall efficiency by a significant margin. All business organizations handle business related documents at various capacities. Especially medium scale businesses process such documents at a considerable level as their operating capacity is comparatively high and doesn’t have investments in methodical automation of the process. Sarotis provide a cost effective and efficient solution at such instances.
Labour cost or employee salaries can be a huge cost to the business even if a single employee is hired for data entry and documents processing. Employee salaries and and other benefits which have to be granted to them increase day by day with ever increasing living costs and government imposed rules and regulations on employers. sarotis provide a simple to handle solution which does not need special training or high level technical skills.mid level management can easily operate this.
Information contained in these documents can be very sensitive to the business and lesser the number of layers which these data pass through, it is better. Businesses these days prefer privacy and security of its data at a very high degree. Sarotis is a cloud based solution and project team can restrict access by unauthorized parties while permitting easy access to relevant parties.
Businesses sometimes require an insight on how to make future investments. Entrepreneurs with years of experience and technical understanding can make such decisions effectively without much effort, even then they need past sales information and information on other parameters to make them. having a business prediction system is an added advantage which analyses past information at a very intensive level and make well founded predictions.
1.2. Literature Review
1.2.1. Sales Analysis and prediction module
To get a better understanding of the business project team extract and analyze internal data from the organization to give predictions and suggestions to improve the business condition and analyze the market behavior and make predictions about future behavior of the market. project team use purchasing data to analyze and have an idea on the sales conditions, based on that researchers make predictions on future purchases. As an example future predictions are important for prepare upcoming seasons of the year. Furthermore, this information could be used to make decisions on how authors can channel researchers resources and make investments.
Machines and humans have distinct strengths and weaknesses in the context of prediction. As prediction machines improve, businesses must adjust their division of labor between humans and machines in response. Prediction machines are better than humans at factoring in complex interactions among different indicators, especially in settings with rich data.
As the number of dimensions for such interactions grows, the ability of humans to form accurate predictions diminishes, especially relative to machines. However, humans are often better than machines when understanding the data generation process confers a prediction advantage, especially in settings with thin data.
Hugh J. Watson, Barbara H. Wixom “The Current State of Business Intelligence”. 
As business users mature to performing analysis and prediction, the level of benefits become more global in scope and difficult to quantify. For example, the most mature uses of BI (Business Intelligence) might facilitate a strategic decision to enter a new market, change a companys orientation from product-centric to customer-centric, or help launch a new product line.
As business users mature to performing analysis and prediction, the level of benefits become more global in scope and difficult to quantify
Ajay Agrawal, Joshua Gans, Avi Goldfarb Prediction Machines: The Simple Economics of Artificial Intelligence
Humans make mistakes around 5 percent of the time. Prediction is the process of filling in missing information. Prediction takes information researchers have, often called data, and uses it to generate information researchers dont have. In addition to generating information about the future, prediction can generate information about the present and the past. 
Between the first year of the competition in 2010 to the final contest in 2017, prediction got much better. Figure 2 shows the accuracy of the contest winners by year. The vertical axis measures the error rate, so lower is better. In 2010, the best machine predictions made mistakes in 28 percent. So predictions are an important part of a business to get a better understanding of the business and make decisions about the future behavior of the market.
Data is often costly to acquire, but prediction machines cannot operate without it. They require data to create, operate, and improve.
Researchers therefore must make decisions around the scale and scope of data acquisition. How many different types of data do researchers need? How many different objects are required for training? How frequently do researchers need to collect data? More types, more objects, and more frequency mean higher cost but also potentially higher benefit. In thinking through this decision, researchers must carefully determine what researchers want to predict. The particular prediction problem will tell researchers what researchers need.
According to the above, Prediction machines use three types of data:
(1) Training data for training the AI.
(2) Input data for predicting.
(3) Feedback data for improving prediction accuracy.
For better accuracy, author needs more data to predict and the high prediction accuracy often enables machines to perform tasks well. Sometimes prediction machines may also lack data because some events are rare. If a machine cannot observe enough data, it cannot predict those decisions. As a result of this, the prediction mechanism is poor.
Machines are bad at prediction for rare events. Managers make decisions on mergers, innovation, and partnerships without data on similar past events for their firms. Humans use analogies and models to make decisions in such unusual situations. Machines cannot predict judgment when a situation has not occurred many times in the past. 
1.2.2. Optical character recognition and processing.
Optical character recognition (OCR) is process of classification of optical patterns contained in a digital image. The character recognition is achieved through segmentation, feature extraction and classification. There are different techniques of OCR(Optical character recognition) systems.
1. Optical scanning.
2. Location segmentation.
6. Feature extraction.
7. Training and recognition.
Arindam Chaudhuri, Krupa Mandaviya, Pratixa Badelia, Soumya K Ghosh (auth.) Optical Character Recognition Systems for Different Languages with Soft Comp
OCR tries to address several issues of above mentioned techniques for automatic identification. They are required when the information is readable both to humans and machines. OCR systems have carved a niche place in pattern recognition. Their uniqueness lies in the fact that it does not require control of process that produces information. OCR deals with the problem of recognizing optically processed characters.
Optical recognition is performed offline after the writing or printing has been completed whereas the online recognition is achieved where computer recognizes the characters as they are drawn. Both hand printed and printed characters may be recognized but the performance is directly dependent upon the quality of input documents. The more constrained the input is, better is the performance of OCR system .
The different areas of character recognition 
The main concept of the optical character recognition is first to teach the machine which class of patterns that may occur and what they look like. This action performed by showing examples of characters for all different classes to the machine.
Imaging Defects errors occurs with the OCR When neighboring characters are joined or fused due to heavy print (Figure 4) and print with light print (Figure 5). 
To improve the accuracy of the OCR authors need to, Improve image processing, Adapting to the current document, Multi-character recognition and increased use of linguistic context.
1.3. Research Gap
As mentioned above, during the literature review authors have found there are similar systems which have been already created for the invoice handling, but there are several drawbacks in those systems.
Most invoices handling systems only manage the invoices which are computer generated. But in the practical world companies received lot of hand written invoices. So authors also plan to handle hand written (only English) invoices along with computer generated ones. For do that were willing to use machine learning algorithm.
Invoices are mainly handle for accounting purposes. But alternative invoices handling systems does not support in build function to make accounting them only summarize the data according to the invoices. But were planning to do summarize the data from invoices and entered that data in proper accounting equation. So authors hope give output of that accounting equations as a common format (XML or Json) for use with any accounting software which company use for there accounting purposes.
Currently any invoices handle system does not give predictions about the business. By getting the predictions about purchasing business can take valuable decisions about their company in future. And they can easily manage there incomes and outcomes. So authors plan provide some valuable prediction based on purchasing and internal factors. For do those researchers willing to use Artificial Intelligence (AI).
1.4. Research problem
Main goal of all businesses are to make profits and maximize them with time. To do so researchers need to increase the efficiency of the business in all aspects. Information handling is a vital factor in almost all modern businesses and they are all integrated with modern technologies to a certain degree. To accomplish efficient handling of business information, opting for an automated system is the only option. Such a system can be more efficient in storing, retrieving, report generation, analysis and much more operation with data.
In a business mainly they process documentation related to various overhead types such as invoices, return notes and many other documents. Manually processing them painstakingly can be time consuming and inefficient. The factor of human error is also there which cannot be eliminated. Sometimes sensitive data of the business organization can be vulnerable as the data has to pass through an extra layer of management hierarchy other than the top level decision makers.
Almost all of the above mentioned issues could be eliminated with sarotis like tool. Efficiency of data processing is increased as the data is updated real-time into a cloud based database. This allows researchers to manipulate data according to researchers wish in areas like sales predictions.
Security of the information is ensured as the access to information can be controlled according to the desire of the management or the ownership.
In large scale organizations there could be a performance bottleneck in the whole process if manual techniques are used. Labour cost to operate such a pool of workers can also be eliminated through this. Processing such documentation is made simple here as the only requirement to operate this system is a scanned image of the invoice or the relevant document. Businesses some time receive handwritten documentation apart from computer generated ones. Sarotis is capable of processing handwritten documents too.
We use templates of frequently used documents to train researchers overhead detection mechanism to extensively train it, increasing the accuracy and speed of the process. When it comes to analyzing and predicting the future behaviors humans are great if the number of variables and volume of data to be manipulated is comparatively low but in most business organizations researchers have to deal with thousands of products and various parameters (Ex: department stores, textile shops). Integrated prediction module can be very advantageous under such circumstances.
2.1. Main Objectives
Main objective of this research is to create a smart and efficient invoice scanner and analyzer to automate the invoice handling process. This system will reduce the labor cost and increase the efficiency of invoice handling. This is a user friendly system can be used for a business of any scale.
2.2. Specific Objectives
Process the computer generated invoices
In order to process computer generated or printed invoices, those invoices need to be scanned through a scanner. After the scanning, details of that image are extracted by using an OCR mechanism. In order achieve maximum efficiency we strive to extract data with a high accuracy.
Process the hand written invoices
Handrwritten documents processing does not always produce highly accurate and reliable. Main goal is to extract data with an accuracy exceeding 70%. Those data will be formatted correctly and uploaded so that it is easy to manipulate real-time.