Requirement Engineering and Analytics for Health Care
Abstract The use of analytics in health care domain is gaining importance in the recent past. Many medical centers and hospitals are realizing the need to implement analytics in order to provide better healthcare. Based on the medical data that is provided, analytics can help in drawing useful predictions. The merge of requirement engineering with analytics is evolving since it takes into consideration the business perspective, helping in understanding what the stakeholder is expecting.
This paper addresses the major challenge of dealing with missing data in the field of requirements engineering and analytics for health care and proposes a solution to this problem using clustering technique.
Index Terms Analytics, requirement engineering, attributes, missing values, clustering,
Requirements is stakeholders needs and desires. Requirements Engineering can be stated as the process of understanding the expectation of the user for a product.
These requirements must be relevant to the scenario and as detailed as possible. Hence requirement engineering is driven by determining the problem.
On the other hand, analytics helps us in finding a solution to the problem. Hence on merging both requirement engineering and analytics, we will be able to find a better solution and tools to help solve the problem .
When analytics is used in collaboration with requirement engineering, we would be able to get an understanding of the exact problem that the stakeholder wants to address, how the data can be used to draw useful results, what is the cost that the entire process can take and the time frame in which the task can be completed.
In order to carry on analytics, data plays a major role. Health care-based analytics have data in the form of medical reports generally known as EMR (Electronic Medical Record) or as EHR (Electronic Health Records). The EMRs and the EHRs have their own importance with respect to the data, and the major differences of these are stated in the table1.
The data that is being generated by the health care industry is increasing continuously . This data comes from hospitals, wellness centers, research laboratories, imaging centers etc. Analytics on health data is important for clinical decision support, evidence-based medicine and drug manufacturing. Clinical decision support (CDS) helps in providing a specific information that can be tailored to the needs of patients, clinicians etc. This helps in making improved decisions about health and advancements in health care. Evidence-based medicine (EBM) is an approach which is based on the evidence of the previous research that was conducted and noted. The decision making is done which helps in bringing out a well informed and trusted decision or recommendation about a medicine.
Storing and analyzing from that data will help in drawing useful insights and help in improving the efficiency of the decision making in health care systems. An efficient algorithm will help deal such huge data in an effective way and provide better recommendations. 
Hence data plays a very important role in bringing in more competent results and better predictions.
This paper will mainly focus on the major challenge of missing data that the field of analytics has, even when associated with requirement engineering, and propose a solution to handle that challenge by using a clustering algorithm
When a variable has its value missing it is generally termed as missing value or missing observation. It effects the conclusions that we try to draw from the data and is very common in various data collection fields. This missing value presents various problems. One of them is, the absence of data reduces the statistical power. Secondly, the data lost can cause bias in the parameter estimation. Also, the data representation might get effected.
There are various reasons why missing data problem might occur. There might be certain cases where the data goes missing because of the observations not being recorded or being intentionally left out provided its not a mandatory field to be filled. Any question in a survey that has no answer filled, is a missing data point. It might occur because of unintentional human error. For example, a doctor might not make a record of the patients blood samples as the report was lost. The process of data storage in databases might also result in missing data problem because whenever a variable mismatch occurs it ignores the values. These missing values individually might seem small, but on a higher scale they have considerable impact.
Handling the incomplete or missing data is the major challenge that is being faced in order to make predictions. As the missing values in the data increases it gets highly difficult to draw useful conclusions in the result. Also, even if the predictions are drawn, they might not be the most reliable ones.
This problem of missing data was addressed in a few research papers which used different techniques. The most followed method is to replace the missing attribute values with a zero, to move forward with the process of analysis. However, this would result in the analysis being more biased towards zero or towards the missing values , which would lead to invariable and unexpected results.
Another technique is to discard the column of data that has missing values. This is generally mentioned to have been followed when the class label itself goes missing or when many attributes are missing. But this might lead to unexpected results or untrusted results since the performance of the recommendation system might depend on that attribute or column to a large extent. In the context of health care data, the age of the patient plays a major role which makes prediction of the dosage of the drug to be prescribed, and if the age column is ignored completely because of missing data, it would not serve the actual purpose of prediction. 
A few techniques also prefer using a global constant to fill in the missing values. The other technique is to use the expectation-maximization (EM) technique to find the log-likelihood of the values that are considered for observation and building a model. However, in this approach we must compute values for every iteration of EM which increases the computational cost, and hence is not reliable.
The solution that is proposed in this paper is to use a K-means clustering technique to find the missing values. In this clustering algorithm, k represents the number of clusters that can be formed for a given dataset. The term mean in K-means, refers to the average of the data points that are part of the cluster that is formed. The data points are allocated to the cluster based on the possible nearest centroid.
The general working of K-means algorithm:
Firstly, a group of randomly selected centroids are used as the initial centroids.
Then based on the centroid values, all the nearest data points to that centroid are mapped to it.
Then the centroids are updated, and the reassignment of datapoints in the clusters is done.
This process continues until the centroids are stabilized or the number of iterations that have been defined is reached.
According to the proposed technique, we would first take in the data and perform clustering on the dataset. The clustering technique that we are using in here is k-means. The data point which has missing value would belong to one of the clusters formed and hence the missing value would be filled with a value which is the average of all the other values in the clusters. Similarly, all the missing values will be identified and will be filled. This would help in making better recommendations.
Fig 1: Sequence of steps followed in the solution proposed.
The dataset considered in this paper is User Identification from Walking Activity Data Set.
We perform k-means clustering for the dataset. The result obtained can be observed in the fig2:
It has 3 clusters that are formed, each represented with a different color.
The cluster centers are obtained that help in determining the missing values.
Fig 3: The cluster centers for the clusters that are formed
The missing values for an attribute in the dataset can be filled by taking in these cluster centers corresponding to the attribute and replacing with that value. Hence the data becomes more reliable.
Data plays a major role in recommendation systems, and when the data is not taken in correctly, the results might be inappropriate. Handling missing values is the major concern for data. This problem can be handled by using clustering technique and filling in the missing values with the cluster average.
I would like to thank Professor Nan Niu for all the guidance and support that he has given all through the process of bringing up this paper. I would also like to thank University of Cincinnati for giving this opportunity.
Raghupathi W, Raghupathi V. Data analytics in healthcare: promise and potential [J]. Health Information Science and Systems,2014.
R. Zhang and L. Liu, Security Models and Requirements for Healthcare Application Clouds, IEEE 3rd International Conference on Cloud Computing, 2010.
Darlan Arruda, Requirements Engineering in the Context of Bigdata Application, In: ACM SIGSOFT Software Engineering Notes.
Ahmed E. Youssef, A Framework for Secure Healthcare Systems based on Big Data Analytics in Mobile Cloud Computing Environments, In: International Journal of Ambient Systems and Applications (IJASA) Vol.2, No.2, June 2014Lin Liu, Letong Feng, Jingdong Li and Zhanqiang Cao, Requirements Engineering for Health Data Analytics, In: IEEE 24th International Requirements Engineering Conference, 2016
Mehta, B., Hofmann, T., and Nejdl, W. Robust collaborative filtering. In: Proceedings of the 2007 ACM Conference on Recommender Systems, 2007Zhang, S., Wang, W., Ford, J., Makedon, F., and Pearlman, J. Using singular value decomposition approximation for collaborative filtering. In: Proceedings of the 7th IEEE Conference on E-Commerce, 2005.