1. Terms of Reference
This report is the outcome of an analysis and investigation of American house price in order to consider which factor influence the price. It is submitted as my project for Essential Data Analysis module on the Business Studies Programme.
2. Executive Summary
The data was investigated using the software Minitab ver. 14. This program is very useful for analyzing big data set faster and easier. Through Minitab were created a graph for each requested point. In order o make the graph more understandable, it is provided a table with the more relevant statistic information. This allows a more comprehensive and understandable reading of the report and an easier and more efficient comparison among 2 or more variables in order to make a proper analysis. Correlation and Regression analysis was applied in order to establish the relationship between the price with the size and the distance to the nearest large town.
The data set given as a sample to analyse contain data collected of 100 houses in America from 5 different township numbered from 1 to 5. Each house is described by its price, size, number of bedrooms and bathrooms, if it has or nor a pool and a garage, the distance from the nearest large town, how desirable it is (scale of value among 1 = very undesirable to 7 = most desirable), the township of belonging and its age.
The aim of this report is to assess and evaluate the distribution of house price in America in the 5 townships used as sample. A conclusion is provided to summarise all the findings, interpretations and explanations followed by suitable suggestions. This report should help an investor to have a more clear background of which factor take into consideration before buying a house.
House and properties are ones of the fixed assets that have shown to have an increasing trend of value. Every year house prices in America increase with the inflation and gain even more value. The same applies worldwide to most of the house values. As the value of houses tend to increase overtime, a lot of people have decided to invest in properties. However, there are more factors to take into consideration before buying a house in order to evaluate if it worth the money asked on the market and if it will keep and increase its value.
4. Statistical Analysis: Findings
The Findings of the report has been dived into 3 parts:
a. The overall distribution of the house prices in the survey; this take into account all the house price within the 5 township without distinguishing for any other factor such as bedrooms and bathrooms number or size.
b. An examination of the proportion of the houses with a pool. This proportion was then investigated in relation to the garage and in the 5 townships.
c. An investigation of possible factors affecting the price such as the presence of a pool, the relationship with its size, the possibility of a link with the desirability and the distance to the nearest large town.
4.1 – Overall Distribution of the house price
By lLooking at Graph 1 it appears that the overall distribution is symmetrical. This is confirmed by comparing the value of the mean with the median: as the two figures have approximately the same value, it implies that the distribution is roughly symmetrical. The mean is the sum of all values divided by the datea set, 100.
The distribution of house price varies between a minimum value of $127,70 and a maximum of $284,00 with a range of $156,30. However, 25% of the houses have a price between the minimum value of $127,70 and the value of the first quartile $179,93. 25% of the houses have a value between the third quartile $221,15 and the maximum value $284,00.
The graph clearly shows that there is a higher concentration of houses with a price between Q1 ($127,70) and Q3 ($221,15). These represent 50% of the overall distribution.
The value of the standard deviation indicates how spread are the data is spread in respect to the mean.
4.2 – Examination of house with a pool
As shown in the Graph 2 above, 55% of the houses (that represent 55 out of 100 houses in the data set given) have a pool. In the Minitab output the percentage equal the count because it is out of a sample of 100. As a result, 45% of the houses analysed does not have a pool.
The Graph 3 shows the proportion of the houses with a pool and a garage. By looking at the table it is clear that the majority of the houses with a pool have also a garage, with 58,18% (32 out of 55 houses with a pool); while 41,82% (23 out of 55) houses with a pool do not have a garage.
However, for houses without a pool, the proportion of houses without a garage is higher than houses with a pool where 82,22% (37 out of 45 houses) do not have a pool nor a garage.
It is evident from Graph 4 that the proportion of the houses with a pool is not the same in all the 5 townships.
In township 5, all the houses (100%) have a pool; followed by township 4 with a 94,4% of the houses. On the other extreme there is township 1 with only 13,33% (2 out of 15 houses) have a pool, followed by township 2 with 22,22% (6 out of 27 houses). As table 4 shows the proportion of houses with a pool are in ascending order with the number of township: township 1 has the lowest percentage and township 5 has the highest. This could be a coincidence.
However, on the overall distribution, township 4 has the highest percentage of houses with a pool, with 32,73% (18 out of 55 total houses with a pool).
4.3 – Investigation of Factors affecting the house price
The box plot clearly shows that the overall distribution of the price for the houses with a pool is higher than houses without a pool. By comparing the data from table 5 it is clear that all the values relative to the distribution (mean, median, min, 1st quartile, 3rd quartile and maximum) for houses with a pool are higher. This leads to state that the houses with a pool are generally more expensive that houses without. Moreover, by comparing the mean and the median value for both groups, it is possible to identify that the distribution for the houses without a pool is skewed to the left or negatively skewed. This indicates that there are a few extreme low values that pull down the value of the mean. However, the “*” indicates that there is also an extreme high value of $250,20. By comparing at in the same way the houses with a pool it emerges that the distribution is roughly symmetrical because the value of the mean and the median are very close.
Another important consideration about the distribution is given by the quartiles that in the graph are represented by the lower and higher limits of the boxes. The 1st quartile of the houses with a pool ($195,90) is higher than the 3rd quartile of houses without a pool ($192,05). This implies that 75% of houses without a pool haves prices similar to the lowest 25% of houses with a pool.
However, the standard deviation measures how spread the data set is. The houses with a pool have a higher standard deviation, which impliesy that they have a more variable set in which each value is more “distant” to each other and to the mean while they are slightly more concentrate for the houses without a pool. By comparing the values of the range and inter-quartile range in relation with the standard deviation, it is clear that the houses with a pool have a higher dispersion in price and the prices are more spread out than houses without a pool.
The scatter plot in graph 6 gives an indication that there is a relationship between the house price and the size of the house. The upward trend indicates that there is a positive linear relationship as both variables are moving in the same direction: when the size rises, the price rises as well. In this case it worth to continue investigating the relationship.
However, the point are scattered quite broadly, so it is necessary to analyse the value of r in order to determine how strong the relationship is. The correlation coefficient (0,65) indicates that there is a positive (given by the sign +) relationship, and not very strong given by the value being lower than 0,8.
The regression equation is Price = -11,1 + 0,0979 * sqrFt
However, the value of the intercept is not statistically meaningful. This is given by the value T being -0,44 and also because logically a house price cannot be negative. In spite of this, the model is still good because the value T of the gradient (or slop) is statistically significant as T = 8,46. Nevertheless, the slope is very low and it indicates in increment of $0,0979 for each extra sqrFt.
The value of R-Sq suggests that only 42.2% of the house prices are explained by the size. This implies that there are other more significant factors that explain the changes in price.
By eye it is also possible to estimate that the houses with a square feeootage between 1900sqrFt and 2300sqrFt are more frequent.
However, it is important to consider that this graph takes in consideration the houses over the 5 townships with or without pool and with different numbers of bedrooms and bathrooms numbers.
The scatter plot shows the relationship between the house price and the distance to the nearest large town.
It actually clearly illustrates that there is not a relationship between the two variables.
This is confirmed by the correlation coefficient equal to 0,042. Moreover, as it is explained by the R-sq value, only 0,2% of the house price is related to this relationship.
It is not necessary to continue this investigation any further.
The following is a summary based on the findings:
1. The overall price distribution is roughly symmetrical and there is a higher concentration (50%) of houses with a price between $127,70 (Q1) and $221,15 (Q3). (Graph 1 – Table 1)
2. The proportion of houses with a pool is slightly higher than houses without a pool: 55% against 45%. (Graph 2 – Table 2)
3. The majority of houses with a pool have also a garage but the highest proportion does not have neither of the two. (Graph 3 – Table 2)
4. The percentage of houses with a pool increase with the township with number 1 having a minority of houses with a pool and 5 having 100% of houses with a pool. 3 out of 5 towns have a higher proportion of houses with a pool. (Graph 4 – Table 4)
5. Houses with a pool are more expensive that houses without. 75% of houses without a pool have a lower price than the 25% lowest prices for houses with a pool. (Graph 5 -Table5)
6. There is a positive relationship between the price and the size of the house; although this relationship is not very strong. Per each extra square feet the price rise of $ 0,0979. There is a higher concentration of houses with square footage between 1900sqrft and 2300sqrft. (Graph 6 – Table 6)
7. There is a link between the price and the desirability of a house. However, this relationship is not very strong. (Graph 7 – Table 7)
8. The distance between the house and a large city does not affect the price. (Graph 8 – Table 8)
Based on the above conclusions of the analysis, the following are suggestions for an investor interested in buying a house in one of the 5 townships:
1. The most popular and thus more demanded price for a house is between $127,70 and $221,15. For a Luxury house the highest demand would be between $245 and $275. Over this amount the demand is very low which imply that it is very exclusive. It depends by the main aim of the investor.
2. There is a slightly higher demand for houses with a pool.
3. If the investor decides to buy a house with a pool, it is suggestible to have a garage as well. Otherwise it is more convenient to have a house without any of the two.
4. If the house is in township 3 to 5, it is highly recommend to have a pool, especially for the last one.
5. The pool will make a huge difference for the value of the house. The value will rise by about 75% if it has a pool.
6. The bigger the house is, the more it values. However, houses with less than 1900sqrft are not very demanded. There is a medium demand for houses with a bigger size.
7. Desirability scale 6 has a highest average and median price and it had a good demand.
8. It is not relevant the distance between the house and a big city.
All the figures used to refer to the price are expressed as thousands of dollars ($ ,000).
In order to determinate the demand, it has been used the assumption that the higher frequency has a higher demand. For example: in township 5 all the houses have a pool. It implies that every