The aim of this project is to find out which factors affect the selling price of a house. I have been given four districts, and in each there are four streets. In each of the 16 streets there a number of houses which the previous factors effect and result in a different house price. In the data presented I have found several rogue results (results that do not fit in with the rest of the results).
If these were to be kept, the results would be biased and so these rogue items must be removed. The rogue items were ï¿½475, 000 in house price, 13,000 in squared ft and 20 in number of bedrooms. They are too large to fit in with the other data. The ranges (highest – lowest values) where ï¿½161, 800 in house price, 2900 in square ft and 6 in number of bedrooms. 27% of the homes had a large garden, 62.
5% had a small one and none was 10%. In Garages 68% had a garage and 32% didn’t.In order to get an idea of the nature of this data I have been given, I will divide Price into suitable groups and draw a histogram. Then I will make a cumulative frequency table and draw a cumulative frequency curve. I will then state my median and the inter-quartile range for my cost, as outliers do not affect them.Out of the fields given, some of these affect the price of the houses.
The fields that will affect the house price are if the house has a garden, if the house has a garage, the number of bedrooms in a house and the area (square ft) of a house. The field that will not affect house price is house number.The fields in order of importance are, number of bedrooms, area (sq ft) of a house, if the house has a garden, if the house has a garage and finally house number which has no importance at all.My main Hypothesis is that “Price will be affected by some factors”PlanI am going to investigate the correlation between price and no of bedrooms in one street and one district (Arlington, Primrose Street). This is so that all other variables will remain constant.Then I will make a sample to make it more manageable, it will contain even amounts on all four streets so that it’s not biased, and isn’t biased to any district or street. After this I will draw a histogram and cumulative frequency for the population and sample then I will compare my results.I will then draw several box plots for the price for each district to see if my data is valid, this could affect my observations.After this I shall draw several scatter graphs, and I will analyze the correlation, each will be named and have a line of best fit and equation for the line presented.First HypothesisI expect that as the number of bedrooms increase; the price will increase accordingly, the correlation will be therefore positive.The reason in taking one district and street is because when the houses are in the same street they are roughly the same price; the other variables that are not number of bedrooms will be kept constant.Testing the HypothesisSamplingAs to make this unbiased, we first picked a number from 1 – 5 randomly, if I were just to pick number 1 that would mean all the numbers at the bottom would not be included in our sample. We picked 1 – 5 as we can count out 10 numbers from each of the points. I picked each 5th value. Next I had to determine the size of my sample using stratified sampling, I chose 40, as it is easy to work with and allowed me to get 10 streets from each district.x- – – – – – X 40 =200With this equation I can find the same proportion in sample as we have in the population. In this case x is number of houses.As we do not just want to pick a random 10 numbers from my sample, as this would be biased, I used systematic sampling and decided to choose every 5th data entry.Using The SampleI am going to draw a histogram for the sample and compare the values with that of the histogram for the whole population. I then drew a cumulative frequency curve of the sample and compared it with the cumulative frequency curve of the whole population.Below is a table of median, lower quartile, upper quartile and inter-quartile range.CF Graph of theMedianLQUQInter-Quartile RangePopulationï¿½68,000ï¿½44,000ï¿½108,000ï¿½64,000Sampleï¿½76,000ï¿½44,000ï¿½128,000ï¿½84,000Modal class is the value, which the highest frequency occurs in.From looking at our histograms it is clear that there is a definite modal class in both, it is in the ï¿½40,000 to ï¿½50,000 section, this value is highest in both. The lowest values in both are in the ï¿½160,000 to ï¿½200,000 bar. The Inter quartile range of the population is lower than that of the sample; this shows the measure of spread is greater in the sample. The greater measure of spread is because the results are more spaced out, i.e. a higher upper or lower quartile in the sample than that of the populations. The layout in both sample and population histograms is very similar proving the sample that has been taken is a good representation of the population.I then drew box plots for the price. I found that in Arlington the median was ï¿½129,800, therefore on average the house prices in that district are dearer. In Castlemains there is an even distribution of mid-priced houses. In Tobermory there are more expensive houses than cheaper ones. Also in Westlake most of the houses are cheap compared to the others. If the median is closer to the LQ it is positively skewed, in my box plots Tobermory and Castlemains are positively skewed. Therefore this means the values above the median are more spread out that than below the median. If the median is closer to the UQ it is negatively skewed. Westlake and Arlington are negatively skewed in this case; this means the values below the median is more spread out.The stronger the gradient the stronger the relationship is. Price of a house = in Arlington’s case: -Price = 18650 times number of bedrooms plus 51900 the answer to the equation would therefore be more reliable with a greater gradient.I calculated the IQR of each district and found that Arlington an IQR of 21,750 this shows a greater range in prices. This compared to Westlake with an IQR of 10,675. This means that the smaller the value the less range of prices in the district. Castlemains had an IQR of 15,300 and Tobermory an IQR of 18,475; again from this we can establish a pattern.The stronger the gradient, the stronger the relationship is.Looking back at my scatter-graphs using the R squared value I can determine the correlation of the graphs. The strongest graph of positive correlation is for Arlington with an R squared value of 0.9272. The graph that shows the weakest graph of positive correlation is for the Castlemains district.Using the equation of the line I was able to interpret the missing data. I will demonstrate how I gathered the missing data. These are 2 examples.Arlington (house number 154)Y=18650x + 51900122400= 18650x + 51900(-51900)70500=18650xx= 3.78 xï¿½ 4Castlemains (house number 76)Y=17379x + 11107Y= 17379 (3) + 11107Y= ï¿½63244The missing data is presented in the blue. They are:DistrictStreetHouse NumberNo. Of BedroomsHouse PriceArlingtonCherrytree1544ï¿½122400ArlingtonOakhill845ï¿½146650DistrictStreetHouse NumberNo. Of BedroomsHouse PriceCastlemainDeerpark763ï¿½63200CastlemainHighfield274ï¿½80600DistrictStreetHouse NumberNo. Of BedroomsHouse PriceTobermoryArchvale512ï¿½38600TobermoryBallyrae193ï¿½50500TobermoryParkmore453ï¿½48000DistrictStreetHouse NumberNo. Of BedroomsHouse PriceWestlakeMaltby613ï¿½41500I rounded these values to the nearest ï¿½100, as the R Squared value is not 1, it is not exact.I compared my missing data with the trends of the houses in my sample. Also all of my graphs had positive correlation and a high R squared value. Although they cannot be perfect values as the R squared value is not exactly 1 and the number of bedrooms is discreet data be. Also there is some data that breaks the trend e.g. Tobermory no. 49 Millrow has an area of 40900 sq ft and 3 beds which costs ï¿½48000 whereas 17b Ballyrae Tobermory has 2 beds and an area of 45150 sq ft and costs ï¿½50,5000. So the number of bedrooms is not the over-ruling factor in all of these houses.From this I can conclude that my first hypothesis has been confirmed as it fits the pattern.”Number of Bedrooms does affect the Price of Houses”ConclusionI found that the number of bedrooms in a house does affect the overall price. I drew various graphs including Scatter graphs, Box plots and Cumulative Frequency Curves, these all support my hypothesis, therefore I believe it to be correct that “some factors do affect house price”We were given a population that contained 200 houses in 4 districts with 50 houses in each district. There were a few limitations of the data as parts of the data were incomplete (missing) and some more information would have been helpful like e.g. age of house and where is it situated (as we do not even know it these houses are in the same town). It was not appropriate to use the whole range of data, as some of it was rogue.My sampling technique was the most appropriate as it was easy to work with, but it was small, if I were to repeat my sampling I would take 60 values instead of 40. The population was represented well by the sample.My overall strategy was effective; my only criticisms are that I could have drawn both Histograms on the same graph, as well as both Cumulative Frequency graphs on the same sheet. It would have been easier to compare to one another using the % Frequency Density/Cumulative Frequency values. I did address the problem I had hoped. The limitations were that I could not draw box plots for the number of bedrooms as it is discreet data, instead of a box plot I could have drawn a pie chart or bar chart to represent the number of bedrooms. Also we do not know about the condition of the house or its interior. Modifications could have been made to the houses like extensions, central heating fitted, double-glazing or a loft conversion.If I were to have more time I would investigate the square footage of the house using the same sample and technique I employed earlier. Then after that I would investigate if a garage affects the price of a house also.Any house that breaks the trend could be down to its condition or modifications made to its interior or exterior. More data would have been helpful to gather a clearer overall picture.