I have chosen to do this statistical coursework that uses data from ‘Mayfield High School.’ Although this is a fictitious school, the data is based on a real school. As the data has been collected for me, it is called secondary data.
I believe that this coursework will allow me to illustrate my ability to handle data, use specific techniques and apply higher level statistical maths by being able to use a variety of methods in order to analyse and compare sets of data.
During this project I will be examining the relationships between the attributes of the pupils of Mayfield High School. My aim is took produce a line of enquiry which has two or more statistics regarding the pupils which are related to each other.
This table shows how many boys and girls there are in each year group at Mayfield High.
Year Group
Number of Boys
Number of Girls
Total
7
150
150
300
8
145
125
270
9
120
140
260
10
100
100
200
11
84
86
170
The total Number of students at the school is 1200
Data is provided for each pupil in the following categories:
* Name
* Age
* Year Group
* IQ
* Weight
* Height
* Hair colour
* Eye colour
* Shoe size
* Distance from home to school
* Usual method if travel to school
* Number of Brothers or sisters
* Key stage 2 & 3 results in English, Mathematics and Science
From the abovementioned, I need to pick several types of data to base my investigation on.
However, I have decided to pick only two (at the maximum 3) pieces of data, as time is a limiting factor in this coursework. When deciding my data categories, there are a few things that I need to bear in mind.
I need to use quantitative data, so I am able to apply all higher level statistical maths to my results. I also need to make sure that the data I choose are closely related, so I can analyse my results thoroughly.
There are several lines of enquiry at this point that I may wish to follow up. These are:
* The relationship between IQ and Key stage 3 English results
* The relationship between height and weight
* The relationship between shoe size and height
Through basic observations of the people in my surroundings, I believe that there may be a strong relationship between a person’s height and weight, not only with people in general, but between separate genders. However, I also feel that age is an affecting factor, and intend to look into that later on in the coursework. I have made this decision based on the fact that each of these pieces of data is interrelated and they are continuous (quantitative).
As previously stated, my line of enquiry will be the relationship between height and weight (with the introduction of age). I predict that there are several hypothesis that are related to this investigation.
* Boys will be taller than girls
* As height increases, so does weight
* Girls are heavier than boys
However, you must also take into consideration that relationships will be different when genders are treated separately.
In order to collect the data, it would take too much time and energy to unnecessarily include every person from the whole school. Therefore, a type of sample is needed. I have decided to take a sample rather than use the whole of the population, as it is quicker to take samples than to collect information from the whole population. Because time is a limiting factor, sampling will help me very much. It is important to choose the sample without bias so that the results will represent the whole population. There are many types of sampling, and I now need to find out which type suits my investigation best.
Random Sampling
In a random sample, every member of the population has a chance of being selected.
* Advantages: Every member of the population has a chance of being selected.
* Disadvantages: Due to its unpredictability, anomalous results can sometimes be obtained that are not representative of the population. In addition, these irregular results may be difficult to spot. For our purposes, there won’t be the same amount from each year and equal amounts of both genders.
Systematic Sampling
In a systematic sample, every member of the sample is chosen at regular intervals from the list.
* Advantages: Can eliminate some sources of bias
* Disadvantages: Can introduce bias where the pattern used for the samples coincides with a pattern in the population. For our purposes, there is a guarantees representative sample of year groups but not of gender
Stratified Sampling
A population may contain separate groups or strata. Each group needs to be fairly represented in the sample. The number from each group is proportional to the group size. The selection is then made at random from each group.
* This form of sampling will work well for our purposes
Quota Sampling
As with stratified samples, the population is broken down into different categories. However, the size of the sample of each category does not reflect the population as a whole. This can be used where an unrepresentative sample is desirable (e.g. you might want to interview more children than adults for a survey on computer games), or where it would be too difficult to undertake a stratified sample.
* Advantages: Simpler to undertake than a stratified sample. Sometimes a deliberately biased sample is desirable
* Disadvantages: Not a genuine random sample, and is likely to yield a biased result. For our purposes it is not very reliable because it depends on the interviewer to choose the sample
Cluster Sampling
Used when populations can be broken down into many different categories, or clusters (e.g. church parishes). Rather than taking a sample from each cluster, a random selection of clusters is chosen to represent the whole. Within each cluster, a random sample is taken.
* Advantages: Less expensive and time consuming than a fully random sample. Can show “regional” variations.
* Disadvantages: Not a genuine random sample. Likely to yield a biased result (especially if only a few clusters are sampled).
After looking at all of the advantages and disadvantages of each types of sampling, I have chosen to use stratified sampling, as this form of sampling will work well for our purposes. The reasons are stated above.
As I have now decided on my line of enquiry and type of sampling, I now need to decide how big my sample size will be. As different sizes of sample will affect the reliability of my results and conclusions, it is imperative that I make the correct choice when deciding the size of my sample.
The bigger a sample, the more useful the data will be. I you select a lot of people, your results will be closer to the actual results for the whole school. However, if you choose too many people the data becomes too difficult to analyze and takes too long to collate and sort. 5 – 10% is usually a fair representation of population, so I have decided to use a 9% sample, which is 54 people. In my opinion, I think this will be a good representation of population and is also a reasonable figure to manage.
When collecting my data, I need to check for outliers and anomalies. I will need to check my sampled data for untypical values which appear to lie outside the general range. (E.g. weight: 1kg/600kg and height: 0.01m/10m) Once I present my results in a graph it will be easy to see where the outlier resides:
If these outliers were included in my calculations or graphs they would distort the data, disrupt the correlation of graphs, and therefore effect my conclusion, and whether or not my hypothesis is correct. This is why it is crucial that I disregard any information that is blatantly incorrect.
Sampling Method (In Detail)
In order to produce my results, I need to know how my sampling method works.
1. Count boys and girls per year group
2. Work out sample size
3. Find the fraction of pupils in each year
4. Find how many people there are in each year out of 54 (9% sample)
5. Use same method to calculate amount of girls and boys in each year for sample
6. Use random sampling to choose correct number of boys and girls per year group and enter results in tables
7. Identify and anomalous data/outliers. Reselect data item
Mathematical Techniques
In order to thoroughly analyze and evaluate my data, there are many mathematical techniques, diagrams and graphs I will need to use. Here is a list of them:
Diagrams:
1. Histograms – A histogram is constructed from a frequency table. The intervals are shown on the X-axis and the number of scores in each interval is represented by the height of a rectangle located above the interval.
2. Box Plots – A box plot provides an excellent visual summary of many important aspects of a distribution. The box stretches from the lower quartile to the upper quartile and therefore contains the middle half of the scores in the distribution. The median is shown as a line across the box. Therefore 1/4 of the distribution is between this line and the top of the box and 1/4 of the distribution is between this line and the bottom of the box.
3. Scatter Diagram – A type of diagram used to show the relationship between data items that have two numeric properties. One property is represented along the x-axis and the other along the y-axis. Each item is then represented by a single point.
4. Cumulative Frequency Graphs – A cumulative frequency graph can be used to estimate some useful statistical measures.
5. Line Of Best Fit – Single line drawn through a series of data points as a best representation of the underlying trend. Can be a straight line or a curve.
Calculations:
1. Mean
2. Mode
3. Median
4. Mean & Modal Class for Grouped Continuous Data – This calculates the mean for grouped continuous data.
5. Interquartile Range – The distance between the upper and lower quartiles. As a measure of variability, it is less sensitive than the standard deviation or range to the possible presence of outliers. It is also used to define the box in a box-and-whisker plot.
6. Standard Deviation – It is the most commonly used measure of spread.
7. Normal distribution – Normal distributions are a family of distributions that have the same general shape. They are symmetric with scores more concentrated in the middle than in the tails. Normal distributions are sometimes described as bell shaped.
8. Spearman’s Rank Correlation Coefficient – The Spearman’s Rank Correlation Coefficient is used to discover the strength of a link between two sets of data.
9. Equation of Line of Best fit – Equation of line that shows underlying spread.
Collecting the Data
In order to find my results, I will need to sort the data and put it into tables. As I am using stratified sampling, I have had to count up the amount of boys and girls in each year and work out my sample size. Once I have done this, I will record my results in two separate tables (one for males, one for females), in year order. From there, I will then create separate tables for each year and then create 1 large mixed table. After I have finished sorting out the tables, I will then do various scatter diagrams. Firstly, one for males one for females, mixed and then one for each year (for both mixed and separate genders).
Finding the Results
As I have previously stated, I have decided to use a samples size of 9%, which in total is 54 people. I now need to apply that information to the investigation and work out my sample for each year, gender etc.
Data:
Year
Boys
Girls
Total
7
75
75
150
8
65
70
135
9
62
68
130
10
51
49
100
11
41
44
85
Total
600
Sample size : 9% of 600 = 54
Now, I have to calculate how many pupils to examine within each year, because each year group varies in total amount of students. I will calculate the proportion of pupils from each of the year groups.
Stratified Sample:
Year
Fraction of population
/54
No. Of Girls in Sample
No. of Boys in Sample
7
150/600= 0.25
13.5
75/150 x 13.5 = 6.75 (7)
75/150 x 13.5 = 6.75 (7)
8
135/600= 0.225
12.2
70/135 x 12.2 = 6.32 (6)
65/135 x 12.2 = 5.87 (6)
9
130/600= 0.2166666
11.7
68/130 x 11.7 = 6.12 (6)
62/130 x 11.7 = 5.58 (6)
10
100/600= 0.1666666
9
49/100 x 9 = 4.41 (4)
51/100 x 9 = 4.59 (5)
11
85/600 = 0.1416666
7.6
44/85 x 7.6 = 3.93 (4)
41/85 x 7.6 = 3.67 (4)
Due to rounding, my sample size has been adjusted from 54 to 55. Given as a percentage, this would be:
55/600 x 100 = 9.166666667
= 9.2%
I now need to randomly select, within the specified year and gender, the designated amount for each category. I will do this by using the random function on my calculator. I need to make sure the results are random, so that they will not be biased. Once I have done this, I need to check for any anomalies in my selected pupils’ weight/height.
Boys
Year
Height (cm)
Weight (kg)
1
7
1.48
44
2
7
1.59
52
3
7
1.49
43
4
7
1.52
45
5
7
1.54
43
6
7
1.55
40
7
7
1.59
45
8
8
1.57
48
9
8
1.67
51
10
8
1.71
46
11
8
1.66
43
12
8
1.59
47
13
8
1.42
40
14
9
1.67
54
15
9
1.8
48
16
9
1.75
63
17
9
1.46
45
18
9
1.5
70
19
9
1.82
66
20
10
1.8
49
21
10
1.6
50
22
10
1.62
52
23
10
1.65
50
24
10
1.77
59
25
11
1.91
82
26
11
1.62
56
27
11
1.74
50
28
11
2
86
Results
Girls
Year
Height (cm)
Weight (kg)
1
7
1.61
45
2
7
1.61
47
3
7
1.56
43
4
7
1.48
42
5
7
1.5
40
6
7
1.56
53
7
7
1.58
48
8
8
1.72
43
9
8
1.62
53
10
8
1.62
54
11
8
1.6
46
12
8
1.75
45
13
8
1.48
46
14
9
1.57
38
15
9
1.62
54
16
9
1.64
40
17
9
1.6
46
18
9
1.8
60
19
9
1.6
51
20
10
1.52
45
21
10
1.72
56
22
10
1.66
45
23
10
1.73
42
24
11
1.7
50
25
11
1.68
48
26
11
1.52
38
27
11
1.62
48
Organising My Results
Although I have already presented my results into 2 separate tables, one for each gender, the results are not concise enough. In order to fully analyse my results, I will need to put my results into scatter diagrams and histograms etc. Therefore, my results need to be grouped into around 5-8 groups, which are the same for both genders. This is because when I put my results into the scatter diagrams (etc), I will need to compare both genders, thus requiring me to use the same groups for both sexes. Once I have chosen my groups, I will enter the information into the frequency tables and use those for me histograms and scatter diagrams.
Statistical coursework that uses data from 'Mayfield High School.'. (2018, Nov 29). Retrieved from https://paperap.com/paper-on-statistical-coursework-that-uses-data-from-mayfield-high-school/