A Level

# Data analysis

## 4. Data analysis

Sophisticated data analysis will help you spot patterns, trends and relationships in your results. Data analysis can be qualitative and/or quantitative, and may include statistical tests. An example of a statistical test is outlined below.

## Lorenz curves

The Lorenz curve is a graph showing how evenly distributed a variable is over space.

The diagonal black line represents a perfectly even distribution. The blue and red lines show uneven distributions. The further these coloured lines are from the black line, the more uneven is the distribution.

You can draw Lorenz curves based on ordinal data (see worked example 1 below) or interval data (see worked example 2 below).

### Worked example 1: Lorenz curve for ordinal data

There are 32844 LSOAs in England. These have been given an IMD score, and then ranked from 0 (the most deprived) to 32844 (the least deprived). The LSOAs can be divided into five quintiles. The table shows how many LSOAs are in each of the five quintiles for Barking and Dagenham and for Hillingdon.

All LSOAs in England All LSOAs in Barking and Dagenham All LSOAs in Hillingdon
1st (top 20% deprived) 66 6
2nd 35 52
3rd 8 30
4th 0 31
5th (least 20% deprived) 0 33
SUM 109 152

From the raw data, it looks like there is a greater number of deprived LSOAs in Barking and Dagenham. In contrast, Hillingdon contains a more even distribution. Calculate the percentages for all three columns.

England Barking and Dagenham Hillingdon
quintile % raw data % raw data %
1st 20 66 60.6 6 3.9
2nd 20 35 32.1 52 34.2
3rd 20 8 7.3 30 19.7
4th 20 0 0 31 20.4
5th 20 0 0 33 21.7
SUM 100 109 100 152 100

Now calculate the cumulative percentages for all three columns.

England Barking and Dagenham Hillingdon
quintile % cu% data % cu% data % cu%
1st 20 20 66 60.6 60.6 6 3.9 3.9
2nd 20 40 35 32.1 92.7 52 34.2 38.2
3rd 20 60 8 7.3 100 30 19.7 57.9
4th 20 80 0 0 100 31 20.4 78.3
5th 20 100 0 0 100 33 21.7 100
SUM 100  100 109 100  100 152 100 100

Plot a scattergraph with axes as follows

• x-axis: cumulative percentages for England
• y-axis: cumulative percentages for a single London Borough

The black line shows a perfectly even distribution. This shows the distribution of deprivation ranks in England. The further a line is from this, the more uneven the distribution. As suspected, Barking and Dagenham has a more uneven distribution of IMD ranks than Hillingdon.

### Worked example 2: Lorenz curve for interval data

Lorenz curves can also be constructed for interval data, but there are some extra steps.

Bristol City Council have divided up the city into 14 ‘Neighbourhood Areas’. For each Neighbourhood Area, the total population of each area has been counted, plus the number of people with a ‘severe limiting long-term illness’.

This information can be used to help answer the question: do certain areas of Bristol contain a greater concentration of severely ill people than other areas? Or by contrast, are severely ill people evenly distributed throughout Bristol?

Name of Neighbourhood Area in Bristol Total population of this Neighbourhood Area Number of severely ill in this Neighbourhood Area
Ashley 47782 3514
Avonmouth 20237 2074
Bishopston 36713 1383
Clifton 41192 1537
Dundry View 28771 3411
Filwood 38778 3553
Bedminster 22664 1864
Brislington 22107 1686
Fishponds 27575 3503
Henbury 24253 2691
Hengrove 28786 2963
Henleaze 31412 2127
Horfield 23912 2141
St Georges 24052 2123
TOTAL 418234 34570

Calculate the percentages for the ‘total population’ and ‘number of severely ill’ columns. This shows the percentage of Bristol’s population and number of severely ill people in each Neighbourhood Area. For example, Fishponds contains 6.59% of Bristol’s population and 10.13% of Bristol’s severely ill people.

Name of Neighbourhood Area in Bristol Total population of this Neighbourhood Area Number of severely ill in this Neighbourhood Area
%   %
Ashley 47782 11.42 3514 10.16
Avonmouth 20237 4.84 2074 6.00
Bishopston 36713 8.78 1383 4.00
Clifton 41192 9.85 1537 4.45
Dundry View 28771 6.88 3411 9.87
Filwood 38778 9.27 3553 10.28
Bedminster 22664 5.42 1864 5.39
Brislington 22107 5.29 1686 4.88
Fishponds 27575 6.59 3503 10.13
Henbury 24253 5.80 2691 7.78
Hengrove 28786 6.88 2963 8.57
Henleaze 31412 7.51 2127 6.15
Horfield 23912 5.72 2141 6.19
St Georges 24052 5.75 2123 6.14
TOTAL 418234 100 34570 100.00

Calculate the ratio between the two percentage columns. "ratio" = "% severely ill"/"% population" For example, in Ashley, the ratio is 10.16 -: 11.42 = 0.89

Name of Neighbourhood Area in Bristol Total population of this Neighbourhood Area Number of severely ill in this Neighbourhood Area Ratio of % severely ill to % population
%   %
Ashley 47782 11.42 3514 10.16 0.89
Avonmouth 20237 4.84 2074 6.00 1.24
Bishopston 36713 8.78 1383 4.00 0.46
Clifton 41192 9.85 1537 4.45 0.45
Dundry View 28771 6.88 3411 9.87 1.43
Filwood 38778 9.27 3553 10.28 1.11
Bedminster 22664 5.42 1864 5.39 1.00
Brislington 22107 5.29 1686 4.88 0.92
Fishponds 27575 6.59 3503 10.13 1.54
Henbury 24253 5.80 2691 7.78 1.34
Hengrove 28786 6.88 2963 8.57 1.25
Henleaze 31412 7.51 2127 6.15 0.82
Horfield 23912 5.72 2141 6.19 1.08
St Georges 24052 5.75 2123 6.14 1.07
TOTAL 418234 100 34570 100.00

Rank the ratio column from highest number to lowest number. You can either do this by hand or by using the Sort command in Excel.

Name of Neighbourhood Area in Bristol Total population of this Neighbourhood Area Number of severely ill in this Neighbourhood Area Ratio of % severely ill to % population
%   %   rank
Ashley 47782 11.42 3514 10.16 0.89 11
Avonmouth 20237 4.84 2074 6.00 1.24 5
Bishopston 36713 8.78 1383 4.00 0.46 13
Clifton 41192 9.85 1537 4.45 0.45 14
Dundry View 28771 6.88 3411 9.87 1.43 2
Filwood 38778 9.27 3553 10.28 1.11 6
Bedminster 22664 5.42 1864 5.39 1.00 9
Brislington 22107 5.29 1686 4.88 0.92 10
Fishponds 27575 6.59 3503 10.13 1.54 1
Henbury 24253 5.80 2691 7.78 1.34 3
Hengrove 28786 6.88 2963 8.57 1.25 4
Henleaze 31412 7.51 2127 6.15 0.82 12
Horfield 23912 5.72 2141 6.19 1.08 7
St Georges 24052 5.75 2123 6.14 1.07 8
TOTAL 418234 100 34570 100.00

Rearrange the rows in the table according to the ranks that you have just made.

Neighbourhood Area % total population % severely ill ratio rank
Fishponds 6.59 10.13 1.54 1
Dundry View 6.88 9.87 1.43 2
Henbury 5.80 7.78 1.34 3
Hengrove 6.88 8.57 1.25 4
Avonmouth 4.84 6.00 1.24 5
Filwood 9.27 10.28 1.11 6
Horfield 5.72 6.19 1.08 7
St Georges 5.75 6.14 1.07 8
Bedminster 5.42 5.39 1.00 9
Brislington 5.29 4.88 0.92 10
Ashley 11.42 10.16 0.89 11
Henleaze 7.51 6.15 0.82 12
Bishopston 8.78 4.00 0.46 13
Clifton 9.85 4.45 0.45 14

Calculate cumulative figures for the two % columns.

Neighbourhood Area total population severely ill
% cumulative % % cumulative %
Fishponds 6.59 6.59 10.13 10.13
Dundry View 6.88 13.47 9.87 20.00
Henbury 5.80 19.27 7.78 27.78
Hengrove 6.88 26.15 8.57 36.36
Avonmouth 4.84 30.99 6.00 42.35
Filwood 9.27 40.26 10.28 52.63
Horfield 5.72 45.98 6.19 58.83
St Georges 5.75 51.73 6.14 64.97
Bedminster 5.42 57.15 5.39 70.36
Brislington 5.29 62.44 4.88 75.24
Ashley 11.42 73.86 10.16 85.40
Henleaze 7.51 81.37 6.15 91.55
Bishopston 8.78 90.15 4.00 95.55
Clifton 9.85 100.00 4.45 100.00

Finally it is time to draw the Lorenz curve! Plot the cumulative % total population on the x-axis. Plot the cumulative % severely ill on the y-axis.

## Gini coefficient

Lorenz curves are a useful visual technique for presenting your data. But it is sometimes difficult to see how one uneven distribution compares to another. The Gini coefficient is a summary statistic that will provide a precise answer.

"Gini coefficient" = "area of graph between the diagonal and the curve"/"area of graph above the diagonal"

The result for the Gini coefficient ranges from 0 (completely even distribution) to 1 (completely uneven distribution).

### Worked example of Gini coefficient

There are 32844 LSOAs in England. These have been given an IMD score, and then ranked from 0 (the most deprived) to 32844 (the least deprived). The LSOAs can be divided into five quintiles. The table shows how many LSOAs are in each of the five quintiles for Barking and Dagenham and for Hillingdon.

All LSOAs in England All LSOAs in Barking and Dagenham All LSOAs in Hillingdon
1st (top 20% deprived) 66 6
2nd 35 52
3rd 8 30
4th 0 31
5th (least 20% deprived) 0 33
SUM 109 152

Lorenz curves were plotted for the data.

To calculate the area of the graph above the diagonal, and the area of graph between the diagonal and the curve, you can count the number of squares on graph paper. Include fractions for part-squares.

There are 625 squares shown 312.5 squares are above the black diagonal line There are 61 squares between the diagonal and the red curve (for Hillingdon) There are 109 squares between the diagonal and the red curve (for Barking)

"Gini coefficient for Hillingdon" = 61-:312.5 = 0.20

"Gini coefficient for Barking" = 109-:312.5 = 0.35

## Location Quotient

The Location Quotient is another mathematical technique for showing how unevenly distributed a variable is over space.

"Location Quotient" = "% in one area" = "% the whole population"

Location Quotient (LQ) varies from 0 to infinity.

If LQ is less than 1, the variable is under-represented in a particular area. If LQ is greater than 1, the variable is over-represented in a particular area.

### Worked example

Bristol City Council have divided up the city into 14 ‘Neighbourhood Areas’. For each Neighbourhood Area, the number of people in different age bands has been counted. Here are the total number of people aged 16-24 and 65-74 for each area.

Name of Neighbourhood Area in Bristol Total population of this Neighbouhood Area Total number of people aged 16-24
Ashley 47782 7519
Avonmouth 20237 2364
Bishopston 36713 8351
Clifton 41192 14003
Dundry View 28771 3621
Filwood 38778 4288
Bedminster 22664 2762
Brislington 22107 2294
Fishponds 27575 5535
Henbury 24253 2631
Hengrove 28786 3137
Henleaze 31412 4160
Horfield 23912 3773
St Georges 24052 2566
TOTAL 418234 67004

Calculate the percentages for the ‘total population’ and ‘number aged 16-24’ columns. This shows the percentage of Bristol’s population and number of people aged 16-24 in each Neighbourhood Area.

For example, Avonmouth contains 3.53% of all the 16-24 year olds in Bristol. Be careful not to get confused here. This does not mean that 3.53% of Avonmouth’s population is aged 16-24.

Name of Neighbourhood Area in Bristol Total population of this Neighbouhood Area Total number of people aged 16-24 in this Neighborhood Area
%   %
Ashley 47782 11.42 7519 11.22
Avonmouth 20237 4.84 2364 3.53
Bishopston 36713 8.78 8351 12.46
Clifton 41192 9.85 14003 20.90
Dundry View 28771 6.88 3621 5.40
Filwood 38778 9.27 4288 6.40
Bedminster 22664 5.42 2762 4.12
Brislington 22107 5.29 2294 3.42
Fishponds 27575 6.59 5535 8.26
Henbury 24253 5.80 2631 3.93
Hengrove 28786 6.88 3137 4.68
Henleaze 31412 7.51 4160 6.21
Horfield 23912 5.72 3773 5.63
St Georges 24052 5.75 2566 3.83
TOTAL 418234 100 67004 100

The Location Quotient is the ratio between the two percentage columns.

"Location Quotient" = "% aged 16-24" = "% whole population"

For example, in Avonmouth, the LQ is 3.53-:4.84 = 0.73

Name of Neighbourhood Area in Bristol Total population of this Neighbourhood Area Total number of people aged 16-24 in this Neighborhood Area Location Quotient
%   %
Ashley 47782 11.42 7519 11.22 0.98
Avonmouth 20237 4.84 2364 3.53 0.73
Bishopston 36713 8.78 8351 12.46 1.42
Clifton 41192 9.85 14003 20.90 2.12
Dundry View 28771 6.88 3621 5.40 0.79
Filwood 38778 9.27 4288 6.40 0.69
Bedminster 22664 5.42 2762 4.12 0.76
Brislington 22107 5.29 2294 3.42 0.65
Fishponds 27575 6.59 5535 8.26 1.25
Henbury 24253 5.80 2631 3.93 0.68
Hengrove 28786 6.88 3137 4.68 0.68
Henleaze 31412 7.51 4160 6.21 0.83
Horfield 23912 5.72 3773 5.63 0.98
St Georges 24052 5.75 2566 3.83 0.67
TOTAL 418234 100 67004 100 1

The calculated figures show that people aged 16-24 are under-represented in a number of areas, such as Avonmouth, Brislington and St Georges. But people aged 16-24 are over-represented in other areas, such as Clifton, Bishopston and Fishponds. The LQ results show that the greatest concentration of young adults is in Clifton: can you find any other data to help explain this?

## Index of Dissimilarility

The Index of Dissimilarility is used to compare the distribution of two variables, such as two socio-economic groups or two ethnic groups in a particular area.

"Index of dissimilarity" = 1/2 ∑ |x_i/X-y_i/Y|

• x_iis the population of group x in small area i
• X is the total population of group xin the whole area
• y_iis the population of group y in small area i
• Y is the total population of group yin the whole area

It helps answer the question: is group X more evenly distributed in a particular place than group Y? The index ranges from 0 (complete integration) to 100 (complete segregation).

### Worked example 1 of Index of Dissimilarity

Census 2011 data for wards in Sandwell (West Midlands) can be obtained from Neighbourhood Statistics. An extract is shown below

Name of ward in Sandwell Number of persons identifying their ethnicity as White in the ward (this is x_i) Number of persons identifying their ethnicity as Asian in the ward (this is y_i)
Abbey 9078 1271
Blackheath 10808 870
Bristnall 9064 1814
Charlemont with Grove Vale 8903 1918
Cradley Heath and Old Hill 11913 1009
Friar Park 11335 619
Great Barr with Yew Tree 8300 3105
Great Bridge 10393 1626
Greets Green and Lyng 6925 3244
Hateley Heath 10295 2182
Langley 10135 1448
Newton 7879 2178
Old Warley 9388 1399
Oldbury 7648 4011
Princes End 11847 369
Rowley 10648 609
St Pauls 4252 7822
Smethwick 7128 4522
Soho and Victoria 3854 6881
Tipton Green 9262 2625
Tividale 10616 913
Wednesbury North 10331 1734
Wednesbury South 9132 2232
West Bromwich Central 6337 4857
TOTAL 215471 (this is X) 59258 (this is Y)

Calculate the percentages for the ‘White’ and ‘Asian columns. This shows the percentage of Sandwell’s population of each ethnic group who live in each ward.

For example, there are 215471 people identifying as White as Sandwell. There are 11847 people identifying as White in Princes End.

"% of Sandwell's White population who live in Princes End" = (11847/215471)xx100 = 5.50%

This means that Princes End contains 5.50% of people identifying as White in Sandwell. Be careful not to get confused here. This does not mean that 5.50% of the population of Princes End is White.

Ward White Asian
raw data % of Sandwell's population (this is x_i/X) raw data % % of Sandwell's population (this is y_i/Y)
Abbey 9078 4.21 1271 2.14
Blackheath 10808 5.02 870 1.47
Bristnall 9064 4.21 1814 3.06
Charlemont 8903 4.13 1918 3.24
Cradley Heath 11913 5.53 1009 1.70
Friar Park 11335 5.26 619 1.04
Great Barr 8300 3.85 3105 5.24
Great Bridge 10393 4.82 1626 2.74
Greets Green 6925 3.21 3244 5.47
Hateley Heath 10295 4.78 2182 3.68
Langley 10135 4.70 1448 2.44
Newton 7879 3.66 2178 3.68
Old Warley 9388 4.36 1399 2.36
Oldbury 7648 3.55 4011 6.77
Princes End 11847 5.50 369 0.62
Rowley 10648 4.94 609 1.03
St Pauls 4252 1.97 7822 13.20
Smethwick 7128 3.31 4522 7.63
Soho and Victoria 3854 1.79 6881 11.61
Tipton Green 9262 4.30 2625 4.43
Tividale 10616 4.93 913 1.54
Wednesbury N 10331 4.79 1734 2.93
Wednesbury S 9132 4.24 2232 3.77
West Bromwich C 6337 2.94 4857 8.20
SUM 215471 100.00 59258 100.00

Calculate |x-y|

This is the difference between the two columns of percentages. Remove all negative numbers.

Ward White Asian Differences (this is |x_i/X-y_i/Y| )
raw data % raw data %
Abbey 9078 4.21 1271 2.14 2.07
Blackheath 10808 5.02 870 1.47 3.55
Bristnall 9064 4.21 1814 3.06 1.15
Charlemont 8903 4.13 1918 3.24 0.90
Cradley Heath 11913 5.53 1009 1.70 3.83
Friar Park 11335 5.26 619 1.04 4.22
Great Barr 8300 3.85 3105 5.24 1.39
Great Bridge 10393 4.82 1626 2.74 2.08
Greets Green 6925 3.21 3244 5.47 2.26
Hateley Heath 10295 4.78 2182 3.68 1.10
Langley 10135 4.70 1448 2.44 2.26
Newton 7879 3.66 2178 3.68 0.02
Old Warley 9388 4.36 1399 2.36 2.00
Oldbury 7648 3.55 4011 6.77 3.22
Princes End 11847 5.50 369 0.62 4.88
Rowley 10648 4.94 609 1.03 3.91
St Pauls 4252 1.97 7822 13.20 11.23
Smethwick 7128 3.31 4522 7.63 4.23
Soho and Victoria 3854 1.79 6881 11.61 9.82
Tipton Green 9262 4.30 2625 4.43 0.13
Tividale 10616 4.93 913 1.54 3.39
Wednesbury N 10331 4.79 1734 2.93 1.87
Wednesbury S 9132 4.24 2232 3.77 0.47
West Bromwich C 6337 2.94 4857 8.20 5.26
SUM 215471 100.00 59258 100.00 75.21

Calculate |x_i/X-y_i/Y|

This is the sum of all the differences column.

In this example, |x_i/X-y_i/Y| = 75.21

Calculate "Index of dissimilarity" = 1/2 ∑ |x_i/X-y_i/Y|

In this example "Index of dissimilarity" = 1/2 xx 75.21 = 37.61

This means that 37.61 of the Asian population of Sandwell would need to change residence to a different ward in order to have the same relative distribution ast the White population of Sandwell.

### Worked example 2 of Index of Dissimilarity

Census 2011 data for wards in Sandwell (West Midlands) can be obtained from Neighbourhood Statistics. The Index of Dissimilarility has been calculated for ward-level data for the 7 largest ethnic groups of residents (excluding people of mixed ethnicity). A summary of the results is shown in the table.

White British White Other Indian Pakistani Bangladeshi Black Caribbean Black African
White British   31.76 37.39 54.01 57.81 33.86 33.48
White Other     18.68 39.80 45.22 15.96 21.07
Indian       34.03 42.73 15.58 25.60
Pakistani         43.42 33.71 26.17