SEMINARS

 

Statistics for the Baseline Survey - An Introduction

Faisal Awartani

Descriptive and Inferential Statistics

In approaching towards the use of scientific methods, human society moved through a developing stage in the 17th, 18th, and 19th centuries from an early phase of metaphysical thinking to a way of thinking that used data to support certain hypotheses or ideas. Since the world embarked upon such empirical studies and statistical analysis, numerous methods of data collection, processing and analysis have evolved. Two basic types of statistics can be distinguished: descriptive statistics and inferential statistics.

Inferential statistics are about how one uses data to make an inference, while descriptive statistics - used more commonly - basically focus on summarizing information in an efficient manner. In summarizing data, statisticians differentiate between two types: graphical, i.e., by means of diagrams, and numerical summaries, i.e., by means of numbers.

Types of Data and Data Summaries

The process of summarizing data depends on the different types of data. Once the information needed is collected the obtained data must be considered according to its character, whereby the following three types are differentiated:

Nominal data
Ordinal data
Continual data

Nominal scale data is data that does not have a numerical element. It is either-or information that only classifies, for example, sex: either male or female. Other examples are religion, marital status, place of residence, etc.

Ordinal data also classifies but gives an order in addition. For example, grades in a course are A, B, C and D, and students can be ranked according to their grades. Ordinal data provides information that someone‘s mark is higher than that of another person, but without knowing the exact numbers. It allows for ranking or putting in order but does not specify the difference. Examples are size of clothes (large, medium, small), or questionnaires that ask for evaluating the management, for example, along a given scale of very good – good – medium – bad – very bad.

Continuous or ratio scale data classifies, puts in order and provides units of distance, so that the difference between one value and another is known. Examples are age, income, height, grades, or time spent in a certain job. For example, if someone’s mark in a course is 75 and someone else’s is 80, one knows the difference is five points, while in ordinal data one would only know that A is more than B. Continuous data gives more details and measures exact differences. In questionnaires and data collection, however, there is often a tendency to put continuous data into ranges. For example, instead of asking people about their exact age or income, one might ask them to specify the applicable range; for age there may be categories such as 30-35, 35-40, 40-45; for income $500-$1,000, $1,000-$1,500, and so on. This is done in order to obtain data where people might be reluctant to give exact information. However, once data is categorized in such a manner, information is lost. Wherever possible, continuous scale data should be collected, so that it can later be re-coded, programmed, put in categories and so on. Continuous scale data can be transformed into nominal or ordinal data, but once continuous variables are collected as ordinal data, it cannot be transformed back into continuous data.

Data is also differentiated by its quantitative and qualitative character. Quantitative data means numerical types of data, such as ages, income, heights, population density - anything that can be measured on a connected scale. Qualitative data is any kind of data that pertains to a person, a group, a country, or an institution – anything that is not numerical. Qualitative does not mean good or bad, but simply non-numerical data. For example, the answer to the question of whether someone supports the peace process or not is yes or no. This type of information is qualitative. Asking about a particular institution - whether it is an NGO, government, or private - is also qualitative, i.e., not numerical. Both qualitative and quantitative data can often be combined when collecting information. For example, regarding the level of education the question on the highest science degree is ordinal and qualitative, but asking about the number of years spent studying is continuous scale data and quantitative. In other words the difference between quantitative data and qualitative data is whether the corresponding questions are answered numerically versus non-numerically.

Graphical summaries of nominal and ordinal data are usually done in the form of tables, pie, bar and line charts. For example, when conducting questionnaires about the distribution of the population in Palestine, there are three answers: a) city, b) camp, or c) village. To summarize this nominal data, the results can be put in a table as follows:

Camp

30%

Village

40%

City

30%

However, people principally prefer charts and pies to tables. In a pie chart the output can be shown in different colors:

Image22.gif (17426 bytes)

A bar chart would look like this:

Image23.gif (22942 bytes)

The pie and bar charts graphically illustrate the size of each variable in addition to indicating the percentage.

A line chart would look like this:

Image24.gif (6854 bytes)

The items of the x-axis (place of residence) are nominal data and can be switched easily because there is no sense of order; it would only change the appearance of the line or bars. In the given example a line chart would be somewhat misleading because lines are normally used to show a trend, illustrated by the decline or rise of the line. In cases of nominal data, which only give the total percentage of an item without saying whether it is increasing or decreasing, line graphs are not suitable; rather, pie or bar charts should be used.

A graphical summary of continuous scale data needs to be based on a summary in table format, showing the category by intervals and the percentage for each interval, adding up to 100 percent. Many software programs, such as EXCEL, do the calculation automatically and often offer a graph straight away without seeing the table. The data itself can be illustrated in charts just like the above example.

Age

Percent

25 - 29

20

30 - 34

25

35 - 39

15

40 - 44

10

45 - 49

30

 

100%

One will notice from the table that there is a sense of order, from a smallest number to a largest. The table gives the distribution of the data, i.e., the areas of concentration of the data. The identification of concentrations can be very important when it comes to decision-making. For example, knowing about the age concentration of a population can be important in planning for certain target groups.

Cumulative Percentages

Another way of summarizing continuous scale data is the ‘cumulative percentage histogram’. This very important method is often found in reports or statistical books; it shows tables, categories with percentages, and cumulative percentages, which is the accumulation of percentages by categories. For this kind of graph a line chart should be used - the computer will do this, but one still has to understand what it means in order to be able to interpret the output. The final result is the most important thing.

Age

Percent

Cumulative Percentage

25 - 29

20

20

30 - 34

25

45

35 - 39

15

60

40 - 44

10

70

45 - 49

30

100

If one wants to know the percentage of people aged 30-39, one could add the 25 percent (category 30-34) and 15 percent (category 35-39) and get the result of 40 percent. Another way of doing it is to look at the cumulative percentage for 39 percent (20 percent + 25 percent + 15 percent), which is 60 percent, and then subtract 20 percent (category 25-29); the result is also 40 percent. For the age group 30-44, the cumulative 70 percent are taken, subtracted by 20 percent (25-29) which makes – direct and quickly - 50 percent.

If there is a large number of categories, and one wants to know the percentage between two categories, the two numbers are subtracted to obtain the percentage wanted. This is important, for example, in the Central Bureau for Statistics, when working on the age distribution of the Palestinian population. All people are put down according to their age starting with one year, two years, three and so on, until the highest age. This adds up to three, four pages of percentages. If someone then wants to look at the percentage of those between 18 and 50, meaning he would have to add all the percentages of the ages between 18 and 50, it would be a major problem. However, if the cumulative percentages are listed, one only needs to look at the percentage for 50 years and subtract from it the percentage for 18 years. In other words, the cumulative helps extract information quickly, and makes calculations easier.

Percentile

The percentile is important for someone who wants to look at the number under a particular percentage. Supposing one wants 70 percentile of the data, i.e., the age that 70 percent of the people are under. This is calculated by looking at where 70 cumulative percent are on the table. Supposing, this is the case at the age of 44, then 44 constitutes the 70th percentile of the data. P70 or P10 means the age that has underneath it 70 percent or 10 percent of the people respectively.

Working out the percentile is also an important process when marking exams or scoring of some sort. For example, there is a pupil whose mark in an exam was 75. In order to judge this score and know if it is a good mark or not regarding the students as a whole, one will look at the percentiles. If P90 was 70 in this exam, this means that 90 percent of the students scored 70 or less than 70. Thus, someone with a mark of 75 was among the top ten percent of the pupils, which means that this score is an excellent score. The process of summarizing data according to percentiles shows the distribution of the data and allows one to make a judgement.

Prices are also a good example. How can one know if a particular trader is expensive or cheap with regard to a particular commodity? Supposing the information about one kilo of apples is that P90 is NIS 6. That is to say 90 percent of traders sell a kilo of apples for NIS6 or less. A trader who sells a kilo for NIS 6.5 is then among the top 10 percent, i.e., expensive. Without percentile style data judgement cannot be made.

Another example that illustrates the importance of percentiles is warranties. For example, companies that produce car batteries usually give a warranty (whereby 18 months is the norm). The producing company has to make a decision with regard to the extent of the warranty, whereby the length of the warranty is connected to a number of factors, such as the competition in the market and the return rate. There should always be a life-time analysis on the batteries, for example, to find out what is the age under which five percent will have died, and the age by which 50 percent will have died, etc. Before giving a warranty, companies often do experiments using cumulative percentages, in order to get data - such as P50 = five years, P10 = two years, and P5 = one and a half years – on which they base their decision-making. If they know, for example, that five percent of the batteries will die before one and a half years, and ten percent will die before two years, and then fix the warranty at two years, they expect that ten percent of the batteries will be returned. If they put one and a half years on the warranty, they will expect five percent to be returned. So the decision will be about what percent of the production should be taken back. It is a simple calculation about profit and loss. With a two-year warranty and a return rate of ten percent the company might fail, for example. A 5-year warranty with an expected return of 50 percent is already ludicrously dangerous. Thus, experimental studies are carried out from which one can ascertain with high confidence what the rate of return should be. A company might decide on a one and a half-year warranty, based on a fixed percentile (return of five percent of the batteries) because of competition. This is the concept of the percentile: looking at data and analyzing it.

The percentile is also important with regard to insurance companies. There is an entire field specific to insurance called actuarial statistics, which studies how to make decisions for insurance companies so that they maximize their profit.

Numerical Summaries

When summarizing continuous scale data numerically there are two basic types of numerical summaries:

Measures of central tendency, and

Measures of variation.

The first type shows where the center of the data is, and the second type shows that there has been a spread, a variation in the data. Measures of central tendency look at the average or the mean of the data. For example, the mean consumption of a Palestinian family is about JD 655 per month (as opposed to JD 890 in Jerusalem).

When looking at such socioeconomic data it is important to bear in mind the difference between consumption and expenditure. Consumption calculates everything one consumes; it is not the income. The average monthly Palestinian household expenditure comes out at JD 604.

The average is the sum of all the total expenditure for every family that is in a sample, divided by the total number of people. The result is the center of the data. The median reflects the middle of the data: it is the point underneath and above which each 50 percent of the sample is located. In other words it is P50, the 50th percentile. The median is a better summary for consumption data than the mean. The mean is generally not used when summarizing information about wages, for example, because it is affected by ‘outlayers’: if there is a small part of the population that spends a lot of money, this affects the mean. Again it depends on the data. If it is a symmetric kind of data and its shape is balanced, the mean is good. But if the data is skewed either to the right or the left, or if there is a small part of the data, which is far right or left, the mean is misleading. The mean does not reflect the middle of the data.

If looking at the consumption data, one would expect the median to smaller than the mean, because the data for consumption is affected by the few who have a high income and high consumption. The median in the above example will come out at approximately JD400. When summarizing continuous scale data it is always good to look at both the mean and the median. A big difference might result from a data entry error (which needs to be corrected) or from the fact that there are natural ‘outlayers’ in the data. ‘Outlayers’ have to be examined closely; there are a number of reasons for ‘outlayers’ in collecting information. The majorment error, for example, appears when one takes information from particular people who give incorrect numbers. Data entry errors often occur when transferring information from a questionnaire to the computer and need to be checked and corrected. Measurement errors cannot be dealt with except by deleting the affected points because otherwise they may well ruin the entire analysis.

Another example are statistics that show the average working hours in the week, the average of working days in the month, the average daily wage, etc. If one wants to make a decision as to where to work and invest his time, he could look at the various types of work and the differences by sector. For Palestinian workers in Israel, for example, the statistics show that the average daily wage is highest in building and construction (NIS 84). A look at the numbers for the sector agriculture, hunting, forestry and fishing shows that the median is less than the mean, i.e., the data is skewed, which means that there is a significant amount of people whose wage is well below the mean.

Statistics are also used to monitor certain processes, whether they arerelated to the whole society or to particular concerns. Understanding the collected data means understanding the underlying problems and the changes and improvements that have occurred. Data has to be looked at very carefully and critically, especially when it is used for decision-making. Findings should always be double-checked in order to discover causes as well as for data entry errors. One has to isolate the other factors and then come to a conclusion about the shift that occurred.

It is very important to learn how to do the design and storage of data and how to present the results so that it is understandable. Everything is done from a certain baseline against which it is then measured again and again to monitor progress. The baseline preparation goes along with certain definitions and design elements that will stay throughout the whole process. Changing these definitions or the design would lead to problems, especially with statistical indicators. It might then happen that someone looks at the problem of unemployment in a particular period and finds 19 percent, someone else finds 40 percent, a third one 30 percent, and so on. In examining an issue, the same definitions and methodologies should be used; otherwise it means talking about different things.

Variations – Standard Deviation and Interquartile Range

Among the important variations that a computer calculates (e.g., EXCEL or SPSS) when using continuous scale data is the standard deviation and the interquartile range.

The standard deviation shows the amount of variation in the data set, i.e., the spread of the data. For example, there are two data sets - one showing the average of students in a particular subject, and the other the average of students in another subject – with the average of both subjects being 75. This does not provide any information on the differences between the two sets, i.e., the variables. The picture that emerges by means of the average is misleading because it is the same despite the fact that two data sets are used with different variables. Thus, a new measure must be defined, which is the standard deviation. The average for data set number one, for example, results from the individual marks 60, 65, 75, 85, 90, the average of which is 75. Data set number two consists of the marks 70, 72, 75, 78, 80, which also averages 75. This shows that the average of summarized data does not give an adequate picture. Looking at the difference between the two data sets shows that the numbers data set one has more spread than data set two. This spread reflects the standard deviation. The standard deviation for the first set will be greater than the one for the second set, which is shown on the x-bar: S1 = 10 and S2 = 5.

In most data sets two standard deviations on either side of the mean are taken. Suppose the average mark is 75 (x-bar = 75) and the standard deviation is 5 (S1 = 5). One has to understand that 95 percent of the data falls within two standard deviations of either side of the mean; thus, one knows straight away that 95 percent of the students got marks between 65 and 85: two standard deviations of five added together comes to ten; ten taken from 75, and ten added to 75. Another example: if a boy tells his father that he got 85 in the exam, what does this mean? That the boy got a mark in the top 2.5 percent, because it is known that 95 percent of the pupils got marks in the interval between 65 and 85. Thus, 2.5 percent of the pupils got marks below 65, and 2.5 percent got marks above 85. In other words, those who got 85 are in the top 2.5 percent.

This subject of standard deviation and normalization is very important; it is used, for example, when talking about standard weights for babies, i.e., to judge whether babies are underweight or overweight. The doctor has a x-bar chart for one-year old babies, for example, that shows the averages and two standard deviations. Above the two standard deviations means overweight, beyond means underweight. In Palestine, there is still a problem because only US or European charts are used to determine child weights and according to these, some 80 percent of Palestinian babies are underweight. It would be easy to develop a Palestinian chart, knowing the average weight for a Palestinian one-year old, two-year old, etc., and have the standard deviations.

Another example is the statistics for consumer prices prepared by the Central Bureau of Statistics. They provide the averages for the prices, but still not the standard deviation. Looking only at the averages, for example, the average price for a kilo of rice, does not allow one to make a judgement. There are always people who sell above or below the average, but without the two standard deviations one cannot say whether a shopkeeper is cheating or not.

In summary, there are nominal, ordinal and continuous types of data; nominal and ordinal data are summarized by means of percentages and continuous data by means and standard deviations.

The interquartile is another major of variation that the computer can calculate. As mentioned above, there are two types of majors of central tendency: the mean and the median. One is affected by the ‘outlayers’, the other is not. It is the same with the majors of variation. There are two methods: standard deviation, which is affected by the ‘outlayers’ and the interquartile range, which is not. A simple definition of the interquartile range is: it is the 75th percentile minus the 25th percentile. It is a main standard variation that tells how much the data spreads.

Cross Tabulations

Cross tabulation is used in questionnaires, surveys and monitoring processes that contain many statistical details. In public opinion polls, for example, people are asked a wide range of questions on economic, political, and social issues, and - as in the case of a Nablus-based research center that conducts polls among Palestinians – on specific topics such as the peace process. The answer to these questions are used to evaluate the Legislative Council, the extent to which people support the Oslo Accords, or to assess the performance of the PNA on the whole.

In order to learn about the perception and opinion of the public, the people are not only asked whether they support the peace process (on a nominal yes-or-no scale), but also about their demographic background. Support for the peace process or whether one prefers Israeli or Palestinian products are variables, measured as nominal scale values. In the data analysis one then has to define a so-called dependent variable as well as the independent variables. Subject matter variables are dependent; then they are checked against certain independent or background variables in order to see how they change at different levels. For example, if ‘support for the peace process’ is the dependent variable one might want to find out how the support is connected or changes according to demographic features (background variables). These independent variables could be sex (male or female), region (Gaza, West Bank), age, educational level, place of residence (camp, village, city), etc. In principle, it can be said that the variable ‘educational level’ makes a real difference while age does not usually have such a big impact.

Another good example in this context is the distribution of workers by place of work and educational level. The statistics show that six percent of those who have 13 years of education and more work inside Israel, along with 17 percent of those who have 10-12 years of education, and - the highest percentage of all - those who have 7-12 years of education. This indicates that the educational level is connected to employment in Israel: most Palestinian workers in Israel have middle or lower education. The dependent variable here is the place of work; the independent variable is the number of years of study.

In the survey itself one first has to take certain samples from areas in Gaza and the West Bank, then does the questionnaires and summarize the data. The first level of the summary looks at basic results, for example, to see the proportion of people who support the peace process versus the proportion of people who are against it. Then the percentages of responses by sex, age, educational level, etc. are looked at for each question to see how support of the peace process is affected by these independent variables. To do this, the method of cross tabulation is used, whereby there are different ways, the main ones being column percent, row percent, and total percent. The way of presentation is important; the Central Bureau of Statistics uses most of the time row percent or column percent.

For example, one poll shows that 60 percent of the Palestinian people in Gaza and the West Bank support the peace process. If these 60 percent are crossed with the educational level, the support for the peace process drops. The higher the educational level, the more support for the peace process drops. This is a good indication of the fact that everyone thinks in his/her own environment. Every human being lives in a particular circle, sees only what is around him - which is not necessarily representative for the entire population - and does not see the whole picture. The broader picture only becomes clear when looking at statistics.

If one looks at geometric shapes from different angles he gets different pictures: if he looks straight at a cylinder, for example, he will see a rectangle; if he looks at from above, he will see a circle. It is the same with people looking at it situations and circumstances; they see them differently, according to their angles, whereby neither view is right or wrong. The problem in polls and questionnaires is that each person is questioned through his or her own demographic background, socioeconomic status, educational level, etc, which are all factors that put someone in a specific position, meaning that he will see something in a specific way.

Doing a survey means gathering opinions or collecting ideas. Two people who look at exactly the same thing might see something completely different because each one does his own filtering process. The mind of human beings filters information according to a person’s experience and his interaction with his environment. This filter decides which and how information is absorbed. Everyone thinks that his own opinion is the correct one and reflects the way everyone else thinks, but this is not the case. People have different pictures, perceptions and scenarios, i.e., very particular views. Thus, in order to collect information that in the end will display a comprehensive picture, a survey must be designed scientifically and take into account a representative sample from the society. Representative means that all sectors in society must be covered, i.e., the different age groups, the educated and uneducated, etc.

It is very important to prepare the field researchers, who will conduct the surveys properly and to train them how to pose a question in order to obtain the right answer. There is a difference between what people actually think and what they tend to answer because they often give the answers that they the interviewer wants to hear.

Presentation of Data

Once the data has been processed and computerized the next task is to look at the data and interpret it in terms of frequencies, cross tabulations, etc.

 

Support Peace Process

Oppose Peace Process

Total

Male

50

50

100

Female

70

30

100

Total

120

80

200

The above cross table shows 50 males that support the peace process, and 50 that oppose it while there are 70 females for and 30 against. The total number of interviewees was therefore 200, 100 men and 100 women. The table shows how many people support the peace process, how many are against it, and how men and women differ in their opinion on this subject.

The data can also be presented as row percent by dividing each cell by the total sum of the row. For male, this means 50 percent support and 50 percent are against the peace process (total = 100 percent), while among the females, 70 percent support and 30 percent are against it (total = 100 percent). The column percent works according to the same system. Whether one has the computer calculate the row percent or the column percent depends on what is more useful in a given context. Here, the row percent was used to see how the sex factor affects support or non-support for the peace process. The result is that among females support is higher than among males, which allows the analyst to conclude that the sex factor does affect the issue and there is a significant difference between the male and female opinion.

The total percent can be obtained by dividing each cell by the total, i.e., 200 in the example used here. This means 25 percent of males support the peace process, 25 percent are against it, and 35 percent females support the peace process while 15 percent do not. The total of all the entries is 100 percent. The total percent shows where the percentage of support is, amongst the males or females. The row percent indicates that among males, there is 50 percent support and among females 70 percent. The total percent indicates the chance - if picked randomly - of someone being male and supportive of the peace process is 25 percent. Equally, the chance that this person will be female and against the peace process is 15 percent. The wording is important here. The chance of someone in the male population being supportive of the peace process is 50 percent; the chance of someone from the female population is 70 percent.

In the above example the dependent variable is support or opposition to the peace process, and the independent is the male/female variable. The resulting data shows how the shift takes place according to the distribution at different levels of the independent variable. Reading a table is not enough; the findings should be analyzed to understand the factors that brought about the results. For example, the fact that women support the peace process more than men (which is a real example) should be explained as follows: women suffered a lot during the Intifada; their husbands and sons were in prison or had been killed, which made their life more difficult and now after the Intifada, they do not want to return to such times. The final step that should follow the analysis is drawing conclusions from the data that can be used for (future) decision-making.

Another example from the health sector will illustrate the above once more. The goal of the survey was to see the proportion of diabetics in Palestine. The dependent variable is thus the diabetic status (diabetic/non-diabetic), and the independent variable, the district (Ramallah, Hebron, Jerusalem).

 

Diabetics

Non-Diabetics

Ramallah

0.15

0.85

Hebron

0.20

0.80

Jerusalem

0.17

0.83

Since the underlying question is whether the distribution of diabetics changes according to different districts, the row percent is used here. The column percent would state how diabetics versus non-diabetics is distributed in different areas but nothing about the proportion of diabetics in Hebron versus Jerusalem, for example. With row percent each area adds up to 100 percent, which makes the process of comparing easier. If, as a general rule, the dependent variable is along the top, and this is true for software, then one has to ask for row percent.

The conclusion from the above table is that the district has nothing to do with the prevalence of diabetic status. The highest prevalence, however, is in the Hebron district, so if there were a $100,000 budget to fight diabetes, Hebron would be a priority area. The question is how much of this amount would go to Hebron.

One way of looking at it is to take the percentage in Hebron, divide it by the total of the percentages from every district and multiply it by 100,000:

Image25.gif (2041 bytes)

However, this kind of calculation does not take into consideration the population size. Thus, what should be done is to take 0.2 and multiply it by the population of Hebron in order to get the number of diabetics in Hebron; then to take 0.17 and multiply it by the population of Jerusalem to get the number of diabetics in Jerusalem, etc. Then all the numbers for diabetics in the different areas must be added to get the total number of diabetics. This total number of diabetics must then be divided by the number of diabetics in Hebron and multiplied by 100,000. Unless this method is used, an area with a large population will be underrepresented. The other method would be fine if the population size was the same for each district.

The same method is often being used in allocating project funds. There is a certain cycle that any project has to pass through, starting with a base line survey, which provides the necessary indicators. After these indicators have been explained, the process of intervention begins. A survey is always done for the present situation, followed by intervention in a particular way, by means of a project or program. After the intervention, there is monitoring and evaluation, then more intervention and so on. There are many statistical matters that one has to be careful about when looking at the data.