GSS Data project instructions

Preliminary instructions for the GSS paper project.

Revision 4/10/2024

Project:

Use the General Social Survey Cumulative File to do a brief analysis. In the paper you will relate the analysis to readings we have done.

We will be using Berkeley’s SDA website (SDA stands for Survey Documentation and Analysis). You may need Excel on your local machine if you want to reformat graphs, and you will need to use a word processor like Word to combine the figures and tables with your own commentary and analysis. The website will do the analysis and generate the tables and figures for you. We will not be doing any fancy statistical analysis or complex manipulations of the data. We simply want to study some basic summary statistics about American attitudes.

Basic web link:

http://sda.berkeley.edu/archive.htm

The latest data link is GSS cumulative datafile 1972-2022:

https://sda.berkeley.edu/sdaweb/analysis/?dataset=gss22rel2

Older versions of the data are also available, but you should use the most recent version if you can.

(older versions of the GSS, 1972-200X that used an older version of SDA software are also available, in case you have problems with the above newer version):

The SDA webpage has links to general help, and help about data recodes.

Things to know about the GSS, and to keep in mind:

* The GSS has been fielded every other year (more or less) from 1972-2022.

* Use the “Search” command to search for keywords to identify variables that you may want to use.

* There are thousands of variables available to study, but not all variables are available for all years. You are going to have to spend time searching for variables that are of interest to you. Be sure to tabulate your variable against year to know which years are available. That is, make a table where “year” is the row variable and your variable of interest is the column variable. Run the table to see what years are available for your variable. You don’t have to include year as one of the variables in your analysis, but you should know (and report in your paper) in what years your question(s) were asked.

* Always check the box for “Question Text” so that the full text of the question appears along with your table and chart.

* If you are dealing with a variable that has lots and lots of levels, you may need to recode the variable to a more manageable set of categories. See http://sda.berkeley.edu/HELPDOCS/helpan.htm#recode

For instance, the feeling thermometers (how do you feel about Catholics...) give respondents a choice of any value from 0 to 100, 100 corresponding to the most positive feeling. If you tabulate a variable like “CATHTEMP” you get a mess. You need to recode it into something like 3 categories (roughly interpreted as don’t like Catholics, don’t mind Catholics, and like Catholics), so CATHTEMP(r: 0-40; 41-60; 61-100). You probably will also want to add labels to the categories so that the tables are more readable, so CATHTEMP(r: 0-40 "do not like Catholics"; 41-60 "lukewarm about Catholics"; 61-100 "like Catholics") [note the regular double quotes, rather than the MS word “smart quotes”- SDA doesn’t like smart quotes).

Note: the same recoding syntax works when you are using variables as Controls. If you want to use AGE as a control (to analyze separately for young and old, for instance), you could write AGE (r: 18-40; 41-89) and you would get two separate tables for young and old. If you just put AGE as a control, you will get a different table for every possible age value, which is probably not what you want.

If you Control for AGE (r: 18-40; 41-89) that will have the same effect as Filtering for AGE(18-40) and then Filtering for AGE(41-89).

* Missing data codes. Variables like AGE have actual values (18-89) and then there are other codes such as 0,98, and 99 which mean “missing data.” You want to make sure that you are not treating the missing values as real ages. Ordinarily, it seems that SDA takes care of this for you, but you need to make sure. Look up each variable you are using (start by SEARCHing the variable name, then VIEWing the variable), note what the missing value codes are, and make sure you are not inadvertently treating them as real values.

* Also note: GSS has a variable for year the respondent was born, called COHORT. Cohort ranges from 1883 to 2004, so think about recodes like COHORT(r: 1883-1940; 1941-1960; 1961-2004) For some analyses, COHORT is more sensible as a control than AGE. Note that COHORT, YEAR, and AGE are all closely related. You have to think carefully about age (life-course) versus period versus cohort effects when you are trying to explain social change. And also realize that with data like the GSS, you often cannot disentangle age, period, and cohort effects. But do try to keep clear in your own language about which one you are talking about, and why.

* SDA doesn’t seem to care whether you capitalize the variable names or not.

* Percentages are crucial because the sample size of the GSS varies from year to year, and not every question is asked of every respondent. So if you want to know whether Americans’ attitudes about Catholics have changed over time, you want to know what percentage of respondents in each year “didn’t like” Catholics, for instance. If year is your row variable, you should probably be percentaging by row. Look at where the percentages add up to 100%, and ask yourself it that is what you want.

* The weights in GSS are designed to correct for things like household size and for the oversampling of blacks that took place in a couple of rounds of the GSS. The weights don’t make a huge difference, so my advice is to ignore the weights (that is, select “No Weight” in the weight option).

* Know your sample sizes! The GSS is a relatively small survey in terms of sample size (on the order of 3,000 subjects per year), and many of the interesting questions were only asked in one year, sometimes only of half the total sample. So if you want to say something about small minority groups like Buddhist accountants, be aware that there may be too few to analyze. Look at your tables of unweighted frequencies to see how many respondents there are in each category. Report the actual numbers if they are small, and use common sense to figure out if the number is too small to allow for strong conclusions. Don’t let yourself be fooled by apparently high or low percentages that are based on samples of 2 or 3 people. When you use filters or controls, the numbers in each table or figure will get smaller.

* You can copy the tables and figures that SDA produces directly into your paper (in Windows, it is right-click, copy, then paste, or use the “snipping tool” that is among the Windows accessory programs, at least in Windows 7). When pasting, please paste as a picture or as a bitmap, because otherwise your graphic may not be viewable when I try to open the file. If in doubt, you can always convert your file to an Acrobat pdf file, which tends to embed all graphics in a readable way. Also, you certainly may retype the output into Excel and that will give you more control over how the figures look.

* One of the limitations of the figures that SDA produces is that if you have 4 years of data for a certain question, let’s say 1988, 1990, 2006, 2008, the SDA figures will put the 4 years equally spaced along the X-axis, thereby treating year as a categorical rather than as a continuous variable. This can give you a somewhat misleading view of change over time. You should be aware of this limitation when you write about the figures, or you can re-type the percentages or tabular output into Excel and produce your own scatter plots that will treat continuous variables like year correctly.

* You can only control for one variable at a time in SDA, but you can use several variables in combination as filters.

Something to keep in mind about the GSS data: The GSS data are generally cross-sectional data, meaning different people are interviewed in each wave. Cross sectional datasets like the GSS are good at measuring the *prevalence* of phenomena, such as divorce, but not very good at measuring the *incidence* of divorce, or the divorce rate. That is, the GSS can tell us what percentage of US adults had a marital status of divorced at the time of the GSS survey in 1980, but we generally don’t know when those people got divorced, so we cannot infer anything about the annual divorce rate, or even the lifetime divorce rate (because people move back and forth between the divorced and married statuses).

A note about race and ethnicity: The GSS variable for race is RACE, this variable is available for every survey year but it only identifies 3 groups, White/ Black/ Other. If you want to identify Hispanics, there is a separate HISPANIC variable you could use to identify Hispanics, but this variable is not available for all years, and you would have to recode the variable to combine the different Hispanic national origin groups. As for identifying Asians, the variable ETHNIC is probably your best bet, but it is available for even fewer years, and the total combined number of identifiable Asians across all GSS years is just a few hundred. You could filter for Asians with: ethnic(5,16,31,40). It is important to keep in mind that the GSS is a survey of modest sample size, so for some small minority groups there may not be enough respondents in the GSS to study that group. Take a careful look at the unweighted numbers in the tables that SDA generates. Viewing your variable of interest in SDA will also give you an unweighted count of how many respondents answered the question.

A note about age: All respondents in the GSS are at least 18 years old, there are no minors among the respondents. There are questions about other children living in the respondent’s household, and there are questions about the respondents’ experiences when they were children.

ASSIGNMENTS:

1) What is expected of the GSS proposal:

Proposals should be 1-2 pages of text, in MS Word or Adobe Acrobat format, uploaded to Canvas (upload the draft and the final to the same assignment in Canvas). For the GSS proposal you should identify one variable you will use, along with year. In your proposal you should provide an embedded table or figure showing how your variable changes over time. If your variable of interest was only asked once, cross tabulate with another variable like race, education, geographic region, age (you might want to recode age into a few categories) or gender. Axes, variables, and categories should be appropriately labeled. Explain briefly what you want to know about the variable in question. Mention what other variable you are thinking of using as a control. When writing about GSS variables, please include the GSS name (i.e. SOCBAR or DIVORCE) as well as the variable description and full text of the question.

Students sometimes ask: “What GSS variables can be studied?” The answer is that you can focus on any variable as long as you believe you will be able to tie the results somehow back to the reading we have done.

2) What is expected of the GSS paper:

Papers should be 3-6 pages of text, plus at least two tables or figures. Convert your paper to Word or to Adobe Acrobat PDF format to make sure that the graphics are embedded in a way that we can read them, then upload to Canvas. The main thing you want to write about is your chosen GSS variable, and how it varies. This is a short assignment, so you will need to get right to the point. First of all, your paper should respond to and accommodate your TA’s feedback on your proposal. In addition to the table or figure you included in your proposal, you need to include a second analysis, which brings a third variable in as a control. Appropriate controls include things like race (RACE), education (EDUC in years, so 12=HS and 16=BA), geographic region (REGION), religion (RELIG or ATTEND), current marital status (MARITAL), birth cohort (COHORT recoded into categories), political party affiliation (PARTYID), income (REALINC is family income in constant 1986 dollars; assume 200K is the top code so as not to get confused by the missing value codes and keep in mind that inflation from 1986 to 2018 increases values by about 2.2 so recode this variable accordingly- $50K is an upper middle class income in 1986 dollars), age (AGE recoded into categories) or gender (SEX), or something else you want to consider. The purpose of the control variable is for you to be able to write something about how the variable you’re interested in varies with respect to the control. Note this sample recode for income REALINC(r: 0-25000 "0-25K"; 25001-50000 "25K-50K"; 50001-200000 ">50K").

One key is for you to discuss the results as they really are, not as you want them to be or as you expected them to be given the literature. If your results contradict the literature in some way, that is good, explain the discrepancy.

The last page or so of your paper needs to address the literature we have read in the class. How do your findings agree with or disagree with something we have read? The quality of the paper will depend, in part, on how thoughtfully you use your results to reflect on one or more of the readings.

When writing about GSS variables, please include the GSS name (i.e. SOCBAR or DIVORCE) as well as the variable description and full text of the question.