---------------------------------------------------------------------------------
name: <unnamed>
log: C:\Documents and Settings\Michael Rosenfeld\My Documents\newer web p
> ages\soc_meth_proj3\2010_logs\class_nine.log
log type: text
opened on: 23 Feb 2010, 14:01:05
. clear all
. *(8 variables, 11 observations pasted into data editor)
*It doesn't show up in the log, but to copy data from Excel to Stata, just copy in Excel and paste into Stata's data editor window.
. twoway(scatter x1 y1)
. twoway(scatter y1 x1)
*we wanted (scatter y x)
. twoway(scatter y1 x1) (lfit y1 x1)
* lfit adds the best fit line to the graph
. twoway(scatter y2 x2) (lfit y2 x2)
. *One reason we graph data is so that we can see whether our best fit line or our best fit function really fits the data
. clear
. use "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta",
> clear
. *a free stata update is available
. *update all
. *update swap
* do the update all command, without the asterisk.
. regress incwage lawyers sociologists male if occ1990==178| occ1990==95 | occ199
> 0==125
variable lawyers not found
r(111);
. tabulate occ1990 if occ1990==178|occ1990==95|occ1990==125
Occupation, 1990 basis | Freq. Percent Cum.
----------------------------------------+-----------------------------------
Registered nurses | 966 68.37 68.37
Sociology instructors | 6 0.42 68.79
Lawyers | 441 31.21 100.00
----------------------------------------+-----------------------------------
Total | 1,413 100.00
. tabulate occ1990 if occ1990==178|occ1990==95|occ1990==125, nolab
Occupation, |
1990 basis | Freq. Percent Cum.
------------+-----------------------------------
95 | 966 68.37 68.37
125 | 6 0.42 68.79
178 | 441 31.21 100.00
------------+-----------------------------------
Total | 1,413 100.00
* Take a look at my excel sheet, the worksheet on "more regression fits"
. gen reduced_occ1990=1 if occ1990==95
no room to add more variables because of width
An attempt was made to add a variable that would have increased the memory
required to store an observation beyond what is currently possible. You have
the following alternatives:
1. Store existing variables more efficiently; see help compress.
2. Drop some variables or observations; see help drop. (Think of Stata's
data area as the area of a rectangle; Stata can trade off width and
length.)
3. Increase the amount of memory allocated to the data area using the set
memory command; see help memory.
r(902);
. clear all
. set mem 200m
Current memory allocation
current memory usage
settable value description (1M = 1024k)
--------------------------------------------------------------------
set maxvar 5000 max. variables allowed 1.909M
set memory 200M max. data space 200.000M
set matsize 400 max. RHS vars in models 1.254M
-----------
203.163M
. use "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta",
> clear
. gen reduced_occ1990=1 if occ1990==95
(132744 missing values generated)
. tabulate occ1990 if occ1990==178|occ1990==95|occ1990==125
Occupation, 1990 basis | Freq. Percent Cum.
----------------------------------------+-----------------------------------
Registered nurses | 966 68.37 68.37
Sociology instructors | 6 0.42 68.79
Lawyers | 441 31.21 100.00
----------------------------------------+-----------------------------------
Total | 1,413 100.00
. tabulate occ1990 if occ1990==178|occ1990==95|occ1990==125, nolab
Occupation, |
1990 basis | Freq. Percent Cum.
------------+-----------------------------------
95 | 966 68.37 68.37
125 | 6 0.42 68.79
178 | 441 31.21 100.00
------------+-----------------------------------
Total | 1,413 100.00
. replace reduced_occ1990=2 if occ1990==125
(6 real changes made)
. replace reduced_occ1990=3 if occ1990==178
(441 real changes made)
. label define reduced_occ 1 "nurses" 2 "sociologists" 3 "lawyers"
. label val reduced_occ1990 reduced_occ
. tabulate occ1990 reduced_occ1990 if reduced_occ1990!=.
Occupation, 1990 | reduced_occ1990
basis | nurses sociologi lawyers | Total
----------------------+---------------------------------+----------
Registered nurses | 966 0 0 | 966
Sociology instructors | 0 6 0 | 6
Lawyers | 0 0 441 | 441
----------------------+---------------------------------+----------
Total | 966 6 441 | 1,413
. *OK... Now I am ready to repeat the regression from the Excel file.
. desmat: regress incwage reduced_occ1990 male
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable incwage
Number of observations: 1413
F statistic: 83.696
Model degrees of freedom: 3
Residual degrees of freedom: 1409
R-squared: 0.151
Adjusted R-squared: 0.149
Root MSE 42234.929
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
reduced_occ1990
1 sociologists -604.948 17320.322
2 lawyers 25723.531** 3256.452
male
3 male 17003.194** 3422.971
4 _cons 36445.550** 1376.531
---------------------------------------------------------------------------------
* p < .05
** p < .01
. *this is just a little income regression, excluding everyone except the nurses, lawyers, and sociologists
. predict new_model
(option xb assumed; fitted values)
(132297 missing values generated)
* predict creates a new variable for the predicted values of the model, which I named new_model
. table reduced_occ1990 sex, contents (freq mean incwage mean new_model) row col
----------------------------------------------------
reduced_occ1 | Sex
990 | Male Female Total
-------------+--------------------------------------
nurses | 62 904 966
| 48602.45161 36777.9281 37536.85197
| 53448.74 36445.55 37536.85
|
sociologists | 2 4 6
| 39200 42662.5 41508.33333
| 52843.8 35840.6 41508.33
|
lawyers | 308 133 441
| 80236.42208 59704.73684 74044.32653
| 79172.27 62169.08 74044.33
|
Total | 372 1,041 1,413
| 74743.46774 39729.70893 48947.76858
| 74743.47 39729.71 48947.77
----------------------------------------------------
. *Note 1: the constant term in the model corresponds to the comparison group for both variables, which is female nurses, but the value of the constant (36,445) is the predicted rather than the actual female nurse mean income, and the predicted and actual values are not exactly the same. The regression has only 1 term for gender gap, and thus assumes equal gender gap across occupational lines.
. *Also note that the overall average income is the same and the average for each occupational group (total, across genders) and the average for each gender group (total across occupations) is exactly the same between real incwage and fitted values from the model.
. rename new_model fitted_values_inwage_1
* Here I am just giving the fitted values for incwage a better and more appropriate name.
. gen residuals=incwage- fitted_values_inwage_1
(132297 missing values generated)
* incwage are the actual wages for each respondent. fitted_values_inwage_1 are the predicted income for each respondent. The residuals are the difference between the two.
. summarize residuals
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
residuals | 1413 -.000951 42190.04 -79172.27 297118.4
* Residuals will average zero, because the mean of the predicted and actual values should be the same.
. *we have predicted values and residuals only for the 1413 cases in our 3 occupational groups
* If we want to fit occupational and gender income averages more fully, we need to account for the fact that the gender wage gap is different in each occupational group.
. desmat: regress incwage reduced_occ1990*male
---------------------------------------------------------------------------------
Linear regression
---------------------------------------------------------------------------------
Dependent variable incwage
Number of observations: 1413
F statistic: 50.578
Model degrees of freedom: 5
Residual degrees of freedom: 1407
R-squared: 0.152
Adjusted R-squared: 0.149
Root MSE 42237.424
Prob: 0.000
---------------------------------------------------------------------------------
nr Effect Coeff s.e.
---------------------------------------------------------------------------------
reduced_occ1990
1 sociologists 5884.572 21165.383
2 lawyers 22926.809** 3922.625
male
3 male 11824.524* 5545.056
reduced_occ1990.male
4 sociologists.male -15287.024 36996.590
5 lawyers.male 8707.162 7067.771
6 _cons 36777.928** 1404.796
---------------------------------------------------------------------------------
* p < .05
** p < .01
. *The asterisk gave us dummy variables not only for occ and gender, but for the occ-gender combinations. Now we have 6 terms in the model, and we should be fitting our occ-gender income table exactly.
. predict fitted_income_m2
(option xb assumed; fitted values)
(132297 missing values generated)
. table reduced_occ1990 sex, contents (freq mean incwage mean fitted_income_m2) row col
----------------------------------------------------
reduced_occ1 | Sex
990 | Male Female Total
-------------+--------------------------------------
nurses | 62 904 966
| 48602.45161 36777.9281 37536.85197
| 48602.45 36777.93 37536.86
|
sociologists | 2 4 6
| 39200 42662.5 41508.33333
| 39200 42662.5 41508.33
|
lawyers | 308 133 441
| 80236.42208 59704.73684 74044.32653
| 80236.42 59704.74 74044.33
|
Total | 372 1,041 1,413
| 74743.46774 39729.70893 48947.76858
| 74743.47 39729.71 48947.77
----------------------------------------------------
. save "C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta", replace
file C:\Documents and Settings\Michael Rosenfeld\Desktop\cps_mar_2000_new.dta saved
. exit, clear