Cuadernos de economía
On-line version ISSN 0717-6821
Cuad. econ. vol.40 no.121 Santiago Dec. 2003
EVALUATING EDUCATION POLICIES WHEN THE RATE OF
RETURN VARIES ACROSS INDIVIDUALS
RETURN VARIES ACROSS INDIVIDUALS
In this paper I summarize some recent developments in the literature on the econometrics of program evaluation. In particular, I consider the estimation of mean and distributional treatment effects when the impact of the program varies across individuals. I focus the exposition on the study of the returns to education (therefore, education is the program being considered). I provide examples of two sets of papers that illustrate the theoretical and practical importance of accounting for heterogeneity in program evaluation. The research summarized in this paper has general applicability. It is not limited to the study of education. Instead of education (as the "program'' being considered) we could use the framework to analyze job training, unionism, migration, medical care, and other types of programs.
2. EVALUATING PROGRAMS-MEANS AND DISTRIBUTIONS
Take the standard model of potential outcomes:
where Y1 and Y0 are potential outcomes, say wages, under two different states, which are high school and college in this paper. X is a vector of observable variables (such as experience and test scores) and U1 and U0 are mean zero unobservables.1 The gain of going from high school to college is the return to college:
Returns can vary across individuals because of both X and (U1, U0). The college decision equation (decision equation) is the following:
where Z are variables that influence the decision to enroll in college but not the potential outcomes:
(these are usually called instrumental variables; some examples in the literature on the returns to schooling are distance to college and tuition). In the empirical work reported in the next sections it is also assumed that
(although this assumption is not essential for the analysis; see Heckman and Vytlacil, 2000, 2003). US is unobservable and mean zero. For each person, we only observe potential wages in the state that is actually chosen. For example, if a person is a college graduate, we do not observe the wages he would have earned had he not gone to college. Equation (3) is the reduced form for a general economic model of schooling (with income or utility maximizing agents). Observed wages are:
where For each person we either observe Y1 or Y0, but not both, and therefore we cannot compute b =Y1 - Y0.
The traditional approach to the estimation of (6) assumes that U1 = Y0 and therefore, conditional on X, everybody has the same return to college (see for example, Griliches, 1977):
In this case (6) becomes
The econometric problem in this equation is that S may be correlated with U0, which is usually called the ability bias problem. The solution is to find an instrument Z for schooling, a variable correlated with S but not with U0. Then we can estimate which is the same for every individual (conditional on X). The distribution of returns conditional on X is degenerate at and once we remove the conditioning on X, it is just the distribution of a function of X. There is no heterogeneity in returns conditional on X, even though there may still be substantial heterogeneity in returns in the population because of X. Suppose that one of the variables in X is a cognitive test score, and this is an important determinant of both the returns to college (through ) and of the decision to attend in college (through ). Then ,, where f is the pdf of b(X).
The assumption that U1 = U0 is convenient but implausible (see Heckman, 2001). If U1¹ U0 then there is another component of heterogeneity in returns, U1 - U0, which is unobservable. In this case (6) is
Furthermore, if it is not clear that is a useful parameter to consider. Suppose, as most of the literature does, that we focus on mean treatment parameters, and that we want to compute the benefits of a particular policy, say, a tuition subsidy. How would we evaluate such a policy? Assume the tuition subsidy induces K individuals to enroll in college. The traditional approach to the problem assumes that the return to college, b, is the same for all individuals (at least conditional on X). Then the benefit of the policy is simply B = b * K.2 However if b varies in the population (even conditional on X) we want to multiply K by the average return to college for those individuals induced to go to college by the tuition subsidy, and therefore we need to know where in the distribution of b are the new entrants coming from. Suppose we have the following version of the model of equation (3):
where b and Z vary in the population. Let Z be the tuition faced by a given individual under no subsidy, and Z' be the tuition he faces with the subsidy in place.3 Then, for individuals induced to attend college by the subsidy, b - Z £ 0 (they do not attend college without the subsidy, or S(Z) = 0, and b - Z' > 0 (they attend college with the subsidy, or S(Z') = 1). Then
Many of the standard parameters of the evaluation literature do not in general equal the policy relevant treatment effect (PRTE), or E (b½Z' < b£ Z (see Heckman and Vytlacil, 2001, 2003, and Carneiro, Heckman and Vytlacil, 2001). The average treatment effect (ATE)\ measures the average return to college for a randomly selected person in the population (E (b)). Treatment on the treated (TT) is the average return to college for those that attend college E (b½S=1) = E (b/b > Z)). Treatment on the untreated (TUT) measures the average return that those who do not attend college would experience had they gone to college E (b½S=0) = E(b/b £ Z)). The average marginal treatment effect (AMTE) is the effect of treatment for individuals at the margin between enrolling in college or not E (b½b =Z. Individuals who are at the margin will probably be more responsive to different kinds of interventions than inframarginal individuals.
Some important evaluation questions require knowledge of features of the distribution of returns beyond the mean parameters reported above (Heckman, Smith and Clemens, 1997). For example, we may want to know what is the proportion of individuals benefiting from a program (and symmetrically, what is the proportion of individuals who lose from the program), or what is the impact of the program on the distribution of income. Knowledge of distributions of returns allows us to answer a much larger range of questions. We can define parameters analogous to the mean parameters described above. f (b) is the density of b in the population f (b½S =1, is the density of b for those who enrolled in college and f (b½S =0 is for those who did not enroll in college. f (b½b =Z is the density of b for those at the margin and f (b½Z' < b £ Zis the density of b for those induced to enroll in college by the tuition subsidy. Once we have the latter distribution it is possible to compute, for example, the proportion of individuals affected by the policy who benefit from the policy.
However, identification of distributions of returns is more difficult than identification of the mean parameters discussed above. Under assumptions (4) and (5) it is possible to identify f (Y1), f (Y0) but not f (Y1, Y0) (see Heckman and Smith, 1993, 1998).4 The reason is that we never observe both Y1 and Y0 for the same individual. Therefore it is not possible to identify the different densities listed in the previous paragraph. However it is still possible to identify the different mean parameters described above (under some continuity and support conditions for the instrument Z; see Heckman and Vytlacil, 2000, 2003). Identification of mean effects is achieved under weaker conditions than identification of distributions of effects.5 Assuming we have the model presented above and there exists a continuous instrument Z, it is possible to identify and therefore, (see, for example, Heckman and Vytlacil, 2000, 2003). Heckman and Vytlacil (2000,2003) show how it is possible to construct parameters such as ATE, TT, TUT, AMTE and PRTE as weighted averages of , a parameter they call the marginal treatment effect (MTE). For example:
where f(X, US) is the joint density of X and US. If it were possible to identify then by the same reasoning it would be possible to construct and the remaining densities I presented above. However all we can recover from the data under the assumptions presented so far is and Heckman and Smith, 1993, 1998), which is not enough for recovering since that requires knowledge of the joint density Knowledge of this joint density is not necessary for constructing the marginal treatment effect; the marginal densities provide enough information. In section 4 I will present alternative sets of assumptions under which it is possible to recover these densities. Once we recover the joint density of potential outcomes, we can compute distributions of returns to schooling, f ( b), and different distributional parameters, and we can also analyze the effects of different policies on the distribution of income.
There is a large literature I could draw on and review in this summary, but the size and scope of the paper is much smaller than a complete review of the literature. In the next two sections I choose to summarize two sets of papers where these ideas are applied in the estimation of the returns to schooling. Section 3 focuses on the work of Carneiro, Heckman and Vytlacil (2001) and Carneiro (2002), papers where different mean treatment parameters are estimated. In section 4 I discuss the work of Carneiro, Hansen and Heckman (2001, 2003) which provides estimates of distributions of returns to schooling.
3. ESTIMATING AVERAGE RETURNS TO EDUCATION
Heckman and Vytlacil (2001, 2003) show how the marginal treatment effect can be estimated using the method of local instrumental variables. Carneiro, Heckman and Vytlacil (2001) and Carneiro (2002) apply these ideas to the estimation of returns to college. The intuition of the method is best explained with an example. Suppose the decision model has the following reduced form:
where Z (the instrumental variable) is tuition in county of residence (g >0) and US is `ability' and it is unobserved.6
We start by using only two counties: A and B. In county A, Z = $100, and in county B, Z= $200. The two counties are equal in every aspect except tuition7. We can estimate b by standard IV using tuition as the instrument and data from only counties A and B:
This is the Local Average Treatment Effect for the case where the instrument takes values Z = $100 and Z = $200 (Imbens and Angrist, 1994). It is the average return for individuals who go to college if Z = $100 but do not go if Z =$200. Therefore these individuals are at the margin between going to college or not if they face a Z that is between 100 and 200. The fact that they are at the margin at such a low level of tuition means that they have low ability (US).8
Suppose we use two different counties. County C has Z=$2100 and county D has Z=$2200. Using C and D only:
This is the average return for individuals who do not go to school if Z = $2200 but who decide to enroll if Z = $2100. They are at the margin at a high level of tuition, which means that they have a high level of ability.
The general expression for any pair of counties with tuition (z, z') is:
By varying Z we can trace out how b varies with US. This is the MTE. Notice that to trace out the whole support of US we need large support for the instrument. In practical applications of the method we can have many instruments. We aggregate them into a single index through a regular first stage regression such as (7), where Z can be a vector of instrumental variables. It is convenient to pick the index to be the predicted probability of attending college conditional on Z. Let U'S = -US, P(Z) = Pr (S = 1|Z) = FU'S ( -Zg) and VS = FU'S (U'S) FU'S (·) where is the cdf of (U'S)9 These are just monotonic transformation of variables (see Heckman and Vytlacil, 2001, 2003). The choice model becomes:
and (8) becomes:
Notice that this is similar to the expression for a derivative (once you take the limit). On the numerator we have the difference of a function of P (E (ln Y|P)) evaluated at two different values, p and p'. In the numerator we have the difference between p and p'. This suggests that an estimator of MTE is simply:
f ( n/P'£ VS < P) can be easily constructed from equation (9) since we observe f(P, P') in the data and by construction VS is a uniform random variable independent of P (see Heckman and Vytlacil, 2000, 2003; Carneiro, Heckman and Vytlacil, 2001; Carneiro, 2002).10
Adding X to the model is a simple (but important) extension. Carneiro, Heckman and Vytlacil (2001) estimateusing white males from NLSY7911 with a high school degree (S = 0) or above (S = 1). Wages are measured in 1992. The relevant variable in X is AFQT scores, which is an important determinant of the return to college. The instrumental variables Z are number of siblings, father`s education, distance to college, tuition and local labor market variables. These estimates correspond to returns to one year of college, and are obtained by dividing gross returns by 3.5, the difference in the average years of schooling for individuals that attend college and those that do not. Figure 1 shows the estimate of . Returns are highest for individuals with the highest level of AFQT scores and the lowest level of VS (notice that the lowest VS the more likely is an individual to attend college).12 The relationship between and VS is not monotonic. Returns are decreasing with VS for low values of this variable indicating that individuals who are more likely to attend college have higher returns. However, returns are increasing with VS for high values of VS. Some individuals have high returns and still do not enroll in college. This may happen if they face high psychic costs of schooling or if they are credit constrained. A simple formal test for selection rejects the null hypothesis that does not vary with VS (i.e., that selection on unobservables is not important). Table 1 shows estimates of ATE, TT, TUT, AMTE and PRTE.13 The simulated policy is a $1000 tuition subsidy. The table also present OLS and linear IV estimates of the returns to college, where the instrument is P. Neither ATE, TT or TUT correspond to PRTE. AMTE is very similar to PRTE. OLS and linear IV do not estimate any of the above parameters, and in particular, they do not estimate PRTE. AMTE (the marginal person)\ is below TT (the average person in college), which means that as we expand education the quality of the marginal entrant declines.
The OLS and IV estimate are evaluated at the average level of AFQT for individuals induced to enroll in college by a $1000 tuition subsidy. Therefore these figures are directly comparable with the policy relevant treatment effect.
Source: Carneiro, Heckman and Vytlacil (2001).
4. ESTIMATING DISTRIBUTION OF RETURNS TO EDUCATION
The estimation of distributions of returns is more complex. Heckman and Smith (1993, 1998) and Heckman, Smith and Clemens (1997) show how under different assumptions it is possible to estimate f(Y1, Y0) from knowledge of f(Y1) and f(Y0). Assuming absolutely continuous and strictly increasing marginal distributions, they postulate that quantiles are perfectly ranked so where and . An alternative assumption is that people are perfectly inversely ranked so the best in one distribution is the worst in the other: . More generally, one could associate quantiles across distributions more freely. Heckman, Smith and Clements (1997) use Markov transition kernels which stochastically map quantiles of one distribution into quantiles of another. They define a pair of Markov kernels and such that
Allowing these operators to be degenerate produces a variety of deterministic transformations, including the two previously presented, as special cases of a general mapping. Different pairs produce different joint distributions. These stochastic or deterministic transformations supply the missing information needed to construct the joint distributions.
A perfect ranking (or perfect inverse ranking) assumption is convenient, but it imposes a strong and arbitrary dependence across distributions. The empirical analysis in Carneiro, Hansen and Heckman (2003), summarized in this paper, shows that this assumption is at odds with data on the returns to education. They implement an alternative approach. Aakvik, Heckman and Vytlacil (1999, 2003) build on Heckman (1990) and Heckman and Smith (1998) by postulating a factor structure connecting (U0, U1, US). The work of Carneiro, Hansen and Heckman (2003) builds on the analysis of Aakvik, Heckman and Vytlacil (1999, 2003) so I describe its essential idea. Suppose that the unobservables follow a factor structure:
where q is independent of and the are mutually independent. In their setup,q is a scalar q can be an unobservable trait like ability or motivation that affects all outcomes. Because the factor loadings , may be different, the factors may affect outcomes and choices differently. Recall that one can identify and under the conditions specified in Heckman and Smith (1998). Thus, one can identify and assuming finite variances and assuming E(q) = 0, Eq2 = . With some normalizations and conditions specified in Carneiro, Hansen and Heckman (2003) we can nonparametrically identify the distribution of and the distributions q of (the last up to scale). With the , and the distributions of q, in hand, we can construct the joint distribution f(Y0, Y1| X).14 Carneiro, Hansen and Heckman (2003) build on this basic idea and extend it to a more general setting. They consider a model with multiple factors, multiple treatments and multiple time periods. Outcome measures may be discrete or continuous. They follow the psychometric literature by adjoining measurement equations to outcome equations to pin down the distribution of . In particular, they use cognitive test scores as additional measures in the system. With this framework they can estimate all pairwise treatment effects in a multiple outcome setting. They also consider the benefits for identification of having access to imperfect measurements on vector q which are observed for all persons independent of their treatment status. Their model integrates the LISREL framework of Jöreskog (1977) into a model of discrete choice and a model of multiple treatment effects.
Carneiro, Hansen and Heckman (2003) estimate the distribution of returns to college using white males in NLSY79. They compute present values of earnings for high school and college graduates, which they use as outcome measures. The return to college for an individual is the gain in the present value of earnings from moving him from high school to college. Figure 2 presents the estimated density of the returns to college for high school and college graduates. College graduates have higher returns to college than high school graduates (their density of returns is to the right of the density of returns for high school graduates). However, notice that a substantial portion of college graduates (7%) have negative returns to college (14% of high school graduates have negative returns to college). Conversely, there is a large overlap in the distributions of returns for high school and college graduates so that a large number of high school graduates would have benefited substantially from going to college if they had gone to college (instead of just completing high school). In the paper there is a discussion of the reasons for so many apparent "mistakes", and two important lessons emerge. The first one is that there is considerable uncertainty in the returns to college at the time the college decision is made (ex-ante uncertainty). Therefore, the apparent "mistakes'' just reflect ex-post realizations of this uncertainty.15 In other words, people make mistakes because they do not know the future with certainty and some of them get a wage draw ex-post that is worse than the wage draw they would have had if they had chosen another schooling level. However, the second lesson is that the most important factor driving the college decision is utility or psychic cost of schooling.16 Enrollment in college has relatively little to do with ex-ante computations of the returns to college and of the uncertainty in the returns to college. Table 2 shows the probability of being in decile i of the college potential discounted earnings distribution conditional on being in decile j of the high school potential earnings distribution. It shows that neither an independence assumption across counterfactual outcomes, which is the Veil of Ignorance assumption used in applied welfare theory, (see, e.g., Sen, 1973) or in aggregate income inequality decompositions (DiNardo, Fortin, and Lemieux, 1996) nor a perfect ranking assumption, which are sometimes used to construct counterfactual joint distributions of outcomes (see e.g. Heckman, Smith, and Clements, 1997 or Athey and Imbens, 2002), are satisfied in the data. There is a strong positive dependence between potential outcomes in each counterfactual state, but it is not perfect dependence. The number of nonzero elements outside the diagonal of Table 2 is substantial. Any analysis of distributional effects of social programs needs to take this dependence into account.
In this paper I summarize two sets of papers that present methods for estimating the effects of different economic activities when these effects vary across individuals. Even though this is far from a complete review of the literature on evaluation, it illustrates the theoretical and empirical importance of accounting for heterogeneity in the evaluation of different policies. The papers summarized here provide simple examples where heterogeneity is important and of how we can account for heterogeneity in the evaluation of different policies.
Aakvik, A., J. Heckman and E. Vytlacil (1999), "Training Effects on Employment when the Training Effects are Heterogeneous: An Application to Norwegian Vocational Rehabilitation Programs", manuscript, University of Chicago. [ Links ]
Aakvik, A., J. Heckman and E. Vytlacil (2003), "Treatment Effects For Discrete Outcomes when Responses To Treatment Vary Among Observationally Identical Persons: An Application to Norwegian Vocational Rehabilitation Programs", NBER Working Paper N° T0262, forthcoming in Journal of Econometrics. [ Links ]
Athey, S. and G. Imbens (2002), "Identification and Inference in Nonlinear Difference-In-Differences Models", NBER Technical Working Paper T0280. [ Links ]
Bureau of Labor Statistics (2001), NLS Handbook 2001. Washington, D.C.: U.S. Department of Labor. [ Links ]
Carneiro, P. (2002), "Heterogeneity in the Returns to Schooling: Implications for Policy Evaluation'', Ph.D. dissertation, University of Chicago. [ Links ]
Carneiro, P. , K. Hansen, and J. Heckman (2001), "Removing the Veil of Ignorance in Assessing The Distributional Impacts of Social Policies'', Swedish Economic Policy Review, 8, 273-201. [ Links ]
Carneiro, P., K. Hansen, and J. Heckman (2003), "Estimating Distributions of Treatment Effects with an Application to the Returns to Schooling and Measurement of the Effects of Uncertainty on Schooling Choice'', International Economic Review, 44(2), 361-422. [ Links ]
Carneiro, P. and J. Heckman (2002), "The Evidence on Credit Constraints in Post Secondary Schooling'', Economic Journal, 112, 705-734. [ Links ]
Carneiro, P., J. Heckman and E. Vytlacil (2001), "Understanding What Instrumental Variables Estimate: Estimating the Marginal and the Average Returns to Education'', working paper, University of Chicago. [ Links ]
Di Nardo, J., N. M. Fortin and T. Lemieux (1996), "Labor Market Institutions and the Distribution of Wages, 1973-1992: A Semiparametric Approach", Econometrica, 64, 1001-1044. [ Links ]
Griliches, Z. (1977), "Estimating the Returns to Schooling: Some Econometric Problems", Econometrica, 45(1), 1-22. [ Links ]
Heckman, J. (1990), "Varieties of Selection Bias". American Economic Review 80(2), 313-18. [ Links ]
Heckman, J. (2001), "Micro Data, Heterogeneity and the Evaluation of Public Policy: Nobel Lecture''. Journal of Political Economy 109(4): 673-748. [ Links ]
Heckman, J. and J. Smith (1993), "Assessing the Case for Randomized Evaluation of Social Programs'', in Measuring Labour Market Measures: Evaluating The Effects of Active Labour Market Policy Initiatives, K. Jensen and P.K. Madsen, eds., (Copenhagen: Ministry Labour). [ Links ]
Heckman, J. and J. Smith (1998), "Evaluating the Welfare State", in S. Strom, ed., Econometrics and Economic Theory in the 20th Century: The Ragnar Frisch Centennial, Econometric Society Monograph Series, (Cambridge: Cambridge University Press). [ Links ]
Heckman, J., J. Smith, and N. Clements (1997), "Making the Most out of Program Evaluations and Social Experiments: Accounting for Heterogeneity in Program Impacts'', Review of Economic Studies 64, 487-535. [ Links ]
Heckman, J. and E. Vytlacil (2000), "Local Instrumental Variables,'' in C. Hsiao, K. Morimune, and J. Powells, (eds.), Nonlinear Statistical Modeling: Proceedings of the Thirteenth International Symposium in Economic Theory and Econometrics: Essays in Honor of Takeshi Amemiya, (Cambridge: Cambridge University Press). [ Links ]
Heckman, J. and E. Vytlacil (2003), "Structural Equations, Treatment Effects and Econometric Policy Evaluation", forthcoming in Econometrica. [ Links ]
Ichimura, H. and C. Taber (2001), "Direct Estimation of Policy Effects''. [ Links ]
Imbens, G. and J. Angrist (1994), "Identification and Estimation of Local Average Treatment Effects,'' Econometrica, 62(2), 467-475. [ Links ]
Jöreskog, K. (1977), "Structural Equations Models In The Social Sciences: Specification, Estimation and Testing.'' In Applications of Statistics, edited by P.R. Krishnaih. (Amsterdam: North Holland). [ Links ]
Sen, Amartya Kumar (1973), On Economic Inequality, Oxford, Clarendon Press. [ Links ]
* The University of Chicago. Email: firstname.lastname@example.org. I learned the material summarized in this paper from close interactions with James Heckman, with whom I worked on many of these topics, and whom I thank for his teachings, encouragement and support. I have also benefited heavily from working with Karsten Hansen and Edward Vytlacil on these same topics. Financial support from Fundaçao Ciencia e Tecnologia and Fundaçao Calouste Gulbenkian is gratefully acknowledged. email@example.com
However, for simplicity and for easier exposition I will continue with the separable case. I refer the reader to the papers referenced throughout the discussion, in particular Heckman and Vytlacil (2000, 2003), and Carneiro, Heckman and Vytlacil (2001).
2 Or, conditional on X, B(X) = b (X) K(X), where B(X) is the benefit of the policy for individuals with a given level of X, b (X) is the return to college for those individuals, and K(X) is the number of such individuals induced to attend college by the subsidy. Then B = SX B(X). I will drop X when convenient for simplicity of exposition.
7 Tuition can vary across counties if there are barriers to student migration. Tuition variation could also reflect quality variation which would invalidate the use of tuition as an instrumental variable but we abstract from quality considerations in this example and in the rest of the paper. For an example see Carneiro and Heckman (2002).
8 If individuals switch when tuition varies vary little at such a low range then that means that even though they are facing tuition at a very low level these individuals still decide not to enroll in college unless tuition decreases to an even lower level. Therefore they are likely to have low levels of ability (or high levels of unobserved cost).
S = 1 if P(Z) > VS.
Notice that VS needs to be read in the opposite way as US: the higher the US (the higher the unobserved ability or benefit, or the lower the unobserved cost), the lower the VS. Individuals with low VS are more likely to go to college than individuals with high US.
12 Recall that where Therefore, VS is negatively correlated with US (if US is ability, then a if an individual has a high level of VS then that means that he has low ability). This transformation is only done for notational simplicity.