The Need for Factor Models

The Number of Estimates Needed for Mean/Variance Analyses

The problem with securities is that there are too many of them. This is also true for pools of securities such as mutual funds. Worldwide, there are hundreds of thousands of securities and tens of thousands of mutual funds.

In the United States alone there are roughly ten thousand mutual funds. To perform a mean/variance analysis of portfolios that could contain any of them would require estimates for the future values of:

10,000 expected returns,
10,000 standard deviations, and
100,000,000 (10,000*10,000) correlation coefficients

To be sure, this overstates the magnitude of the problem. We know that 10,000 of the correlation coefficients will equal 1.0, since each fund will be perfectly correlated with itself. Moreover, for each entry below the main diagonal of the correlation matrix there is a corresponding entry above it (that is, cc(i,j)=cc(j,i)). Thus the number of potentially different correlation coefficients to be estimated will be only (!) (10,000*10,000 - 10,000)/2, or 49,995,000.

More generally, with N different assets, we require:

N expected returns
N standard deviations
(N^2 - N)/2 correlation coefficients

for a grand total of (N^2 + 3*N)/2 different estimates.

There are two consequences of the fact that problems involving large numbers of assets require a great many estimates. The first concerns the sheer computational requirements for optimization or even the determination of the risk and return of a given portfolio. Fortunately, ever-declining computer costs can ameliorate the pain caused on this front. But a second problem remains -- it is simply too difficult to estimate each of the required values explicitly.

The Use of Historic Data

At first glance it might seem that the estimation problem could also be solved simply by unleashing a sufficient amount of computer power. Why not obtain a set of historic returns for the N assets and compute historic mean returns, standard deviations of return and correlations among the returns? Even for large values of N this could be done in reasonable time and for reasonable cost, although storage of each of the resulting estimates would use up a considerable amount of computer space.

Issues of cost and time aside, such an approach would not provide a good solution. A set of historic data provides only a sample of possible outcomes. The statistics we desire are those that describe the entire underlying "return-generating" process. But the statistics from a sample are likely to differ in potentially significant ways from those that are appropriate for tasks such as risk estimation and portfolio optimization. In statistician's terms, the numbers obtained from historic data are "subject to error". More simply put, they include noise. In some cases this may be reasonably benign. For example, if some values are overstated and others understated, a simple average of historic values may provide a quite accurate estimate of the expected value of the true process. This suggests that the use of historic data for estimating the expected returns and risks of pre-specified portfolios might be an acceptable practice. However, the use of optimization to find the best portfolio for a given investor will be fraught with hazard if historic data are used, since optimization programs look for unusual values, and such values are far more likely to include error than those that are not unusual. The same danger lurks when evaluating portfolios chosen in simpler ways, but with knowledge of the behavior of the assets over the historic period. In either case, the purported risks and returns for the portfolios will be biased toward favorable estimates (higher expected returns and/or lower risks), and the portfolios will almost certainly be inefficient in prospective terms. More precisely, portfolios that appear on the efficient frontier using unadjusted historic data will almost certainly plot below the true efficient frontier that could be constructed if the correct future risks, expected returns and correlations were known.. Unhappily, of course, we can never know the location of the true efficient frontier, since we can never know precisely the correct future risks, expected returns and correlations.

The problem with using historic data when estimates are required for a large number of assets can be seen by comparing the required number of estimates with the data available for the estimation. Assume that returns are available for N assets for T periods (e.g. months). In all, N*T numbers are available in the empirical database. But we need to estimate (N^2 + 3*N)/2 different numbers (expected returns, standard deviations and correlation coefficients). Taking the ratio of the former to the latter gives the ratio of numbers available per number to be estimated. It is 2*T/(N+3). The table below shows the ratio of the numbers available to the numbers being estimated for selected values of N and T.

N T available/estimated

10 60 9.23

100 60 1.17

1,000 60 0.12

10 120 18.46

100 120 2.33

1,000 120 0.24

10 840 129.23

100 840 16.31

1,000 840 1.68

10,000 840 0.17

For the common cases in which monthly returns are used for estimation, each set of three rows corresponds to 5, 10 and 70 years (the latter being approximately the number of years in longer-term databases.

Cases in which fewer numbers are available than are to be estimated are clearly beyond the pale. Yet such combinations can easily arise in practice. This is often encountered in scenario analyses, when judgmental forecasts of asset returns in a limited set of possible future situations are used as the foundation for portfolio construction. But it is not uncommon in empirical analyses of historic data.

One might assume that the problem of insufficient data can be mitigated sufficiently by simply using more data. Unfortunately this usually requires going farther back in history, and the longer the historic period covered, the less likely is the maintained hypothesis that the underlying joint probability distribution generating the returns has been the same.

While these dilemmas cannot totally be resolved, there are ways to mitigate the problem. Needed are procedures that can produce estimates of risks, returns and correlations closer to the desired future values than those obtained by simply using historic statistics. Two ingredients are required.

First, historic data must be "smoothed" to try to focus on underlying relationships that are more likely to be true in the future and to ignore deviations from those relationships that are more likely to be due to random noise or errors. The tools used most often to accomplish this are factor models -- the subject of this chapter.
Second, good financial economic theory must be utilized to adjust estimates of risks, expected returns and correlations until they bear some reasonable relationship with one another. This involves the use of concepts and models associated with equilibrium in efficient markets -- a subject that is treated at length in a later chapter.

N	T	available/estimated
10	60	9.23
100	60	1.17
1,000	60	0.12
10	120	18.46
100	120	2.33
1,000	120	0.24
10	840	129.23
100	840	16.31
1,000	840	1.68
10,000	840	0.17

The Need for Factor Models

Contents:

The Number of Estimates Needed for Mean/Variance Analyses

The Use of Historic Data