Simulation FAQ



What are these simulations all about?

For some time now, I’ve been collecting polls and simulating election results. I began systematically collecting state head-to-head polls for the 2008 presidential election in October of 2007. Since then, I’ve added analyses of the 2008 senatorial and gubernatorial elections. The FAQ mostly discusses the presidential election, although the methods for other races are the same.

How are you doing the electoral college analyses?

The analyses are Monte-Carlo simulations of the Electoral College outcome based on state head-to-head polling data. The results are driven by poll results (or the 2004 election for states with no polls). Essentially, I simulate a large number of elections (typically 100,000) for all states (plus D.C.) based on the recent polling data and then tally the number of Electoral College votes for each candidate. Details of the methods follow.

What polling data do you use?

I collect state polls in which Democratic candidates are matched-up, head-to-head, with Republican challengers.

Where do your polling data come from?

Several places. I find polls from the web sites of well-known polling firms. However, if there is sufficient information, a secondary source (e.g. news summary of a poll) can be acceptable. The most common polling firms to release head-to-head polls are SurveyUSA, Rassmussen, Research 2000, and Quinnipiac. But there are many, many more polls and polling companies. Some of them are listed here. Frequently, I am made aware of a poll through a polling aggregation site like Atlas of US elections, TPMElectionCentral, Pollster.com, or Real Clear Politics, but I try to find (and link to) an original report.

How do you select which polls to include?

To be considered acceptable, each poll must come from a reputable pollster and must included the following information:

  1. The name of the polling firm
  2. The dates on which the poll was taken
  3. The state in which the poll was taken
  4. The number of individuals polleda
  5. The number or percentage of individuals supporting each candidate

aMost reputable polls include the number of individuals sampled. As I write this (Feb, 2008), all polls have included this number. But some poll will eventually not publish that number. When a 95% margin of error (MOE) is provided, I can estimate the number of sampled individuals as (0.98/MOE)2. (This is based on the standard error of a binomial distribution and, as is commonly done by pollsters, assuming the true proportion for each candidate is 0.5.)

Do you include push polls?

No. A push polls is not real poll. Rather, it is a marketing tool. In any case, results of push polls are rarely, if ever, published.

Do your simulations include all polls in each state?

No. Whenever possible, I include only “recent” polls. As I write this (Feb 2008), “recent” means polls with a mid-point that falls within the last month. Beginning in August, the window has been reduced to three weeks. It was subsequently reduced to 14 days in September. I went to a 10 day window on Oct 21 when there was ten days remaining. The final reduction will be to a 1 week window when there is one week remaining until the election.

So, for example, if there are two polls in the last three weeks for Missouri and four older polls, only the two “recent” polls are included in the simulations.

What if there are no polls that have been conducted in the “current poll” window?

In that case, I use the most recent poll on record, even if it was taken some months ago.

What if there are no polls whatsoever taken in the state?

In that case, that state always goes the way it did in the last similar election. In the presidential elections, States that went for Bush are always credited to the Republican nominee and those that went for Kerry are credited to the Democratic candidate.

Nebraska has no known polls with head-to-head match-ups as I write this, so I assume McCain always beats Obama in Nebraska. This is because Bush won 65.9% of the vote and Kerry 32.7% of the vote in Nebraska in 2004.

Update: As of early March, 2008, there is at least one poll for each state.

Why use the 2004 election results for states lacking any polls?

The 2004 results in non-polled states seem like the best empirical data available. The states with no polling are those that the media, the candidates, and polling firms believe are highly predictable—therefore there is no reason to pay good money for a poll. They’re probably right. For example, D.C is, almost certainly, not going to go for the Republican candidate in ‘08, so nobody is going to pay for polling D.C. until much closer to the election.

Of course, as the election season goes on, there will be fewer and fewer unpolled states. In 2004, all 50 states plus D.C. were eventually polled, but there was only a single poll in some cases (like D.C.).

How can I see the polls being used?

From the map, click on a state to jump to the results table. From there, click on the number in the “# polls” column, and you will be taken to a list of polls.

How are the simulations done?

For each simulation, I “hold” an election in each state (plus D.C.) using “current” polls. For the presidential analyses, state results are then combined as would happen in the electoral college.

In Feb 2008, there is a single poll conducted in Maryland for an Obama–McCain match-up. The Rasmussen poll was conducted on 2 Jan 2008 and surveyed 500 voters and found that 42% support McCain, 48% support Obama and 10% were either undecided or supported someone else. (This poll is “old” because it is more than a month old, but it is the most current poll, so it is the best information available.) Here are the steps:

  1. I compute the number of people who voted for each candidate. 500×0.48 gives 240 votes for Obama and 500×0.42 gives 210 McCain supporters in the original poll. There were 240 + 210 = 450 decided voters.
  2. We figure out the normalized percentage who voted for each candidate. For Obama it is 240/450 = 53.33%, and it is 46.67% for McCain. “Normalized” means that the percentage for Obama and McCain sum to 1.0
  3. The estimated probability of a voter voting for Obama in Maryland in Jan is p= 0.533. But since p itself is estimated from a sample, p is more properly described as a distribution of possible Obama preferences. That is, we really have a distribution of ps
  4. . Thus, in each simulation for each poll, we randomly draw a value from the distribution of ps (let’s call it p’). So for the current simulation I might draw the value p’ = 0.527. Technical details: We draw p’ from a beta distribution with parameters (Dvotes + 1) and (Rvotes + 1). So, in this example we draw from beta(rand | 241, 211). This corresponds to a binomial distribution with a uniform (uninformative) prior distributionp.

  5. Now, we simulate 450 voters, who each have a p’ (here, a 52.7%) probability of voting for Obama and 1 – p’ probability of voting for McCain. How is this done? The easy way is to draw a uniform random number between 0 and 1. If the number is less than 0.527 then the vote goes to Obama, otherwise it is a vote for McCain. The process is repeated 450 times. Technical details: In practice we use a much faster method that yields identical results. A number of votes for the candidate is drawn from a binomial quantile function with a uniform random number as its argument and parameters N and p’ (here, 450 and 0.527).

When there are multiple current polls this process is repeated for each poll in each state and the number of votes for each candidate tallied.

How are you incorporating undecided voters in your analysis?

I ignore undecided voters. In absence of any information, the method is assuming that the undecided fraction would break as the decided sample breaks. Someone wrote a dissertation some year ago doing similar simulations and tried different assumptions about the undecideds—as I recall none of the manipulations affected the ability to predict the 2004 election. I’ll link to the dissertation when I find it again.

Maine and Nebraska use a different method of assigning electoral college votes. Shouldn’t you treat them differently?

All states but Maine and Nebraska use winner-take-all for electoral votes. If congressional-district polling data becomes widely available for these states, I’ll treat them specially so as to incorporate potential splits in electoral votes.

Are you doing your analyses to favor a particular candidate or party?

I am most certainly not neutral on politics, but these election analyses are done as objectively as possible.

Are you trying to predict the result of the 2008 election?

No. My analyses make no projections to election day, 2008. Rather, I am showing what the state head-to-head polls indicate would happen if the election was held right now.

I’ll use a sports metaphor. During a basketball game, the current score does not always predict the winner. Rather, it provides information on the past and current performance of each team. We get some indication of the eventual winner, but only as the end of the game approaches or the point difference gets very large. Still, it would be unacceptable to fans if no score was “published” until the winner was fairly certain.

Likewise in an election contest, these analyses serve as a score for each team. I fully expect the score and the point spread to change as the game goes on, but I want to know who is in the lead and by how much at every point of the game.

Aren’t these exercises futile early in the election season when the party’s are focused on the primary instead of messaging?

No. Likewise, I don’t think the score should remain hidden from spectators for the first half of a football game.

In fact, the ebbs and flows over time—particularly with respect to events and media coverage—are fascinating. For example, Giuliani’s fall from grace in the polls after LoverGate, and in the absence of any showing in IA and NH was nothing short of stunning. That sort of thing is at least as interesting as any attempts to predict a final outcome.

Why not use national head-to-head polls instead?

National polls have the advantage of being current—that is, people express their support for each candidate all at the same time. The state head-to-head polls suffer because some polls are older, and public opinion may have changed since the older polls were taken. But the national head-to-head polls have some disadvantages. Most importantly, they predict the outcome of a national popular vote. We don’t elect our presidents by popular vote. As we learned in 2000, the national popular vote doesn’t always give the same election outcome as the Electoral College vote.

How are you incorporating the margin of error of each poll in your analysis?

The margin of error is inherently incorporated into the analyses. This is done by simulating elections in each state that include the number of polled individuals, and drawing a new value of p’ (described above) for each poll every simulated election.

What is the distribution of electoral votes?

The “distribution of electoral votes” graph look like this:

To produce this graph, I save the electoral vote from each of the (typically, 100,000) simulated elections. Then, the relative frequency (on the y-axis) of each possible electoral vote outcome (x-axis) is plotted. The graph can tell you several things:

  1. The highest bar is the most likely outcome for an election (this is the mode).
  2. The vertical dashed line is simply a marker for 269 votes—which reflects a tie in the Electoral College. The blue bars to the right of the center line are wins for the Democrat and the red bars to the left are wins for the Republican
  3. If you squint a bit you can estimate where the graph would balance on a fulcrum. That is an estimate of the mean or expected electoral vote total.
  4. The point on the x-axis were 50% of the bar mass is above and 50% of the mass is below is the median electoral vote.
  5. The spread of the distribution is an indication of how variable the outcomes are
  6. The raggedness of the bars reflects the differing numbers of votes per state with an Electoral College system. With 100,000 simulations, we would expect a pretty smooth distribution if a popular vote was being simulated. Not necessarily so with an electoral college system.

How are the trend graphs produced?

The trend graphs look like this:

The graph results from simulations done over time. This graph was created by simulating weekly elections over an eight month period. Basically, this comes from a series of 100,000 simulated elections for every week between 01 Dec 2007 to 01 Aug 2008. For each simulated election:

  1. Polls collected in the month (now, 3 weeks) preceding the focus week are included
  2. If no polls occur in the month preceding the focus week, the most recent poll taken prior to that week is used
  3. Or if no polls are available prior to the focus week, all the electoral votes are assigned according to the outcome of the 2004 election

The graph shows the median electoral vote count (purple line) for Obama. The blue lines enclose the central 75% mass of Obama’s electoral vote counts, and the green line enclose 95% of Obama’s electoral vote count.

What do the colors mean?

The colors are found in four places:

  1. On the map (like this)
  2. On the state results summary table (like this)
  3. On the poll list (like this)
  4. On lists of polls for an individual state (usually in poll results posts like this)

For the first two cases (map and results table), the colors are coded according to the probability that the Democrat wins based on the actual results of the simulation analysis:

Color From To
100% 99.999%
99.999% 90%
90% 60%
60% 50%
Exactly 50%
50% 40%
40% 10%
10% 0.001%
0% 0.001

For the poll results table and state poll lists, I do something different, because the simulation results are not saved by poll (and the state poll lists don’t involve simulations at all). Instead I do a t-test of the hypothesis that the Democratic results is greater than the Republican results. Technically, I am computing

t test

where d is the normalized Democratic proportion, r is the normalized Republican proportion and n is the number of individuals who responded for either the Democratic or Republican candidate. I then compare this number (t) to a Student’s t distribution to decide the probability of the Democrat winning given the observed poll results. I use the same cut-offs as in the table above.

What is that distorted map?

Beginning 10 May 2008, a cartogram is included in the presidential poll analyses:

The cartogram scales the area of each state according to its electoral vote total. Thus, Alaska is scaled to the same size as Washington D.C.—both have three votes in the Electoral College. The cumulative area covered by each color on the cartogram is an honest representation of the proportion of electoral votes that would be expected if a general election were held.

For more information on cartograms, check out Mark Newman’s web page or Victor L. Vescovo’s book The Atlas of World Statistics (2006, published by Caladan Press).

Why do you assign ties to the Democrat?

In the event of a 269–269 tie in the Electoral College, the selection of the next President and Vice President is specified by the 12th Amendment of the U.S. Constitution. The House of Representatives would vote (using an unorthodox single-vote-per-state method) for the President and the Senate would select the Vice President. Since it seems highly unlikely that the House will be under Republican control after the November election, I assign ties to the Democratic candidate.

AWSOM Powered