# Births by Day of Year

**Posted:**June 7th, 2012 |

**Author:**chmullig |

**Filed under:**Nerdery, School |

**Tags:**birthday, programming, r, statistics | 16 Comments »

Andrew Gelman has posted twice about certains days being more or less common for births, and lamented the lack of a good, simple visualization showing all 366 days.

Well, I heard his call! My goal was simply a line graph that showed

Finding a decent public dataset proved surprisingly hard – very few people with large datasets seem willing to release full date of birth (at least for recent data). I considered voter files, but I think the data quality issues would be severe, and might present unknown bias. There’s some data from the CDC’s National Vital Statistics System, but it either only contains year and month, or isn’t available in an especially easy to use format. There’s some older data that seemed like the best bet, and which others have used before.

A bit more searching revealed that Google’s BigQuery coincidentally loads the NVSS data as one of their sample datasets. A quick query in their browser tool and export to CSV and I had the data I wanted. NVSS/google seems to include only the day of the month for 1/1/1969 through 12/31/1988. More recent data just includes year and month.

SELECT MONTH, DAY, SUM(record_weight) FROM [publicdata:samples.natality] WHERE DAY >= 1 AND DAY <= 31 GROUP BY MONTH, DAY ORDER BY MONTH, DAY |

Some basic manipulation (including multiplying 2/29 by 4 per Gelman’s suggestion) and a bit of time to remember all of R’s fancy graphing features yielded this script and this graph:

I’ve labeled outliers > 2.3 standard deviations from the loess curve (which unfortunately I should really predict “wrapping” around New Years…), as well as Valentine’s and Halloween. You can see by far the largest peaks and valleys are July 4th, Christmas, and just before/after New Years while Valentine’s and Halloween barely register as blips.

It’s possible there data collection issues causing some of this – perhaps births that occurred on July 4th were recorded over the following few days? The whole thing is surprisingly less uniform than I expected.

### Simulating Birthday Problem

I also wanted to simulate the birthday problem using these real values, instead of the basic assumption of 1/365th per day. In particular I DON’T multiply Feb 29th by 4, so it accurately reflects the distribution in a random population. This is data for 1969 to 1988, but I haven’t investigated whether there’s a day of week skew by selecting this specific interval as opposed to others, this is just the maximal range.

I did a basic simulation of 30,000 trials for each group size from 0 to 75. It works out very close to the synthetic/theoretical, as you can see in this graph (red is theoretical, black is real data). Of note, the real data seems to average about 0.15% more likely than the synthetic for groups of size 10-30 (the actual slope).

I’ve also uploaded a graph of the P(Match using Real) – P(Match using Synthetic).

If you’re curious about the raw results, here’s the most exciting part:

n |
real |
synthetic |
diff |

10 | 11.59% | 11.41% | 0.18% |

11 | 14.08% | 14.10% | -0.02% |

12 | 16.84% | 16.77% | 0.08% |

13 | 19.77% | 19.56% | 0.21% |

14 | 22.01% | 22.06% | -0.05% |

15 | 25.74% | 25.17% | 0.57% |

16 | 28.24% | 27.99% | 0.25% |

17 | 31.81% | 31.71% | 0.10% |

18 | 34.75% | 33.76% | 0.98% |

19 | 37.89% | 37.90% | -0.01% |

20 | 40.82% | 40.82% | 0.00% |

21 | 44.48% | 44.57% | -0.09% |

22 | 47.92% | 47.45% | 0.47% |

23 | 50.94% | 50.80% | 0.14% |

24 | 53.89% | 53.79% | 0.10% |

25 | 57.07% | 56.76% | 0.31% |

26 | 59.74% | 59.75% | -0.01% |

27 | 62.61% | 63.00% | -0.40% |

28 | 65.88% | 65.26% | 0.63% |

29 | 68.18% | 67.85% | 0.32% |

30 | 70.32% | 70.49% | -0.18% |

31 | 73.00% | 72.73% | 0.27% |

32 | 75.37% | 75.65% | -0.28% |

33 | 77.59% | 77.63% | -0.04% |

34 | 79.67% | 78.86% | 0.81% |

35 | 81.44% | 81.24% | 0.19% |

36 | 83.53% | 82.79% | 0.74% |

37 | 84.92% | 84.52% | 0.41% |

38 | 86.67% | 86.62% | 0.05% |

39 | 87.70% | 88.09% | -0.39% |

40 | 89.07% | 88.88% | 0.19% |

41 | 90.16% | 90.48% | -0.32% |

### Update

Gelman commented on the graph and had some constructive feedback. I made a few cosmetic changes in response: rescaled so it’s relative to the mean, removing the trend line, and switching it to 14 months (tacking December onto the beginning, and January onto the end). Updated graph:

Nice! Could you please multiply the rate for Feb 29 by 4? Also, could you please tell me the range of dates/years represented by this dataset? Then I can post your graph, link to it, and suggest improvements. Thanks.

That was fast, I was just writing you an email.

I’ve multiplied the 29th by 4.

The coverage is 1/1/1969 to 12/31/1988.

I like your visualization and may end up using your R script.

Check out these posts, which might also interest you:

http://punkrockor.wordpress.com/2012/05/16/the-birthday-problem/

http://punkrockor.wordpress.com/2012/05/18/the-birthday-problem-with-a-mating-season-a-simulation-approach/

I actually did some birthday problem simulations (25k repeats for n of 0:80). So far it seems the theoretical case holds pretty well. It looks like on average the probability of a match for a group of size n<=50 in the real data is .05% higher than the simplified simulation. I’ll try to do a full update on the birthday problem later…

Post updated with birthday problem. Thanks Ryan!

This is very interesting. Great job!

Great job… Very interesting…

[...] From Chris Mulligan: [...]

Updated with the new graph at the bottom!

Fascinating graph!

1) Since births peak in summer, it’s clear that the practice of starting all brand new residents on July 1 is extremely unfortunate.

2)It would be really interesting to see how this graph has changed in recent years, because tax incentives for giving birth before midnight on 12/31 have greatly increased since 1988, due to the expansion of the Earned Income Credit as part of the mid-1990′s welfare reform and then subsequent increases in the child tax credit over the past decade. For some parents, giving birth on 12/31 will increase their tax refund (federal and state combined) by $5,000, and electronic tax administration means that they can get that money within a month of giving birth.

See: Stacy Dickert-Conlin & Amitabh Chandra, Taxes and the Timing of Births, 107 J. Pol. Econ. 161 (1999)

Also, Teny Maghakian and Lisa Schulkind have a working paper with more recent data

What a Difference a Day Makes: A New Look at Child Tax Benefits and the Timing of Births,

https://docs.google.com/viewer?url=http%3A%2F%2Flisaschulkind.weebly.com%2Fuploads%2F8%2F7%2F7%2F1%2F8771915%2Fschulkind_paper2.pdf

Also, of course, the rate of C-sections has grown considerably in recent decades, so it would be really interesting to see the change graphically.

I’d really love to find better data. I think the CA birth database has 1960-2010, but it’s $200/year.

They also have death files for 1970-2010, but that’s ~$150/year.

Nice! Thanks for posting this.

[...] ที่กล่าวมานี่มีสมมติฐานว่ทุกวันมีความน่าจะเป็นที่จะเป็นวันเกิดได้เหมือนกันหมด (uniform distribution) แล้วถ้าจริงๆมันไม่ uniform ล่ะ ต้องลองดูหน้านี้ซึ่งมีการใช้ข้อมูลจริงเพื่อหาความน่าจะเป็นของแต่ละวันด้วย [...]

Is the data strictly births in the U.S. ? I presume it is since its called a National database.

Correct, it’s strictly US data.

Of course there are all sorts of minor ambiguities, many of which changed over time. Things like how you account for tourists? What about a Canadian who’s here for college?

This graph has a clear high frequency cycle with a period of about one week. If the data were from only one year then this could be explained by day of the week skew. But since it is averaged across 50 years, each day of the month should fall roughly the same number of times on each day of the week.

Do you have any explanation for this cycle?