Births by Day of Year

Posted: June 7th, 2012 | Author: | Filed under: Nerdery, School | Tags: , , , | 16 Comments »

Andrew Gelman has posted twice about certains days being more or less common for births, and lamented the lack of a good, simple visualization showing all 366 days.

Well, I heard his call! My goal was simply a line graph that showed

Finding a decent public dataset proved surprisingly hard – very few people with large datasets seem willing to release full date of birth (at least for recent data). I considered voter files, but I think the data quality issues would be severe, and might present unknown bias. There’s some data from the CDC’s National Vital Statistics System, but it either only contains year and month, or isn’t available in an especially easy to use format. There’s some older data that seemed like the best bet, and which others have used before.

A bit more searching revealed that Google’s BigQuery coincidentally loads the NVSS data as one of their sample datasets. A quick query in their browser tool and export to CSV and I had the data I wanted. NVSS/google seems to include only the day of the month for 1/1/1969 through 12/31/1988. More recent data just includes year and month.

SELECT MONTH, DAY, SUM(record_weight)
FROM [publicdata:samples.natality]
WHERE DAY >= 1 AND DAY <= 31
GROUP BY MONTH, DAY
ORDER BY MONTH, DAY

Some basic manipulation (including multiplying 2/29 by 4 per Gelman’s suggestion) and a bit of time to remember all of R’s fancy graphing features yielded this script and this graph:

See update at bottom!

I’ve labeled outliers > 2.3 standard deviations from the loess curve (which unfortunately I should really predict “wrapping” around New Years…), as well as Valentine’s and Halloween. You can see by far the largest peaks and valleys are July 4th, Christmas, and just before/after New Years while Valentine’s and Halloween barely register as blips.

It’s possible there data collection issues causing some of this – perhaps births that occurred on July 4th were recorded over the following few days? The whole thing is surprisingly less uniform than I expected.

Simulating Birthday Problem

I also wanted to simulate the birthday problem using these real values, instead of the basic assumption of 1/365th per day. In particular I DON’T multiply Feb 29th by 4, so it accurately reflects the distribution in a random population. This is data for 1969 to 1988, but I haven’t investigated whether there’s a day of week skew by selecting this specific interval as opposed to others, this is just the maximal range.

I did a basic simulation of 30,000 trials for each group size from 0 to 75. It works out very close to the synthetic/theoretical, as you can see in this graph (red is theoretical, black is real data). Of note, the real data seems to average about 0.15% more likely than the synthetic for groups of size 10-30 (the actual slope).

Birthday Problem - Real vs Synthetic

I’ve also uploaded a graph of the P(Match using Real) – P(Match using Synthetic).

If you’re curious about the raw results, here’s the most exciting part:

n real synthetic diff
10 11.59% 11.41% 0.18%
11 14.08% 14.10% -0.02%
12 16.84% 16.77% 0.08%
13 19.77% 19.56% 0.21%
14 22.01% 22.06% -0.05%
15 25.74% 25.17% 0.57%
16 28.24% 27.99% 0.25%
17 31.81% 31.71% 0.10%
18 34.75% 33.76% 0.98%
19 37.89% 37.90% -0.01%
20 40.82% 40.82% 0.00%
21 44.48% 44.57% -0.09%
22 47.92% 47.45% 0.47%
23 50.94% 50.80% 0.14%
24 53.89% 53.79% 0.10%
25 57.07% 56.76% 0.31%
26 59.74% 59.75% -0.01%
27 62.61% 63.00% -0.40%
28 65.88% 65.26% 0.63%
29 68.18% 67.85% 0.32%
30 70.32% 70.49% -0.18%
31 73.00% 72.73% 0.27%
32 75.37% 75.65% -0.28%
33 77.59% 77.63% -0.04%
34 79.67% 78.86% 0.81%
35 81.44% 81.24% 0.19%
36 83.53% 82.79% 0.74%
37 84.92% 84.52% 0.41%
38 86.67% 86.62% 0.05%
39 87.70% 88.09% -0.39%
40 89.07% 88.88% 0.19%
41 90.16% 90.48% -0.32%

Update

Gelman commented on the graph and had some constructive feedback. I made a few cosmetic changes in response: rescaled so it’s relative to the mean, removing the trend line, and switching it to 14 months (tacking December onto the beginning, and January onto the end). Updated graph:

16 Comments »

16 Comments on “Births by Day of Year”

  1. 1
    Andrew Gelman said at 9:10 pm on June 7th, 2012:

    Nice! Could you please multiply the rate for Feb 29 by 4? Also, could you please tell me the range of dates/years represented by this dataset? Then I can post your graph, link to it, and suggest improvements. Thanks.

  2. 2 chmullig said at 9:26 pm on June 7th, 2012:

    That was fast, I was just writing you an email.

    I’ve multiplied the 29th by 4.

    The coverage is 1/1/1969 to 12/31/1988.

  3. 3 Ryan J. O'Neil said at 10:17 am on June 8th, 2012:

    I like your visualization and may end up using your R script.

    Check out these posts, which might also interest you:

    http://punkrockor.wordpress.com/2012/05/16/the-birthday-problem/
    http://punkrockor.wordpress.com/2012/05/18/the-birthday-problem-with-a-mating-season-a-simulation-approach/

  4. 4 chmullig said at 10:22 am on June 8th, 2012:

    I actually did some birthday problem simulations (25k repeats for n of 0:80). So far it seems the theoretical case holds pretty well. It looks like on average the probability of a match for a group of size n<=50 in the real data is .05% higher than the simplified simulation. I’ll try to do a full update on the birthday problem later…

  5. 5 chmullig said at 5:32 pm on June 9th, 2012:

    Post updated with birthday problem. Thanks Ryan!

  6. 6 Laura McLay said at 3:28 pm on June 10th, 2012:

    This is very interesting. Great job!

  7. 7 Fareez Ahamed said at 5:46 am on June 12th, 2012:

    Great job… Very interesting…

  8. 8 Simple graph WIN: the example of birthday frequencies « Statistical Modeling, Causal Inference, and Social Science said at 9:55 am on June 12th, 2012:

    [...] From Chris Mulligan: [...]

  9. 9 chmullig said at 3:55 pm on June 12th, 2012:

    Updated with the new graph at the bottom!

  10. 10 Mary O'Keeffe said at 10:26 pm on June 12th, 2012:

    Fascinating graph!

    1) Since births peak in summer, it’s clear that the practice of starting all brand new residents on July 1 is extremely unfortunate.

    2)It would be really interesting to see how this graph has changed in recent years, because tax incentives for giving birth before midnight on 12/31 have greatly increased since 1988, due to the expansion of the Earned Income Credit as part of the mid-1990′s welfare reform and then subsequent increases in the child tax credit over the past decade. For some parents, giving birth on 12/31 will increase their tax refund (federal and state combined) by $5,000, and electronic tax administration means that they can get that money within a month of giving birth.

    See: Stacy Dickert-Conlin & Amitabh Chandra, Taxes and the Timing of Births, 107 J. Pol. Econ. 161 (1999)
    Also, Teny Maghakian and Lisa Schulkind have a working paper with more recent data
    What a Difference a Day Makes: A New Look at Child Tax Benefits and the Timing of Births,
    https://docs.google.com/viewer?url=http%3A%2F%2Flisaschulkind.weebly.com%2Fuploads%2F8%2F7%2F7%2F1%2F8771915%2Fschulkind_paper2.pdf

    Also, of course, the rate of C-sections has grown considerably in recent decades, so it would be really interesting to see the change graphically.

  11. 11 chmullig said at 12:42 am on June 13th, 2012:

    I’d really love to find better data. I think the CA birth database has 1960-2010, but it’s $200/year.

    They also have death files for 1970-2010, but that’s ~$150/year.

  12. 12 Chuck said at 8:08 am on June 13th, 2012:

    Nice! Thanks for posting this.

  13. 13 ความน่าจะเป็นที่มีคนเกิดวันเดียวกัน | WJ's Sandbox said at 7:13 am on June 14th, 2012:

    [...] ที่กล่าวมานี่มีสมมติฐานว่ทุกวันมีความน่าจะเป็นที่จะเป็นวันเกิดได้เหมือนกันหมด (uniform distribution) แล้วถ้าจริงๆมันไม่ uniform ล่ะ ต้องลองดูหน้านี้ซึ่งมีการใช้ข้อมูลจริงเพื่อหาความน่าจะเป็นของแต่ละวันด้วย [...]

  14. 14 kcvearner said at 9:00 pm on June 14th, 2012:

    Is the data strictly births in the U.S. ? I presume it is since its called a National database.

  15. 15 chmullig said at 9:06 pm on June 14th, 2012:

    Correct, it’s strictly US data.

    Of course there are all sorts of minor ambiguities, many of which changed over time. Things like how you account for tourists? What about a Canadian who’s here for college?

  16. 16 Anonymouse said at 5:11 pm on February 19th, 2014:

    This graph has a clear high frequency cycle with a period of about one week. If the data were from only one year then this could be explained by day of the week skew. But since it is averaged across 50 years, each day of the month should fall roughly the same number of times on each day of the week.

    Do you have any explanation for this cycle?


Leave a Reply