Births by Day of Year

Posted: June 7th, 2012 | Author: | Filed under: Nerdery, School | Tags: , , , | 17 Comments »

Andrew Gelman has posted twice about certains days being more or less common for births, and lamented the lack of a good, simple visualization showing all 366 days.

Well, I heard his call! My goal was simply a line graph that showed

Finding a decent public dataset proved surprisingly hard – very few people with large datasets seem willing to release full date of birth (at least for recent data). I considered voter files, but I think the data quality issues would be severe, and might present unknown bias. There’s some data from the CDC’s National Vital Statistics System, but it either only contains year and month, or isn’t available in an especially easy to use format. There’s some older data that seemed like the best bet, and which others have used before.

A bit more searching revealed that Google’s BigQuery coincidentally loads the NVSS data as one of their sample datasets. A quick query in their browser tool and export to CSV and I had the data I wanted. NVSS/google seems to include only the day of the month for 1/1/1969 through 12/31/1988. More recent data just includes year and month.

SELECT MONTH, DAY, SUM(record_weight)
FROM [publicdata:samples.natality]
WHERE DAY >= 1 AND DAY <= 31
GROUP BY MONTH, DAY
ORDER BY MONTH, DAY

Some basic manipulation (including multiplying 2/29 by 4 per Gelman’s suggestion) and a bit of time to remember all of R’s fancy graphing features yielded this script and this graph:

See update at bottom!

I’ve labeled outliers > 2.3 standard deviations from the loess curve (which unfortunately I should really predict “wrapping” around New Years…), as well as Valentine’s and Halloween. You can see by far the largest peaks and valleys are July 4th, Christmas, and just before/after New Years while Valentine’s and Halloween barely register as blips.

It’s possible there data collection issues causing some of this – perhaps births that occurred on July 4th were recorded over the following few days? The whole thing is surprisingly less uniform than I expected.

Simulating Birthday Problem

I also wanted to simulate the birthday problem using these real values, instead of the basic assumption of 1/365th per day. In particular I DON’T multiply Feb 29th by 4, so it accurately reflects the distribution in a random population. This is data for 1969 to 1988, but I haven’t investigated whether there’s a day of week skew by selecting this specific interval as opposed to others, this is just the maximal range.

I did a basic simulation of 30,000 trials for each group size from 0 to 75. It works out very close to the synthetic/theoretical, as you can see in this graph (red is theoretical, black is real data). Of note, the real data seems to average about 0.15% more likely than the synthetic for groups of size 10-30 (the actual slope).

Birthday Problem - Real vs Synthetic

I’ve also uploaded a graph of the P(Match using Real) – P(Match using Synthetic).

If you’re curious about the raw results, here’s the most exciting part:

n real synthetic diff
10 11.59% 11.41% 0.18%
11 14.08% 14.10% -0.02%
12 16.84% 16.77% 0.08%
13 19.77% 19.56% 0.21%
14 22.01% 22.06% -0.05%
15 25.74% 25.17% 0.57%
16 28.24% 27.99% 0.25%
17 31.81% 31.71% 0.10%
18 34.75% 33.76% 0.98%
19 37.89% 37.90% -0.01%
20 40.82% 40.82% 0.00%
21 44.48% 44.57% -0.09%
22 47.92% 47.45% 0.47%
23 50.94% 50.80% 0.14%
24 53.89% 53.79% 0.10%
25 57.07% 56.76% 0.31%
26 59.74% 59.75% -0.01%
27 62.61% 63.00% -0.40%
28 65.88% 65.26% 0.63%
29 68.18% 67.85% 0.32%
30 70.32% 70.49% -0.18%
31 73.00% 72.73% 0.27%
32 75.37% 75.65% -0.28%
33 77.59% 77.63% -0.04%
34 79.67% 78.86% 0.81%
35 81.44% 81.24% 0.19%
36 83.53% 82.79% 0.74%
37 84.92% 84.52% 0.41%
38 86.67% 86.62% 0.05%
39 87.70% 88.09% -0.39%
40 89.07% 88.88% 0.19%
41 90.16% 90.48% -0.32%

Update

Gelman commented on the graph and had some constructive feedback. I made a few cosmetic changes in response: rescaled so it’s relative to the mean, removing the trend line, and switching it to 14 months (tacking December onto the beginning, and January onto the end). Updated graph:

17 Comments »

Sort with sleep

Posted: June 16th, 2011 | Author: | Filed under: Nerdery | Tags: , , | No Comments »

Inspired by an Ars thread that was inspired by a 4chan thread found on reddit, it’s an interesting sort idea for integers.

Basically, sort a list of integers by spawning a new thread or process for each element then sleep for the value of that element then print out that element. Here’s the original bash example, but I’d love to see other crazy languages.

#!/bin/bash
function f() {
    sleep "$1"
    echo "$1"
}
while [ -n "$1" ]
do
    f "$1" &
    shift
done
wait
No Comments »

Starcraft II: Lost Viking “macro” for mac

Posted: June 2nd, 2011 | Author: | Filed under: Gaming | Tags: , , , | 1 Comment »

So Stracraft II is a fantastic game, and includes achievements. I’ve turned into a bit of a SC2 achievement whore (Currently: 3350). I long ago completed every achievement in the single player campaign except 4: The Lost Viking. It’s a stupid arcade game within the game! It doesn’t matter at all! Yet I was tantalizingly close to 100%, so finally I decided to tackle it.

First, you can read all sorts of strategies. Basically they come down to this: immediately get 2 side missiles, then get 2 drones. Whenever you loose a drone, replace it first chance you get. Then everything else should be bombs. (Any drop goes through a sequence, so wait until it’s the one you want). Use bombs liberally to prevent death and loss of drones. They make you invincible for a few seconds, in addition to clearly crap away. You’ll have basically all the bombs you need. Press space as fast as you can to shoot faster.

So it’s mostly a question of staying alive, and keeping your drones alive. Except for the bosses/mini bosses there’s not a ton of strategy. After a few attempts yesterday I beat it once, then hit 245k points, then 279k points (getting Silver). That only left gold (500k points), and I didn’t have the energy. However I knew there was a macro on the Windows side to hit space for you. In that case all you’d have to do is navigate around, which isn’t too hard. A few attempts to track one down on the Mac failed. Mac OS X is nice, but it definitely lacks some of the rom emulation tools so popular on Windows. I wanted an OS X program to simply pres the space bar over and over and over again forever, quickly, while not interfering with the rest of the system. It was useless if I couldn’t

However I figured there must be a better option, perhaps Objective-C? My Obj-C is super rusty, but I stumbled across a StackOverflow hint suggesting how easy it would be to do in plain C. Here’s the C program I wrote that uses CG Quartz Events to simulate pressing the space bar every .05 seconds (In retrospect that’s probably way faster than it needs, could probably easily be .1 seconds).

#include <stdio.h>
#include <ApplicationServices/ApplicationServices.h>
#include <unistd.h>
 
int main (int argc, const char * argv[]) {
    CGEventRef spaceDown = CGEventCreateKeyboardEvent (NULL, (CGKeyCode)49, true);
    CGEventRef spaceUp = CGEventCreateKeyboardEvent (NULL, (CGKeyCode)49, false);
    int sleepTime = 50000;
    printf("Pressing space every %d microseconds\n", sleepTime);
    sleep(2);
 
    while (1) {
        CGEventPost(kCGHIDEventTap, spaceDown);
        CGEventPost(kCGHIDEventTap, spaceUp);
        usleep(sleepTime);
    }
 
    CFRelease(spaceDown);
    CFRelease(spaceUp);
    return 0;
}

It can be copied to a file like “cstroker.c” and compiled with this gcc command (you may need to install Xcode if you don’t already have it) from Terminal.app:

gcc -o cstroker cstroker.c -O -Wall -framework ApplicationServices

You then execute it by simply calling ./cstroker

Update: Because some folks asked I’ve uploaded the binary. Might work for you, in which case you can skip the gcc compilation.

Finally nailed Lost Viking Gold!

1 Comment »

MITRE Name Matching Challenge

Posted: February 17th, 2011 | Author: | Filed under: Nerdery | Tags: , , | No Comments »

My illustrious former colleague Ryan is now over at MITRE doing operations research and who knows what. He pointed me toward the MITRE Challenge.

The MITRE Challenge™ is an ongoing, open competition to encourage innovation in technologies of interest to the federal government. The current competition involves multicultural person name matching, a technology whose uses include vetting persons against a watchlist (for screening, credentialing, and other purposes) and merging or deduplication of records in databases. Person name matching can also be used to improve document searches, social network analysis, and other tasks in which the same person might be referred to by multiple versions or spellings of a name.

Basically they give you a small list of target names, and a ginormous list of candidate names, and for each target name you return up to 500 possible matches from the candidate name list. Currently the matching software we built at Polimetrix back in 2005-2007 is doing pretty well. It was designed for full voter records, but I broke out the name component by itself. The result is pretty awesome. Currently we’re ranked #1 at 72.038. Below us are a few teams, including Intaka at 68.801 and Beethoven at 58.501.

No Comments »

Stackoverflow overflow

Posted: February 9th, 2011 | Author: | Filed under: Nerdery | Tags: , | 3 Comments »

Recently I’ve gotten a bit obsessed with stackoverflow.com. It’s a programming Q&A site. You can ask questions, you can answer and comment on them. However they have a sick twist – people vote on everything. They vote on your questions, answers, comments. You earn reputation points when your content is voted up, and you lose points when it’s voted down. You also earn badges, like gaming achievements.

They’ve recently started a whole bunch of related sites under the stackexchange brand. Same model and software, but with different subjects. So far there are already more than I care to count with only very spurious differentiation, but a few highlights include gaming, cooking, english, programming (as a profession), power users, sysadmin, linux, ubuntu, and a lot more.

Here’s my badge of honor. Right now I have 674 rep and 10 badges on Stackoverflow, and 261/4 on gaming (plus ~100 on a bunch of the other sites, just for signing up). That’s my profile image, which should update automatically!

Stack Overflow profile for chmullig at Stack Overflow, Q&A for professional and enthusiast programmers

It’s amazing how satisfying and competitive the Q&A system ends up. I find myself less and less interested in any other medium for asking or answering questions like the kind on Stackoverflow. It’s slow and there’s no rep, what’s the point?

3 Comments »

Programming Challenges

Posted: November 11th, 2010 | Author: | Filed under: Nerdery | Tags: , , , | No Comments »

I’m a fan of puzzles, programming and learning, so I’ve always enjoyed The Python Challenges. Recently my coworkers Delia, Chris and I came up with the idea of doing some of those within the company to help ourselves and our coworkers become more familiar with Python and R (and to a lesser extent SQL and other languages).

The end result is the YG Challenge, where we’ll be posting a few problems a week in at least R & Python, then solving them. Week 1 is up, and we have some great ideas for the future. Intended for our coworkers, it’s public because why not! Feel free to take a stab at solving them, especially if you haven’t used either of those languages before.

No Comments »