Intro to Data Science – Kaggle Leaderboard

Posted: November 7th, 2012 | Author: | Filed under: Nerdery, School | Tags: , , , , | 3 Comments »

This semester I’m auditing Rachel Schutt’s Intro to Data Science class. I originally registered for it, but at the end of the add/drop period decided I wasn’t confident in my academic background, and wasn’t sure about the workload that would be required. In retrospect it was a mistake to drop it. However I have been attending class as I can (about half the time).

The final project, accounting for most of the grade, is a Kaggle competition. It’s based on an earlier competition, and the goal is to develop a model to grade standardized test essays (approximately middle school level). As an auditor Rachel asked me not to submit, but my cross-validation suggests my model (linear regressions with some neat NLP derived features) is still besting the public leaderboard (Quadratic Weighted Kappa Error Measure of .75), but who knows.

I thought I could easily adjust my MITRE competition leaderboard graph to Kaggle’s CSV, and it turned out to be pretty easy. The biggest issue ended up being that MITRE scored 0 to 100, and this scores 0 to 1. That had some unintended consequences. launchd + python + R should upload this every hour or so (when my laptop is running).

I’m frankly surprised Kaggle hasn’t done something like this before. Maybe if I have a bored evening I’ll try to do it in D3, which should look much nicer.


Update, 12/16: I’ve posted a followup after the end of the semester.


3 Comments on “Intro to Data Science – Kaggle Leaderboard”

  1. 1
    Columbia University Data Science-y Updates « Introduction to Data Science, Columbia University said at 3:29 pm on November 20th, 2012:

    [...] I realized Kaggle provides a leaderboard CSV very similar to what I was generating in that previous competition, so I could easily adjust my original script to generate a visual for the in class Kaggle project. I thought folks might be interested in seeing how the rankings change over time. A few minutes of coding (mostly because I’d hard coded some assumptions about a range of 0-100, now made flexible) resulted in this: [...]

  2. 2 Newsletter: Startups, Santa, Giant Frogs | no free hunch said at 10:00 am on December 15th, 2012:

    [...] this is your chance to change the Kaggle experience in all future competitions.  Inspired by Chris Mulligan's visualizations of the Columbia in-class competition, we're calling on you to bring the leaderboard to life. [...]

  3. 3 » Intro Data Science/Kaggle update said at 9:02 am on December 16th, 2012:

    [...] The semester is over, so here’s a little update about the Intro to Data Science class (previous post). [...]

Leave a Reply