Rosio Pavoris a blog

/prog/scrape

Our power was out for eight hours this Thursday, so I spent most of the day being bored and drawing diagrams. One of those diagrams turned into this, which got me started on a project I’ve been meaning to get started on for a while now.
Of course, then I lost interest and it ended up just being this bit. The rest is just GUI twattery and crafting HTTP requests in ways that won’t trigger the auto-banners anyway.

What the thing I wrote (and which I guess I’ll call /prog/scrape) does now is open or create an SQLite database, pull subject.txt, compare subject.txt to the data in the database, load the thread pages for any thread that’s out of date, subject it to horrible and primitive pattern matching to pull the individual posts (and relevant metadata), and update the database.

The code is here (and should require no modules not present in a standard Python 2.5 install), but I strongly suggest you don’t try to run it. It will pull every single thing ever posted on /prog/, which takes something like two and a half hours.
Instead, download this (13.0 MB gzipped, 54.5 MB expanded), which is the SQLite database I built today. It’s up to date as of a few minutes ago, but to update it, just untar it into the same folder as the script (preserving the name prog.db), and run the script. It’ll find and use the existing database when determining what needs to be pulled.

Obviously this isn’t very useful by itself, but people interested in random statistics can have some fun with it.
To wit:

sqlite> select count(*) from posts;
168246
sqlite> select count(*) from (select distinct body from posts);
140893
sqlite> select count(*) from posts where author = 'Xarn';
28

Requires SQLite 3. Have fun.

(The scraper should work with any Shiichan board by just changing the variables at the top, but honestly, who really uses Shiichan?)

Edit (July 11, 2009): Yes, this thread broke it. Updated version deals with it (by ignoring broken threads altogether).

Edit (August 12, 2009): http://github.com/Cairnarvon/progscrape/tree/master

13 Comments

  1. BAMPU PANTSU MEME FAN said,

    sqlite> select count(*) from posts where body LIKE “%hax my anus%”;
    1284

    ( ゚ -゚)

  2. Cairnarvon said,

    More posts got duplicated than I initially thought. Should be fixed now.

    sqlite> select count(*) from posts where body like '%hax my anus%';
    445

    Slightly better.

  3. HAX MY PROXY MEME FAN said,

    Cool story bro. I did an indexer in Clojure but my subject.txt regex was shit (I separated into tokens via <>, but sometimes there were missing fields :/ ).

  4. ENTERPRISE SCALABLE TURNKEY SOLUTION said,

    RULES 1 AND 2 GOD DAMN IT
    YOU FUQIN ANGERED AN EXPERT PROGRAMMER

  5. Cairnarvon said,

    back to /b/, please.

  6. Watakwa said,

    Somewhat off topic, but what is your opinion on Python 3 and it’s non-backwards computability with Python 2?

  7. Cairnarvon said,

    Gweedo van Rossum should be shot.

  8. Anonymous said,

    “and should require no modules not present in a standard Python 2.5 install”
    i had to install the sqlite3 module…
    also, use “#!/usr/bin/env python2.5″, so people who use real operating systems don’t have to fix your shit.

  9. Cairnarvon said,

    Not my fault your Python package is broken. Perhaps it is you who should be using a real operating system.
    Debian has a bunch of pysqlite packages, but I don’t have any of them installed; it still worked out of the box.

  10. Anonymous said,

    anyone with a real operating system will have python in /usr/local/bin, not /usr/bin.

  11. Cairnarvon said,

    Tertiary hierarchies are for Jews. /usr/bin is exactly where something like a Python interpreter belongs.
    Enjoy your enterprise.

  12. Anonymous said,

    Any way just to pull a specific set of post numbers? instead of all?

  13. Cairnarvon said,

    Sure, if you know Python.

Post a Comment

RSS feed for comments on this post · TrackBack URL