/prog/scrape
Our power was out for eight hours this Thursday, so I spent most of the day being bored and drawing diagrams. One of those diagrams turned into this, which got me started on a project I’ve been meaning to get started on for a while now.
Of course, then I lost interest and it ended up just being this bit. The rest is just GUI twattery and crafting HTTP requests in ways that won’t trigger the auto-banners anyway.
What the thing I wrote (and which I guess I’ll call /prog/scrape) does now is open or create an SQLite database, pull subject.txt, compare subject.txt to the data in the database, load the thread pages for any thread that’s out of date, subject it to horrible and primitive pattern matching to pull the individual posts (and relevant metadata), and update the database.
The code is here (and should require no modules not present in a standard Python 2.5 install), but I strongly suggest you don’t try to run it. It will pull every single thing ever posted on /prog/, which takes something like two and a half hours.
Instead, download this (13.0 MB gzipped, 54.5 MB expanded), which is the SQLite database I built today. It’s up to date as of a few minutes ago, but to update it, just untar it into the same folder as the script (preserving the name prog.db), and run the script. It’ll find and use the existing database when determining what needs to be pulled.
Obviously this isn’t very useful by itself, but people interested in random statistics can have some fun with it.
To wit:
sqlite> select count(*) from posts; 168246 sqlite> select count(*) from (select distinct body from posts); 140893 sqlite> select count(*) from posts where author = 'Xarn'; 28
Requires SQLite 3. Have fun.
(The scraper should work with any Shiichan board by just changing the variables at the top, but honestly, who really uses Shiichan?)
Edit (July 11, 2009): Yes, this thread broke it. Updated version deals with it (by ignoring broken threads altogether).
Edit (August 12, 2009): http://github.com/Cairnarvon/progscrape/tree/master
Edit (January 1, 2013): According to my referrer logs people are still finding this post and often coming from non-programming places. If all you’re looking for is a Google-like way to search world4ch boards, try this other thing I made.
BAMPU PANTSU MEME FAN said,
November 30th, 2008 at 9:12 pm
sqlite> select count(*) from posts where body LIKE “%hax my anus%”;
1284
( ゚ -゚)
Cairnarvon said,
December 1st, 2008 at 6:15 pm
More posts got duplicated than I initially thought. Should be fixed now.
Slightly better.
HAX MY PROXY MEME FAN said,
December 2nd, 2008 at 1:34 am
Cool story bro. I did an indexer in Clojure but my subject.txt regex was shit (I separated into tokens via <>, but sometimes there were missing fields :/ ).
ENTERPRISE SCALABLE TURNKEY SOLUTION said,
December 2nd, 2008 at 8:08 pm
RULES 1 AND 2 GOD DAMN IT
YOU FUQIN ANGERED AN EXPERT PROGRAMMER
Cairnarvon said,
December 2nd, 2008 at 9:18 pm
back to /b/, please.Watakwa said,
December 5th, 2008 at 1:10 am
Somewhat off topic, but what is your opinion on Python 3 and it’s non-backwards computability with Python 2?
Cairnarvon said,
December 5th, 2008 at 8:42 pm
Gweedo van Rossum should be shot.
Cairnarvon said,
December 11th, 2008 at 2:28 pm
The
sqlite3module is part of a standard Python install. Maybe it is you who should be using a real operating system.Anonymous said,
December 11th, 2008 at 1:45 pm
“and should require no modules not present in a standard Python 2.5 install”
i had to install the sqlite3 module…
also, use “#!/usr/bin/env python2.5″, so people who use real operating systems don’t have to fix your shit.
Anonymous said,
June 24th, 2009 at 5:42 am
anyone with a real operating system will have python in /usr/local/bin, not /usr/bin.
Cairnarvon said,
June 24th, 2009 at 5:44 am
Tertiary hierarchies are for Jews.
/usr/binis exactly where something like a Python interpreter belongs.Enjoy your enterprise.
Anonymous said,
December 10th, 2009 at 9:24 pm
Any way just to pull a specific set of post numbers? instead of all?
Cairnarvon said,
December 10th, 2009 at 11:50 pm
Sure, if you know Python.
Anonymous said,
October 29th, 2010 at 4:14 pm
Can I suggest you `$crunchbang =~ s#/usr/bin/python#&2#`
Cairnarvon said,
October 29th, 2010 at 11:10 pm
That would presuppose the existence of
/usr/bin/python2. It’d be nice if the hashbang line could enforce Python 2.x on systems where/usr/bin/pythonis Python 3.x, but there’s no way to do that that doesn’t break everywhere, and there are no sane systems where/usr/bin/pythonis Python 3.x anyway. Maybe Arch does that now, but I don’t think Arch users should be encouraged.Anonymous said,
October 31st, 2010 at 4:06 am
Oh I’d just assumed it was standard, I’ll be honest; I didn’t even notice the change (good guess at Arch, what do you see that’s wrong with it? HIBT by the great [b]XARN?[/b]) and suddenly I try a simple [code]print "hi"[/code] and nothing works. This was on the uni servers (Fedora), which also have [m]/usr/bin/python2[/m].
I guess you’re sticking with a python2 version then?
There goes my quick [m]./pr.py[/m] anyway…
FIOC wins this time… but I’ll be back
>>16 said,
October 31st, 2010 at 4:09 am
damn it
Cairnarvon said,
October 31st, 2010 at 4:30 am
My problem with Arch is the same as my problem with Anonix: it was created by people who don’t really know what they’re doing, but think they can do better than everyone else anyway. It’s not as bad as Anonix in that they actually have a working system, but it’s this attitude that explains why, for example, their package manager is the only one around that has no support for signed packages: they don’t understand the purpose, so they consider it to be bloat.
It’s interesting that Fedora has a
/usr/bin/python2; presumably RHEL and all of its derivatives do too, then. Debian and its derivatives don’t, though, and that’s what I and most people I know use. Maybe this will change, but for the time being, changing the hashbang would break things for too many people.What I suggest you do is define an alias in your
.bashrcor equivalent. I recommend doing that anyway to make command line options explicit (in case the defaults change).Another option is to use
git‘spost-checkouthook to automatically run the 2to3 tool whenever you pull updates. I’ve never used 2to3 myself, and I don’t have a Py3k at hand to try it, so I don’t know if that’s going to work properly. Might be worth a shot.The only downside to that is that you can’t contribute patches from that repository, of course.
Anonymous said,
October 31st, 2010 at 7:13 pm
That’s a fair point, thanks for the heads up. I guess it’s relatively safe with the official repos, so long as one’s DNS isn’t poisoned, but hey, assumptions are the mother of all fuckups. However I still stand by many of the devs’ choices, for example the simpler init system/daemon handling. I find it difficult to configure my sister’s ubuntu laptop via ssh without using some kind of GUI (although perhaps that’s just lack of experience).
I think I’ll go with git’s post checkout, I’ve used it before for spaces to tabs, and I’ll just have it stick a 2 on the end of /usr/bin/python.
Thanks for your advice
>>19 said,
October 31st, 2010 at 7:14 pm
Oh not again!