In the course of writing a little python command line RSS engine, I naturally came to a point where I needed to download the RSS feeds to store them and work with them.
My options looked like this:
urllib2.urlopen which would download the feeds serially. When you have hundreds of feeds in your opml file, that takes too long. Besides, what do I have ADSL for?
So, the documentation for pycurl is all well and fine except for the small fact that the documntation for the CurlMulti object is obtuse, and the example code they provide opens more questions than it resolves. Look at this:
01 import pycurl 02 c = pycurl.Curl() 03 c.setopt(pycurl.URL, "http://curl.haxx.se") 04 m = pycurl.CurlMulti() 05 m.add_handle(c) 06 while 1: 07 ret, num_handles = m.perform() 08 if ret != pycurl.E_CALL_MULTI_PERFORM: break 09 while num_handles: 10 apply(select.select, m.fdset() + (1,)) 11 while 1: 12 ret, num_handles = m.perform() 13 if ret != pycurl.E_CALL_MULTI_PERFORM: break
In lines 6-8, the code repeatedly runs perform() on the pycurl.CurlMulti() object until the return value for pre is not pycurl.E_CALL_MULTI_PERFORM. Why does it do this twice? Here's how this works.
You can get through lines 1-5 pretty well through the pucyrl documentation, but in short:
perform() on the CurlMulti object, which is basically a "go fetch now" command. Curl objects are non-blocking, which means they keep on working in the background, freeing up your application to do other things while they spin around productively. Since we don't really have much else to do in this specific case until we get our bloody feeds, we keep on calling perform() like a whining child until we are told that, "Dude, I'm pretty much done. Get off my back". Which pycurl tells us by setting the variable ret as returned by perform() to a value which is something other than pycurl.E_CALL_MULTI_PERFORM. So in a nutshell, this whole loop is about nagging the CurlMulti object until it tells us in its own way that it is done.
num_handles. Yes, we're braindead. We used the loop in lines 6-8 to initialize num_handles so we could loop over it in lines 9-13, calling select() on the file descriptors which were blossoming out of the CurlMulti object as a consequence of calling perform() on it ad nauseam.
And if you're thinking now that we could rip out lines 6-8 and just count how many actual Curl objects we stuck so carnally into our CurlMulti object, you would be spot on the money. So, this code works too and is a hell of a lot less obtuse:
import pycurl c = pycurl.Curl() c.setopt(pycurl.URL, "http://haxx.curl.se") m = pycurl.CurlMulti() m.add_handle(c) # I'm not handwaving with this next line # if you're adding several Curl objects you can # damn well count them and initialize accordingly num_handles = 1 while num_handles: while 1: ret, num_handles = m.perform() if ret != pycurl.E_CALL_MULTI_PERFORM: break m.select(1.0)
$DEITY, what awful documentation!