blog stats
seamusc.com

Automagically download from rapidshare

Published on October 7, 2007

This script will crack the rapidshare captcha and download the file Requires:

  • python-mechanize
  • imagemagick
  • ocrad/gocr

Run with ./downloadFromRapidshare.py http://rapidshare.com/somefile

downloadFromRapidshare.py:

#!/usr/bin/env python

from mechanize import *
import re
import urllib
import commands
import os
import sys
import time



url=sys.argv[1]

print "V2:  "+ url

br = Browser()

response = br.open(url)



br.select_form(nr=0)
response = br.submit(nr=1)

html = response.read()


def htc(m):
return chr(int(m.group(1),16))

def urldecode(url):
    rex=re.compile('%([0-9a-hA-H][0-9a-hA-H])',re.M)
    return rex.sub(htc,url)

htmlDec = urldecode(html)


try:
    exit = 0
    waitTime = re.search('Or wait [0-9]* minute', html).group(0).split(' ')[2]
    try:
        if sys.argv[2] == 'y':
            print 'Waiting for '+waitTime+' minutes'
            time.sleep(int(waitTime)*60)
    except:
        print waitTime
    exit =1
except:
    pass
if exit == 1:
    sys.exit(2)

waitTime = re.search('var c=[0-9]*',htmlDec).group(0)[6:]


print 'Waiting for ' + waitTime + ' seconds'
for sec in range(int(waitTime),0,-1) :
    sys.stdout.write('    \r'+str(sec)+'  ')
    sys.stdout.flush()
    time.sleep(1) 

imgURL = re.search('http://[^"]*jpg', re.search('Please enter[^:]*:[^:]*:', htmlDec).group(0)).group(0)
print 'Downloading captcha image ('+imgURL+')'
f = urllib.urlopen(imgURL)
fp = open('captcha.jpg','w')
fp.write(f.read())
fp.close()

print 'Breaking captcha...'


commands.getoutput('convert captcha.jpg captcha.pbm')
captchaText = commands.getoutput('gocr -d 10 -m 256 -m 2 -p ./db2/db2 captcha.pbm')
print 'Captcha is "'+captchaText[0:4]+'"'

# save a copy for later
commands.getoutput('mv captcha.jpg captchas/`md5sum captcha.jpg | cut -d " " -f 1`')


postUrl = re.search('action="http://[^"]*',htmlDec).group(0)[8:]


file = br.open(postUrl+'?accesscode='+captchaText[0:4])
url = postUrl



print "Downloading " + url.split('/')[-1] + "..."



fileSize = file.info().getheader("Content-Length")
try:
    exit = 0
    if os.stat(url.split('/')[-1]).st_size == int(fileSize):
        print 'File already fully downloaded.'
        exit =1
except:
    pass
if exit == 1:
    sys.exit(0)

fp = open(url.split('/')[-1],'w')

numK = 256
#numK = 25

downloadedSize = 0
while 1:
    st = time.time()
    chunk = file.read(1024*numK)
    sizeOfChunk = len(chunk)
    downloadedSize += sizeOfChunk

    if not chunk: break

    speed =  int( ( sizeOfChunk /( time.time() - st ) ) / 1024 )

    fp.write(chunk)
    fp.flush()

    percentage = (float(downloadedSize)/float(fileSize))*100
    sys.stdout.write('\r                                                                    ')
    sys.stdout.flush()
    sys.stdout.write('\r'+str(downloadedSize)+'/'+str(fileSize)+'  '+str(int(percentage))+'%  '+str(speed)+' kb/s                                    ')
    sys.stdout.flush()

fp.close()

Oct 13th:

Updated code to improve performance

Oct 19th:

rapidshare.com updated their site and now write the form using javascript They don't check where the parameters are coming from though so we can issue a GET request instead of a very messy POST request


Nov 19th:

rapidshare.com updated their captchas. The new convert args seem to work but only in about 1 out of 2 or 3 times.

convert captcha.jpg -monochrome -edge 23 -fuzz 60% -floodfill 1x1 white -negate  captcha.pbm

It has problems with some characters (2, Z, 4, M) It's not a complicated captcha so maybe if I have some free time soon I will improve the accuracy

Feb 3rd:

Updated code, it's a lot cleaner now and works ~90% of the time once gocr has been trained


Comments

Conn
October 12, 2007 at 7:11 a.m.

Any help on how to install orcad under windows? I huffed and I puffed and I can't find any binaries and makefiles make no sense to me.

Seamus
October 12, 2007 at 10:27 a.m.

Hey Conn, I haven't tried getting ocrad to run on windows but this post on the ocrad mailing list might be of some use.


http://www.nabble.com/Ocrad-on-Windows-t2091346.html

Conn
October 12, 2007 at 11:13 a.m.

Thanks a ton, awesome script. I myself started fiddling with mechanize to automate some everyday stuff and its really great.


Keep it up, bookmakred =)

Logan9773
November 21, 2007 at 5:12 p.m.

Nice, another reason to move to Linux. I've bookmarked your page and hope you get time to crack the new Captcha.

seamus
November 22, 2007 at 8:25 p.m.

Thanks,


I actually have started using gocr instead of ocrad as I can train gocr. Rapidshare's current captcha is composed of random letters and numbers which are always in the same font, so after training gocr it get's the text of the captcha almost 100% of the time.

steven
February 2, 2008 at 10:39 p.m.

Any chance can you add a little thing in for a router/modem so it can change the ip address by just resetting it after each download you do?

steven
February 2, 2008 at 11:02 p.m.

Or perhaps a modified version of your script to specify your own wait time instead of the one I guess you would get from rapidshare itself?

steven
February 2, 2008 at 11:14 p.m.

I guess this doesn't work anymore does it? :( sorry for all the comments if it does

Seamus
February 3, 2008 at 1:52 p.m.

Steven, the script still works, you just need to train gocr for the new font type. The wait time specified by rapidshare is enforced on the server side so changing the time on the client side will not work

not to be rude
February 13, 2008 at 3:01 p.m.

Your script is a bit buggy.

Seamus
February 15, 2008 at 2:51 p.m.

Well tell me where it's "buggy" and i can fix it. I never said that it works perfectly, I know that it doesn't

Prateek
February 21, 2008 at 2:55 a.m.

Hi. I'm very new to all this. Can you please guide me how to use this script. I had already downloaded Python earlier and have now downloaded the mechanize add-on. Please guide me through the rest. I cant really work it out.Thank you.

dexter
February 28, 2008 at 8:37 p.m.

heh .. nice script .. bookmarked .. problem is rs today seemd to change a few fonts .. some hollow ones .. some curly ones .. some slim ones .. and a lil' bit earlier they had a fuzzy one .. I'm trying to get all the letters I can and train it .. I'll give a link to the db created if anyone needs any among with a php script I made ( using curl ) .. if anyone is interested

miso
March 5, 2008 at 2:30 p.m.

Hi, thank you for a nice script. Does it still work for you even after the captcha images were changed and some distortion (small images of cats and dogs) were added? If anyone has trained the GOCR database already, could you please provide it?


Or is there a possibility to show the captcha image to user who would write the letters and GOCR would somehow learn it after some time?


Again, thank you fo providing the script.

seamus
March 5, 2008 at 3:21 p.m.

No at the moment gocr can't recognize characters as the images of cats and dogs are placed randomly on the images. If they can be removed then it should work perfectly (anyone know how to do this?)


You can train gocr using gocr -m 130 -p ./db2/db2 capcha.pbm

miso
March 7, 2008 at 12:31 p.m.

Thank you for your reply. I have just made some changes to the script to ask user instead of running gocr:


captchaText = commands.getoutput('kaptain downloadFromRapidshare.kaptn')
print 'Captcha is "'+captchaText[0:4]+'"'


The file downloadFromRapidshare.kaptn:


start -> captcha str ok_btn;
captcha -> @icon("captcha.jpg") ;
str -> @string(4)="" ;


ok_btn -> @exec(ok)="OK" ;
ok -> "echo " str ;

corr
March 8, 2008 at 10:57 a.m.

The cats&dogs patterns can be exactly located running a 2D cross correlation function using a selected mask for each, and then just erasing them (and part of the chars) form the original image prior to sending it to the ocr.


Although the programming involved is quite simple unfortunately right now I don't know of any specific application that could be easily scripted to use here "out the box".

Seamus
March 8, 2008 at 1:51 p.m.

Yea I was thinking of something along those lines. Regarding what application to implement this in I think python would be appropriate here. I'm working on some image processing applications using python, PIL and Numeric. If i get a chance in the next few days i'll try to implement this myself but if anyone else has a go at it please let us now how it goes.

dexter
March 8, 2008 at 11:56 p.m.

I was thinking about doing it in php .. what I'll try to do is ( like you guys said ) try to identify the images and dogs/cats in the image and delete them along with the letter .. ( shouldn't affect too much the recognition with a good training ) .. will start working on it on monday and I'll let you guys know about the results ..

Reban
March 9, 2008 at 2:35 p.m.

Just to drop by a note,
RSConsole is a Rapidshare.com Premium file download accelerator and scheduler for Linux, Unix and BSD.


I've just tried it, and it works like a charm. Probably the best Rapidshare Premium downloader around that's open source.


You can find it here:
http://www.addedworth.com/rsconsole.php
Cheers,
Reban

Dexter
March 12, 2008 at 1:02 p.m.

Well, I don't see what's so open source about it since it's ~65$ or something like that and requires, and I quote: "At least 1 Rapidshare Premium Account". Using a rapidshare account isn't an option, that's easy php&curl, anyhow, so far didn't manage to get that kitties and dogies out .. will continue to see what can gd do about this ..

corr
March 13, 2008 at 3:06 p.m.

If you are decided to work out the details, just notice that there is always eight of them with theirs horizontal position fixed. Only the vertical coordinate needs to be located along the family of the pet :)
Then the hollow chars doesn't help but the rate should rise with time and training.


Really I stoped up/downloading to rapidshare since the free 1GB/day is ridiculous compared to what megaupload allows having a similar persistency transfer rate of >250KBps and without the need to deal with the stupid rapidshare captchas that are harder to read for me than for a program. :(

dexter
March 16, 2008 at 11:57 a.m.

Ok, I give up. I suck at graphics and math. if anyone is interested in continuing or testing I can provide about 1011 captchas ( got them from rapidshare using php & curl) and the image of the cat and dog .. lemme know and I'll provide a link ..

corr
March 16, 2008 at 6:38 p.m.

Well the bash script code shows garbled, I'm been unable to quote it properly but you get the idea.
Will remove (crude and slowly but using only readily available commands like pgmhist as some sort of cheap correlator) the cat in the first block with the apropriate mask and pattern and display the result with xview. Didn't even tried to teach the ocr but really don't know if really worth the hassle for so little MB that is allowed.


#!/bin/sh


corr()
{
PATTERN=$1

MASK=$2

THRESHOLD=$3


x=-2 # adjust for your pattern size
y=48
MAX_RI=0
MAX_Y=0
while [ $y -ge -16 ]; do
R=$(cat $file |
pnmcomp -xoff=$x -yoff=$y $PATTERN -alpha=$MASK - |
pgmhist - | grep "^255" | awk '{print $2}')
let RI=$R-$I
if [ $RI -gt $THRESHOLD -a $RI -gt $MAX_RI ];then
MAX_RI=$RI
MAX_Y=$y
echo "Max found at $x,$y: $RI"
fi
let y=$y-1
done
if [ $MAX_RI -gt $THRESHOLD ]; then
cat "$file" | pnmcomp -xoff=$x -yoff=$MAX_Y $PATTERN -alpha=$MASK - > clean.pnm
xview clean.pnm &
else

echo "Not found"
fi
}


file=$1
xview $file &


CAT=mcat100.pgm # 100% white box 32
MCAT=nocat.pgm # Inverted cat pattern


I=$(cat $file | pgmhist - | grep "^255" | awk '{print $2}')
echo I=$I


corr $CAT $MCAT 30


exit


Regards.

dexter
March 17, 2008 at 12:22 p.m.

I don't get it.
1) xview is an old image viewer, I replaced those lines with eog. Any problem ?
2) nocat.pgm is a inverted cat image right ? I used gimp -> menu -> layers -> invert to invert the image ..
3) mcat100.pgm is a 32x32 white box, right ?


when I run it, I get this:


[dexter@u15286215 captcha]$ ./x.sh 13_58_58_c110d3ff61b044bb02f074cf51ab2c55_access1614249.pgm
I=6355
pnmcomp: bad magic number - not a ppm, pgm, or pbm file
pgmhist: Error reading magic number from Netpbm image stream. Most often, this means your input file is empty.
pnmcomp: bad magic number - not a ppm, pgm, or pbm file
pgmhist: Error reading magic number from Netpbm image stream. Most often, this means your input file is empty.
pnmcomp: bad magic number - not a ppm, pgm, or pbm file
.
.
.


pnmcomp: bad magic number - not a ppm, pgm, or pbm file
pgmhist: Error reading magic number from Netpbm image stream. Most often, this means your input file is empty.
pnmcomp: bad magic number - not a ppm, pgm, or pbm file
pgmhist: Error reading magic number from Netpbm image stream. Most often, this means your input file is empty.
pnmcomp: bad magic number - not a ppm, pgm, or pbm file
Not found.


So, what did I do wrong ?

corr
March 17, 2008 at 3:28 p.m.

pnmcomp complains that $PATTERN or $MASK are not valid


pnmcomp: bad magic number - not a ppm, pgm, or pbm file


try to get the type of images generated by gimp and if necessary convert them with ppmtopgm since pnmcomp could be picky.


$ file mcat100.pgm nocat.pgm
mcat100.pgm: Netpbm PGM "rawbits" image data
nocat.pgm: Netpbm PGM "rawbits" image data


And can debug it first from command line


$ cat $file | pnmcomp -xoff=-2 -yoff=20 mcat100.pgm -alpha=nocat.pgm - | pgmhist -

dexter
March 17, 2008 at 4:46 p.m.

ok, seems one of them was png ... resolved that part ..


[dexter@u15286215 captcha]$ file nocat.pgm mcat100.pgm
nocat.pgm: Netpbm PGM "rawbits" image data
mcat100.pgm: Netpbm PGM "rawbits" image data
[dexter@u15286215 captcha]$ ./x.sh 13_58_58_c110d3ff61b044bb02f074cf51ab2c55_access1614249.pgm
I=6355
pnmcomp: Alpha map and overlay image are not the same size
pgmhist: Error reading magic number from Netpbm image stream. Most often, this means your input file is empty.
pnmcomp: Alpha map and overlay image are not the same size
pgmhist: Error reading magic number from Netpbm image stream. Most often, this means your input file is empty.
pnmcomp: Alpha map and overlay image are not the same size
.
.
.
pgmhist: Error reading magic number from Netpbm image stream. Most often, this means your input file is empty.
pnmcomp: Alpha map and overlay image are not the same size
pgmhist: Error reading magic number from Netpbm image stream. Most often, this means your input file is empty.
Not found


[dexter@u15286215 captcha]$ file 13_58_58_c110d3ff61b044bb02f074cf51ab2c55_access1614249.pgm
13_58_58_c110d3ff61b044bb02f074cf51ab2c55_access1614249.pgm: Netpbm PGM "rawbits" image data
[dexter@u15286215 captcha]$ cat 13_58_58_c110d3ff61b044bb02f074cf51ab2c55_access1614249.pgm | pnmcomp -xoff=-2 -yoff=20 mcat100.pgm -alpha=nocat.pgm - | pgmhist -
pnmcomp: Alpha map and overlay image are not the same size
pgmhist: Error reading magic number from Netpbm image stream. Most often, this means your input file is empty.

paka
April 3, 2008 at 5:30 p.m.

reban get out of our saviours way. the people are working on getting rid of the kitties and doggies that make our free downloaders life a pain. People good luck and keep working

Sami
April 17, 2008 at 2:14 p.m.

Its more difficult now, there are 6 letters, you should just write the 4 letters that having a cat on it !!


Die Hard Rapidshare !!

Seamus
April 17, 2008 at 4:45 p.m.

Yea it's a real pain in the ass now! I don't have the time these days at all to look into it.

Dexter
April 19, 2008 at 7:55 p.m.

Dang :| Well, the letters are a lil bit separated, so there is a way to differentiate the letters .. damn ..