| tags: [ scraping ]
Web Scraping 101: E-Reader app
Let’s say you bought a textbook and it comes with a code that lets you read its online version.
Of course that online version is tied to your account, it expires in 6 months and is not compatible with your tablet browser. So what are you gonna do? Hack together a script that takes screenshots of the pages? That’s not a bad idea, but first let’s see if we can get through the e-reader’s DRM.
After logging into the app, I immediately open the dev tools and this is what I see:
So, individual pdf pages are being read from this
every time you flip a page in the app.
This is what is sent to the endpoint:
globalbookid: "<hash>" pdfpage: "<hash>.pdf" iscover: "N" authkey: "<hash>" hsid: "<hash>"
globalbookid is the unique ID of the book I am looking at.
pdfpage is the ID of the page, there is
probably a way to get a list of those with another endpoint.
authkey are self explanatory. So what exactly is
hsid parameter? From what I can see, it is different for every request.
Looking further, I find the
getpagedetails endpoint, which does exactly what the name suggests:
Okay, so we have our authkey, the list of
pdfpages, and we know the
code to find out how the
getpdfpage endpoint is called.
Interesting… So the query URL is built by concatenating the different parameters together as you would expect but then a part of the
URL - everything but the mysterious
hsid parameter - is put into a hash function and its result is the value of the
Without even looking at the
s.c function it is becoming more and more obvious that the value of
hsid is an MD5 hash of the whole query URL,
l.b.MD5_SECRET_KEY as the salt.
The secret code was hidden only a few keystrokes away into the source. Now that we have all the puzzle pieces, let’s hack together a simple Python script to automate the download process:
To stitch the pages together, I used
pdfunite $(ls -v) output.pdf
Now even if you wanted, you couldn’t even buy a digital version of that book of that quality.