I was recently out with a friend of mine who mentioned that he was having a tough time scraping some data off a website. After a few drinks we arrived at a barter, if I could scrape the data he’d buy me some single malt scotch which seemed like a great deal for me. I assumed I’d make a couple of HTTP requests, parse some HTML, grab the data and dump it into a CSV. In the worst case I imagined having to write some custom code to login to a web app and maybe sticky some cookies. And then I got started.
As it turned out this site was running one of the most sophisticated anti-scraping/anti-robot packages I’ve ever encountered. In a regular browser session everything looked normal but after a half dozen or so programmatic HTTP requests I started running into their anti-robot software. After poking around a bit it, the blocks they were deploying were a mix of:
- Whitelisted User Agents – Following a few requests from PHP cURL the site started blocking requests from my IP that didn’t include a “regular” user agent.
- Soft IP rate limits – After a couple of dozen requests from my IP I started receiving “Solve this captcha” pages in order to view the target content.
With a full browser environment, we now need to tackle the IP restrictions that cause captchas to appear. At face value, like most people, I assumed solving captchas with OCR magic would be easier than getting new IPs after a couple of requests but it turns out that’s not true. There weren’t any usable “captcha solvers” on npm so I decided to pursue the IP angle. The idea would be to grab a new IP address after a few requests to avoid having to solve a captcha which would require human intervention. Following some research, I found out that it’s possible to use Tor as a SOCKS proxy from a third party application. So concretely, we can launch a Tor circuit and then push our Electron HTTP requests through Tor to get a different IP address that your normal Internet connection.
Ok, enough talk, show me some code!
I setup a test “target page” at http://code.setfive.com/scraper_demo/ which randomly shows “content you want” and a “please solve this captcha”. The github repository at https://github.com/adatta02/electron-scraper-skeleton has all the goodies, a runnable Electron application. The money file is injected.js which looks like:
To run that locally, you’ll need to do the usual “npm install” and then also run a Tor instance if you want to get a new IP address on every request. The way it’s implemented, it’ll detect the “content you want” and also alert you when there’s a captcha by playing a “ding!” sound. To launch, first start Tor and let it connect. Then you should be able to run:
Once it loads, you’ll see the test page in what looks like a Chrome window with a devtools instance. As it refreshes, you’ll notice that the IP address is displays for you keeps updating. One “gotcha” is that by default Tor will only get a new IP address each time it opens a conduit, so you’ll notice that I run “killall” after each request which closes the Tor conduit and forces it to reopen.
And that’s about it. Using Tor with the skeleton you should be able to build a scraper that presents a new IP frequently, scrapes data, and conveniently notifies you if human input is required.
As always questions and comments are welcomed!
A feature request we get fairly frequently is the ability to convert an HTML document to a PDF. Maybe it’s a report of some sort or a group of charts but the goal is the same – faithfully replicate a HTML document as a PDF. If you try Google, you’ll get a bunch of options from the open source wkhtmltopdf to the commercial (and pricey) Prince PDF. We’ve tried those two as well as a couple of others and never been thrilled with the results. Simple documents with limited CSS styles work fine but as the documents get more complicated the solutions fail, often miserably. One conversion method that has consistently generated accurate results has been using Chrome’s “Print to PDF” functionality. One of the reasons for this is that Chrome uses its rendering engine, Blink, to create the PDF files.
So then the question is how can we run Chrome in a way to facilitate programmatically creating PDFs? Enter, Electron. Electron is a framework for building cross platform GUI applications and it provides this by basically being a programmable minimal Chrome browser running nodejs. With Electron, you’ll have access to Chrome’s rendering engine as well as the ability to use nodejs packages. Since Electron can leverage nodejs modules, we’ll use Gearman to facilitate communicating between our Electron app and clients that need HTML converted to PDFs.
The code as well as a PHP example are below:
As you can see it’s pretty straightforward. And you can start the Electron app by running “./node_modules/electron/dist/electron .” after running “npm install”.
One caveat is you’ll still need a X windows display available for Electron to connect to and use. Luckily, you can use Xvfb, which is a virtual framebuffer, on a server since you obviously wont have a physical display. If you’re on Ubuntu you can run the following to grab all dependencies and setup the display:
sudo apt-get install chromium-browser libgconf-2-4 xvfb Xvfb :19 -screen 0 1024x768x16 & export DISPLAY=:19
After that, you can launch your Electron app normally and it’ll use a virtual display.
Anyway, as always let me know if you have any questions or feedback!
We recently started a new project and decided to use TypeScript along with Angular 1.5. Angular 1.5 introduces a new abstraction called a “component” which closely resembles Angular 2’s component based approach. Surprisingly, there isn’t a lot of simple TypeScript sample code available for Angular 1.5 so I decided to throw something together in case anyone else is looking. The code is available at https://github.com/Setfive/ng_typescript_starter and a live demo of it is running at http://code.setfive.com/ng_typescript_starter.
So what are some standouts with TypeScript and Angular 1.5?
- The 1 way bindings components introduce are easier to reason about but having to explicitly add functions for “outputs” does add some verbosity
- Related to that, there’s a fair amount of boilerplate to create a single component since you have to define 2 classes
- Dropping $scope in favor of automatically binding the controller object to $ctrl in templates is great – especially with TypeScript classes
- Related to that, without $scope for events it’s unclear when it’s appropriate to use $rootScope for an event bus
- You can write typesafe code for almost all of your controller business logic
- It’s really unfortunate the TypeScript compiler can’t typecheck your Angular templates
- Using the $inject annotation with component classes looks “right” versus the “array like” syntax
- You need to be somewhat cognizant of matching your @types annotations with the correct version of the library you’re using
- Using components with ui-router makes it fairly difficult to communicate between sibling views
As always, questions and comments are welcome!
Over the past few day we’ve been evaluating using Angular 1.x vs. Angular 2 for a new project on which in the past we would have used Angular 1.x without much debate. However, with release of Angular 2 around the corner we decided to evaluate what starting a project with Angular 2 would involve. As we started digging in it became clear that using Angular 2 without programming in TypeScript would be technically possible but painful to put it lightly. Because of the tight timeline of the project we decided that was too large of a technical risk so we decided to move forward with 1.x. But I decided to spend some time looking at TypeScript anyway, for science. I didn’t have anything substantial to write but needed to hammer out a quick HTML scraper so I decided to whip it up in TypeScript.
Discovering functionality in modules is easier. In order to properly interface with nodejs modules you’ll need to grab type definitions from somewhere like DefinitelyTyped. The definition files are similar to “.h” files from C++, code stubs that just provide function type signatures to TypeScript. An awesome benefit of this is that it’s much easier to “discover” the functionality of nodejs modules by looking at how the functions transform data between types. It also makes it much easier to figure out the parameters of a callback without having to dig into docs or code.
TypeScript is definitely interesting and it’s tight coupling to Angular 2 only bolsters how useful it’ll be in the future. Next up, I’d be interested in building something more substantial with both a client and server component and hopefully share some of the same code on both.
As always, questions and comments are more than welcome!
Note: This is a bad idea™. Don’t do it unless you know why you’re doing it.
What you need to do is create a custom directive to render you form, use the directive’s “link” function to grab the form element, and then use jQuery’s serialize() function to generate a string that you could shoot off in a POST request.
Here’s a sample implementation:
Again, you should really only do this if you have a valid reason since you’re really fighting the framework by manipulating data this way.
Posted In: AngularJS