Ramblings on code, startups, and everything in between
I was recently out with a friend of mine who mentioned that he was having a tough time scraping some data off a website. After a few drinks we arrived at a barter, if I could scrape the data he’d buy me some single malt scotch which seemed like a great deal for me. I assumed I’d make a couple of HTTP requests, parse some HTML, grab the data and dump it into a CSV. In the worst case I imagined having to write some custom code to login to a web app and maybe sticky some cookies. And then I got started.
As it turned out this site was running one of the most sophisticated anti-scraping/anti-robot packages I’ve ever encountered. In a regular browser session everything looked normal but after a half dozen or so programmatic HTTP requests I started running into their anti-robot software. After poking around a bit it, the blocks they were deploying were a mix of:
With a full browser environment, we now need to tackle the IP restrictions that cause captchas to appear. At face value, like most people, I assumed solving captchas with OCR magic would be easier than getting new IPs after a couple of requests but it turns out that’s not true. There weren’t any usable “captcha solvers” on npm so I decided to pursue the IP angle. The idea would be to grab a new IP address after a few requests to avoid having to solve a captcha which would require human intervention. Following some research, I found out that it’s possible to use Tor as a SOCKS proxy from a third party application. So concretely, we can launch a Tor circuit and then push our Electron HTTP requests through Tor to get a different IP address that your normal Internet connection.
Ok, enough talk, show me some code!
I setup a test “target page” at http://code.setfive.com/scraper_demo/ which randomly shows “content you want” and a “please solve this captcha”. The github repository at https://github.com/adatta02/electron-scraper-skeleton has all the goodies, a runnable Electron application. The money file is injected.js which looks like:
To run that locally, you’ll need to do the usual “npm install” and then also run a Tor instance if you want to get a new IP address on every request. The way it’s implemented, it’ll detect the “content you want” and also alert you when there’s a captcha by playing a “ding!” sound. To launch, first start Tor and let it connect. Then you should be able to run:
Once it loads, you’ll see the test page in what looks like a Chrome window with a devtools instance. As it refreshes, you’ll notice that the IP address is displays for you keeps updating. One “gotcha” is that by default Tor will only get a new IP address each time it opens a conduit, so you’ll notice that I run “killall” after each request which closes the Tor conduit and forces it to reopen.
And that’s about it. Using Tor with the skeleton you should be able to build a scraper that presents a new IP frequently, scrapes data, and conveniently notifies you if human input is required.
As always questions and comments are welcomed!
It has become a bi-weekly ritual. The professor spent too much time on the course material again and is left mumbling through a complex project description during the 11th hour of class. All the while, you’re off somewhere else. As you sling your backpack over your shoulder, you catch the only words you’ll need to hear: “You can download the syllabus along with the source code from the CS department’s website,” they say. Great! You hustle back to study location of choice, open your laptop, and extract the project files. After the obligatory knuckle crack, you look down at the method stubs spelled out for you. “All I have to do is fill-in these functions?” you think to yourself. And as you’re getting familiar with the project structure, a couple flicks of the scroll wheel reveal hundreds, sometimes thousands of lines of unexplained boilerplate code.
You eventually finish up the assignment and push it to the CS department’s server for grading. Without fail, someone raises their hand during the next class asking the instructor if they could explain what some of that boilerplate code was for, at which point the student is usually told to refer to the language documentation to figure it out for themselves. And for the most part, this makes perfect sense. After all, you’re there to learn about some of the more complex topics in computer science, not to write setter and getter methods all day. That’s what your data structures class was for.
But I would like to share with you the first few months of my experience as a Jr. Software Engineer and compare it to my time as an undergraduate student. You might be not-so-surprised to hear I have spent more time writing code similar to the boilerplate stuff mentioned above than I have perfecting the space and time complexity of my pioneering solution to The Traveling Salesman problem.
As an undergraduate student, I was an ace at avoiding merge conflicts in repositories where I was the only contributor. I could even run a build script with the best of ‘em. Nobody ever really told me how to use version control systems to manage a collaborative project with tens of thousands of lines of code strewn across a mess of files and directories. And if, for some reason, those same build scripts broke or a merge conflict popped up on a group project? Well, I was pretty much at the mercy of Stack Overflow.
At Setfive, when I was tasked with setting up a relational database schema for my first real project, I wasn’t really sure where to begin. There was no syllabus to refer to and no professor to schedule office hours with. While I was aware of relational database software such as MySQL and NodeJS, I had never really written a query, so I certainly didn’t know the difference between an inner and outer join. And while coordinating all those AJAX calls and setting up the Symfony bundle configs was a little confusing at first, I think I’m starting to learn how to apply my undergraduate education to these real-world projects.
So far, I have found that industry-level programming helps hone a much more practical skill set than academic programming. Don’t get me wrong, I learned a ton in college, and I know the concepts taught are not only important to a fundamental understanding of the field of computer science, but also have profound and meaningful applications elsewhere, such as in operating systems, machine learning, and so on. But when I look back on the things I have learned in such a short period of time over these past few months, it gets me excited for the road ahead. I owe an enormous thanks to Setfive for bringing me on as an entry-level software developer and advising me with patience.
Posted In: General
You might remember Txty Jukebox, our free to use collaborative music web app that we built on top of the YouTube Data API. We were happy to find that our original version was well received and even got some press from the folks over at makeuseof.com. Well, we’ve finally got a chance to spend some time ( big thanks to our new hire Josh who led the charge ) to make improvements based on the feedback we received and re-branded it under jointdj.com!
The main idea behind our music inspired web application is to create an easy way for groups of people to collaboratively share and listen to song (and video) requests. Any user with a smart phone or computer can enter the event code provided by the event’s host on jointdj.com and start submitting songs to the event’s playlist. The “event” doesn’t always have to be a traditional party either, for example, we’ve been using Joint DJ ourselves in our office as a Pandora or Spotify replacement.
A feature request we get fairly frequently is the ability to convert an HTML document to a PDF. Maybe it’s a report of some sort or a group of charts but the goal is the same – faithfully replicate a HTML document as a PDF. If you try Google, you’ll get a bunch of options from the open source wkhtmltopdf to the commercial (and pricey) Prince PDF. We’ve tried those two as well as a couple of others and never been thrilled with the results. Simple documents with limited CSS styles work fine but as the documents get more complicated the solutions fail, often miserably. One conversion method that has consistently generated accurate results has been using Chrome’s “Print to PDF” functionality. One of the reasons for this is that Chrome uses its rendering engine, Blink, to create the PDF files.
So then the question is how can we run Chrome in a way to facilitate programmatically creating PDFs? Enter, Electron. Electron is a framework for building cross platform GUI applications and it provides this by basically being a programmable minimal Chrome browser running nodejs. With Electron, you’ll have access to Chrome’s rendering engine as well as the ability to use nodejs packages. Since Electron can leverage nodejs modules, we’ll use Gearman to facilitate communicating between our Electron app and clients that need HTML converted to PDFs.
The code as well as a PHP example are below:
As you can see it’s pretty straightforward. And you can start the Electron app by running “./node_modules/electron/dist/electron .” after running “npm install”.
One caveat is you’ll still need a X windows display available for Electron to connect to and use. Luckily, you can use Xvfb, which is a virtual framebuffer, on a server since you obviously wont have a physical display. If you’re on Ubuntu you can run the following to grab all dependencies and setup the display:
sudo apt-get install chromium-browser libgconf-2-4 xvfb Xvfb :19 -screen 0 1024x768x16 & export DISPLAY=:19
After that, you can launch your Electron app normally and it’ll use a virtual display.
Anyway, as always let me know if you have any questions or feedback!
On one of our projects that I am working on I had the following problem: I needed to create an aggregate temporary table in the database from a few different queries while still using Doctrine2. I needed to aggregate the results in the database rather than memory as the result set could be very large causing the PHP process to run out of memory. The reason I wanted to still use Doctrine to get the base queries was the application passes around a QueryBuilder object to add restrictions to the query which may be defined outside of the current function, every query in the application goes through this process for security purposes.
After looking around a bit, it was clear that Doctrine did not support (and shouldn’t support) what I was trying to do. My next step was to figure out how to get an executable query from Doctrine2 without ever running it. Doctrine2 has a built in SQL logger interface which basically lets you to listen for executed queries and to see what the actual SQL and parameters were for the executed query. The problem I had was I didn’t want to actually execute the query I had built in Doctrine, I just wanted the SQL that would be executed via PDO. After digging through the code a bit further I found the routines that Doctrine used to actually build the query and parameters for PDO to execute, however, the methods were all private and internalized. I came up with the following class to take a Doctrine Query and return a SQL statement, parameters, and parameter types that can be used to execute it via PDO.
In the ExampleUsage.php file above I take a query builder, get the runnable query, and then insert it into my temporary table. In my circumstance I had about 3-4 of these types of statements.
If you look at the QueryUtils::getRunnableQueryAndParametersForQuery function, it does a number of things.
Overall this was the best solution I could find at the time for what I was trying to do. If I was ok with running the query first, capturing the actual SQL via an SQL Logger would have been the proper and best route to go, however I did not want to run the query.
Hope this helps if you find yourself in a similar situation!