Ramblings on code, startups, and everything in between
When Amazon Web Services rolled out their version 4 signature we started seeing sporadic errors on a few projects when we created pre-authenticated link to S3 resources with a relative timestamp. Trying to track down the errors wasn’t easy. It seemed that it would occur rarely while executing the same exact code. Our code was simply to get a pre-authenticated URL that would expire in 7 days, the max duration V4 signatures are allowed to be valid. The error we’d get was “The expiration date of a signature version 4 presigned URL must be less than one week”. Weird, we kept passing in “7 days” as the expiration time. After the error occurred a couple of times over a few weeks I decided to look into it.
The code throwing the error was located right in the SignatureV4 class. The error is thrown when the end timestamp minus the start timestamp for the signature was greater than a week. Looking through the way the timestamps were generated it went something like this:
So a rough example with straight PHP could of the above steps for a ‘7 days’ expiration would be as follows:
Straight forward enough, right? the problem lies when a second “rolls” between generating the `$start` and the end timestamp check. For example, if you generate the `$start` at `2017-08-20 12:01:01.999999`. Let’s say this gets assigned the timestamp of `2017-08-20 12:01:01`. Then the check for the 7 weeks occurs at `2017-08-27 12:01:02.0000` it’ll throw an exception as duration between the start and end it’s actually now for 86,401 seconds total. It turns outs triggering this error is easier than you’d think. Run this script locally:
That will throw an exception within a few seconds of running most likely.
After I figured out the error, the next step was to submit a issue to make sure I’m not misunderstanding how the library should be used. The simplest fix for me was to generate the end expiration timestamp before generating the start timestamp. After I made the PR, Kevin S. from AWS pointed out that while this fixed the problem, the duration still wasn’t guaranteed to always be the same for the same relative time period. For example, if you created 1000 presigned URLs all with ‘+7 days’ as the valid period, some may be 86400 in duration others may be 86399. This isn’t a huge problem, but Kevin made a great point that we could solve the problem by locking the relative timestamp for the end based on the start timestamp. After adding that to the PR it was accepted. As of release 3.32.4 the fix is now included in the SDK.
I was recently out with a friend of mine who mentioned that he was having a tough time scraping some data off a website. After a few drinks we arrived at a barter, if I could scrape the data he’d buy me some single malt scotch which seemed like a great deal for me. I assumed I’d make a couple of HTTP requests, parse some HTML, grab the data and dump it into a CSV. In the worst case I imagined having to write some custom code to login to a web app and maybe sticky some cookies. And then I got started.
As it turned out this site was running one of the most sophisticated anti-scraping/anti-robot packages I’ve ever encountered. In a regular browser session everything looked normal but after a half dozen or so programmatic HTTP requests I started running into their anti-robot software. After poking around a bit it, the blocks they were deploying were a mix of:
With a full browser environment, we now need to tackle the IP restrictions that cause captchas to appear. At face value, like most people, I assumed solving captchas with OCR magic would be easier than getting new IPs after a couple of requests but it turns out that’s not true. There weren’t any usable “captcha solvers” on npm so I decided to pursue the IP angle. The idea would be to grab a new IP address after a few requests to avoid having to solve a captcha which would require human intervention. Following some research, I found out that it’s possible to use Tor as a SOCKS proxy from a third party application. So concretely, we can launch a Tor circuit and then push our Electron HTTP requests through Tor to get a different IP address that your normal Internet connection.
Ok, enough talk, show me some code!
I setup a test “target page” at http://code.setfive.com/scraper_demo/ which randomly shows “content you want” and a “please solve this captcha”. The github repository at https://github.com/adatta02/electron-scraper-skeleton has all the goodies, a runnable Electron application. The money file is injected.js which looks like:
To run that locally, you’ll need to do the usual “npm install” and then also run a Tor instance if you want to get a new IP address on every request. The way it’s implemented, it’ll detect the “content you want” and also alert you when there’s a captcha by playing a “ding!” sound. To launch, first start Tor and let it connect. Then you should be able to run:
Once it loads, you’ll see the test page in what looks like a Chrome window with a devtools instance. As it refreshes, you’ll notice that the IP address is displays for you keeps updating. One “gotcha” is that by default Tor will only get a new IP address each time it opens a conduit, so you’ll notice that I run “killall” after each request which closes the Tor conduit and forces it to reopen.
And that’s about it. Using Tor with the skeleton you should be able to build a scraper that presents a new IP frequently, scrapes data, and conveniently notifies you if human input is required.
As always questions and comments are welcomed!
It has become a bi-weekly ritual. The professor spent too much time on the course material again and is left mumbling through a complex project description during the 11th hour of class. All the while, you’re off somewhere else. As you sling your backpack over your shoulder, you catch the only words you’ll need to hear: “You can download the syllabus along with the source code from the CS department’s website,” they say. Great! You hustle back to study location of choice, open your laptop, and extract the project files. After the obligatory knuckle crack, you look down at the method stubs spelled out for you. “All I have to do is fill-in these functions?” you think to yourself. And as you’re getting familiar with the project structure, a couple flicks of the scroll wheel reveal hundreds, sometimes thousands of lines of unexplained boilerplate code.
You eventually finish up the assignment and push it to the CS department’s server for grading. Without fail, someone raises their hand during the next class asking the instructor if they could explain what some of that boilerplate code was for, at which point the student is usually told to refer to the language documentation to figure it out for themselves. And for the most part, this makes perfect sense. After all, you’re there to learn about some of the more complex topics in computer science, not to write setter and getter methods all day. That’s what your data structures class was for.
But I would like to share with you the first few months of my experience as a Jr. Software Engineer and compare it to my time as an undergraduate student. You might be not-so-surprised to hear I have spent more time writing code similar to the boilerplate stuff mentioned above than I have perfecting the space and time complexity of my pioneering solution to The Traveling Salesman problem.
As an undergraduate student, I was an ace at avoiding merge conflicts in repositories where I was the only contributor. I could even run a build script with the best of ‘em. Nobody ever really told me how to use version control systems to manage a collaborative project with tens of thousands of lines of code strewn across a mess of files and directories. And if, for some reason, those same build scripts broke or a merge conflict popped up on a group project? Well, I was pretty much at the mercy of Stack Overflow.
At Setfive, when I was tasked with setting up a relational database schema for my first real project, I wasn’t really sure where to begin. There was no syllabus to refer to and no professor to schedule office hours with. While I was aware of relational database software such as MySQL and NodeJS, I had never really written a query, so I certainly didn’t know the difference between an inner and outer join. And while coordinating all those AJAX calls and setting up the Symfony bundle configs was a little confusing at first, I think I’m starting to learn how to apply my undergraduate education to these real-world projects.
So far, I have found that industry-level programming helps hone a much more practical skill set than academic programming. Don’t get me wrong, I learned a ton in college, and I know the concepts taught are not only important to a fundamental understanding of the field of computer science, but also have profound and meaningful applications elsewhere, such as in operating systems, machine learning, and so on. But when I look back on the things I have learned in such a short period of time over these past few months, it gets me excited for the road ahead. I owe an enormous thanks to Setfive for bringing me on as an entry-level software developer and advising me with patience.
Posted In: General
You might remember Txty Jukebox, our free to use collaborative music web app that we built on top of the YouTube Data API. We were happy to find that our original version was well received and even got some press from the folks over at makeuseof.com. Well, we’ve finally got a chance to spend some time ( big thanks to our new hire Josh who led the charge ) to make improvements based on the feedback we received and re-branded it under jointdj.com!
The main idea behind our music inspired web application is to create an easy way for groups of people to collaboratively share and listen to song (and video) requests. Any user with a smart phone or computer can enter the event code provided by the event’s host on jointdj.com and start submitting songs to the event’s playlist. The “event” doesn’t always have to be a traditional party either, for example, we’ve been using Joint DJ ourselves in our office as a Pandora or Spotify replacement.
A feature request we get fairly frequently is the ability to convert an HTML document to a PDF. Maybe it’s a report of some sort or a group of charts but the goal is the same – faithfully replicate a HTML document as a PDF. If you try Google, you’ll get a bunch of options from the open source wkhtmltopdf to the commercial (and pricey) Prince PDF. We’ve tried those two as well as a couple of others and never been thrilled with the results. Simple documents with limited CSS styles work fine but as the documents get more complicated the solutions fail, often miserably. One conversion method that has consistently generated accurate results has been using Chrome’s “Print to PDF” functionality. One of the reasons for this is that Chrome uses its rendering engine, Blink, to create the PDF files.
So then the question is how can we run Chrome in a way to facilitate programmatically creating PDFs? Enter, Electron. Electron is a framework for building cross platform GUI applications and it provides this by basically being a programmable minimal Chrome browser running nodejs. With Electron, you’ll have access to Chrome’s rendering engine as well as the ability to use nodejs packages. Since Electron can leverage nodejs modules, we’ll use Gearman to facilitate communicating between our Electron app and clients that need HTML converted to PDFs.
The code as well as a PHP example are below:
As you can see it’s pretty straightforward. And you can start the Electron app by running “./node_modules/electron/dist/electron .” after running “npm install”.
One caveat is you’ll still need a X windows display available for Electron to connect to and use. Luckily, you can use Xvfb, which is a virtual framebuffer, on a server since you obviously wont have a physical display. If you’re on Ubuntu you can run the following to grab all dependencies and setup the display:
sudo apt-get install chromium-browser libgconf-2-4 xvfb Xvfb :19 -screen 0 1024x768x16 & export DISPLAY=:19
After that, you can launch your Electron app normally and it’ll use a virtual display.
Anyway, as always let me know if you have any questions or feedback!