Blog

Ramblings on code, startups, and everything in between

Despite how important they are, MySQL indexes are a bit of a dark art. Sure everyone knows indexes are important but details on how they’re implemented and when they’ll be used are hard to come by. Beyond regular indexes, MySQL’s composite indexes are especially opaque in regards to how and when they’ll be used. As the name suggests composite indexes are an index constructed across two columns versus a regular index on a single column. So when might a composite index come in handy? Let’s take a look!

We’ll look at a table “client_order” that captures some fictional orders from our fictional clients:

And we’ll fill it up with 5 million fictional orders with dates spanning the last 10 years. You can grab the data from https://setfive-misc.s3.amazonaws.com/client_order.sql.gz if you want to follow along locally.

To get started, let’s figure out the total amount spent for a couple of clients:
https://gist.github.com/adatta02/f675b2c7b0659ab960d791b44ee02861

~1.5 seconds to calculate the sums and according to the EXPLAIN MySQL had to use a temporary table and a filesort. Will an index help here? Lets add one and find out.

~0.2 seconds and looking at the EXPLAIN we’ve cut down the number of rows MySQL has to look at to 424, much better. OK great, but now what if we’re only interested in looking at data from Christmas Eve in 2016?

(Note: Details on why we’re querying with full timestamps below)

As you can see, MySQL is still using the client_id index but we’re left still scanning 281,308 rows even though only 335 are actually relevant to us. So how do we fix this? Enter, the composite index! Let’s add one on (client_id, created_at) and see if it helps our query:

It helps but we’re clearly still looking a lot more rows than we need. So what gives? It turns out the order of the composite index is actually critically important since that dictates how MySQL assembles the b-tree for the index. Let’s flip the order of our index and try again:


And there you go! MySQL only has to look at 1360 rows as expected.

So what’s up with having to query with the full timestamps vs. just using DATE(created_at)? It turns out MySQL can’t use datetime indexes when you apply functions to the column you’re querying on. And beyond that, even certain ranges cause MySQL to not select indexes that would work fine:

Which then leads to the unintuitive conclusion that if you actually needed to implement any sort of aggregation by day you’d be better off adding a “date” column calculated from the “created_at” and indexing on that:

Anyway, as always comments and feedback welcome!

Posted In: Big Data

Tags:

When Amazon Web Services rolled out their version 4 signature we started seeing sporadic errors on a few projects when we created pre-authenticated link to S3 resources with a relative timestamp. Trying to track down the errors wasn’t easy. It seemed that it would occur rarely while executing the same exact code. Our code was simply to get a pre-authenticated URL that would expire in 7 days, the max duration V4 signatures are allowed to be valid. The error we’d get was “The expiration date of a signature version 4 presigned URL must be less than one week”. Weird, we kept passing in “7 days” as the expiration time. After the error occurred a couple of times over a few weeks I decided to look into it.

The code throwing the error was located right in the SignatureV4 class. The error is thrown when the end timestamp minus the start timestamp for the signature was greater than a week. Looking through the way the timestamps were generated it went something like this:

  1. Generate the start timestamp as current time for the signature assuming one is not passed.
  2. Do a few other quick things not related to this problem.
  3. Do a check to insure that the end minus start timestamp is less than a week in seconds.

So a rough example with straight PHP could of the above steps for a ‘7 days’ expiration would be as follows:

Straight forward enough, right? the problem lies when a second “rolls” between generating the `$start` and the end timestamp check. For example, if you generate the `$start` at `2017-08-20 12:01:01.999999`. Let’s say this gets assigned the timestamp of `2017-08-20 12:01:01`. Then the check for the 7 weeks occurs at `2017-08-27 12:01:02.0000` it’ll throw an exception as duration between the start and end it’s actually now for 86,401 seconds total. It turns outs triggering this error is easier than you’d think. Run this script locally:

That will throw an exception within a few seconds of running most likely.

After I figured out the error, the next step was to submit a issue to make sure I’m not misunderstanding how the library should be used. The simplest fix for me was to generate the end expiration timestamp before generating the start timestamp. After I made the PR, Kevin S. from AWS pointed out that while this fixed the problem, the duration still wasn’t guaranteed to always be the same for the same relative time period. For example, if you created 1000 presigned URLs all with ‘+7 days’ as the valid period, some may be 86400 in duration others may be 86399. This isn’t a huge problem, but Kevin made a great point that we could solve the problem by locking the relative timestamp for the end based on the start timestamp. After adding that to the PR it was accepted. As of release 3.32.4 the fix is now included in the SDK.

Posted In: Amazon AWS, PHP

Tags: , , , , , ,

I was recently out with a friend of mine who mentioned that he was having a tough time scraping some data off a website. After a few drinks we arrived at a barter, if I could scrape the data he’d buy me some single malt scotch which seemed like a great deal for me. I assumed I’d make a couple of HTTP requests, parse some HTML, grab the data and dump it into a CSV. In the worst case I imagined having to write some custom code to login to a web app and maybe sticky some cookies. And then I got started.

As it turned out this site was running one of the most sophisticated anti-scraping/anti-robot packages I’ve ever encountered. In a regular browser session everything looked normal but after a half dozen or so programmatic HTTP requests I started running into their anti-robot software. After poking around a bit it, the blocks they were deploying were a mix of:

  • Whitelisted User Agents – Following a few requests from PHP cURL the site started blocking requests from my IP that didn’t include a “regular” user agent.
  • Requiring cookies and Javascript – I thought this was actually really clever. After a couple of requests the site started quietly loading an intermediate page that required your browser to run Javascript to set a cookie and then complete a POST request to a URL that included a nonce in order to view a page. To a regular user, this was fairly transparent since it happened so quickly but it obviously trips up a client HTTP client.
  • Soft IP rate limits – After a couple of dozen requests from my IP I started receiving “Solve this captcha” pages in order to view the target content.

Taken all together, it’s a pretty sophisticated setup for what’s effectively a niche social networking site. With the “requires Javascript” requirement I decided to explore using Electron for this project. And turns out, it’s a perfect fit. For a quick primer, Electron is an open source project from GitHub that enables developers to build cross platform desktop applications by merging nodejs and Chrome. Developers end up writing Javascript that can leverage the nodejs ecosystem while also using Chrome’s browser internals to render windows and widgets. Electron helps in this use case because it provides a full Chrome browser that’s scriptable and has access to node’s system level modules. For completeness, you could implement all of this in a Chrome extension but in my experience extensions have more complicated non-privileged to privileged communication and lack access to node so you can’t just fire off a “fs.writeFileSync” to persist your results.

With a full browser environment, we now need to tackle the IP restrictions that cause captchas to appear. At face value, like most people, I assumed solving captchas with OCR magic would be easier than getting new IPs after a couple of requests but it turns out that’s not true. There weren’t any usable “captcha solvers” on npm so I decided to pursue the IP angle. The idea would be to grab a new IP address after a few requests to avoid having to solve a captcha which would require human intervention. Following some research, I found out that it’s possible to use Tor as a SOCKS proxy from a third party application. So concretely, we can launch a Tor circuit and then push our Electron HTTP requests through Tor to get a different IP address that your normal Internet connection.

Ok, enough talk, show me some code!

I setup a test “target page” at http://code.setfive.com/scraper_demo/ which randomly shows “content you want” and a “please solve this captcha”. The github repository at https://github.com/adatta02/electron-scraper-skeleton has all the goodies, a runnable Electron application. The money file is injected.js which looks like:

To run that locally, you’ll need to do the usual “npm install” and then also run a Tor instance if you want to get a new IP address on every request. The way it’s implemented, it’ll detect the “content you want” and also alert you when there’s a captcha by playing a “ding!” sound. To launch, first start Tor and let it connect. Then you should be able to run:

Once it loads, you’ll see the test page in what looks like a Chrome window with a devtools instance. As it refreshes, you’ll notice that the IP address is displays for you keeps updating. One “gotcha” is that by default Tor will only get a new IP address each time it opens a conduit, so you’ll notice that I run “killall” after each request which closes the Tor conduit and forces it to reopen.

And that’s about it. Using Tor with the skeleton you should be able to build a scraper that presents a new IP frequently, scrapes data, and conveniently notifies you if human input is required.

As always questions and comments are welcomed!

Posted In: Javascript

Tags: , ,

It has become a bi-weekly ritual. The professor spent too much time on the course material again and is left mumbling through a complex project description during the 11th hour of class. All the while, you’re off somewhere else. As you sling your backpack over your shoulder, you catch the only words you’ll need to hear: “You can download the syllabus along with the source code from the CS department’s website,” they say. Great! You hustle back to study location of choice, open your laptop, and extract the project files. After the obligatory knuckle crack, you look down at the method stubs spelled out for you. “All I have to do is fill-in these functions?” you think to yourself. And as you’re getting familiar with the project structure, a couple flicks of the scroll wheel reveal hundreds, sometimes thousands of lines of unexplained boilerplate code.

You eventually finish up the assignment and push it to the CS department’s server for grading. Without fail, someone raises their hand during the next class asking the instructor if they could explain what some of that boilerplate code was for, at which point the student is usually told to refer to the language documentation to figure it out for themselves. And for the most part, this makes perfect sense. After all, you’re there to learn about some of the more complex topics in computer science, not to write setter and getter methods all day. That’s what your data structures class was for.

But I would like to share with you the first few months of my experience as a Jr. Software Engineer and compare it to my time as an undergraduate student. You might be not-so-surprised to hear I have spent more time writing code similar to the boilerplate stuff mentioned above than I have perfecting the space and time complexity of my pioneering solution to The Traveling Salesman problem.

As an undergraduate student, I was an ace at avoiding merge conflicts in repositories where I was the only contributor. I could even run a build script with the best of ‘em. Nobody ever really told me how to use version control systems to manage a collaborative project with tens of thousands of lines of code strewn across a mess of files and directories. And if, for some reason, those same build scripts broke or a merge conflict popped up on a group project? Well, I was pretty much at the mercy of Stack Overflow.
At Setfive, when I was tasked with setting up a relational database schema for my first real project, I wasn’t really sure where to begin. There was no syllabus to refer to and no professor to schedule office hours with. While I was aware of relational database software such as MySQL and NodeJS, I had never really written a query, so I certainly didn’t know the difference between an inner and outer join. And while coordinating all those AJAX calls and setting up the Symfony bundle configs was a little confusing at first, I think I’m starting to learn how to apply my undergraduate education to these real-world projects.

So far, I have found that industry-level programming helps hone a much more practical skill set than academic programming. Don’t get me wrong, I learned a ton in college, and I know the concepts taught are not only important to a fundamental understanding of the field of computer science, but also have profound and meaningful applications elsewhere, such as in operating systems, machine learning, and so on. But when I look back on the things I have learned in such a short period of time over these past few months, it gets me excited for the road ahead. I owe an enormous thanks to Setfive for bringing me on as an entry-level software developer and advising me with patience.

Posted In: General

You might remember Txty Jukebox, our free to use collaborative music web app that we built on top of the YouTube Data API. We were happy to find that our original version was well received and even got some press from the folks over at makeuseof.com. Well, we’ve finally got a chance to spend some time ( big thanks to our new hire Josh who led the charge ) to make improvements based on the feedback we received and re-branded it under jointdj.com!

The main idea behind our music inspired web application is to create an easy way for groups of people to collaboratively share and listen to song (and video) requests. Any user with a smart phone or computer can enter the event code provided by the event’s host on jointdj.com and start submitting songs to the event’s playlist. The “event” doesn’t always have to be a traditional party either, for example, we’ve been using Joint DJ ourselves in our office as a Pandora or Spotify replacement.

To see how it works I suggest skimming the jointdj.com landing page which does a good job of quickly outlining how to use. Instead of regurgitating that information here I’ll highlight a few new features/improvements to get excited about:
  • One big lesson learned from our first go around with Txty Jukebox was that while it’s great when everyone at your event is engaged and the song queue is filled up you can run into awkward silences if the playlist runs of songs when people get distracted, say, doing work or playing an intense game of flip cup. In the past you had to wait until someone queued another song so it became a bit of a chore for the event host. To solve this issue and ensure there will never be a silent moment, we’ve created a new feature that lets the event host to pick a genre of music when they create an event from which a song will be randomly selected and played if a playlist ever runs out. For example, I could create an event with “Top 40 / Pop” as the auto fill genre. If at any point during my event the playlist is empty, all the sudden the latest Chainsmokerz song will magically be queued up!
  • Another issue we saw in the first version was that sometimes users didn’t get the exact song played that they were searching for. That was because we automatically selected the first result from Youtube regardless of whether it’s the desired result. For Joint DJ, we’ve added the ability for users to use an intuitive browser based UI to easily search for a song and then review the list of music video results from YouTube along with the thumbnail. Once the user finds exactly what song they want to play they can simply select it to add it to the event’s playlist.

  • Lastly, we improved the design of the live player view where events users can watch and listen to the music videos associated with the requests. You’ll see “flash” messages when songs are added that show the artist, title and which “DJ” submitted it. Additionally we show the next 4-5 upcoming songs in the queue along with their thumbnails on the left side of the player window. Overall, the new look is more colorful and crisp and should be more impressive to the events users keeping them engaged, having fun, and contributing songs to the event. Below is a screenshot of what the live player view looks like:

Posted In: AngularJS, Demo, General, Javascript, Launch