{5} Setfive - Talking to the World - Page 33 of 66 - Ramblings on code, startups, and everything in between

Yesterday afternoon, PandoDaily’s Hamish McKenzie published a post titled Move fast, break things: The sad story of Platform, Facebook’s gigantic missed opportunity. The post outlined the lofty expectations and ultimate failures of the Facebook Platform. Central to Hamish’s piece was the thesis that a series of missteps by Facebook alienated developers and eventually pushed the platform into obscurity.

With the benefit of hindsight, I’d argue there were actually only three major mistakes that ended up dooming the Facebook Platform.

Lack of payments

Hamish mentions this, but I think the lack of payments across the platform was the source of many of its problems. With no seamless way to charge users for either “installs” themselves or “in-app purchases”, developers were forced to play the eyeball game and as a consequence were left clinging to the “viral loop”. Facebook Credits ended up being a non-starter and as the Zynga spat demonstrated, the 30% haircut was intractable. In a world where Facebook launched “card on file” style micropayments with the Platform, maybe we’d be exchanging “Facebook Credits” at Christmas.

No sponsored feed placements

Without on platform payments, developers were essentially left chasing Facebook’s “viral loop” to drive new users, eyeballs, and hopefully eventually revenues. Developers eventually started gaming the system, generating what users perceived as spam, and ultimately forcing Facebook to change notifications. I’d argue that had developers originally had some way to pay for sponsored feed placements they would have been less likely to chase virility. Along with the functionality to sponsor feed posts, Facebook undoubtedly would of ended up building rate limits and other spam fighting measures in order to protect the “sponsored post” product and ultimately helped the platform.

Everything tied to Connect

Even today, one of the most popular components of the Facebook Platform is the Connect single sign on piece. The problem was, and to some extent still is today, was that everything was tied to Connect. Even if you were just logging into a site with Connect, it still had access to your entire Facebook account. Facebook eventually fixed this, but it opened the floodgates of every site posting unwanted updates, breaching user trust, and hurting the credibility of the entire platform.

The PandoDaily piece has a deeper exploration of what drove the decline of the Facebook Platform but I think lack of payments, sponsored feed posts, and the tie in with Connect put the platform in a difficult position from day one.

Last week, a client of ours as us to look at some code that was running particularly slowly. The code was powering an autocompleter that searched a list of high schools in the US and returned the schools that matched and an identifying code. We took a look at the code, and it turns out the original developers had implemented a naivete solution that was choking up since the list had gotten to ~45k elements and I imagine they had only tested with a dozen or so. During the process of implementing a slicker solution, we decided to benchmark a couple of different approaches to see how much the differences in “big-o” complexity really mattered.

The Problem

What we were looking at was the following:

– There is a CSV file that looks something like:

ID, STATE, SCHOOL NAME
2,NMSC DEPT OF ED & SVCS,IL
3,MY SCHOOL IS NOT LISTED DOMEST,NY
4,MY SCHOOL IS NOT LISTED-INTRNT,NY
8,DISTRICT COUNCIL 37 AFSCME,NY
20,AMERICAN SAMOA CMTY COLLEGE,AS
81,LANDMARK COLLEGE,VT

With data for about 45k schools.

On the frontend, there was a vanilla jQuery UI autocompleter that passed a state as well as “school name part” to the backend to retrieve autocomplete results.
The endpoint basically takes the state and school part, parses the available data, and returns the results as a JSON array.
So as an example, the function accepts something like {state: “MA”, query: “New”} and returns:

[
  {name: "New School", code: 1234}.
  {name: "Newton South", code: 1234},
  {name: "Newtown High", code: 1234},
]

The Solutions

In the name of science, we put together a couple of solutions, benchmarked them by running them 1000 times and calculating the min/max/average times, and those values are graphed below. Each of the solutions is briefly described below along with how they’re referenced in the graph.

The initial solution that our client had been running read the entire CSV into a PHP array, then searched the PHP array for schools that matched the query. (readMemoryScan)

A slightly better approach is doing the search “in-place” without actually reading the entire file into memory. (unsortedTableScan)

But can we take advantage of how the data is structured? Turns out we can. Since we’re looking for schools in a specific state whose name’s start with a search string we can sort the file by STATE then SCHOOL NAME which will let us abort the search early. (sortedTableScan)

Since we’re always searching by STATE and SCHOOL NAME can we exploit this to cut down on the number of elements that need to be searched even further?

Turns out we can by transforming the CSV file into a PHP array indexed by state and then writing that out as a serialized PHP object. Another detail we can exploit is that the autocompleter has a minimum search length of 3 characters so we can actually build sub-arrays inside the list of schools keyed on the first 3 letters of their name (serializednFileScan).

So the data structure we’d end up creating looks something like:

{
...
  "MA": {
  ...
   "AME": [...list of schools in MA starting with AME...],
   "NEW": [...list of schools in MA starting with NEW...],
  ...
  },
  "NJ": {
  ...
   "AME": [...list of schools in NJ starting with AME...],
   "NEW": [...list of schools in NJ starting with NEW...],
  ...
  },
  "CA": {
  ...
   "AME": [...list of schools in CA starting with AME...],
   "NEW": [...list of schools in CAA starting with NEW...],
  ...
  },
...
}

The results

Running each function 1000 times, recording the elapsed time between results, and calculating the min / max / and average times we ended up with these numbers:

test_name	min (sec.)	max (sec.)	average (sec.)
readMemoryScan	.662	.690	.673
unsortedTableScan	.532	.547	.536
sortedTableScan	.260	.276	.264
serializednFileScan	.149	.171	.154

And then graphing the averages gets you a graphic that looks like:

The most interesting metric is how the different autocompleters actually “feel” when you use them. We setup a demo at http://symf.setfive.com/autocomplete_test/ Turns out, a few hundred milliseconds makes a huge difference

The conclusion

Looking at our numbers, even with relatively small data sets (<100k elements), the complexity of your algorithms matter. Even though the actual number differences are small, the responsiveness of the autocompleter between the three implementations varies dramatically. Anyway, so long story short? Pay attention in algorithms class.

Recently I was doing a fairly common task on Symfony2, logging in a user programatically. Often applications do this on registration, via auto login links, complex login forms, etc. This time I was using an auto login link that expires that users get via email. I came across the issue that it seemed the first time the page loaded I was logged in properly but then as soon as I redirect or navigated anywhere I was logged out.

Here is the basic workflow we were using:

Create an auto login link, basically just an entity which had an expiration date and a special url hash
When clicked forward to action which gets the related entity, and then retrieves the user.
Login the user programatically
Redirect to whatever we want them to see

The issue was somewhere between step 3 and 4 something was amiss, if I eliminated step 3 the profiler toolbar showed I was properly logged in as expect, as soon as I redirect it showed me as unauthenticated. Here is the code for the most part:

Fairly simple, used a ParamConverter to convert the incoming request to an entity. After a while of troubleshooting, I noticed that the ‘$user’ in this case wasn’t an actual Entity, it was a ProxyClass that Doctrine2 had generated. I had read that Doctrine2 ProxyClasses when serialized don’t properly bring over some of their attributes, namely the ID. This caused an issue with FOSUserBundle as the UserProvider looks up the user by their ID. Since the ID was blank this kept causing it to not find my user on the next page load.

There are a number of ways you can fix this, two that come to mind is to override the ‘refreshUser’ method of the UserProvider to look up by username as that is properly serialized from Proxy objects. Instead, as this was only for this one action and I wanted to be more efficient I switched the query to do a join to the user from the get go. This means when you do getUser Doctrine will return the actual Entity and not a Proxy class. Here is my update annotation:

For more on how to use joins and entity repository specific consult the current manual.

Good luck!

According to a study by researchers at the University of Warwick Business School, using publicly available data on financial related search terms from Google Trends may help predict what the stock market will do.

In the study titled “Quantifying Trading Behavior in Financial Markets Using Google Trends” researchers tracked about 100 financially related terms such as “debt,” “crisis” and “derivatives” from 2004 to 2011.

In order to test their theory, the researchers created an investing “game” to see if the terms searched in the previous week to any given closing day could predict the Dow Jones Industrial Average (DIJA). If searches for financial terms went down, they opted to buy and hold stocks. Conversely, search term volumes went up, the researchers would sell assuming that the stocks were going to fall in value.

The game follows the logic that if people get anxious about the stock market, they will likely seek out information on financial issues before trying to get rid of their stock. It almost goes without saying, the first place people go when seeking information on just about anything is Google. So then, the idea was that finance-related searches on Google should spike before a stock market decline and vice versa. Thanks to Google Trends, the public now has access to aggregated information on the volume of queries for different search terms and how these volumes change over time.

The researchers further found that “debt” was the most reliable term for predicting market fluctuations. By buying when “debt” search volumes dropped and selling when the volumes rose, the researchers were able to increase their hypothetical portfolio by 326%. The chart below from the research report illustrates these results:

The second chart, also from the report, shows the relative search volume change for the term “debt” vs. time series of closing prices for the DIJA on the first day of trading each week. If you look closely, immediately after every bright red spike in volume change, the index values drop and vice versa.

If you are like me, the results from the study immediately brought out my skeptical side so I figured I’d play around with Google Trends myself. I compared the Google Trends volume reports for “debt” vs the DIJA for the month of January and April 2013. As you see below, there is an inverse correlation:

I’m still not completely sold, however, since it’s always easy to look back at data and make obvious correlations. It’s another thing to predict future movements of the market. Obviously, there are many factors that influence a market participant’s decisions other than information obtained through Google searches and an unlimited number of factors that affect stock prices.

There will never be a “silver bullet” for predicting the market but the idea behind this study is certainly intriguing and could potentially be used to create another tool to be used by investors in making informed decisions about the market. For example, this study was done at the “macro” level focusing on vague terms such as “debt” and analyzing an index that represents the broad market. What if the same concept could be applied to market segments or even down to individual stocks whose prices are particularly affected by investor sentiment such as Apple. Is it possible that analyzing a certain set of Apple related search terms could provide similar results as the original study at the individual company level?

Maybe or maybe not… but who wants to give me some money to play with to find out? :)

A couple of days ago, one of our developers mentioned wanting to log all the requests that hit a specific Symfony2 controller. Back in Symfony 1.2, you’d be able to easily accomplish this with a “preExecute” function in the specific controller that you want to log. We’d actually set something similar to this up and the code would end up looking like:

Symfony2 doesn’t have a “preExecute” hook in the same fashion as 1.2 but using the event system you can accomplish the same thing. What you’ll basically end up doing is configuring an event listener for the “kernel.controller” event, inject the EntityManager (or kernel) and then log the request.

The pertinent service configuration in YAML looks like:

And then the corresponding class looks something like:

And thats about it.

Tech: The 3 mistakes that doomed the Facebook Platform