Ashish Datta, Author at {5} Setfive - Talking to the World

I’m pretty bullish on Bitcoin so a few months ago I set out to build a “pure” Bitcoin related application. Specifically, I was looking to build an application that leveraged the Bitcoin network directly, without using any third party APIs or services. The goal behind avoiding third party services was to explore how difficult using the Bitcoin network directly is and also to embrace Bitcoin’s decentralized nature and not rely on another company to move coins.

Conceptually the way the Bitcoin network works is relatively straightforward. You move coins by creating transactions which are just messages written and cryptographically signed in a specific format and then you listen for transactions which include your addresses to keep your balance up to date. Of course, the devil is in the details and there’s a dauntingly large number of them. For example, Ken Shiriff explains how to craft a transaction by hand in Bitcoins the hard way: Using the raw Bitcoin protocol and it’s no easy read. Given that just crafting transactions involved so much code, I started researching existing open source libraries that facilitate working with Bitcoin.

After doing some research, it looked like the most popular approach to interfacing with the network directly was to run the bitcoind deamon and then make RPC calls to the exposed functions. Objectively, using RPC calls to bitcoind qualifies as a “pure” solution but I still didn’t love it. After a bit more searching, I came across bitcoinj which is a pure Java library for working with Bitcoin.

Unlike bitcoind, bitcoinj is a library so its designed to be embedded in other codebases and it supports simplified payment verification (SPV) which allows it to operate without downloading the entire blockchain, ~25GB as of today. On top of this, its written in Java so it’s easy to use from Scala, something I’d been looking to experiment with.

So what are some key takeaways from using bitcoinj?

Test on the TestNet: Although debugging new code with live money would be thrilling. the TestNet allows you to develop against the Bitcoin network without using “live” coins. With bitcoinj, if you instantiate TestNet3Params you’ll connect into the Testnet.
TestNet Fountains: Testing is pretty tough without some coins but luckily you can use fountains to grab some. http://faucet.xeno-genesis.com/ worked well for me to grab some coins. Just a heads up, TestNet transactions take awhile to get broadcast and included into the Blockchain so be patient.
Extend WalletAppKit: WalletAppKit is a utility class which helps bootstrap everything you need for a SPV wallet and will save you some headaches. I started off trying to hand roll everything and quickly switched over to extending WalletAppKit.
Chain API: Since the Chain API can be used on the TestNet it’s useful for testing your own applications. Specifically, Bitcoinj treats sending and receiving coins to “yourself” weirdly so it’s more effective to use Chain to test transactions.
Coin.parseCoin: Sorting out how many Satoshis you want to send is surprisingly difficult since sendCoins accepts a Coin object. Luckily, Coin.parseCoin will accept a BTC value and return representative Coin object.
Blockchain downloads and wallet start up slow down development: Even in SPV mode, it still takes a non-trivial amount of time to “catch up” if your client hasn’t been running for awhile. On top of that, starting up the wallet takes a few minutes since it has to discover and connect to beers. What this translates to is resuming development after being away for a while is slow and the “edit, compile, test” loop isn’t as tight as usual.
It’s real money: This should go without saying but once you’re running on the live Bitcoin network you’re playing with real money so be careful. If your keys get out anyone will be able to move your coins out of your addresses. So be careful what you check in on git and if you’re building something public definitely take a second to lock down your server.

Anyway, this was my first time building something Bitcoin related and it was a positive experience. The project is still private but I’ll definitely share it once it’s released. As always, questions or comments are welcome!

Given that it’s the day after Thanksgiving, most of the team is still dusting off a turkey hangover and because of that we’re a bit short on links.

We hope everyone had a great Thanksgiving and we’ll be back next week!

In case you missed some, we’ve got a run down of some of the crazy stuff from last week! The Europeans landed on a comet, Microsoft is open sourcing .NET, and there’s a new variety of Firefox just for developers. Oh and we found an awesome list of UI kits!

One Land For Robots, One Giant Leap For Mankind

For the first time in history a lander, called Philae, successfully landed on the surface of P67, a 2.5-mile-wide comet, on Wednesday (November 12) at approximately 11:00 A.M. EST. For more details on this historic landing keep reading >

.NET Core Is Now Open Sourced on GitHub

.NET has announced that .NET Core stack will be open sourced on GitHub, which includes the runtime and framework libraries. To learn the reasoning behind this decision and what it will entail keep reading >

.Introducing the New Firefox Developer Edition

Firefox recently turned 10 this week and as a birthday present to the world, Mozilla has launched the Firefox Developer Edition. A new version of Firefox, the developer edition is of course designed for developers and places all the browser’s developer tools in front instead of previously being available only as add-ons. To check out the tools available in this edition keep reading >

Over the last few weeks we’ve been utilizing Gearman to help us do some realtime stream processing. In production, what we’ve basically been doing is reading messages off an Amazon Kinesis stream, creating jobs in Gearman for anything that’s computationally expensive, and then gathering up the processed data for a batched insert into Amazon Redshift on a Gearman job as well. Conceptually, this workflow is reasonably similar to how MapReduce works where a series of input jobs is transformed by “mappers” and then results are collected in a “reduce” step.

From a practical point of view, using Gearman like this offers some interesting benefits:

Adding additional “map” capacity is relatively straightforward since you can just add additional machines that connect to the Gearman server.
Developing and testing the “map” and “reduce” functionality is easy since nothing is shared and you can run the code directly, independently of Gearman.
In our experience so far, the Gearman server can handle a high volume of jobs/minute – we’ve pushed ~300/sec without a problem.
Since Gearman clients exist for dozens of languages, you could write different pieces of the system in whatever language fits best.

Overview

OK, so how does all of this actually work. For the purposes of a demonstration, lets assume you’ve been tasked with scraping the META keywords and descriptions from a few hundred thousand sites and counting up word frequencies across all the sites. Assuming you were doing this in straight PHP, you’d end up with code that looks something like this.

The problem is that since you’re making the requests sequentially, scraping a significant number of URLs is going to take an intractable amount of time. What we really want to do is fetch the URLs in parallel, extract the META keywords, and then combine all that data in a single data structure.

To keep the amount of code down, I used the Symfony2 Console component, Guzzle and Monolog to provide infastructure around the project. Walking through the files of interest:

GearmanCommand.php: Command to execute either the “node” or the “master” Gearman workers.
StartScrapeCommand.php: Command to create the Gearman jobs to start the scrapers
Master.php: The code to gather up all the extracted keywords and maintain a running count.
Node.php: Worker code to extract the meta keywords from a given URL

Setup

Taking this for a spin is straightforward enough. Fire up an Ubuntu EC2 and then run the following:

OK, now that everything is setup lets run the normal PHP implementation.

Looks like about 10-12 seconds to process 100 URLs. Not terrible but assuming linear growth that means processing 100,000 URLs would take almost 2.5 hours which is a bit painful. You can verify it worked by looking at the “bin/nogearman_keyword_results.json” file.

Now, lets look at the Gearman version. Running the Gearman version is straightforward, just run the following:

You’ll eventually get an output from the “master” when it finishes with the total elapsed time. It’ll probably come in somewhere around 15ish seconds again because we’re still just using a single process to fetch the URLs.

Party in parallel

But now here’s where things get interesting, we can start adding multiple “worker” processes to do some of the computation in parallel. In my experience, the easiest way to handle this is using Supervisor since it makes starting and stopping groups of processes easy and also handles collecting their output. Run the following to copy the config file, restart supervisor, and verify the workers are running:

And now, you’ll want to run “application.php setfive:gearman master” in one terminal and in another run “php setfive:start-scraper 100sites.txt” to kick off the jobs.

Boom! Much faster. We’re still only doing 100 URLs so the effect of processing in parallel isn’t that dramatic. Again, you can check out the results by looking at “bin/keyword_results.json”.

The effects of using multiple workers will be more apparent when you’ve got a larger number of URLs to scrape. Inside the “bin” directory there’s a file named “quantcast_site_lists.tar.gz” which has site lists of different sizes up to the full 1 million from Quantcast.

I ran a some tests on the lists using different numbers of workers and the results are below.

	0 Workers	10 Workers	25 Workers
100 URLs	12 sec.	12 sec.	5 sec.
1000 URLs	170 sec.	34 sec.	33 sec.
5000 URLs	1174 sec.	195 sec.	183 sec.
10000 URLs	2743 sec.	445 sec.	424 sec.

One thing to note, is if you run:

And notice that “processUrl” has zero jobs but there’s a lot waiting for “countKeywords”, you’re actually saturating the “reducer” and adding additional worker nodes in Supervisor isn’t going to increase your speed. Testing on a m3.small, I was seeing this happen with 25 workers.

Another powerful feature of Gearman is that it makes running jobs on remote hosts really easy. To add a “remote” to the job server, you’d just need to start a second machine, update the IP address in Base.php, and user the same Supervisor config to start a group of workers. They’d automatically register to your Gearman server and start processing jobs.

Anyway, as always questions and comments appreciated and all the code is on GitHub.

Over the last few weeks we’ve been working with one of our clients to build out a real time data processing application. At a high level, the system ingests page view data, processes it in real time, and then ingests it into a database backend. In terms of scale, the system would need to start off processing roughly 30,000 events per minute at peak with the capability to scale out to 100,000 events per minute fairly easily. In addition, we wanted the data to become available to query “reasonably quickly” so that we could iterate quickly on how we were processing data.

To kick things off, we began by surveying the available tools to ingest, process, and then ultimately query the data. On the datawarehouse side, we had already had some positive experiences with Amazon Redshift so it was a natural choice to keep using it moving forward. In terms of ingestion and processing, we decided to move forward with Kinesis and Gearman. The fully managed nature of Kinesis made it the most appealing choice and Gearman’s strong PHP support would let us develop workers in a language everyone was comfortable with.

Our final implementation is fairly straightforward. An Elastic Load Balancer handles all incoming HTTP requests which are routed to any number of front end machines. These servers don’t do any computation and fire of messages into a Kinesis stream. On the backend, we have a consumer per Kinesis stream shard that creates Gearman jobs for pre-processing as well as Redshift data ingestion. Although it’s conceptually simple, there’s a couple of “gotchas” that we ran into implementing this system:

Slow HTTP requests are a killer: The Kinesis API works entirely over HTTP so anytime you want to “put” something into the stream it’ll require a HTTP request. The problem with this is that if you’re making these requests in real time in a high traffic environment you run the risk of locking up your php-fpm workers if the network latency to Kinesis starts to increase. We saw this happen first hand, everything would be fine and then all of a sudden the latency across the ELB would skyrocket when the latency to Kinesis increased. To avoid this, you need to make the Kinesis request in the background.

SSL certificate verification is REALLY slow: Kinesis is only available over HTTPs so by default the PHP SDK (I assume others as well) will perform an SSL key verification every time you use a new client. If you’re making Kinesis requests inside your php-fpm workers that means you’ll be verifying SSL keys on every request which turns out to be really slow. You can disable this in the official SDK using the “curl.options” parameter and passing in “CURLOPT_SSL_VERIFYHOST” and “CURLOPT_SSL_VERIFYPEER”

There’s no “batch” add operation: Interestingly Apache Kafka, which Kinesis is based on, supports batch operations but unfortunately Kinesis doesn’t. You have to make an HTTP request for every message you’re adding to the stream. What this means is that even if you’re queuing the messages in the background, you’ll still need to loop through them all firing off HTTP requests

Your consumer needs to be fast: In your consumer, you’ll basically end up with code that looks like – https://gist.github.com/adatta02/842531b3fe93097ee030 Because Kinesis shard iterators are only valid for 5 minutes, you’ll need to be cognizant of how long the inner for loop takes to run. Each “getRecords” call can return a max of 10,000 records so you’ll need to be able to process 10k records in less than 5 minutes. Our solution for this was to offload all the actual processing to Gearman jobs.

Anyway, we’re still fairly new to using Kinesis so I’m sure we’ll learn more about using it as the system is in production. A few things have already been positive including that it makes testing new code locally easy since you can just “tap” into the stream, scaling up looks like it means just adding additional shards, and since its managed we’ve got one less thing to worry about.

As always, questions and comments welcome!

Author: Ashish Datta

Bitcoin: Tips for building a native Bitcoin app

So what are some key takeaways from using bitcoinj?

Friday Links: Belated Happy Thanksgiving

Friday Links: Comets, .NET news, and FF Dev