Apache Flume: Setting up Flume for an S3 sink - {5} Setfive

We’ve been evaluating Apache Flume over the last few weeks as part of a client project we’re working on. At a high level, our goal was to get plain text data generated by one of our applications running in a non-AWS datacenter back into Amazon S3 so that we could load it into Redshift. Reading through the Is Flume a good fit? section of their docs it perfectly describes this use case:

If you need to ingest textual log data into Hadoop/HDFS then Flume is the right fit for your problem, full stop

OK great, but what about writing into S3? It turns out you can use the HDFS sink to write into S3 if you use a “path” configuration formatted like ‘s3n://<AWS.ACCESS.KEY>:<AWS.SECRET.KEY>@<bucket.name>’.

But wait! Unfortunately Flume doesn’t ship with “batteries included” for writing to HDFS and S3 so you’ll need to grab a couple more dependencies before you can get this working. Frustratingly, you need to grab version compatible JARs of the Amazon S3 client, HDFS, and Hadoop with S3 compatibility. After flailing around downloading packages, hitting an error, downloading more JARs, and finally getting Flume working I realized there had to be a better way to replicate the process.

Enter Maven! Since we’re just grabbing down JARs it’s actually possible to use a pom.xml to describe what dependencies we need, let Maven grab the JARs, and then copy the JARs into a local folder. Here’s a working pom.xml file against Flume 1.6:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <modelVersion>4.0.0</modelVersion>
    <groupId>com.setfive</groupId>
    <artifactId>flume-deps</artifactId>
    <version>1.0</version>

    <dependencyManagement>
      <dependencies>
        <dependency>
          <groupId>com.amazonaws</groupId>
          <artifactId>aws-java-sdk-bom</artifactId>
          <version>1.10.72</version>
          <type>pom</type>
          <scope>import</scope>
        </dependency>
      </dependencies>
    </dependencyManagement>

    <dependencies>
        <dependency>
            <groupId>com.amazonaws</groupId>
             <artifactId>aws-java-sdk-s3</artifactId>
            <version>1.10.72</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.5.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.5.2</version>
        </dependency> 
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-annotations</artifactId>
            <version>2.5.2</version>
        </dependency> 
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-auth</artifactId>
            <version>2.5.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>2.5.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-common</artifactId>
            <version>2.5.2</version>
        </dependency> 
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
            <version>2.5.2</version>
        </dependency>                                              
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-dependency-plugin</artifactId>
                    <executions>
                        <execution>
                            <phase>process-sources</phase>

                            <goals>
                                <goal>copy-dependencies</goal>
                            </goals>

                            <configuration>
                                <outputDirectory>lib</outputDirectory>
                            </configuration>
                        </execution>
                    </executions>
            </plugin>
        </plugins>
    </build>
</project>

To use it, just run “mvn process-sources” and you’ll end up with all the JARs conveniently in a “lib/” folder in the current directory. Copy those JARs into the “lib/” folder of your Flume download and you should be off to the races. Note: These are very possibly more JARs than you need to get Flume running but as Maven dependencies this is the simplest I could come up with.

Flushing out the steps to getting a working S3 sink you should be able to do the following:

ubuntu@ip-10-0-0-148:~$ wget http://apache.claz.org/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
ubuntu@ip-10-0-0-148:~$ tar -zxvf apache-flume-1.6.0-bin.tar.gz
ubuntu@ip-10-0-0-148:~/apache-flume-1.6.0-bin$ wget https://gist.githubusercontent.com/adatta02/b8be1a256b5c0cabba73f4b6b57d635b/raw/a1f6775345917c3586926ab817ebdc664583aac1/pom.xml
ubuntu@ip-10-0-0-148:~/apache-flume-1.6.0-bin$ mvn process-sources
ubuntu@ip-10-0-0-148:~/apache-flume-1.6.0-bin$ curl https://gist.githubusercontent.com/adatta02/9e3c8ace050e7e8e6519bb3b527df4db/raw/00f2d7481410c395173c67d61c81b0d0bc8bed7e/gistfile1.txt > agent1.conf
ubuntu@ip-10-0-0-148:~/apache-flume-1.6.0-bin$ ./bin/flume-ng agent --conf conf/ --conf-file agent1.conf --name a1 -Dflume.root.logger=DEBUG,console

Before you run the last command to launch Flume you’ll need to edit “agent1.conf” to enter your Amazon token, secret key, and S3 bucket location. You’ll need to create the S3 bucket before trying to write to it with Flume. And then finally, to test that everything is working you can use netcat with the following:

ubuntu@ip-10-0-0-148:~$ echo "hello world" | nc localhost 44444
OK

Back on the terminal with Flume you should see debug data about receiving the message an a notification about an S3 upload. So what’s next? Not much, you’ll need to pick an appropriate source and then tune your HDFS and channel parameters for the amount of throughput you need.

As always, questions and comments welcome!

Questions or comments about this? Email us at contact@setfive.com.

Read more

The $4,000 Polling Loop

AWS: Looking back on a year with AWS Elastic Container Service

MySQL: Composite indexes in 5 minutes