30 Oct 2012

Packetpig on Amazon Elastic Map Reduce

2 comments Permalink Tuesday, October 30, 2012

Packetpig can now be used with Amazon's Elastic Map/Reduce (EMR) for Big Data Security Analytics.

We've added some sugar around the EMR API to help you start running packet captures through our Packetpig User Defined Functions (UDFs) as easily as possible.

Let's start with a very basic example, pumping a set of captures through the supplied pig/examples/binning.pig.

'binning.pig' uses the PacketLoader UDF to extract IP and TCP/UDP attributes from each packet in each capture. If you look in the script, you'll see the format returned in the LOAD statement.
We want to extract all of these and store them in a CSV file for later analysis.

First let's setup our credentials. Set these env variables in your terminal.

export AWS_ACCESS_KEY_ID=[your key]
export AWS_SECRET_ACCESS_KEY=[your key]
export EMR_KEYPAIR=[name of key you create in ec2 console]
export EMR_KEYPAIR_PATH=[path to saved key you just created]
export EC2_REGION=us-west-1 (optional, defaults to us-east-1)

Now, run the job:

$ lib/run_emr -o s3://your-bucket/output/  \
-l s3://your-bucket/logs/ \
-f s3://packetpig/pig/examples/binning.pig \
-r s3://your-bucket/captures/ \
-w
...
Created job flow j-33QXAKHCEOXUO

Type lib/run_emr --help for more information but for now, we specify the output dir with -o, the log dir with -l, the pig file with -f and the read dir with -r*.
-w specifies we like to watch.

After a while, you'll see the bootstrap process begin, some packages will be installed, and then Hadoop will start.

At this stage, an EC2 node has been spawned to run the Hadoop master and it's also where the mappers and reducers will run in this example.

It's boring to watch logs, it'd be nicer if we could see more.
$ lib/run_emr -e
j-33QXAKHCEOXUO RUNNING david's pig jobflow
        Setup Pig      COMPLETED                   22s
        binning.pig    RUNNING                   3485s

$ lib/run_emr -x j-33QXAKHCEOXUO
Connect to http://localhost:9100/jobtracker.jsp - hit ctrl-c to stop the ssh forwarder

Do as it says and hit localhost:9100 in your browser and you can look at the Hadoop job tracker which is useful to get a measure of how well you've tweaked your node type and node count.

In my case, I'm looking at 22.64% mappers completed after 1h 14m. That's a bit slow!
The default is to run 1 m1.large instance == 4 cores.

$ lib/run_emr -o s3://your-packetpig-output/ \
-l s3://your-packetpig-logs/ \
-f s3://packetpig/pig/examples/binning.pig \
-r s3://yourbucket/captures/ \
-w -k 20 -t m1.xlarge
Created job flow j-38QAABHC3RXO7

Now we're looking at 20 m1.xlarge nodes == 80 cores.

If you change your mind about the job you can easily terminate it like so:

$ lib/run_emr -d j-38QAABHC3RXO7

All the included Packetpig scripts in pig/examples are mirrored in s3://packetpig/pig/examples.*
If you want to run your own, just change the -f argument to point to whereever your script is.

Here's a video showing how you can use Packetpig and EMR to find Zero Days in past traffic.

2 comments:

Post a Comment