Friday, July 17, 2015

Stream tweeter data into hdfs using Flume

Install Flume:
# wget http://www.gtlib.gatech.edu/pub/apache/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
 tar -xzvf apache-flume-1.6.0-bin.tar.gz
cd apache-flume-1.6.0-bin
cd conf
cp flume-env.sh.template flume-env.sh
vi flume-env.sh  << add below lines>

export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Give Flume more memory and pre-allocate, enable remote monitoring via JMX
# export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"

# Note that the Flume conf directory is always included in the classpath.
FLUME_CLASSPATH="/home/hduser/apache-flume-1.6.0-bin/lib/flume-sources-1.0-SNAPSHOT.jar"

#cp flume-conf.properties.template flume-conf
*************
Twitter application setup

https://apps.twitter.com/

https://apps.twitter.com/app/3389049/show




#vi flume-conf
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = ZR0iLmZXu1QM1ZvX0K3VlPglE
TwitterAgent.sources.Twitter.consumerSecret = CNKjEE9j4iT4Hev6P6joq7iWSIAPx0hRaRKJwGeew9gg1SRoms
TwitterAgent.sources.Twitter.accessToken = 3280478912-ieuY8LQEA3fbgbKkb92aDNTKrmxiNn43ZtsexjF
TwitterAgent.sources.Twitter.accessTokenSecret =  n5Rti4gQy4DxyGp7EFr83hx0CFwWBm4hSlkJ5vOkWfOyC
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100


#cd bin/

# ./flume-ng agent -n TwitterAgent -c conf -f /home/hduser/apache-flume-1.6.0-bin/conf/flume.conf 


Browse the Name node and hdfs file system
http://localhost:50070/dfshealth.jsp




Error
15/07/18 20:45:51 WARN hdfs.HDFSEventSink: HDFS IO error
java.io.IOException: Callable timed out after 15000 ms on file: hdfs://localhost:54310/user/flume/tweets/2015/07/18/20//FlumeData.1437277535793.tmp
at org.apache.flume.sink.hdfs.BucketWriter.callWithTimeout(BucketWriter.java:693)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:235)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:514)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)

Fixed : Increated timeout parameter by 15000ms


5 comments: