Thursday, September 17, 2015

zookeeper

Zookeeper: is a distributed, open source coornidation service for distributed applications. It exposes a simple set of primitives that distributed application can build upton to implement higher level of services for synchronization, configuration maintenance and groups and naming


Design goal: Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace which is orgnised similarly to a standard file system.

The Zookeeper implementations puts a premium on high performance, highly available , strictly ordered access.

ZooKeeper is replicated: Like the distributed processes it coordinates, zooker itsself is intended to be replicated over a set of hosts called ensemble



The servers that makeup Zookeeper service must all know about each other. They maintain an in-memory image of state, along with the transaction logs and snapshots in persistent store. As long as majority of serves are available zookeeper will be available


ZooKeerps is ordered. Zookeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level of abstraction such as synchronization primitives.

Nodes and ephemeral nodes
Each node in ZooKeeper namespace can have data associated with it as well as childre. It is like having a file system that allows a files to also be direcotry, called Znodes

ZooKeerp also has the notion of ephemeral nodes. These znodes exists as long as the sessions that created the znode is active. When the session end the ephemeral znodes gets deleted

Conditional update and watches:
ZooKeeper supports the concept of watches. client can set a watch on znodes. A watch will be triggered and removed when the znode changes.When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of ZooKeeper servers is brokgen the client will receive a local notification.

Gurantees: Zookeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronisation, it provides a set of guarantees These are

- Sequential Consistency: Updates from a client will be applied in the order that they were sent
- Atomicity - Update either succeed or fail. No partial results
- Single System image: A client will see the same view of service regardless of the server that it connects to
- Reliability Once an update has been applied, it will persist from that time forward untill a client overwrites the update
- Timeliness: The clients view of the system is guranteed to be up to date within a certain time bound

Implementation
The ZooKeeper service can run into two modes. In standalone  and replicated mode
Standalone mode for testing purpose
Replicated mode for production system

ZooKeeper runs in replicated mode on a cluster of machines called ensemble. ZooKeeper achieves high availability through replication and can provide a service as long as a majority of the machines in the ensemble are up.
For example: In 5 node ensemble, any two  machine can fail and service still work because a majority of three remain. .
Note that a six-node ensemble can also tolerate if only two machine failing, if three machine fail the remaining three do not constitute a majority of the six. For this reason, it is usual to have an odd number of machines in an ensemble.


ZooKeeper is very simple: all it has to do is ensure that every modification to the tree of znodes is replicated to a majority of the ensemble.

ZooKeeper uses protocol called Zab that runs in two phases, which may be repeated indefinitely

Phase 1: Leader election
The machines in an ensemble go through a process of electing a distinguished member called the leader. The other machine called follower. This state is finished once a majority of followers have synchronized their state with leader

Phase 2. Atomic broadcast
All write requests are forwarded to the leader, which broadcasts the update to the followers. When a majority have persisted the change, the leader commits the update and the client gets response saying the update succeeded. The protocol for achieving consensus is designed to be atomic, so a change either succeeds or fail

If leader fails, the remaining machine hold another leader election and continue as before with new leader. If the old leader later recovers, it then starts as a follower.

All machines in the ensemble write updates to disk before updating their in-memory copies of the znode tree. Read request may be serviced from any machine, and becuase they involve only a lookup from memory, they are very fast

ZooKeeper client and ZooKeeper server




Tuesday, July 28, 2015

Uni-Variate Analysis

Week 1 - Univariate Analyses Assignment
Bhupendra Mishra
Monday, July 27, 2015
a: Generate a random sample of 500 observations from the Ecommerce data using R. Save as a dataframe.
ecommerce <- read.delim("C:/Users/Bhupendra Mishra/Desktop/donotbackup/BridgeSchoolMgmt/Bridge School Mgmt/Module2/ecommerce.txt")
ecomm_samp<-ecommerce[sample(1:nrow(ecommerce),500),]
b.          Generate univariate profiles of the data, using the summary() function
attach(ecomm_samp)
summary(ecomm_samp)
##   churn_status session_length_seconds session_count     event_count     
##  Churned:227   Min.   :     0         Min.   :  1.00   Min.   :    1.00 
##  Stayed :273   1st Qu.:  1497         1st Qu.:  4.00   1st Qu.:   40.25 
##                Median :  7549         Median : 16.50   Median :  200.00 
##                Mean   : 30605         Mean   : 62.92   Mean   :  623.54 
##                3rd Qu.: 32575         3rd Qu.: 80.50   3rd Qu.:  746.25 
##                Max.   :616183         Max.   :695.00   Max.   :10931.00 
##  closed_session_event_count open_session_event_count
##  Min.   :   0.0             Min.   :   0.0         
##  1st Qu.:   6.0             1st Qu.:   6.0         
##  Median :  28.0             Median :  26.5         
##  Mean   : 106.2             Mean   : 105.8         
##  3rd Qu.: 127.5             3rd Qu.: 128.5         
##  Max.   :1717.0             Max.   :1714.0         
##  quest_completed_event_count store_purchase_event_count  active_days  
##  Min.   :   0.00             Min.   :  0.000            Min.   : 1.00 
##  1st Qu.:   4.00             1st Qu.:  0.000            1st Qu.: 2.75 
##  Median :  21.00             Median :  0.000            Median : 7.00 
##  Mean   : 127.51             Mean   :  5.126            Mean   :15.13 
##  3rd Qu.:  79.75             3rd Qu.:  3.000            3rd Qu.:24.00 
##  Max.   :4419.00             Max.   :243.000            Max.   :55.00
Inferences:
- Categorical variable is churn_status and rest are numerical variable
- Churn status of customer is almost 50:50
- Based on above summary we can see data are not normally distributed.
- Mean is greater than median for every variable hence data are positively skew.
c.           Generate pairwise correlation plots of numeric variables and bar charts of categorical or factor variables.
pairs(~session_length_seconds+session_count+closed_session_event_count+open_session_event_count+event_count+quest_completed_event_count+store_purchase_event_count+active_days, main="Pair wise corelation plot")
Inference: Positive correlation exist among all variable

barplot(table(churn_status))
vars<-c("session_length_seconds","session_count","closed_session_event_count", "open_session_event_count", "event_count","quest_completed_event_count","store_purchase_event_count","active_days" )
cor(ecomm_samp[vars])
##                             session_length_seconds session_count
## session_length_seconds                   1.0000000     0.8503394
## session_count                            0.8503394     1.0000000
## closed_session_event_count               0.9430646     0.9473013
## open_session_event_count                 0.9418005     0.9474239
## event_count                              0.9596580     0.8924715
## quest_completed_event_count              0.8380808     0.6582946
## store_purchase_event_count               0.5607137     0.4329665
## active_days                              0.6299118     0.8381357
##                             closed_session_event_count
## session_length_seconds                       0.9430646
## session_count                                0.9473013
## closed_session_event_count                   1.0000000
## open_session_event_count                     0.9998514
## event_count                                  0.9574356
## quest_completed_event_count                  0.7595074
## store_purchase_event_count                   0.5387210
## active_days                                  0.7450125
##                             open_session_event_count event_count
## session_length_seconds                     0.9418005   0.9596580
## session_count                              0.9474239   0.8924715
## closed_session_event_count                 0.9998514   0.9574356
## open_session_event_count                   1.0000000   0.9578605
## event_count                                0.9578605   1.0000000
## quest_completed_event_count                0.7605956   0.8848637
## store_purchase_event_count                 0.5372247   0.5439314
## active_days                                0.7434909   0.6958241
##                             quest_completed_event_count
## session_length_seconds                        0.8380808
## session_count                                 0.6582946
## closed_session_event_count                    0.7595074
## open_session_event_count                      0.7605956
## event_count                                   0.8848637
## quest_completed_event_count                   1.0000000
## store_purchase_event_count                    0.4610465
## active_days                                   0.4731775
##                             store_purchase_event_count active_days
## session_length_seconds                       0.5607137   0.6299118
## session_count                                0.4329665   0.8381357
## closed_session_event_count                   0.5387210   0.7450125
## open_session_event_count                     0.5372247   0.7434909
## event_count                                  0.5439314   0.6958241
## quest_completed_event_count                  0.4610465   0.4731775
## store_purchase_event_count                   1.0000000   0.3898231
## active_days                                  0.3898231   1.0000000





require(car)
scatterplotMatrix(ecomm_samp[vars])
     
 
Inferences:

- Green line shows the actual regression line where as red line is best possible relation with some interval

Friday, July 17, 2015

Stream tweeter data into hdfs using Flume

Install Flume:
# wget http://www.gtlib.gatech.edu/pub/apache/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
 tar -xzvf apache-flume-1.6.0-bin.tar.gz
cd apache-flume-1.6.0-bin
cd conf
cp flume-env.sh.template flume-env.sh
vi flume-env.sh  << add below lines>

export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Give Flume more memory and pre-allocate, enable remote monitoring via JMX
# export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"

# Note that the Flume conf directory is always included in the classpath.
FLUME_CLASSPATH="/home/hduser/apache-flume-1.6.0-bin/lib/flume-sources-1.0-SNAPSHOT.jar"

#cp flume-conf.properties.template flume-conf
*************
Twitter application setup

https://apps.twitter.com/

https://apps.twitter.com/app/3389049/show




#vi flume-conf
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = ZR0iLmZXu1QM1ZvX0K3VlPglE
TwitterAgent.sources.Twitter.consumerSecret = CNKjEE9j4iT4Hev6P6joq7iWSIAPx0hRaRKJwGeew9gg1SRoms
TwitterAgent.sources.Twitter.accessToken = 3280478912-ieuY8LQEA3fbgbKkb92aDNTKrmxiNn43ZtsexjF
TwitterAgent.sources.Twitter.accessTokenSecret =  n5Rti4gQy4DxyGp7EFr83hx0CFwWBm4hSlkJ5vOkWfOyC
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100


#cd bin/

# ./flume-ng agent -n TwitterAgent -c conf -f /home/hduser/apache-flume-1.6.0-bin/conf/flume.conf 


Browse the Name node and hdfs file system
http://localhost:50070/dfshealth.jsp




Error
15/07/18 20:45:51 WARN hdfs.HDFSEventSink: HDFS IO error
java.io.IOException: Callable timed out after 15000 ms on file: hdfs://localhost:54310/user/flume/tweets/2015/07/18/20//FlumeData.1437277535793.tmp
at org.apache.flume.sink.hdfs.BucketWriter.callWithTimeout(BucketWriter.java:693)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:235)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:514)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)

Fixed : Increated timeout parameter by 15000ms


Thursday, July 9, 2015

Selling analytics (HR Analytics

Predictive Analytic in HR 

Business Objective: To clearly demonstrate the interaction of business objective and work force strategies to determine a full picture of likely outcomes

Based on research, Global organization with workforce analytics and workforce planning outperform all other organisation by 30%

Outlining the tree step Process for Predictive Analytic in HR Function
1. Hindsight: Gather data by Reporting
2. Insights: Make sense of data by Analysis and monitoring
3. Foresight: Develop predictive Models

KPI/Matrices
What is generally measured:
- Employee Engagement
- Performance Ratings
- Retention/Turnover
- % of employee with dev plan
- Reediness for jobs
- Internal hire %age
- Diversity of workforce
- Level of expertise/competence

HR Analytics:
What could be measured:
- Recruitment
- Retention
- Performance and Career Management
- Training
- Compensation and Benefits
- Workfore
- Organization and effectiveness

Benefits
Turnover modeling: Predicting future turnover in business units in specific functions, geographies by looking at factor such as commute time, time since last role change and performance over time
Targeted retention: Find out high risk of churn in future and focus on retention of few critical people
3. Risk Management: Profiling of candidate with high risk of leaving prematurely or those performing below standard
4. Talent Forecasting: To predict which new hire, based on their profile, are likely to be high flier and then moving to them on fast track programs.

Overall Return on the Investment
- Retention of key performance and their associated customer and revenue
- Reduced compensation over payment
- Improved HR staff  productivity  resulting in the need of fewer HR Staf
- Reduced risk of litigation due to non-compliance




Thursday, July 2, 2015

Tree map and Geospatial map

The term mashup started in the music world but  adopted rapidly in the world of web that means an application that combines data from different sources into a whole new application

Geographic data is one of the most common types of data available. Dealing with geographic data without a map like going into the mountain without, well, a map
Geospatial analysis help in identifying "hot spots" where disease or problems are occurring.
Please refer the below article how dominos is leveraging geospatial analytics to know their customer and stores performance.
http://www.zdnet.com/article/dominos-pizza-gets-customer-specific-using-geospatial-analytics/

Treemapping is a method for displaying hierarchical data by using nested rectangle
Treemap display hierarchical(tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangle representing sub branches.
Common use case we are seeing these days in analyzing disk space. There is utility called treesize being used in monitoring disk space on servers

Here is very interesting software developed by MIT student which is based on the treemap.

http://pantheon.media.mit.edu/treemap/country_exports/IN/all/-4000/2010/H15/pantheon

For further indepth reading on treemap from NorthWestern University
http://www.cs.uic.edu/~wilkinson/Publications/c&rtrees.pdf


Reference:
https://en.wikipedia.org/wiki/Treemapping
http://searchbusinessanalytics.techtarget.com/news/1507131/Data-mashups-meet-business-intelligence-Bashups-explained


Predicting Wine Quality Analytically


Abstract: Wine industry shows growth in overall consumption of wine. Price of wine depend on two critical factor
- Wine appreciation by wine tester
- Certification and quality assessment in the physicochemical test
We have two dataset red wine and white wine.
We have done exploratory data analysis using standard function in R. During the analysis, we have identified the outliers in different variables using box plot. Also using cor() R function tried to understand the correlation between Quality and rest of variables.
Finally devised model using Liner regression technique to predict the quality of wine
Red wine data used to depict the picture. Same steps can be applied on white wine data

Project Goal:
- Explore the data in dataset and be able to list all the standard summary statistics
- investigate distribution of the variables graphically to determine the outliers
- devise method to handle outlier
- investigate correlation between quality and remaining properties

- suggest methods for the final “Quality” determination



Looking into dataset(Red Wine)
> red.wine.data <-read.delim(file.choose(), header=T)
> dim(red.wine.data)
[1] 1599   12
> names(red.wine.data)
 [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"          "residual.sugar"     
 [5] "chlorides"            "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"            
 [9] "pH"                   "sulphates"            "alcohol"              "quality"
> str(red.wine.data)
'data.frame':       1599 obs. of  12 variables:
 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
> attributes(red.wine.data)
$names
 [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"          "residual.sugar"     
 [5] "chlorides"            "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"            
 [9] "pH"                   "sulphates"            "alcohol"              "quality"            
$class
[1] "data.frame"
> summary(red.wine.data)
 fixed.acidity   volatile.acidity  citric.acid    residual.sugar     chlorides     
 Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900   Min.   :0.01200 
 1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900   1st Qu.:0.07000 
 Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200   Median :0.07900 
 Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539   Mean   :0.08747 
 3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600   3rd Qu.:0.09000 
 Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500   Max.   :0.61100 
 free.sulfur.dioxide total.sulfur.dioxide    density             pH          sulphates    
 Min.   : 1.00       Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300 
 1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500 
 Median :14.00       Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200 
 Mean   :15.87       Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581 
 3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300 
 Max.   :72.00       Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000 
    alcohol         quality    
 Min.   : 8.40   Min.   :3.000 
 1st Qu.: 9.50   1st Qu.:5.000 
 Median :10.20   Median :6.000 
 Mean   :10.42   Mean   :5.636
 3rd Qu.:11.10   3rd Qu.:6.000 
 Max.   :14.90   Max.   :8.000

Identifying outliers

> fa<-red.wine.data$fixed.acidity
> boxplot(fa)





Handling outliers

Removed the outlier values and stored data in new data frame called red.wine

> red.wine<-subset(red.wine.data, fa< 12 & va<1 & ca<.8 & rs< 5 & cl<.2& fsd < 50 & tsd<150 & d<1 & ph<3.5 & sul<1.5 & al< 14 & ql<9)

> summary(red.wine)

fixed.acidity    volatile.acidity  citric.acid     residual.sugar    chlorides     
 Min.   : 5.000   Min.   :0.1200   Min.   :0.0000   Min.   :0.900   Min.   :0.01200 
 1st Qu.: 7.200   1st Qu.:0.3800   1st Qu.:0.1100   1st Qu.:1.900   1st Qu.:0.07000 
 Median : 8.000   Median :0.5100   Median :0.2600   Median :2.100   Median :0.07900 
 Mean   : 8.276   Mean   :0.5093   Mean   :0.2657   Mean   :2.243   Mean   :0.08112 
 3rd Qu.: 9.100   3rd Qu.:0.6200   3rd Qu.:0.4000   3rd Qu.:2.500   3rd Qu.:0.08900 
 Max.   :11.900   Max.   :0.9800   Max.   :0.7300   Max.   :4.800   Max.   :0.19400 
 free.sulfur.dioxide total.sulfur.dioxide    density             pH          sulphates    
 Min.   : 1.00       Min.   :  6.00       Min.   :0.9901   Min.   :2.870   Min.   :0.3300 
 1st Qu.: 7.00       1st Qu.: 21.50       1st Qu.:0.9956   1st Qu.:3.220   1st Qu.:0.5400 
 Median :13.00       Median : 37.00       Median :0.9966   Median :3.300   Median :0.6100 
 Mean   :15.39       Mean   : 44.58       Mean   :0.9965   Mean   :3.293   Mean   :0.6407 
 3rd Qu.:21.00       3rd Qu.: 59.00       3rd Qu.:0.9975   3rd Qu.:3.380   3rd Qu.:0.7100 
 Max.   :48.00       Max.   :149.00       Max.   :0.9998   Max.   :3.490   Max.   :1.3600 
    alcohol         quality    
 Min.   : 8.50   Min.   :3.000 
 1st Qu.: 9.50   1st Qu.:5.000 
 Median :10.10   Median :6.000 
 Mean   :10.37   Mean   :5.655 
 3rd Qu.:11.00   3rd Qu.:6.000 
 Max.   :13.60   Max.   :8.000  





Correlation: Principal component analysis

By doing principal component analysis and plotting, we can easily identify the principal components and their correlation.
#number of element
> temp_red.wine<-length(as.matrix(red.wine))/length(red.wine)
#PCA analysis
> pcx<-prcomp(red.wine, scale=TRUE)
#plotting using biplot

> biplot(pcx, xlab=rep('.', temp_red.wine))

Interesting about the plot is that judging by the first two principal components, a quality is very much correlated with alcohol content and sulphate


Predicting wine quality using liner regression line

> plot(ql~al, data=red.wine)
> mean(ql)
[1] 5.636023
> abline(h=mean(ql))
> model1=lm(ql~al, data=red.wine)
> model1

Call:
lm(formula = ql ~ al, data = red.wine)

Coefficients:
(Intercept)           al 
     1.8750       0.3608 

> abline(model1,col="red")
> plot(model1)