Bhupendra Mishra

Thursday, September 17, 2015

zookeeper

Zookeeper: is a distributed, open source coornidation service for distributed applications. It exposes a simple set of primitives that distributed application can build upton to implement higher level of services for synchronization, configuration maintenance and groups and naming

Design goal: Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace which is orgnised similarly to a standard file system.

The Zookeeper implementations puts a premium on high performance, highly available , strictly ordered access.

ZooKeeper is replicated: Like the distributed processes it coordinates, zooker itsself is intended to be replicated over a set of hosts called ensemble

The servers that makeup Zookeeper service must all know about each other. They maintain an in-memory image of state, along with the transaction logs and snapshots in persistent store. As long as majority of serves are available zookeeper will be available

ZooKeerps is ordered. Zookeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level of abstraction such as synchronization primitives.

Nodes and ephemeral nodes
Each node in ZooKeeper namespace can have data associated with it as well as childre. It is like having a file system that allows a files to also be direcotry, called Znodes

ZooKeerp also has the notion of ephemeral nodes. These znodes exists as long as the sessions that created the znode is active. When the session end the ephemeral znodes gets deleted

Conditional update and watches:
ZooKeeper supports the concept of watches. client can set a watch on znodes. A watch will be triggered and removed when the znode changes.When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of ZooKeeper servers is brokgen the client will receive a local notification.

Gurantees: Zookeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronisation, it provides a set of guarantees These are

- Sequential Consistency: Updates from a client will be applied in the order that they were sent
- Atomicity - Update either succeed or fail. No partial results
- Single System image: A client will see the same view of service regardless of the server that it connects to
- Reliability Once an update has been applied, it will persist from that time forward untill a client overwrites the update
- Timeliness: The clients view of the system is guranteed to be up to date within a certain time bound

Implementation
The ZooKeeper service can run into two modes. In standalone and replicated mode
Standalone mode for testing purpose
Replicated mode for production system

ZooKeeper runs in replicated mode on a cluster of machines called ensemble. ZooKeeper achieves high availability through replication and can provide a service as long as a majority of the machines in the ensemble are up.
For example: In 5 node ensemble, any two machine can fail and service still work because a majority of three remain. .
Note that a six-node ensemble can also tolerate if only two machine failing, if three machine fail the remaining three do not constitute a majority of the six. For this reason, it is usual to have an odd number of machines in an ensemble.

ZooKeeper is very simple: all it has to do is ensure that every modification to the tree of znodes is replicated to a majority of the ensemble.

ZooKeeper uses protocol called Zab that runs in two phases, which may be repeated indefinitely

Phase 1: Leader election
The machines in an ensemble go through a process of electing a distinguished member called the leader. The other machine called follower. This state is finished once a majority of followers have synchronized their state with leader

Phase 2. Atomic broadcast
All write requests are forwarded to the leader, which broadcasts the update to the followers. When a majority have persisted the change, the leader commits the update and the client gets response saying the update succeeded. The protocol for achieving consensus is designed to be atomic, so a change either succeeds or fail

If leader fails, the remaining machine hold another leader election and continue as before with new leader. If the old leader later recovers, it then starts as a follower.

All machines in the ensemble write updates to disk before updating their in-memory copies of the znode tree. Read request may be serviced from any machine, and becuase they involve only a lookup from memory, they are very fast

ZooKeeper client and ZooKeeper server

Monday, August 17, 2015

https://hadoopgeeks.slack.com/messages/hadoopgeeks/

Tuesday, July 28, 2015

Uni-Variate Analysis

Week 1 - Univariate Analyses Assignment

Bhupendra Mishra

Monday, July 27, 2015

a: Generate a random sample of 500 observations from the Ecommerce data using R. Save as a dataframe.

ecommerce <- read.delim("C:/Users/Bhupendra Mishra/Desktop/donotbackup/BridgeSchoolMgmt/Bridge
School Mgmt/Module2/ecommerce.txt")

ecomm_samp<-ecommerce[sample(1:nrow(ecommerce),500),]

b. Generate univariate profiles of the data, using the summary() function

attach(ecomm_samp)

summary(ecomm_samp)

##   churn_status session_length_seconds
session_count     event_count      

## 
Churned:227   Min.   :    
0         Min.   : 
1.00   Min.   :   
1.00  

## 
Stayed :273   1st Qu.:  1497        
1st Qu.:  4.00   1st Qu.:  
40.25  

##                Median :  7549        
Median : 16.50   Median :  200.00 


##                Mean   : 30605         Mean  
: 62.92   Mean   : 
623.54  

##                3rd Qu.: 32575         3rd Qu.: 80.50   3rd Qu.: 
746.25  

##                Max.   :616183         Max.  
:695.00   Max.   :10931.00 


## 
closed_session_event_count open_session_event_count

## 
Min.   :   0.0             Min.   :  
0.0          

##  1st
Qu.:   6.0             1st Qu.:   6.0         


## 
Median :  28.0             Median :  26.5         


## 
Mean   : 106.2             Mean   : 105.8          

##  3rd
Qu.: 127.5             3rd Qu.:
128.5          

## 
Max.   :1717.0             Max.   :1714.0          

## 
quest_completed_event_count store_purchase_event_count  active_days  


## 
Min.   :   0.00   
         Min.   : 
0.000            Min.   : 1.00 


##  1st
Qu.:   4.00             1st Qu.:  0.000            1st Qu.: 2.75  

## 
Median :  21.00             Median :  0.000            Median : 7.00  

## 
Mean   : 127.51             Mean   : 
5.126            Mean  
:15.13  

##  3rd
Qu.:  79.75             3rd Qu.:  3.000            3rd Qu.:24.00  

## 
Max.   :4419.00             Max.   :243.000            Max.   :55.00

Inferences:

- Categorical variable is
churn_status and rest are numerical variable

- Churn status of customer is
almost 50:50

- Based on above summary we
can see data are not normally distributed. 

- Mean is greater than median
for every variable hence data are positively skew.

c. Generate pairwise correlation plots of numeric variables and bar charts of categorical or factor variables.

pairs(~session_length_seconds+session_count+closed_session_event_count+open_session_event_count+event_count+quest_completed_event_count+store_purchase_event_count+active_days,
main="Pair
wise corelation plot")

Inference: Positive correlation exist among all variable

barplot(table(churn_status))

vars<-c("session_length_seconds","session_count","closed_session_event_count", "open_session_event_count", "event_count","quest_completed_event_count","store_purchase_event_count","active_days" )

cor(ecomm_samp[vars])

##                            
session_length_seconds session_count

## session_length_seconds                   1.0000000     0.8503394

## session_count                            0.8503394     1.0000000

## closed_session_event_count               0.9430646     0.9473013

## open_session_event_count                 0.9418005     0.9474239

## event_count                              0.9596580     0.8924715

## quest_completed_event_count              0.8380808     0.6582946

## store_purchase_event_count               0.5607137     0.4329665

## active_days                              0.6299118     0.8381357

##                            
closed_session_event_count

## session_length_seconds                       0.9430646

## session_count                                0.9473013

## closed_session_event_count                   1.0000000

## open_session_event_count                     0.9998514

## event_count                                  0.9574356

## quest_completed_event_count                  0.7595074

## store_purchase_event_count                   0.5387210

## active_days                                  0.7450125

##                            
open_session_event_count event_count

## session_length_seconds                     0.9418005   0.9596580

## session_count                              0.9474239   0.8924715

## closed_session_event_count                 0.9998514   0.9574356

## open_session_event_count                   1.0000000   0.9578605

## event_count                                0.9578605   1.0000000

## quest_completed_event_count                0.7605956   0.8848637

## store_purchase_event_count                 0.5372247   0.5439314

## active_days                                0.7434909   0.6958241

##                            
quest_completed_event_count

## session_length_seconds                        0.8380808

## session_count                                 0.6582946

## closed_session_event_count                    0.7595074

## open_session_event_count                      0.7605956

## event_count                                   0.8848637

## quest_completed_event_count                   1.0000000

## store_purchase_event_count                    0.4610465

## active_days                                   0.4731775

##                            
store_purchase_event_count active_days

## session_length_seconds                       0.5607137   0.6299118

## session_count                                0.4329665   0.8381357

## closed_session_event_count                   0.5387210   0.7450125

## open_session_event_count                     0.5372247   0.7434909

## event_count                                 
0.5439314   0.6958241

## quest_completed_event_count                  0.4610465   0.4731775

## store_purchase_event_count                   1.0000000   0.3898231

## active_days                                 
0.3898231   1.0000000

require(car)

scatterplotMatrix(ecomm_samp[vars])

Inferences:

- Green line shows the actual regression line where as red line is best possible relation with some interval

Friday, July 17, 2015

Stream tweeter data into hdfs using Flume

Install Flume:
# wget http://www.gtlib.gatech.edu/pub/apache/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz

tar -xzvf apache-flume-1.6.0-bin.tar.gz

cd apache-flume-1.6.0-bin

cd conf

cp flume-env.sh.template flume-env.sh

vi flume-env.sh << add below lines>

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

# Give Flume more memory and pre-allocate, enable remote monitoring via JMX

# export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"

# Note that the Flume conf directory is always included in the classpath.

FLUME_CLASSPATH="/home/hduser/apache-flume-1.6.0-bin/lib/flume-sources-1.0-SNAPSHOT.jar"

#cp flume-conf.properties.template flume-conf

*************

Twitter application setup

https://apps.twitter.com/

https://apps.twitter.com/app/3389049/show

#vi flume-conf

TwitterAgent.sources = Twitter

TwitterAgent.channels = MemChannel

TwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey = ZR0iLmZXu1QM1ZvX0K3VlPglE

TwitterAgent.sources.Twitter.consumerSecret = CNKjEE9j4iT4Hev6P6joq7iWSIAPx0hRaRKJwGeew9gg1SRoms

TwitterAgent.sources.Twitter.accessToken = 3280478912-ieuY8LQEA3fbgbKkb92aDNTKrmxiNn43ZtsexjF

TwitterAgent.sources.Twitter.accessTokenSecret = n5Rti4gQy4DxyGp7EFr83hx0CFwWBm4hSlkJ5vOkWfOyC

TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing

TwitterAgent.sinks.HDFS.channel = MemChannel

TwitterAgent.sinks.HDFS.type = hdfs

TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/flume/tweets/%Y/%m/%d/%H/

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream

TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000

TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memory

TwitterAgent.channels.MemChannel.capacity = 10000

TwitterAgent.channels.MemChannel.transactionCapacity = 100

#cd bin/

# ./flume-ng agent -n TwitterAgent -c conf -f /home/hduser/apache-flume-1.6.0-bin/conf/flume.conf

Browse the Name node and hdfs file system

http://localhost:50070/dfshealth.jsp

Error
15/07/18 20:45:51 WARN hdfs.HDFSEventSink: HDFS IO error
java.io.IOException: Callable timed out after 15000 ms on file: hdfs://localhost:54310/user/flume/tweets/2015/07/18/20//FlumeData.1437277535793.tmp
at org.apache.flume.sink.hdfs.BucketWriter.callWithTimeout(BucketWriter.java:693)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:235)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:514)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)

Fixed : Increated timeout parameter by 15000ms

Thursday, July 9, 2015

Selling analytics (HR Analytics

Predictive Analytic in HR

Business Objective: To clearly demonstrate the interaction of business objective and work force strategies to determine a full picture of likely outcomes

Based on research, Global organization with workforce analytics and workforce planning outperform all other organisation by 30%

Outlining the tree step Process for Predictive Analytic in HR Function
1. Hindsight: Gather data by Reporting
2. Insights: Make sense of data by Analysis and monitoring
3. Foresight: Develop predictive Models

KPI/Matrices
What is generally measured:
- Employee Engagement
- Performance Ratings
- Retention/Turnover
- % of employee with dev plan
- Reediness for jobs
- Internal hire %age
- Diversity of workforce
- Level of expertise/competence

HR Analytics:
What could be measured:
- Recruitment
- Retention
- Performance and Career Management
- Training
- Compensation and Benefits
- Workfore
- Organization and effectiveness

Benefits
Turnover modeling: Predicting future turnover in business units in specific functions, geographies by looking at factor such as commute time, time since last role change and performance over time
Targeted retention: Find out high risk of churn in future and focus on retention of few critical people
3. Risk Management: Profiling of candidate with high risk of leaving prematurely or those performing below standard
4. Talent Forecasting: To predict which new hire, based on their profile, are likely to be high flier and then moving to them on fast track programs.

Overall Return on the Investment
- Retention of key performance and their associated customer and revenue
- Reduced compensation over payment
- Improved HR staff productivity resulting in the need of fewer HR Staf
- Reduced risk of litigation due to non-compliance

Thursday, July 2, 2015

Tree map and Geospatial map

The term mashup started in the music world but adopted rapidly in the world of web that means an application that combines data from different sources into a whole new application

Geographic data is one of the most common types of data available. Dealing with geographic data without a map like going into the mountain without, well, a map
Geospatial analysis help in identifying "hot spots" where disease or problems are occurring.
Please refer the below article how dominos is leveraging geospatial analytics to know their customer and stores performance.
http://www.zdnet.com/article/dominos-pizza-gets-customer-specific-using-geospatial-analytics/

Treemapping is a method for displaying hierarchical data by using nested rectangle
Treemap display hierarchical(tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangle representing sub branches.
Common use case we are seeing these days in analyzing disk space. There is utility called treesize being used in monitoring disk space on servers

Here is very interesting software developed by MIT student which is based on the treemap.

http://pantheon.media.mit.edu/treemap/country_exports/IN/all/-4000/2010/H15/pantheon

For further indepth reading on treemap from NorthWestern University
http://www.cs.uic.edu/~wilkinson/Publications/c&rtrees.pdf

Reference:
https://en.wikipedia.org/wiki/Treemapping
http://searchbusinessanalytics.techtarget.com/news/1507131/Data-mashups-meet-business-intelligence-Bashups-explained

Predicting Wine Quality Analytically

Abstract: Wine industry shows growth in overall consumption of wine. Price of wine depend on two critical factor

- Wine appreciation by wine tester

- Certification and quality assessment in the physicochemical test

We have two dataset red wine and white wine.

We have done exploratory data analysis using standard function in R. During the analysis, we have identified the outliers in different variables using box plot. Also using cor() R function tried to understand the correlation between Quality and rest of variables.

Finally devised model using Liner regression technique to predict the quality of wine
Red wine data used to depict the picture. Same steps can be applied on white wine data

Project Goal:

- Explore the data in dataset and be able to list all the standard summary statistics

- investigate distribution of the variables graphically to determine the outliers

- devise method to handle outlier

- investigate correlation between quality and remaining properties

- suggest methods for the final “Quality” determination

Looking into dataset(Red Wine)

> red.wine.data <-read.delim(file.choose(), header=T)

> dim(red.wine.data)

[1] 1599 12

> names(red.wine.data)

[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar"

[5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density"

[9] "pH" "sulphates" "alcohol" "quality"

> str(red.wine.data)

'data.frame': 1599 obs. of 12 variables:

$ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...

$ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...

$ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...

$ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...

$ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...

$ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...

$ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...

$ density : num 0.998 0.997 0.997 0.998 0.998 ...

$ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...

$ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...

$ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...

$ quality : int 5 5 5 6 5 5 5 7 7 5 ...

> attributes(red.wine.data)

$names

[1] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar"

[5] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide" "density"

[9] "pH" "sulphates" "alcohol" "quality"

$class

[1] "data.frame"

> summary(red.wine.data)

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides

Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900 Min. :0.01200

1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900 1st Qu.:0.07000

Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200 Median :0.07900

Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539 Mean :0.08747

3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600 3rd Qu.:0.09000

Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500 Max. :0.61100

free.sulfur.dioxide total.sulfur.dioxide density pH sulphates

Min. : 1.00 Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300

1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500

Median :14.00 Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200

Mean :15.87 Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581

3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300

Max. :72.00 Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000

alcohol quality

Min. : 8.40 Min. :3.000

1st Qu.: 9.50 1st Qu.:5.000

Median :10.20 Median :6.000

Mean :10.42 Mean :5.636

3rd Qu.:11.10 3rd Qu.:6.000

Max. :14.90 Max. :8.000

Identifying outliers

> fa<-red.wine.data$fixed.acidity

> boxplot(fa)

Handling outliers

Removed the outlier values and stored data in new data frame called red.wine

> red.wine<-subset(red.wine.data, fa< 12 & va<1 & ca<.8 & rs< 5 & cl<.2& fsd < 50 & tsd<150 & d<1 & ph<3.5 & sul<1.5 & al< 14 & ql<9)

> summary(red.wine)

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides

Min. : 5.000 Min. :0.1200 Min. :0.0000 Min. :0.900 Min. :0.01200

1st Qu.: 7.200 1st Qu.:0.3800 1st Qu.:0.1100 1st Qu.:1.900 1st Qu.:0.07000

Median : 8.000 Median :0.5100 Median :0.2600 Median :2.100 Median :0.07900

Mean : 8.276 Mean :0.5093 Mean :0.2657 Mean :2.243 Mean :0.08112

3rd Qu.: 9.100 3rd Qu.:0.6200 3rd Qu.:0.4000 3rd Qu.:2.500 3rd Qu.:0.08900

Max. :11.900 Max. :0.9800 Max. :0.7300 Max. :4.800 Max. :0.19400

free.sulfur.dioxide total.sulfur.dioxide density pH sulphates

Min. : 1.00 Min. : 6.00 Min. :0.9901 Min. :2.870 Min. :0.3300

1st Qu.: 7.00 1st Qu.: 21.50 1st Qu.:0.9956 1st Qu.:3.220 1st Qu.:0.5400

Median :13.00 Median : 37.00 Median :0.9966 Median :3.300 Median :0.6100

Mean :15.39 Mean : 44.58 Mean :0.9965 Mean :3.293 Mean :0.6407

3rd Qu.:21.00 3rd Qu.: 59.00 3rd Qu.:0.9975 3rd Qu.:3.380 3rd Qu.:0.7100

Max. :48.00 Max. :149.00 Max. :0.9998 Max. :3.490 Max. :1.3600

alcohol quality

Min. : 8.50 Min. :3.000

1st Qu.: 9.50 1st Qu.:5.000

Median :10.10 Median :6.000

Mean :10.37 Mean :5.655

3rd Qu.:11.00 3rd Qu.:6.000

Max. :13.60 Max. :8.000

Correlation: Principal component analysis

By doing principal component analysis and plotting, we can easily identify the principal components and their correlation.

#number of element

> temp_red.wine<-length(as.matrix(red.wine))/length(red.wine)
#PCA analysis

> pcx<-prcomp(red.wine, scale=TRUE)
#plotting using biplot

> biplot(pcx, xlab=rep('.', temp_red.wine))

Interesting about the plot is that judging by the first two principal components, a quality is very much correlated with alcohol content and sulphate

Predicting wine quality using liner regression line

> plot(ql~al, data=red.wine)

> mean(ql)

[1] 5.636023

> abline(h=mean(ql))

> model1=lm(ql~al, data=red.wine)

> model1

Call:

lm(formula = ql ~ al, data = red.wine)

Coefficients:

(Intercept) al

1.8750 0.3608

> abline(model1,col="red")

> plot(model1)