Sunday, June 21, 2015

Descriptive Vs. Predictive Vs. Prescriptive

Descriptive Analytics (EDA)
The purpose of descriptive analytics is simply to summarize and tell you what happened. For example, number of post, mentions, fans, followers, page views, kudos, +1s, check-ins, pins, etc. There are literally thousands of these metrics – it’s pointless to list them – but they are all just simple event counters. Other descriptive analytics may be results of simple arithmetic operations, such as share of voice, average response time, % index, average number of replies per post, etc.

Predictive Analytics

The purpose of predictive analytics is NOT to tell you what will happen in the future. It cannot do that. In fact, no analytics can do that. Predictive analytics can only forecast what might happen in the future, because all predictive analytics are probabilistic in nature.

The essence of predictive analytics, in general, is that we use existing data to build a model. Then we use the model to predict data that doesn’t yet exist. So predictive analytics is all about using data you have to predict data that you don’t have.

Prescriptive Analytics
Presecriptinve analytics not only predicts a possible future , it predicts
multiple futures based on the decisions makes action. Therefore a prescriptive model is
by definition, also predictive. As such must be validated too.
A prescriptive model can be viewed as a combination of multiple predictive models
running in parallel one for each possible input action
The Goal of the most prescriptive analytics is to guide the decision maker so the decision he makes
will ultimately lead to target outcome

In prescriptive analytics, we also build a predictive model of data
The predictive model must have two more added component in order to be prescriptive
1. Actionable: The data consumers must be able to take action based on the predictive outcome of the model

2. Feedback System: The model must have feedback system that tracks the adjusted outomes
based on the action taken. This means the predictive model must be smart enough to learn the complex relationship between the users' action and the adjusted outcome through the feedback data

1. Descriptive Analytics: Compute descriptive statistics to summarize the data. The majority of social analytics fall in this category
2. Predictive Analytics: Build a statistical model that uses existing data to predict data that that we don’t have. Examples of predictive analytics include trend lines, influence scoring, sentiment analysis, etc.
3. Prescriptive Analytics: Build a prescriptive model that uses not only the existing data, but also the action and feedback data to guide the decision maker to a desired outcome. Because prescriptive models must be actionable and have a feedback data stream, social analytics are rarely prescriptive

Identifying outlier

Identifying outliers is called is separating signal from the noise
Business Analytics process goes with 3 steps:
- Framing
- Analysis
- Reporting
Each step pass through with multiple steps. And analysing has
- Data collection and understanding
- Data preparation
- Analysis and Model building
Under data preparation, we have filtering where we have to perform identifying outliers. Outliers are the data points which significantly differ from the sample data. Having such data points/outliers can result in abnormality in result
Hence finding out outliers is very important activity.

Let me introduce how to find the outlier..
Just take an example from the following sample data points(already organized in lowest to highest order)
10. 12, 13, 15, 16, 20, 35
Median: 7+1/2=4 and That is 15
Lower Quartile Q1: 12
Upper Quartile Q3: 20

Inter quartile Range: Q3  -  Q1  =  20 -12 = 8
Outlier (Lower Fence) < Lower Quartile(Q1) -  1.5(IQ) = 12 - 1.5(8) = 0
Outlier(Upper Fence ) > Upper Quartile(Q3) - 1.5(IQ) = 20 + 1,5(8) = 32
Hence, there is no value lower than 0 in above sample data but there is value greater than 32 which is 35. 
Accordingly in our sample data we have one data point that is 35 which can be considered as outlier.
Understand the importance of retaining outliers
While some outliers should be omitted from the data set because that can lead to bad result. However, we should be careful as that can be critical element in deriving the results.
Use a qualitative assessment in determining whether to "throw out" outliers. 
There are two graphical technique to identify the outliers that is scatter plot (Links to an external site.) and box plot (Links to an external site.) and analytical procedure Grubbs' Test  (Links to an external site.)for detecting outliers 

Filtering Outliers

Its General strategy to omit the outlier from the same. But, in every situation an outlier must not be candidate for omission.
As stated my my previous DQ 6.1 understanding the importance of retaining or throwing outlier is very important. 
We should use Qualitative Assessment  in determining whether to throw out the outlier.

Scientific experiment generally contain sensitive data and hence an outlier can give new trend/insight to experiment. 

Let me share the situation which can clarify that outlier is not always candidate for omission.

In a clinical trial conducted on an Hen in farm. Here is weekly(10 weeks) eggs produced by that Hen

10, 11, 14, 15, 12, 13, 30, 9, 16

In above data 30 eggs in the 7th week seem outlier. But we should not omit it because assuming its not due to error. This can result significant success in experiment. And the drug used in 7th week has given a better results. The experiment conducted on 7th week yielded successful results. 

Data Cleansing

Data cleaning also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from the data in order to improve the quality of data

Data quality discussion typically involve the following services
- Parsing
- Standardization
- Validation
- Verification
- Matching

There are tools in the market which worked on above process.  For example IBM Infoshpere Information Server for data Quality and Microsoft SQL Server 2016.
Microsoft 2016 includes Data Quality Services (DQS) which is a computer-assisted process that analyzes how data conforms to the knowledge in knowledge base DQS  categorized data data under following five tab
- Suggested
- New
- Invalid
- Corrected
- Correct

Consider the case of Bigdata where integrity aspects of data is very hot and now known as veracity of data.
Some case veracity of data can be maintained at the origin of data itself by enforcing field integrity.
For example:
Email field: email should be valid. It should contain validation on special character like @
Zip Code Field : Zip code field should have certain no. of digit. If  system is integrated with master data where each country code is store. We can enforce integrity of zip code by verifying zip code from master data
Mobile no Filed: Mobile no. can be validated by introducing OTP mechanism. This can insure the mobile no/contact information provided is correct

One of other approach in data cleansing may involve relationship integrity which is know as association rule. This concept was first introduced in market basket analysis

Above field integrity and relation ship integrity is mostly true in case of structured data collection
but what about in case of unstructured data?
Consider the case of social media based customer profiling. Big data center are coming with concept of Master Data Management (MDM). And linkage between Dataware house and MDM project is good starting point for providing cleaned data. For example: Consider the Master Data project centers around a person, then extracting life event over Twitter or Facebook, such as a change in relation ship status, birth announcement  and so on enriches that master information,