Sunday, June 21, 2015

Identifying outlier

Identifying outliers is called is separating signal from the noise
Business Analytics process goes with 3 steps:
- Framing
- Analysis
- Reporting
Each step pass through with multiple steps. And analysing has
- Data collection and understanding
- Data preparation
- Analysis and Model building
Under data preparation, we have filtering where we have to perform identifying outliers. Outliers are the data points which significantly differ from the sample data. Having such data points/outliers can result in abnormality in result
Hence finding out outliers is very important activity.

Let me introduce how to find the outlier..
Just take an example from the following sample data points(already organized in lowest to highest order)
10. 12, 13, 15, 16, 20, 35
Median: 7+1/2=4 and That is 15
Lower Quartile Q1: 12
Upper Quartile Q3: 20

Inter quartile Range: Q3  -  Q1  =  20 -12 = 8
Outlier (Lower Fence) < Lower Quartile(Q1) -  1.5(IQ) = 12 - 1.5(8) = 0
Outlier(Upper Fence ) > Upper Quartile(Q3) - 1.5(IQ) = 20 + 1,5(8) = 32
Hence, there is no value lower than 0 in above sample data but there is value greater than 32 which is 35. 
Accordingly in our sample data we have one data point that is 35 which can be considered as outlier.
Understand the importance of retaining outliers
While some outliers should be omitted from the data set because that can lead to bad result. However, we should be careful as that can be critical element in deriving the results.
Use a qualitative assessment in determining whether to "throw out" outliers. 
There are two graphical technique to identify the outliers that is scatter plot (Links to an external site.) and box plot (Links to an external site.) and analytical procedure Grubbs' Test  (Links to an external site.)for detecting outliers 

Filtering Outliers

Its General strategy to omit the outlier from the same. But, in every situation an outlier must not be candidate for omission.
As stated my my previous DQ 6.1 understanding the importance of retaining or throwing outlier is very important. 
We should use Qualitative Assessment  in determining whether to throw out the outlier.

Scientific experiment generally contain sensitive data and hence an outlier can give new trend/insight to experiment. 

Let me share the situation which can clarify that outlier is not always candidate for omission.

In a clinical trial conducted on an Hen in farm. Here is weekly(10 weeks) eggs produced by that Hen

10, 11, 14, 15, 12, 13, 30, 9, 16

In above data 30 eggs in the 7th week seem outlier. But we should not omit it because assuming its not due to error. This can result significant success in experiment. And the drug used in 7th week has given a better results. The experiment conducted on 7th week yielded successful results. 


Data Cleansing

Data cleaning also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from the data in order to improve the quality of data

Data quality discussion typically involve the following services
- Parsing
- Standardization
- Validation
- Verification
- Matching

There are tools in the market which worked on above process.  For example IBM Infoshpere Information Server for data Quality and Microsoft SQL Server 2016.
Microsoft 2016 includes Data Quality Services (DQS) which is a computer-assisted process that analyzes how data conforms to the knowledge in knowledge base DQS  categorized data data under following five tab
- Suggested
- New
- Invalid
- Corrected
- Correct

Consider the case of Bigdata where integrity aspects of data is very hot and now known as veracity of data.
Some case veracity of data can be maintained at the origin of data itself by enforcing field integrity.
For example:
Email field: email should be valid. It should contain validation on special character like @
Zip Code Field : Zip code field should have certain no. of digit. If  system is integrated with master data where each country code is store. We can enforce integrity of zip code by verifying zip code from master data
Mobile no Filed: Mobile no. can be validated by introducing OTP mechanism. This can insure the mobile no/contact information provided is correct

One of other approach in data cleansing may involve relationship integrity which is know as association rule. This concept was first introduced in market basket analysis

Above field integrity and relation ship integrity is mostly true in case of structured data collection
but what about in case of unstructured data?
Consider the case of social media based customer profiling. Big data center are coming with concept of Master Data Management (MDM). And linkage between Dataware house and MDM project is good starting point for providing cleaned data. For example: Consider the Master Data project centers around a person, then extracting life event over Twitter or Facebook, such as a change in relation ship status, birth announcement  and so on enriches that master information,










No comments:

Post a Comment