Bhupendra Mishra

Saturday, January 2, 2016

Sort Benchmark for Big Data System

http://sortbenchmark.org/

Monday, December 28, 2015

over 20 million time series at your fingertips

https://www.quandl.com/

Thursday, December 24, 2015

How to Learn Python

Great source of information and quick start

http://www.pythonlearn.com/

Hive: PARTITIONED BY, DISTRIBUTED BY, SORT BY AND ORDER BY WITH BUCKETTING

1. What is External table in hive. Why external tables should be used over temporary?

External table can be created using the "EXTERNAL" keyword that lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This helps if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system. An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
This is also called unmanaged tables. Deleting and dropping unmanaged table does not impact on actual datasets Where as default/internal/managed table, if data gets deleted if table got deleted

2. Hive: Difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS.

Concept is clear about why we don partitioning. Its generic concept in database concept. with the help of Partitioning you can manage large dataset by slicing. Resulting high performance of query

Partitionedby: Partitioned table can be created by using PARTITIONED BY clause. A table can have one or more partition column. Further tables or partition cab be bucketed using CLUSTERED BY columns and data can be stored within bucket via SORT BY columns.

ORDER BY: This gurantees the global ordering of the data using a single reducer. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish

Sortedby: Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

ClusterBy: Cluster By is a short-cut for both Distribute By and Sort By.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

Bucket: Bucketing is further level of slicing of data. And its allow much more efficient sampling than non-bucketed tables. How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.
No of reducer should be equal to no. of buckets. The command set hive.enforce.bucketing = true; allows the correct number of reducers and the cluster by column to be automatically selected based on the table

Creation of bucketed table:
We can create bucketed table with help of CLUSTERED BY clause and optional SORTED BY clause in CREATE TABLE statement

create table bucketed_usertab(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
)
COMMENT 'A bucketed sorted user table'
PARTITIONED BY (country VARCHAR(64))
CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
STORED AS SEQUENCEFILE;

3. Example to insert overwrite with PARTITIONED and CLUSTER BY?

Insert into: Append data
Insert overwrite: Overwrite/replace the data
Inserting data into bucketed table
To insert data into the bucked table, we need to set property hive.enfore.bucketing =true
also we can not directly load bucketed tables with LOAD DATA COMMANDS like partitioned tabled. For loading data in bucketed table we have to use following statement rather LOAD DATA command

set hive.enforce.bucketing = true;

INSERT OVERWRITE TABLE bucketed_usertab PARTITION (country)
SELECT firstname ,
lastname ,
address ,
city ,
country
FROM temp_user;

4. Which is most suited to migrate data from SQL Server to HDFS. How to proceed about it. How do we take care of the constraints/joins within the tables?
Using Sqoop we can migrate data from SQL server to HDFS.

For example
sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' --split-by a.id --target-dir /user/foo/joinresults

Constraints
By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.
syntax
sqoop-export --table foo --update-key id --export-dir /path/to/data --connect ...

5. What is difference between mongodb and cassandra?
Please read at : http://db-engines.com/en/system/Cassandra%3BHBase%3BMongoDB

6. How to access files/directory in shell, that are created in Hue -> Files
Hue files and browser depend on HDFS configuration. Please check what is configuration details you have for name node

you can explore using hdfs commands: hdfs dfs -ls -R /

Tuesday, December 1, 2015

Hadoop Cluster Setup: Best Practice

Operating System Utilities

ssh

HBase uses the Secure Shell (ssh) command and utilities extensively to communicate between cluster nodes. Each server in the cluster must be running ssh so that the Hadoop and HBase daemons can be managed. You must be able to connect to all nodes via SSH, including the local node, from the Master as well as any backup Master, using a shared key rather than a password. You can see the basic methodology for such a set-up in Linux or Unix systems at "Procedure: Configure Passwordless SSH Access". If your cluster nodes use OS X, see the section, SSH: Setting up Remote Desktop and Enabling Self-Login on the Hadoop wiki.

DNS

HBase uses the local hostname to self-report its IP address. Both forward and reverse DNS resolving must work in versions of HBase previous to 0.92.0. The hadoop-dns-checker tool can be used to verify DNS is working correctly on the cluster. The project README file provides detailed instructions on usage.

Loopback IP

Prior to hbase-0.96.0, HBase only used the IP address 127.0.0.1 to refer to localhost, and this could not be configured. See Loopback IPfor more details.

NTP

The clocks on cluster nodes should be synchronized. A small amount of variation is acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time synchronization is one of the first things to check if you see unexplained problems in your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or another time-synchronization mechanism, on your cluster, and that all nodes look to the same service for time synchronization. See the Basic NTP Configuration at The Linux Documentation Project (TLDP) to set up NTP.

Limits on Number of Files and Processes (ulimit)

Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open to 1024 (or 256 on older versions of OS X). You can check this limit on your servers by running the commandulimit -n when logged in as the user which runs HBase. See the Troubleshooting section for some of the problems you may experience if the limit is too low. You may also notice errors such as the following:

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901

It is recommended to raise the ulimit to at least 10,000, but more likely 10,240, because the value is usually expressed in multiples of 1024. Each ColumnFamily has at least one StoreFile, and possibly more than six StoreFiles if the region is under load. The number of open files required depends upon the number of ColumnFamilies and the number of regions. The following is a rough formula for calculating the potential number of open files on a RegionServer.

Calculate the Potential Number of Open Files

(StoreFiles per ColumnFamily) x (regions per RegionServer)

For example, assuming that a schema had 3 ColumnFamilies per region with an average of 3 StoreFiles per ColumnFamily, and there are 100 regions per RegionServer, the JVM will open 3 * 3 * 100 = 900 file descriptors, not counting open JAR files, configuration files, and others. Opening a file does not take many resources, and the risk of allowing a user to open too many files is minimal.

Another related setting is the number of processes a user is allowed to run at once. In Linux and Unix, the number of processes is set using theulimit -u command. This should not be confused with the nproc command, which controls the number of CPUs available to a given user. Under load, a ulimit -u that is too low can cause OutOfMemoryError exceptions. See Jack Levin’s major HDFS issues thread on the hbase-users mailing list, from 2011.

Configuring the maximum number of file descriptors and processes for the user who is running the HBase process is an operating system configuration, rather than an HBase configuration. It is also important to be sure that the settings are changed for the user that actually runs HBase. To see which user started HBase, and that user’s ulimit configuration, look at the first line of the HBase log for that instance. A useful read setting config on your hadoop cluster is Aaron Kimball’s Configuration Parameters: What can you just ignore?

Example 6. ulimit Settings on Ubuntu

To configure ulimit settings on Ubuntu, edit /etc/security/limits.conf, which is a space-delimited file with four columns. Refer to the man page for limits.conf for details about the format of this file. In the following example, the first line sets both soft and hard limits for the number of open files (nofile) to 32768 for the operating system user with the username hadoop. The second line sets the number of processes to 32000 for the same user.

hadoop  -       nofile  32768
hadoop  -       nproc   32000

The settings are only applied if the Pluggable Authentication Module (PAM) environment is directed to use them. To configure PAM to use these limits, be sure that the /etc/pam.d/common-session file contains the following line:

session required  pam_limits.so

Linux Shell

All of the shell scripts that come with HBase rely on the GNU Bash shell.

Windows

Prior to HBase 0.96, testing for running HBase on Microsoft Windows was limited. Running a on Windows nodes is not recommended for production systems.

Reference: http://hbase.apache.org/book.html#_configuration_files

Monday, November 23, 2015

Lambda & Zeta Architecture for Modern Dataware house

Lamda Architecture

developed by Nathan Marz: it provides a clear set of architecture of both real time/streaming data and batch processing work together to provide support for modern DW

Per Lambda architecture, it has three layer which

- Batch layer
- Serving layer
- Speed layer

https://www.mapr.com/solutions/zeta-enterprise-architecture

Sunday, November 15, 2015

Cluster analysis and K-mean clustering

Cluster Analysis

When does use Cluster analysis
1. To figure out what to predict
2. to detect patter of interest in data- when does not know enough what patter to expect
3. As an exploratory tool to understand the data

What is cluster analsys

PCA = Reduce number of cloumns

Cluster analysis = Reduce number of rows

Cluster Analysis:
1. Group objects(typically, rows or records or observations) in a data set based on similarity of the properties or attributes(columns) of the objects(rows)
2. There are at least a few hundred(if not more) approaches for performing cluster analysis

Pupular approaches to cluster Analysis
1. Hierarchical clustering(agglomerative) such as single Linkage, Average Linkage, and Complete linkage clustering

2. Partitioning clustering methods such as K-means and K-Medians for numeric data and K-Models for cluster categorical data
3. Overlapping clustering methods
4. Latent Class methods

K-means cluster
While performing partitioning(K-Means) and overlapping clustering (overlapping k-centroids)
1. Normalize normalize normalize
2. Severe local optima: Run procedure from at least 50 random starts for clustering into K- Groups. Otherwise, there is high risk of sub-optima cluster or groups
3. How many cluster to choose
a. USe Scree plot and interpret-ability
b. Don't Link solution don't use it
c. Unless a population is perfectly multi-model i.e. has exactly K models, once can extract as many cluster as one wants
4. Decide how you would handle missing data- missing values in some variables
a. Replace missing values in a cloumn by the mean of the non-missing values of each columns
b. Ignore row entirely if column has missing values

c. Imput the missing values by predicting modeling techniques

Case study
Assume you are a business analyst in a large finance company operating in the secondary (financial derivatives) market. You have access to their monthly operations database, with 500,000 trades and 200 variables, and are asked to summarize the key patterns and relationships in the data. How would you proceed?

above is centered around the Cluster Analysis topic. I would rather focus to explain the cluster analysis and how K-means clustering can be useful here

Stated problem can be covered under Unsupervized-Classification problem and we can use K-means clustering technique to solve the problem

First let me try to explain the clustering and when does it will be useful.
- To figure out what to predict
- To detect pattern of interest in data when we don have enough what patter to expect
- This can be used as exploratory tool to udernstand the data

Based on the available datasets(500000 trade data with 200 variables), we can use cluster analysis to identify key pattern and relationship in data
and to do the same, here is my approach to procceed
- We will first check the distribution of data if its normal or not
- Using the boxplot will try to detect outlier
- Check if there any outlier or missing values in data
- Treat the outlier/missing values by imputing and/or capping/florring method
- Whole concept is try to normalize data as much as possible
- plot the data using plot() or scree() plot in r and try to identify the possoble group
- Us K-mean cluster create cluster
- Based on the created cluster/group.
- find the mean values of Pricipal components of each group
- Draw the conclustion about the each group for business purpose