Thursday, December 24, 2015

How to Learn Python

Great source of information and quick start


1. What is External table in hive. Why external tables should be used over temporary?

External table can be created using the "EXTERNAL" keyword that lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This helps if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system. An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
This is also called unmanaged tables. Deleting and dropping unmanaged table does not impact on actual datasets Where as default/internal/managed table, if data gets deleted if table got deleted

2. Hive: Difference between PARTITIONED BY, CLUSTERED BY and SORTED BY with BUCKETS.

Concept is clear about why we don partitioning. Its generic concept in database concept. with the help of Partitioning you can manage large dataset by slicing. Resulting high performance of query

Partitionedby: Partitioned table can be created by using PARTITIONED BY clause. A table can have one or more partition column.  Further tables or partition cab be bucketed using CLUSTERED BY columns and data can be stored within bucket via SORT BY columns.

ORDER BY: This gurantees the global ordering of the data using a single reducer. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish

Sortedby: Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

ClusterBy: Cluster By is a short-cut for both Distribute By and Sort By.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

Bucket: Bucketing is further level of slicing of data. And its allow much more efficient sampling than non-bucketed tables. How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc.
No of reducer should be equal to no. of buckets. The command set hive.enforce.bucketing = true; allows the correct number of reducers and the cluster by column to be automatically selected based on the table

Creation of bucketed table:
We can create bucketed table with help of CLUSTERED BY clause and optional SORTED BY clause in CREATE TABLE statement

create table bucketed_usertab(
       firstname VARCHAR(64),
        lastname  VARCHAR(64),
        address   STRING,
        city  VARCHAR(64),
        COMMENT 'A bucketed sorted user table'
        PARTITIONED BY (country VARCHAR(64))
        CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS

3. Example to insert overwrite with PARTITIONED and CLUSTER BY?

Insert into: Append data
Insert overwrite: Overwrite/replace the data
Inserting data into bucketed table
To insert data into the bucked table, we need to set property hive.enfore.bucketing =true
also we can not directly load bucketed tables with LOAD DATA COMMANDS like partitioned tabled. For loading data in bucketed table we have to use following statement rather LOAD DATA command

set hive.enforce.bucketing = true;

INSERT OVERWRITE TABLE bucketed_usertab PARTITION (country)
        SELECT  firstname ,
        lastname  ,
        address   ,
          city      ,
        FROM temp_user;

4. Which is most suited to migrate data from SQL Server to HDFS. How to proceed about it. How do we take care of the constraints/joins within the tables?
Using Sqoop we can migrate data from SQL server to HDFS.

For example
sqoop import --query 'SELECT a.*, b.* FROM a JOIN b on ( == WHERE $CONDITIONS' --split-by --target-dir /user/foo/joinresults

By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.
sqoop-export --table foo --update-key id --export-dir /path/to/data --connect ...

5. What is difference between mongodb and cassandra?
Please read at :

6. How to access files/directory in shell, that are created in Hue -> Files
Hue files and browser depend on HDFS configuration. Please check what is configuration details you have for name node

you can explore using hdfs commands: hdfs dfs -ls -R /