Wednesday, June 10, 2015

Structured - Unstructured data and DataWare House


DB4.1: Data types and large-scale analytics

3 unread replies.3 replies.



“Unstructured data refers to information that either does not have a pre-defined data model and/or is not organized in a predefined manner.”

In addition to social media there are many other common forms of unstructured data:
  • Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
  • Audio Files - Customer service recordings, voicemails, 911 phone calls
  • Presentations - PowerPoints, SlideShares
  • Videos - Police dash cam, personal video, YouTube uploads
  • Images - Pictures, illustrations, memes
  • Messaging - Instant messages, text messages
  • In all these instances, the data can provide compelling insights. Using the right tools, unstructured data can add a depth to data analysis that couldn’t be achieved otherwise.
Structured Data
Contrasting to unstructured data, structured data is data that can be easily organized. Regardless of its simplicity, most experts in today’s data industry estimate that structured data accounts for only 20% of the data available. It is clean, analytical and usually stored in databases.
Sensory Data - GPS data, manufacturing sensors, medical devices
- Point-of-Sale Data - Credit card information, location of sale, product information
- Call Detail Records - Time of call, caller and recipient information
- Web Server Logs - Page requests, other server activity
- Input Data - Any data inputted into a computer: age, zip code, gender, etc.
The scale  and verity of data have permanently overwhelmed the ability to cost effectively  extract value using traditional platform
Choice of storage system and computing platform is depend on types of data and/or source of data and size of data. 
Today's' parallel computing platform is best suited platform to handler the speed and volume of data being produced.  There are 3 prominent computing platform options are available today.

- Cluster or Grids
- Massively Parallel processing(MPP)
- High performance computing (HPC)

If there is verity(structured and unstructured)  in data and coming for  different sources in big volume the Hadoop is best suited technology in these days.
Hadoop basically used as synonym for HDFS(Hadoop distributed File system) which comes with default parallel processing programming framework called "MapReduce"
MapReduce is fault tolerant parallel programming framework that was designed to harness distributing processing capabilities. MapReduce automatically divides the process workload into smaller workloads that are distributed.  
I would like to share few advantage which can help you to gauge the power of hadoop and help to decide the right storage platform.

Cost effective
- Its open source software which is govern by open source community under the APACHE License. It means no cost involved and we are open to manipulate and customized per our requirement.
- It work on commodity hardware hence its again cost effective 
Scalable
- Hadoop is a highly scalable storage platform, because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel
Flexible

Hadoop enables businesses to easily access new data sources and tap into different types of data (both structured and unstructured) to generate value from that data.
Fast

Hadoop’s unique storage method is based on a distributed file system that basically ‘maps’ data wherever it is located on a cluster. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. If you’re dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.
Resilient to failure
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes in the cluster, which means that in the event of failure, there is another copy available for use.

HP Vertica:
The HP Vertica Analytics Platform was consciously designed with speed, scalability, simplicity, and openness at its core and architected to handle analytical workloads via a distributed compressed columnar architecture. HP Vertica provides blazingfast speed (queries run 50-1,000x faster), petabyte-scale (store 10-30x more data per server), and openness and simplicity (use
any BI/ETL tools, Hadoop, etc.) — all at 30% of the cost of traditional data warehouse solutions.
The Technology that Makes HP Vertica So Powerful
The HP Vertica Analytics Platform is a standards-based, relational database. It supports popular SQL, JDBC/ODBC. This allows users to preserve years of investment and training in these technologies because all popular SQL programming tools and languages work seamlessly. All popular BI and visualization tools are tightly integrated, such as Tableau, Microstrategy and others and so are all popular ETL tools like Informatica, Pentaho, and more. HP Vertica is optimized for large-scale analytics. It is uniquely designed using a memory-and-disk balanced distributed compressed columnar paradigm, which makes it exponentially faster than older techniques for modern data analytics workloads. Additionally, HP Vertica supports a series of built-in analytics libraries like time series and analytics packs for geospatial and sentiment plus additional functions from vendors like
SAS. And, it supports analytics written using the R programming language for predictive modeling.
HP Vertica is a Massively Parallel Processing (MPP) platform that distributes its workload over multiple commodity servers using a
shared-nothing architecture

Enables organizations to manage and analyze massive volumes of structured and semi-structured data quickly and reliably with no limits or business compromises.

What About Hadoop?
The HP Vertica Analytics Platform and Hadoop are highly complementary systems for big data analytics. The HP Vertica Analytics Platform is ideal for interactive, blazing-fast analytics and Hadoop is well suited for batch-oriented data processing and low-cost data storage.
When used together, with tight integration and support for all Hadoop distributions, the HP Vertica Analytics Platform and Hadoop offer the most open SQL on Hadoop. As a result, organizations can use the most powerful set of data analytics capabilities and do far more than either platform could do on its own, extracting significantly higher levels of value from massive amounts of structured, unstructured, and semi-structured data.

BM Netezza's approach to data warehousing
IBM's approach to data warehousing is very different from what is offered by vendors such as Oracle and Teradata. The IBM Netezza data warehouse appliances are purpose-built for crunching massive volumes of data quickly and efficiently. This enables organizations to realize business value quickly, and analytically explore areas previously unimagined. Competitive offerings just aren't able to do that.
Those traditional vendors are now offering their own versions of systems that are similar to IBM Netezza data warehouse appliances. For the most part, however, these are simply a repackaging of current technology and come with the same limitations: poor performance, complexity, high administrative costs and lack of scale.

In addition, every IBM Netezza data warehouse appliance is delivered with IBM Netezza Analytics, an embedded software platform for advanced analytics. It provides the technology infrastructure to support enterprise deployments of parallel, in-database analytics. Support for a variety of popular tools and languages as well as a built-in library of parallelized analytic functions make it simple to move analytic modeling and scoring inside the data warehouse appliance. IBM Netezza Analytics is fully integrated into the IBM Netezza data warehouse asymmetric massively parallel processing (AMPP) architecture enabling data exploration, model-building, model-diagnostics and scoring with unprecedented speed.

Oracle and RDBMS database
Its row and column based database used only of storing of highly structured database. Applying analytic and Processing on large amount of data for analytic purpose is very time consuming based on its design(row and column). To overcome from this situation, we started building cubes (OLAP) but this does not help us to do real time analytics.

Over the period Oracle the owner of Oracle RDBMS database realized the power of Big data and came up with Oracle Big data Appliance 
Oracle Big Data Appliance is an engineered system that combines optimized hardware with a
comprehensive big data software stack to deliver a complete, easy-to-deploy solution for
acquiring and organizing big data.
Oracle Big Data Appliance comes in a full rack configuration with 18 Sun servers for a total
storage capacity of 648TB. Every server in the rack has 2 CPUs, each with 8 cores for a total of
288 cores per full rack. Each server has 64GB1 memory for a total of 1152GB of memory per

full rack.
The Oracle Big Data Appliance software includes:
- Full distribution of Cloudera’s Distribution including Apache Hadoop (CDH4)
- Oracle Big Data Appliance Plug-In for Enterprise Manager
- Cloudera Manager to administer all aspects of Cloudera CDH
- Oracle distribution of the statistical package R
- Oracle NoSQL Database Community Edition2

- And Oracle Enterprise Linux operating system and Oracle Java VM

Reference:
http://www.oracle.com/us/products/database/big-data-for-enterprise-519135.pdf
http://www.netezza1000.com/

http://www8.hp.com/h20195/V2/GetPDF.aspx/4AA5-8917ENW.pdf