Big Data on AWS


you can go through in this article under several sections
1: Overview of Big Data in AWS
EMR: Helps to deploy big data platform like Hadoop and Spark
Athena: Run interactive query on S3
Elastic Search:Help to deploy elastic search based cluster
Kinesis :Helps in analyzing the Streaming data

S3 : Object Storage
DynamoDB for Titan : Help to deal with graph database in massive scale in AWS
DynamoDB : NoSQL database
Hbase on amzon EMR :Petabyte scale NoSQL database
Amazon Aurora : MySQL version of amazon
Amazon Redshift : Peta byte scale Data warehouse
Amazon Quick sight: provide a fast cloud BI provider.Provide in memory calculation engine
Amazon Lex: Advanced Deep learning functionality.(Speech recognition)
Amzon polly: helps to build speech enable products
Amzon Rekognition: Helps to do appropriate changes in videos and images
Amazon machine learning: guide us in the process of building process
AWS lamda:just write your code ,deploy and run it
EC2 instance: like virtual computers
Direct Connect service:provide direct connection from organization to Amazon Virtual Private Cloud
AWS Snow Ball: Migrate Large amount on data ,can keep them either S3 or other services
Storage gateway service:which is a gate way from our on promise machine to AWS cloud
2: Big Data Storage and Databases on AWS
EMR
previously we have to create the hadoop instance by our self.Now all of these things are managed by AWS.
Let’s use EMR cluster , and Hive table for demonstrate purposes.
Big Data Analytics Framework
Data Ware housing on AWS Redshift
OLTP vs OLAP :when it’s comes to OLTP ,imagine a set of tables which normalized properly.But in OLAP ,when we need to query something we don’t need to do massive joins like in OLTP. That means ,less number of tables and all data in one place.From variety of OLTP sources ,OLPA getting data with ETL process.Data ware house are created for these OLAP.
Features:
fully managed and peta byte scale.
column base data store
How it works:

This is a collection of computing resources called nodes. Group of nodes knows as a cluster and it is handled by Redshift engine.
Practical:
Let’s build a cluster and Tables.Building a cluster is kind of easy. Better to Choose 4 nodes for following exercise. We are gonna create tables in start scheme.You have to create a DB to execute these queries for testing.
CREATE TABLE part (
p_partkey INTEGER NOT NULL, p_name VARCHAR(22) NOT NULL, p_mfgr VARCHAR(6) NOT NULL, p_category VARCHAR(7) NOT NULL, p_brand1 VARCHAR(9) NOT NULL, p_color VARCHAR(11) NOT NULL, p_type VARCHAR(25) NOT NULL, p_size INTEGER NOT NULL, p_container VARCHAR(10) NOT NULL
);
CREATE TABLE supplier (
s_suppkey INTEGER NOT NULL, s_name VARCHAR(25) NOT NULL, s_address VARCHAR(25) NOT NULL, s_city VARCHAR(10) NOT NULL, s_nation VARCHAR(15) NOT NULL, s_region VARCHAR(12) NOT NULL, s_phone VARCHAR(15) NOT NULL
);
CREATE TABLE customer (
c_custkey INTEGER NOT NULL,c_name VARCHAR(25) NOT NULL, c_address VARCHAR(25) NOT NULL, c_city VARCHAR(10) NOT NULL, c_nation VARCHAR(15) NOT NULL, c_region VARCHAR(12) NOT NULL, c_phone VARCHAR(15) NOT NULL, c_mktsegment VARCHAR(10) NOT NULL
);
CREATE TABLE dwdate (
d_datekey INTEGER NOT NULL, d_date VARCHAR(19) NOT NULL, d_dayofweek VARCHAR(10) NOT NULL, d_month VARCHAR(10) NOT NULL, d_year INTEGER NOT NULL, d_yearmonthnum INTEGER NOT NULL, d_yearmonth VARCHAR(8) NOT NULL, d_daynuminweek INTEGER NOT NULL, d_daynuminmonth INTEGER NOT NULL, d_daynuminyear INTEGER NOT NULL, d_monthnuminyear INTEGER NOT NULL, d_weeknuminyear INTEGER NOT NULL, d_sellingseason VARCHAR(13) NOT NULL, d_lastdayinweekfl VARCHAR(1) NOT NULL, d_lastdayinmonthfl VARCHAR(1) NOT NULL, d_holidayfl VARCHAR(1) NOT NULL, d_weekdayfl VARCHAR(1) NOT NULL
);
CREATE TABLE lineorder (
lo_orderkey INTEGER NOT NULL, lo_linenumber INTEGER NOT NULL, lo_custkey INTEGER NOT NULL, lo_partkey INTEGER NOT NULL, lo_suppkey INTEGER NOT NULL, lo_orderdate INTEGER NOT NULL, lo_orderpriority VARCHAR(15) NOT NULL, lo_shippriority VARCHAR(1) NOT NULL, lo_quantity INTEGER NOT NULL, lo_extendedprice INTEGER NOT NULL, lo_ordertotalprice INTEGER NOT NULL, lo_discount INTEGER NOT NULL, lo_revenue INTEGER NOT NULL, lo_supplycost INTEGER NOT NULL, lo_tax INTEGER NOT NULL,lo_commitdate INTEGER NOT NULL, lo_shipmode VARCHAR(10) NOT NULL
);
now you have to fill the data for this schema:

Get your AWS credentials for this .Amazon S3 buckets that give read access to all authenticated AWS users, so any valid AWS credentials that permit access to Amazon S3 will work. Execute below commands by replacing the credentials.
copy customer from ‘s3://awssampledbuswest2/ssbgz/customer’
credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’
gzip compupdate off region ‘us-west-2’;
copy dwdate from ‘s3://awssampledbuswest2/ssbgz/dwdate’
credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’
gzip compupdate off region ‘us-west-2’;
copy lineorder from ‘s3://awssampledbuswest2/ssbgz/lineorder’
credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’
gzip compupdate off region ‘us-west-2’;
copy part from ‘s3://awssampledbuswest2/ssbgz/part’
credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’
gzip compupdate off region ‘us-west-2’;
copy supplier from ‘s3://awssampledbuswest2/ssbgz/supplier’
credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’
gzip compupdate off region ‘us-west-2’;
load operation will take about 10 to 15 minutes for all five tables.results should look similar to the following.
Load into table ‘customer’ completed, 3000000 record(s) loaded successfully.
0 row(s) affected. copy executed successfully
Execution time: 10.28s
(Statement 1 of 5 finished) … …
Script execution finished Total script execution time: 9m 51s
verify that each table loaded correctly by executing select * query:) .
select count(*) from LINEORDER;
Real Time Big Data Analysis
when data in the motion even before reach the target ,data getting analysed.
Amazon Kinesis Architecture

in this case shard consists with similar types of data records.At the same time partition key is used to group data by shard with a stream.
Kinesis Firehose

modifying data, and loads to desired destination such as S3,Splunk, Elasticsearch or Redshift ,happening here.
Kinesis Analytics
this is used to generate time series analytics like we can trigger custom triggers for real time analysis.

AI/ML
Lex:
with NLU ,this is trying to give the motive behind the speech.
other use case Example:

Poly:
applied the real time steaming.When you send the text ,right after that we are getting audio.
Rekognition:

Use cases:
1 .find the missing persons.
2 .image moderation( set what is not appropriate)
SageMaker:
sagemaker automatically tune our model by adjusting multiple combinations of algorithm parameters. And also there are so many algorithm supports in real world project which provide massive support ,which hard to summarize.
Business Intelligence on AWS & Big Data Computation on AWS comes in next part.