Big Data on AWS

you can go through in this article under several sections

1: Overview of Big Data in AWS

EMR: Helps to deploy big data platform like Hadoop and Spark

Athena: Run interactive query on S3

Elastic Search:Help to deploy elastic search based cluster

Kinesis :Helps in analyzing the Streaming data

S3 : Object Storage

DynamoDB for Titan : Help to deal with graph database in massive scale in AWS

DynamoDB : NoSQL database

Hbase on amzon EMR :Petabyte scale NoSQL database

Amazon Aurora : MySQL version of amazon

Amazon Redshift : Peta byte scale Data warehouse

Amazon Quick sight: provide a fast cloud BI provider.Provide in memory calculation engine

Amazon Lex: Advanced Deep learning functionality.(Speech recognition)

Amzon polly: helps to build speech enable products

Amzon Rekognition: Helps to do appropriate changes in videos and images

Amazon machine learning: guide us in the process of building process

AWS lamda:just write your code ,deploy and run it

EC2 instance: like virtual computers

Direct Connect service:provide direct connection from organization to Amazon Virtual Private Cloud

AWS Snow Ball: Migrate Large amount on data ,can keep them either S3 or other services

Storage gateway service:which is a gate way from our on promise machine to AWS cloud

2: Big Data Storage and Databases on AWS


previously we have to create the hadoop instance by our self.Now all of these things are managed by AWS.

Let’s use EMR cluster , and Hive table for demonstrate purposes.

Big Data Analytics Framework

Data Ware housing on AWS Redshift

OLTP vs OLAP :when it’s comes to OLTP ,imagine a set of tables which normalized properly.But in OLAP ,when we need to query something we don’t need to do massive joins like in OLTP. That means ,less number of tables and all data in one place.From variety of OLTP sources ,OLPA getting data with ETL process.Data ware house are created for these OLAP.


fully managed and peta byte scale.

column base data store

How it works:

This is a collection of computing resources called nodes. Group of nodes knows as a cluster and it is handled by Redshift engine.


Let’s build a cluster and Tables.Building a cluster is kind of easy. Better to Choose 4 nodes for following exercise. We are gonna create tables in start scheme.You have to create a DB to execute these queries for testing.


p_partkey INTEGER NOT NULL, p_name VARCHAR(22) NOT NULL, p_mfgr VARCHAR(6) NOT NULL, p_category VARCHAR(7) NOT NULL, p_brand1 VARCHAR(9) NOT NULL, p_color VARCHAR(11) NOT NULL, p_type VARCHAR(25) NOT NULL, p_size INTEGER NOT NULL, p_container VARCHAR(10) NOT NULL


CREATE TABLE supplier (

s_suppkey INTEGER NOT NULL, s_name VARCHAR(25) NOT NULL, s_address VARCHAR(25) NOT NULL, s_city VARCHAR(10) NOT NULL, s_nation VARCHAR(15) NOT NULL, s_region VARCHAR(12) NOT NULL, s_phone VARCHAR(15) NOT NULL


CREATE TABLE customer (

c_custkey INTEGER NOT NULL,c_name VARCHAR(25) NOT NULL, c_address VARCHAR(25) NOT NULL, c_city VARCHAR(10) NOT NULL, c_nation VARCHAR(15) NOT NULL, c_region VARCHAR(12) NOT NULL, c_phone VARCHAR(15) NOT NULL, c_mktsegment VARCHAR(10) NOT NULL



d_datekey INTEGER NOT NULL, d_date VARCHAR(19) NOT NULL, d_dayofweek VARCHAR(10) NOT NULL, d_month VARCHAR(10) NOT NULL, d_year INTEGER NOT NULL, d_yearmonthnum INTEGER NOT NULL, d_yearmonth VARCHAR(8) NOT NULL, d_daynuminweek INTEGER NOT NULL, d_daynuminmonth INTEGER NOT NULL, d_daynuminyear INTEGER NOT NULL, d_monthnuminyear INTEGER NOT NULL, d_weeknuminyear INTEGER NOT NULL, d_sellingseason VARCHAR(13) NOT NULL, d_lastdayinweekfl VARCHAR(1) NOT NULL, d_lastdayinmonthfl VARCHAR(1) NOT NULL, d_holidayfl VARCHAR(1) NOT NULL, d_weekdayfl VARCHAR(1) NOT NULL


CREATE TABLE lineorder (

lo_orderkey INTEGER NOT NULL, lo_linenumber INTEGER NOT NULL, lo_custkey INTEGER NOT NULL, lo_partkey INTEGER NOT NULL, lo_suppkey INTEGER NOT NULL, lo_orderdate INTEGER NOT NULL, lo_orderpriority VARCHAR(15) NOT NULL, lo_shippriority VARCHAR(1) NOT NULL, lo_quantity INTEGER NOT NULL, lo_extendedprice INTEGER NOT NULL, lo_ordertotalprice INTEGER NOT NULL, lo_discount INTEGER NOT NULL, lo_revenue INTEGER NOT NULL, lo_supplycost INTEGER NOT NULL, lo_tax INTEGER NOT NULL,lo_commitdate INTEGER NOT NULL, lo_shipmode VARCHAR(10) NOT NULL


now you have to fill the data for this schema:

Get your AWS credentials for this .Amazon S3 buckets that give read access to all authenticated AWS users, so any valid AWS credentials that permit access to Amazon S3 will work. Execute below commands by replacing the credentials.

copy customer from ‘s3://awssampledbuswest2/ssbgz/customer’

credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’

gzip compupdate off region ‘us-west-2’;

copy dwdate from ‘s3://awssampledbuswest2/ssbgz/dwdate’

credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’

gzip compupdate off region ‘us-west-2’;

copy lineorder from ‘s3://awssampledbuswest2/ssbgz/lineorder’

credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’

gzip compupdate off region ‘us-west-2’;

copy part from ‘s3://awssampledbuswest2/ssbgz/part’

credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’

gzip compupdate off region ‘us-west-2’;

copy supplier from ‘s3://awssampledbuswest2/ssbgz/supplier’

credentials ‘aws_access_key_id=<Your-Access-KeyID>;aws_secret_access_key=<Your-Secret-Access-Key>’

gzip compupdate off region ‘us-west-2’;

load operation will take about 10 to 15 minutes for all five tables.results should look similar to the following.

Load into table ‘customer’ completed, 3000000 record(s) loaded successfully.

0 row(s) affected. copy executed successfully

Execution time: 10.28s

(Statement 1 of 5 finished) … …

Script execution finished Total script execution time: 9m 51s

verify that each table loaded correctly by executing select * query:) .

select count(*) from LINEORDER;

Real Time Big Data Analysis

when data in the motion even before reach the target ,data getting analysed.

Amazon Kinesis Architecture

in this case shard consists with similar types of data records.At the same time partition key is used to group data by shard with a stream.

Kinesis Firehose

modifying data, and loads to desired destination such as S3,Splunk, Elasticsearch or Redshift ,happening here.

Kinesis Analytics

this is used to generate time series analytics like we can trigger custom triggers for real time analysis.



with NLU ,this is trying to give the motive behind the speech.

other use case Example:


applied the real time steaming.When you send the text ,right after that we are getting audio.


Use cases:

1 .find the missing persons.

2 .image moderation( set what is not appropriate)


sagemaker automatically tune our model by adjusting multiple combinations of algorithm parameters. And also there are so many algorithm supports in real world project which provide massive support ,which hard to summarize.

Business Intelligence on AWS & Big Data Computation on AWS comes in next part.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store