Selecting the right database in Amazon Web Service(AWS)

Featured image

Deciding on the database to use for our application or workload can be very tricky. Since I join AWS Community Builder, I spend a least 1 hour every day exploring AWS services based on use case. Amazon Web Service(AWS) provides several options for Databases, we can be confused on the right one to choose. This article is a documentation of what I learned and the resources I used in understanding the various databases in AWS and how to decide when to use them. I hope it will be of value to you. I will like to have feedback on what you think I should add or remove or improve on as I continue exploring AWS and other cloud services.

There are a lot of criteria the could help us in selecting the right database in AWS, to make it easier for us, we summarize it into the following 4 criteria

  1. Type of Data
  2. Size of Data
  3. Structure or Shape of Data
  4. Activities that will be done on the Data

Now that we have an idea of the criteria required in selecting the database, let us dive into each of these databases. All databases in AWS are known to have the following properties

Relation Databases

Amazon Relational Database Service(RDS)

Amazon RDS

source

Amazon RDS is not a database itself instead it is used to set up, operate, and scale relational databases in the cloud. It enables us to provisioning Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. It is like an administrator for these databases. It automates failover, backups, restore, disaster recovery, access management, encryption, security, monitoring, and performance optimization. It has two major backup solutions which are automated backups and manual snapshots. It has a maximum of 5 replicas. Its replicas can be multi-availability zone replica, cross-region replicas, and read replica. But the resources aren’t replicated across AWS Regions by default expect you set it specifically.

When to use Amazon Relational Database Service(RDS)

If we need to use any of these six databases Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server.

Pricing

Amazon Relational Database Services(RDS) pricing depends on either we are using it On-Demand or Reserved Instances.

Official Resources

Amazon Relational Database Services(RDS)

Amazon RDS Documentation

Amazon RDS Blog

Other Resources

AWS Certified Solutions Architect - Associate 2020 (6:23:16 - 7:00:11)

Amazon Relational Database Service (Amazon RDS) by Vlad Vlasceanu

Amazon Relational Database Service (RDS) (DAT302) by Brain Welcker

Amazon RDS

Amazon Aurora

Amazon Aurora has MySQL and PostgreSQL compatibility, it is five times faster than standard MySQL and three times faster than standard PostgreSQL database. Amazon Aurora is 90% cheaper than standard MySQL and PostgreSQL databases. It has a maximum size of 128 TB. Amazon Aurora defines a scaling policy of a maximum of 15 Aurora Replicas. Aurora Backup and Failover are automatic. Amazon Aurora supports cross-region replication. Aurora MySQL DB Cluster and PostgreSQL are created using the Amazon Relation Database Service console. Aurora Serverless gives Amazon Aurora the ability to automatically scale up, scale down, start-up, and shut down(auto-scaling). Aurora Serverless is best used when building an application that is not frequently used, building a new application, building an application with varying and unpredictable workloads.

Parallel Query for Amazon Aurora

Parallel Query for Amazon Aurora - source

Pricing

Amazon Aurora Pricing is based on either we select the MySQL edition or the PostgreSQL edition. Aurora Serverless is charged based on Aurora Capacity Unit(ACU)

Official Resources

Amazon Aurora

Amazon Aurora Documentation

Amazon Aurora Blog

Other Resources

Amazon Aurora Introduction

AWS re:Invent 2019: [REPEAT] Amazon Aurora storage demystified: How it all works (DAT309-R) by Murali Brahmadesa and Tobias Ternström

Amazon Aurora Global Database Deep Dive by Aditya Samant

AWS Certified Solutions Architect - Associate 2020 (7:02:14 - 7:06:56)

Amazon Aurora ascendant: How we designed a cloud-native relational database

Data Warehousing

Amazon Redshift

Amazon Redshift is columnar storage used for data warehousing, it is used to analyze and get insight from large data quickly by executing complex queries on them. These data are usually at rest and historical data. It contains a cluster of nodes, it could be in single-node mode or multi-node mode. There are two types of nodes in Amazon Redshift, namely leader node and compute node. The leader node stores SQL endpoints, metadata, and coordinates parallel SQL processing. Compute nodes stores the data, and execute the queries. Amazon Redshift stores data on a single Availability Zone. Amazon Redshift spectrum is used to query Amazon Simple Storage Service(Amazon S3) directly. Amazon Redshift federated queries enables us to query and analyze live data across databases, data warehouses, and data lakes.

Amazon Redshift Architecture

Amazon Redshift Architecture - source

When to use Amazon Redshift

for Online Analytical Processing

if we need to run queries across multiple data sources. For instance, we can copy data from different storages like Amazon EMR and Amazon S3 into Amazon Redshift.

Amazon Redshift is suitable for generating reports for business intelligence

Pricing

Amazon Redshiftp pricing - the basic price for Amazon Redshift starts from $0.25 per hour. There are several other features that can influence the price such as Amazon Redshift Spectrum pricing, Concurrency Scaling pricing, Redshift managed pricing, and Redshift ML pricing.

Amazon Redshift Architecture

source

Official Resources

Amazon Redshift

Amazon Redshift Documentation

Amazon Redshift Blog

Other Resources

AWS re:Invent 2017: Best Practices for Data Warehousing with Amazon Redshift & Redsh (ABD304) by Tony Gibbs

Getting Started with Amazon Redshift - AWS Online Tech Talks by Greg Khairallah and Harshida Patel

Amazon Redshift Tutorial | Amazon Redshift Architecture | AWS Tutorial For Beginners | Simplilearn

Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Training | Edureka

AWS Certified Solutions Architect - Associate 2020 (7:07:58)

What is Amazon Redshift? By Kevin Goldberg

NoSQL - Key/Value

DynamoDB

DynamoDB is a NoSQL database, key/value, and document database. That is it support document and key/value structures. DynamoDB’s major components are tables, items, attributes, keys, and values. A table is a collection of items and an item is a collection of attributes. Items are similar to rows while attributes are similar to columns in a traditional database. A key is used to identify attributes and value is the data itself. The Major API components in DynamoDB are control plane, data plane, DynamoDB streams, and transactions. On-Demand and Provisioned Mode are the read/write capacity modes in DynamoDB. Amazon DynamoDB provides us the ability to specific our Provisioned capacity based on Read Capacity Units(RCU) and Write Capacity Units(WCU). Amazon DynamoDB creates partitions based on size, Read Capacity Units and Write Capacity Units. The criteria required for partitioning are size of 10GB, RCU of 3000, and WCU of 1000. Encrypt data at rest (that is inactive data), data that is not moving from one device to another or from one network to another. DynamoDB has a Point in time recovery feature that is we can restore your data to any point in time. Amazon DynamoDB Accelerator(DAX) enables us to manage write through cache on DynamoDB, it reduces response time from milliseconds to microseconds. Amazon DynamoDB uses SSD storage and stores its data across 3 different availability zone.

Amazon DynamoDB Key/Value

source

When to use Amazon DynamoDB

for Online Transaction Processing(OTP).

to store real-time data from an IoT device.

to store activities and events on a web application such as clicks.

to store items in a Web application like user profile, user events used by advertising, gaming, retail, finance, and media companies.

for Data that requires high request rate(millions of requests per seconds).

it is best used in situations that require high consistency.

Pricing

Amazon DynamoDB pricing depends on on-demand capacity mode and provisioned capacity mode.

Hands-On

Creating Tables and Loading Data

Sample Code

Create a ToDo Web App Storing your data in Amazon DynamoDB

Official Resources

Amazon DynamoDB

Amazon DynamoDB Documentation

Amazon DynamoDB Blog

Other Resources

AWS DynamoDB Tutorial | AWS Services | AWS Tutorial For Beginners | AWS Training Video | Simplilearn

AWS Certified Developer - Associate 2020 (5:05:21)

AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database by Jaso Sorenson

AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401) by Rick Houlihan

Building a Mars Rover Application with DynamoDB

Document Database

Amazon DocumentDB

This is a document database that supports MongoDB. It has the capability to easily store, query, and index JSON data. It has about 15 read replicas and scales vertically with very low impact. It has flexible schema and Ad hoc query capability. It is easy to index and can be used for operational and analytics workloads.

Amazon DocumentDB

Pricing

Amazon DocumentDB pricing is based on On-demand instances, database I/O, database storage and backup storage.

When to use Amazon DocumentDB

it is Amazon version of MongoDB, it is used when you need to run MongoDB at scale

best for Profile management, Content, and catalog management.

Official Resources

Amazon DocumentDB

Amazon DocumentDB Documentation

Amazon DocumentDB Blog

Other Resources

AWS re:Invent 2019: Amazon DocumentDB deep dive (DAT326) by Joseph Idziorek and Antra Grover

Building with Amazon DocumentDB (with MongoDB compatibility) - AWS Virtual Workshop by Meet Bhagdev

DynamoDB vs AWS DocumentDB vs MongoDB

MongoDB vs. DocumentDB: Which Is Right for You?

Difference between AWS DynamoDB vs AWS DocumentDB vs MongoDB?

Graph Database

Amazon Neptune

Amazon Neptune is a graph database, it works with highly connected datasets. It checks for relations or similarity in data. For instance similarity between the movies a user watches on Netflix. It components are node(data entities), edges(relationships) and properties. Amazon Neptune support property graph like open-source Apache TinkerPop Germlin and Resource Description Framework(RDF) SPARQL. Amazon Neptune replicates data 6 times across 3 Availability Zones. Amazon Neptune Streams can be used to capture changes in a graph.

Knowledge Graph

Pricing

Amazon Neptune pricing is influenced by On-demand instances, database I/O, database storage, backup storage and data transfer.

When to use Amazon Neptune

Amazon Neptune is best used when we have relationships in the data.

for recommendation engines, fraud detection, and drug discovery.

for knowledge base applications such as Wikidata.

for identity graphs to show unified view of customers and prospects based on their interactions with a product or a website.

for social Networking applications to store user profiles and interactions.

Recommendation Relationships

Official Resources

Amazon Neptune

Amazon Neptune Documentation

Amazon Neptune Blog

Other Resources

AWS re:Invent 2019: Deep dive on Amazon Neptune (DAT361) by Bradley Bebee, Karthik Bharathy, and Ora Lassila

Nike: A Social Graph at Scale with Amazon Neptune

Homesite: Event-Driven Data Analytics Platform Using Amazon Neptune

AWS on Air 2020: AWS What’s Next ft. Amazon Neptune ML

In memory

Amazon ElastiCache

Amazon ElastiCache use to manage in-memory caching. Caching is storing data in a temporary storage area. This data is stored on the RAM which is volatile, that is the data can get lost easily and can be accessed fast. It stores frequently accessed data to improve performance, this helps to avoid application latency and throughput. It caches data from the database which is different from CloudFront(Content Delivery Network). Amazon ElastiCache stores important data in memory. Amazon Cloudfront stores static files, for example, HTML, audio, video, media files required by a web app. Amazon ElastiCache access only resources in the same VPC.

ElatiCache Redis vs Memcached source

Amazon ElastiCache has two engines

  1. Amazon ElastiCache for Redis
  2. Amazon ElastiCache for Memcached.

Pricing

Amazon ElastiCache pricing is based on the node types, backup storage, and data transfer.

When to use Amazon ElastiCache

it is best used when you need microseconds latency, key-based queries, and specialized data structures.

for situations like leader boards and real-time caching

if the data is on every page load or every request.

Official Resources

Amazon ElastiCache

Amazon ElastiCache for Redis Documentation

Amazon ElastiCache for Memcached Documentation

Amazon ElastiCacke Blog

Other Resources

AWS re:Invent 2019: Supercharge your real-time apps with Amazon ElastiCache (DAT208) by Pratibha Suryadevara

AWS re:Invent 2018: ElastiCache Deep Dive: Design Patterns for In-Memory Data Stores (DAT302-R1) byMichael Labib

AWS Certified Solutions Architect - Associate 2020(8:38:40)

Time series

Amazon Timestream

Amazon Timestream is a serverless time-series database for IoT and operational applications. Time series data are recorded over a period of time such as stock data and temperatures of a device. Amazon Timestream can be used to store and analyze trillions of events per day up to 1,000 times faster and at as little as 1/10th the cost of relational databases. One major advantage of Amazon Timestream database is its capability to move historical data to low-cost storage(magnetic store) but retain recent data(hot data) in-memory(SSD store). Queries can be run on both historical data and recent data. In addition, Amazon Timestream has a built-in time-series analytics function such as smoothing, approximation, and interpolation which helps in detecting patterns in data. Major concepts on Amazon Timestream are record, dimension, measure, timestamp, table, and Database. Records cannot be deleted or updated.

Amazon Timestream

Pricing

Amazon Timestream pricing is based on writes, SSD store, magnetic store, data transfer and queries.

When to use Amazon Timestream

for time series data from IoT devices

collecting and analysing operational metrics

analytical application

Sample code

Getting started with Amazon Timestream with Python

Official Resources

Amazon TImestream

Amazon Timestream Documentation

Amazon Timestream Blog

Other Resources

Getting Started with Amazon Timestream by Tony Gibbs

Deep Dive on Amazon Timestream by Tony Gibbs

Ledger Database

Amazon Quantum Ledger Database(QLDB)

Amazon QLDB is a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log owned by a central trusted authority. It is used to track data changes in applications. It can be used for storing audit logs. It uses a SQL like language called PartiQL. It is immutable, transparent, cryptographically verifiable, and serverless.

Amazon QLDB

Pricing

Amazon Quantum Ledger Database(QLDB) pricing is based on the data transfer, data storage and I/O.

When to use Amazon QLDB

best suited for economic and financial record

for application data

used in finance for tracking credit and debit transactions

for HR systems to track employee payroll, bonus, benefits and other details

for manufacturing to track product manufacturing history

Official Resources

Amazon Quantum Ledger Database(QLDB)

Amazon Quantum Ledger Database Documentation

Amazon Quantum Ledger Database Blog

Other Resources

AWS re:Invent 2019: Amazon QLDB: An engineer’s deep dive on why this is a game changer (DAT380) by Andrew Certain

Building System of Record Applications with Amazon QLDB - AWS Online Tech Talks by Philip Simko and Michael Labib

Other resources on selecting the right database

AWS re:Invent 2017: [REPEAT] Which Database to Use When? (DAT310-R) by Tony Petrossian and Ian Meyers

Selecting the Right Database for Your Application by Joseph Idziorek

Choosing The Right Database by Randall Hunt

Whoa, so many databases and terminologies. I am sure you need a break. I hope you were able to understand the different databases in AWS, when to use them, and resources that will give you a deep dive.