When you deal with monitoring metrics/logs, realtime events from multiple clients or IOT devices, you need a robust way to receive the load, process it, and store it somewhere for immediate/later analysis. A simple database (MySQL, Postgres) will not be able to answer a huge traffic.
AWS Kinesis is made for these cases. It makes it easy to collect, process, and analyse realtime, streaming data so you can get timely insights and react quickly to new information.
Kinesis comes in 3 flavors:
Data streams: collect realtime data, really robust for heavy load (terabytes per hour), need to manually provision the shards to handle the volume, then data can be delivery to Analytics, Firehose, EMR, EC2 or Lambda. This service is similar to Kafka or Google Pub/Sub
Firehose: the simple version of data injection & delivering, easy to configure (automatically scale to meet demand), you usually use Firehose to capture data and send them to S3.
Our demo will use Firehose because of the simplicity of deployment and configuration. In my case, I don't need to do heavy transformation on the incoming events, I need to collect the data, and send them to an S3 for later SQL analysis.
Analytics: process data streams in realtime with SQL or Java without having to learn new programming languages or processing frameworks. This service is similar to Spark. I will not use Kinesis Analystics in my demo but simple SQL query services (Athena & QuickSight).
When the data are present in S3, AWS has several managed services to query them:
But when looking at data in the S3 bucket, there are no indexes present to be able to use SQL. To remediate, we will use AWS Glue which is a fully managed extract, transform, and load (ETL) service that makes it easy for developers to prepare and load their data for analytic. Glue runs a crawler every 5 minutes, which will analyse the S3 data and output an index representing the format of the data. This index is store in a catalog that is available for other services (like Athena or QuickSight) in order to made SQL query.
From that, we are able to make query and dashboard to analyse, in realtime the data set.
To show you a simple data pipeline with Kinesis, I prepared a demo using Terraform to deploy only AWS managed services.
We will generate events via a ReactJS app, running locally. Kinesis Firehose will collect these (json) events, convert them in Parquet, and store them in S3. We will be able to query these data with Athena and QuickSight, with the help of Glue data interpretation.
shill-your-coin, that will generate dummy events (running locally)
Check out the Github repo to deploy the infra.
Thank you for reading :-) See you in the next post!