Kinesis Firehose & Analytics: a simple serverless data pipeline

May 22, 2020 in #aws #kinesis #firehose #bigdata #glue | | | Share on Google+

Realtime data pipeline with AWS managed services, ETL and analytics, deployed with Terraform

Your takeaways from this post

  • Understanding what is a Kinesis Firehose
  • How to manage a huge set of realtime data, and to analyse them
  • How to quickly deploy a data pipeline in AWS, with Terraform

Collect the data with Kinesis Firehose

When you deal with monitoring metrics/logs, realtime events from multiple clients or IOT devices, you need a robust way to receive the load, process it, and store it somewhere for immediate/later analysis. A simple database (MySQL, Postgres) will not be able to answer a huge traffic.

AWS Kinesis is made for these cases. It makes it easy to collect, process, and analyse realtime, streaming data so you can get timely insights and react quickly to new information.

Kinesis comes in 3 flavors:

  • Data streams: collect realtime data, really robust for heavy load (terabytes per hour), need to manually provision the shards to handle the volume, then data can be delivery to Analytics, Firehose, EMR, EC2 or Lambda. This service is similar to Kafka or Google Pub/Sub Data-Stream

  • Firehose: the simple version of data injection & delivering, easy to configure (automatically scale to meet demand), you usually use Firehose to capture data and send them to S3. Our demo will use Firehose because of the simplicity of deployment and configuration. In my case, I don't need to do heavy transformation on the incoming events, I need to collect the data, and send them to an S3 for later SQL analysis.
    Data-Firehose

  • Analytics: process data streams in realtime with SQL or Java without having to learn new programming languages or processing frameworks. This service is similar to Spark. I will not use Kinesis Analystics in my demo but simple SQL query services (Athena & QuickSight). Data-Analytics

Data analytics

When the data are present in S3, AWS has several managed services to query them:

  • Athena: to create SQL query on the data set, similar to any SQL database engines
  • QuickSight: to make dashboards by SQL querying the data on S3.

But when looking at data in the S3 bucket, there are no indexes present to be able to use SQL. To remediate, we will use AWS Glue which is a fully managed extract, transform, and load (ETL) service that makes it easy for developers to prepare and load their data for analytic. Glue runs a crawler every 5 minutes, which will analyse the S3 data and output an index representing the format of the data. This index is store in a catalog that is available for other services (like Athena or QuickSight) in order to made SQL query.

Glue

From that, we are able to make query and dashboard to analyse, in realtime the data set.

Deploy Kinesis & Analytics

To show you a simple data pipeline with Kinesis, I prepared a demo using Terraform to deploy only AWS managed services.

We will generate events via a ReactJS app, running locally. Kinesis Firehose will collect these (json) events, convert them in Parquet, and store them in S3. We will be able to query these data with Athena and QuickSight, with the help of Glue data interpretation.

Infra

Infra

  • Cloud: AWS
  • Front: ReactJs app shill-your-coin, that will generate dummy events (running locally)
  • Kinesis Firehose: to inject realtime data (similar to Kafka), and store them to a S3
  • Cognito: our identity provider, it will authorise the user to send data to Firehose
  • S3: to easily store a huge amount of data
  • Glue: it will analyse your data in S3, make sense of them, and output metadata representing your data index in order to query them later (kind of a mapper, or catalog)
  • Athena: with the index created by Glue, we can do SQL queries on our S3 data, in order to find patterns and create reports
  • QuickSight: same as Athena, it will use Glue and S3 data in order to create dashboards representing the data
  • Code source: Github
  • Deployment: Terraform describes all components to be deployed. One command line will setup the infra

Get the code

Check out the Github repo to deploy the infra.

Conclusion

  • Managing a huge flow of realtime data is really easy with Kinesis Firehose, as you don't need to take care of any servers, it scales up and down with the load. It is a serious competitor to Kafka, which needs lots of skills to configure and maintain.
  • Glue eases all efforts by discovering the data schema. Data can change with time and Glue will make adjustments to the schema.
  • Athena and QuickSight are again totally managed services, so the cost and headaches are minimal.
  • S3 is always the perfect fully managed place to send terabytes of data.

Thank you for reading :-) See you in the next post!

May 22, 2020 in #aws #kinesis #firehose #bigdata #glue | | | Share on Google+