Big data for the little guy

Building a serverless data warehouse on AWS with Firehose, Glue and Athena

motivation

I work at a company with fewer than 10 people on the engineering team. Everyone wears multiple hats and is responsible for their piece of the product end to end. There are no BAs, QAs, DBAs or DevOps engineers. There is no fence to throw anything over.

This presented a challenge when we determined the aggregate data we were getting from Google Analytics/Firebase was not sufficient and that we needed a new system to track user engagement. It needed to be able to collect large volumes of data from all of our products and expose this for both ad-hoc querying and BI reporting – a data warehouse.

Since we use AWS, the obvious approach would have been to use Redshift. However it’s expensive and realistically the data we needed to store was going to be in the order of gigabytes per year. Certainly not the terabytes or petabytes Redshift boasts of being able to handle. This was late 2018 and so Athena and Glue had been released, but not Lake Formation. I was vaguely aware that these services existed, but hadn’t really worked out what they might be useful for. To be fair, the AWS FAQs don’t exactly clarify things:

Q: What can I do with Amazon Athena?

Amazon Athena helps you analyze data stored in Amazon S3. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Amazon Athena can process unstructured, semi-structured, and structured data sets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC…

https://aws.amazon.com/athena/faqs/?nc=sn&loc=6

Not the most motivating bit of sales copy.

Two things helped things fall into place for me:

  1. Reading the book Designing Data-Intensive Applications. This gives a contemporary overview of the database ecosystem and underlying technologies. It helped me understand the Parquet file format (a compressed columnar alternative to CSVs/JSON) and that Athena is actually just AWS hosting the open source Presto query engine for you. Additionally, AWS Glue is just a way of defining Apache Hive tables.
  2. Reading the engineering blog posts of a few big tech companies (Netflix, AirBnB, Uber). It turns out even these big players are not using commercial propriety data warehouse systems but instead composing their own, with similar sets of open source technologies for streaming, storage and querying.

Implementation

It turns out the answers were all there in the AWS docs if you dig hard enough, but it took a while to work out how to combine the various services. We got there eventually and now have a cheap, low maintenance and simple ‘data warehouse for the little guy’. The moving parts of our solution are:

  1. A simple custom API that accepts requests from our web applications and mobile apps, adds some additional data (timestamp, unique uuid) and sends this JSON onto AWS Firehose as a base64 encoded string.
  2. An AWS Kinesis Firehose delivery stream which is configured to:
    1. Match the JSON with a specific glue schema
    2. Convert the record to Apache Parquet format
    3. Buffer 15 mins worth of events and then write them all to a specific S3 bucket in a year/month/day/hour folder structure.
  3. AWS Glue configured with a database, table and columns to match the format of events being sent. Additionally, an Glue crawler is configured to run several times a day to discover the new folders being added to the S3 bucket and to update the schema with this partition metadata.
  4. AWS Athena to query the database as defined in Glue.
  5. External SQL query tools and Microsoft Power BI also query Athena using the ODBC driver.

At the time of writing, we have 6Gb of Parquet files stored in S3 using this system. That’s ~130 million events. The monthly costs to run this are wonderfully low:

  • S3 storage – $0.80
  • Athena – $0.30 (varies based on your query volume)
  • Glue – $0.50
  • Kinesis Firehose – $0.50

gotchas

  • Modifying the schema in Glue (eg. adding a new column) does not automatically reflect in Firehose. You need to modify the stream config to use the new schema version.
  • The Glue crawlers have a number of configuration options. It turns out the settings you want if you wish to be in full control of the schema and not have deleted columns re-appear are:
    • Schema updates in the data store: Ignore the change and don’t update the table in the data catalog.
    • Object deletion in the data store: Ignore the change and don’t update the table in the data catalog.
  • I read that Parquet files perform best when they are ~1Gb in size. Ours are way smaller than that, but to try and get as close as possible we have the buffer settings in our Firehose stream at the maximum setting. One consequence of this is that when testing new events you have to remember to wait for the buffer to flush out and to run the Glue crawler before trying to query the data.
  • Finally, and this is a biggie – Athena is a read only service. If you want to modify or delete data, you need to work with the underlying files in S3. Our data is append only so this hasn’t been much of an issue, but definitely something to consider.
  • There is a one-one relationship between a Firehose stream and a AWS Glue table.
  • At some point you or an upstream events producer will send through bad data.
    • Firehose emits a FailedConversion.Records CloudWatch metric. It’s useful to put some monitoring around this so you can react quickly.
    • If an event does not match the configured schema, Firehose will put it in a S3 folder named ‘format-conversion-failed’. The raw data will be there in base64 encoded form.

Future plans

Overall, this system is working really well for us. A few things we’re considering for the future:

  • Investigating the new options in Lake Formation and Glue for ingesting changes directly from RDS databases. This would open up options for transitioning to more of a ‘data lake’ with data coming from multiple places.
  • Experimenting with reading and analyzing data with Databricks/Spark as an alternative to Athena – might be good for more involved queries.
  • Looking at using more complicated data structures such as struct and array columns.
  • Adding a downstream process to query the system and aggregate certain events into a format better suited for quick reporting.

The guide to Computer Engineering I wish I had: Part 1

Last month’s exams bought an end to four years of Computer Engineering (CE) studies at the University of Canterbury (UC). As I moved through the years, it became clear that the prospectus and other glossy brochures provided by the marketing department are rather deceiving. They give the impression that students would be designing robots, computers and biomedical devices on a daily basis. Not true! I’d therefore like to delve more deeply into my journey through the degree with the intention of providing some of the guidance to future students that I would have appreciated.

It should be noted that the Christchurch earthquakes of 2010 and 2011 struck in the first and second years of my studies, and in many ways shaped my university experience.

The Bachelor of Engineering with Honors degree at the UC is split into a number of departments and specialties. All programs are four years of full time study along with a requirement of 50 days each of practical and professional work completed over the summer breaks, a first-aid course and two workshop courses. The first year is known as the ‘intermediate’ year in which students must complete a number of prescribed courses, with some options depending on their desired specialty. If all the intermediate year courses are passed with sufficiently high grades (the actual marks required depend on the popularity and size of each program), students are admitted to the Engineering faculty proper and complete three ‘professional’ years.

My main interest going into university was programming and I would therefore seem better aligned with a Computer Science degree. I ended up choosing CE on the advice of a number of family members, combined with the persevered higher prestige of an Engineering degree. UC has subsequently introduced a Software Engineering program which I would have chosen had it been available at the time.

Below I go into detail about my experience of the various required courses; however if you want the short version, here it is:

Engineering sucks in Intermediate year but gets progressively more enjoyable as you advance.

I shall qualify that statement shortly but it boils down to the fact that in intermediate year you are forced to do a lot of courses that have little to do with your desired specialty . I believe this is especially pronounced for a ‘fringe’ Engineering specialty such as CE. Much of the physics and mathematics taught at this level is far more relevant to Civil and Mechanical students than it is to CE students.

Intermediate Year Courses

Computer Science

Two first year computer science courses are offered. COSC121 is the introductory programming course, using the Python language. COSC122 is a basic algorithms and data structures course. I found both of these courses to be well taught and my most enjoyable courses of the year. COSC121 is interesting because the distribution of marks in the class always produces a ‘double hump’. This is a well documented occurrence in university ‘programming 101’ courses:

 It is as if there are two populations: those who can program, and those who cannot, each with its own independent bell curve.

COSC122 introduces sorting and searching algorithms, and various data structures. It is one, if not the most important course you will take at university as it is the one that will help you get a job! Every technical job interview I’ve had has asked about concepts from this course – with hashing showing up disproportionately often!

Maths

As with computer science, there are two maths papers; one each semester. I would classify both as ‘necessary evils’. They are tough papers with a lot of content being pushed through. It is not uncommon for people to fail the first course, EMTH118, repeat it in second semester and then to take the second paper, EMTH119, in summer school.

The good thing about the maths courses is that they change little from year to year and thus a lot of resources are available and the lecturers are familiar with the material. I found the maths department courses to be the best organized of all of the intermediate year courses. The examinations are also fair and consistent, so good results can be obtained through hard work.

A third required maths course, EMTH171 is taught in second semester. This introduces mathematical modelling using MATLAB. CE oriented students typically excel at this course because they are already familiar with programming. There are a couple of things to be aware of however. Firstly, the course is not offered in summer school, so a failure necessitates a second intermediate year. Secondly, the course contains two major assignments known as ‘case studies’ which are done in pairs. This contributes a significant percentage of the course grade, so it is important you choose a good partner (choosing  good group members is actually a consistent theme to having an enjoyable Engineering degree).

Physics

These are two pretty stock standard courses, PHYS101 and PHYS102. The amount of content does make them fairly tough however. The weekly three hour labs and associated lab reports will be one of the worst parts of your year. Manually calculating uncertainties is one of the most boring things you can be made to do!

Engineering

ENGR101 is a first semester course which is widely regarded as a waste of time. You do learn how to write an engineering report though, which is important. Overall, you can get away with not doing as much work in this one.

In the next post I will talk about the professional years of the degree.