Big data for the little guy

Building a serverless data warehouse on AWS with Firehose, Glue and Athena

motivation

I work at a company with fewer than 10 people on the engineering team. Everyone wears multiple hats and is responsible for their piece of the product end to end. There are no BAs, QAs, DBAs or DevOps engineers. There is no fence to throw anything over.

This presented a challenge when we determined the aggregate data we were getting from Google Analytics/Firebase was not sufficient and that we needed a new system to track user engagement. It needed to be able to collect large volumes of data from all of our products and expose this for both ad-hoc querying and BI reporting – a data warehouse.

Since we use AWS, the obvious approach would have been to use Redshift. However it’s expensive and realistically the data we needed to store was going to be in the order of gigabytes per year. Certainly not the terabytes or petabytes Redshift boasts of being able to handle. This was late 2018 and so Athena and Glue had been released, but not Lake Formation. I was vaguely aware that these services existed, but hadn’t really worked out what they might be useful for. To be fair, the AWS FAQs don’t exactly clarify things:

Q: What can I do with Amazon Athena?

Amazon Athena helps you analyze data stored in Amazon S3. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Amazon Athena can process unstructured, semi-structured, and structured data sets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC…

https://aws.amazon.com/athena/faqs/?nc=sn&loc=6

Not the most motivating bit of sales copy.

Two things helped things fall into place for me:

  1. Reading the book Designing Data-Intensive Applications. This gives a contemporary overview of the database ecosystem and underlying technologies. It helped me understand the Parquet file format (a compressed columnar alternative to CSVs/JSON) and that Athena is actually just AWS hosting the open source Presto query engine for you. Additionally, AWS Glue is just a way of defining Apache Hive tables.
  2. Reading the engineering blog posts of a few big tech companies (Netflix, AirBnB, Uber). It turns out even these big players are not using commercial propriety data warehouse systems but instead composing their own, with similar sets of open source technologies for streaming, storage and querying.

Implementation

It turns out the answers were all there in the AWS docs if you dig hard enough, but it took a while to work out how to combine the various services. We got there eventually and now have a cheap, low maintenance and simple ‘data warehouse for the little guy’. The moving parts of our solution are:

  1. A simple custom API that accepts requests from our web applications and mobile apps, adds some additional data (timestamp, unique uuid) and sends this JSON onto AWS Firehose as a base64 encoded string.
  2. An AWS Kinesis Firehose delivery stream which is configured to:
    1. Match the JSON with a specific glue schema
    2. Convert the record to Apache Parquet format
    3. Buffer 15 mins worth of events and then write them all to a specific S3 bucket in a year/month/day/hour folder structure.
  3. AWS Glue configured with a database, table and columns to match the format of events being sent. Additionally, an Glue crawler is configured to run several times a day to discover the new folders being added to the S3 bucket and to update the schema with this partition metadata.
  4. AWS Athena to query the database as defined in Glue.
  5. External SQL query tools and Microsoft Power BI also query Athena using the ODBC driver.

At the time of writing, we have 6Gb of Parquet files stored in S3 using this system. That’s ~130 million events. The monthly costs to run this are wonderfully low:

  • S3 storage – $0.80
  • Athena – $0.30 (varies based on your query volume)
  • Glue – $0.50
  • Kinesis Firehose – $0.50

gotchas

  • Modifying the schema in Glue (eg. adding a new column) does not automatically reflect in Firehose. You need to modify the stream config to use the new schema version.
  • The Glue crawlers have a number of configuration options. It turns out the settings you want if you wish to be in full control of the schema and not have deleted columns re-appear are:
    • Schema updates in the data store: Ignore the change and don’t update the table in the data catalog.
    • Object deletion in the data store: Ignore the change and don’t update the table in the data catalog.
  • I read that Parquet files perform best when they are ~1Gb in size. Ours are way smaller than that, but to try and get as close as possible we have the buffer settings in our Firehose stream at the maximum setting. One consequence of this is that when testing new events you have to remember to wait for the buffer to flush out and to run the Glue crawler before trying to query the data.
  • Finally, and this is a biggie – Athena is a read only service. If you want to modify or delete data, you need to work with the underlying files in S3. Our data is append only so this hasn’t been much of an issue, but definitely something to consider.
  • There is a one-one relationship between a Firehose stream and a AWS Glue table.
  • At some point you or an upstream events producer will send through bad data.
    • Firehose emits a FailedConversion.Records CloudWatch metric. It’s useful to put some monitoring around this so you can react quickly.
    • If an event does not match the configured schema, Firehose will put it in a S3 folder named ‘format-conversion-failed’. The raw data will be there in base64 encoded form.

Future plans

Overall, this system is working really well for us. A few things we’re considering for the future:

  • Investigating the new options in Lake Formation and Glue for ingesting changes directly from RDS databases. This would open up options for transitioning to more of a ‘data lake’ with data coming from multiple places.
  • Experimenting with reading and analyzing data with Databricks/Spark as an alternative to Athena – might be good for more involved queries.
  • Looking at using more complicated data structures such as struct and array columns.
  • Adding a downstream process to query the system and aggregate certain events into a format better suited for quick reporting.

Grand Canyon/Zion Trip

I’ve just returned from a short, but very enjoyable trip down to the Grand Canyon and Zion National Parks.

A few thoughts for anyone organising something similar…

Camping

In the Grand Canyon I stayed at the huge NPS Mather Campground which was nice enough. However it sells out quickly and for those who don’t want to plan so far ahead, or would rather not spend the $$$ I’ve since discovered that there is dispersed camping available in Kaibab National Forest, just outside the park. Details here.

Food

On the drive from the Grand Canyon to Zion, Kanab is a good place to stop for food. I would particularly recommend the Kanab Creek Bakery:
http://www.kanabcreekbakery.com/.

Yellowstone Logistics

Here’s a quick summary of how I organised my six day trip to Yellowstone.

  1. Flew Vancouver -> Salt Lake City. Picked up rental car, bought gas + groceries and drove up to near the West entrance of Yellowstone. Slept in the back of the rental off a forest service road.
  2. Entered via West Yellowstone and drove anti-clockwise around the loop, sorting out my backcountry permits in Grant Village. I then continued around the loop to the Cascade Lake 4K5 Trailhead just north of Canyon Village and hiked into my campsite at Wolf Lake.
  3. Hiked back out, check out the canyon and then drove to Mammoth HotSprings via Tower Junction. After walking around the hot springs I drove down to the Biscuit Basin parking lot which was the trailhead for my hike to tonight’s campsite at Firehole Falls.
  4.  Hiked out and visited a bunch more geysers and hot springs in the Old Faithful area. Continued driving to the DeLacy Creek parking lot where I started walking to my next campsite ‘Bluff Top’ on the shores of Shoshone Lake.
  5. Hiked out and drove out the South Yellowstone gate and through Grand Teton National Park. Had an early dinner in Jackson before continuing south to a free campsite I found on Wikicamps, next to the Salt River – Whitetail Lane Recreation Area, just outside Afton.
  6. Drove down to Park City and did a few hours of mountain biking before driving over the hill to Salt Lake City and flying home.

4 states visited – Utah, Idaho, Montana (briefly), Wyoming

Identifying scam websites

I’m writing this post after growing increasingly frustrated with the ‘tips’ that news sites put out each time a prominent scam does the rounds. I find they tend to give a bunch of general, waffly information that misses the point entirely. I’m here to tell you how than you can avoid 99{8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98} of scams with this ‘one simple trick’:

Look At The URL

That’s this thing (aka. the web address): 

The usual tip that gets dished out is to check for ‘https’ and the little padlock, and it’s true you should do that. If a site is asking for your credit card details and doesn’t have those two things then you should definitely get ‘outa there fast. Unfortunately any semi-sophisticated scam site will have no trouble putting those two things in place. Thus their absence is a very likely indicator of a scam but their presence is no guarantee that it isn’t one.

I’m going to use Air New Zealand as an example here, for no other reason than I’ve seen a few scams attempting to fool people into thinking they can get free flights. This same information is relevant for any website.

What you really need to be looking at is the domain and top-level domain of the URL. What is this techno-gibberish I speak of? Here’s a quick lesson:

URL Structure

A URL consists of a few parts, separated by periods:

  1. Protocol – either http or https. The ‘s’ stands for secure and tells you whether communication between your browser and the web server is encrypted.
  2. The sub-domain. This could simply be www or it could be mail in the case of mail.google.com. It could also be blah.something.else. Usually there is one sub-domain, but there can also be none or multiple. The good news is that we don’t care about it when checking for a scam site.
  3. The domain is the unique word(s) that the site developer purchases. It’s a bit like buying a bit of land – as long as you continue paying your rates, you can keep it.
  4. The top-level domain is the top of the hierarchy and is used to segment the internet into smaller chunks. .com is probably the most well known top-level domain. Country specific top-level domains are also common eg. .co.nz. There’s also .govt, .net and a whole swath of more esoteric ones.
  5. The path is everything following the first slash after the top-level domain. This is just how the website developer has organised their site and is also unimportant when identifying a scam site.

A website developer purchases a combination of domain & top-level domain which must be unique across the entire internet.

What is important is the domain and top-level domain and their proximity to each other in the URL. Look closely at the periods. The hyphens don’t count!

Often a scam site will register a longer domain that includes the same word(s) as those in the legitimate site. eg. https://www.airnewzealand-freeflights.co.nz. Or they’ll put the word(s) from the legitimate site into a sub-domain eg. https://airnewzealand.freeflights.co.nz. With your new knowledge of URLs you should see that these are obvious scams. For instance, in the second example the scammer has legitimately purchased the domain freeflights.co.nz and has then added a sub-domain to try and make it look like the real deal.

Here’s some other examples. The key thing is that you know what Air New Zealand’s normal/legit domain is. The best way to double check is just to do a Google search and check the first (non-ad) result:

Once you know that, see if you can identify whether these are safe sites or not. Note that I haven’t made them clickable like normal links (except for the last one).

  • https://www.airnewzealand.co.nz/safe. The domain is airnewzealand which is what we expect.
  • https://www.airnewzealnd.co.nz/ not safeRegistering a domain with a slightly different spelling is a common trick among scammers.
  • https://www.air.newzealand.co.nz/– not safe. The domain here is newzealand and they have added air as a subdomain to make it appear from a quick glance that is the same site.
  • https://www.air.newzealand-online.co.nz/– not safe. For similiar resasons. The domain here is newzealand-online.
  • https://flightbookings.airnewzealand.co.nz/vbook/actions/ext-search?searchLegs[0].originPoint=YVR&searchLegs[0].destinationPoint=&promoCode=&adults=0&children=0&tripType=return&searchType=flexible&internalRevenueSource=cms-book-book-now-button{8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98}7Cbook&_ga=2.147022391.1465521394.1536551535-1663654223.1536551535 safe. Don’t be put off by the fact that it looks long and scary. All you need to care about is that airnewzealand is separated by a period in front and is the last part of the URL before the top-level domain .co.nz
  • https://flightbookings-airnewzealand.co.nz/vbook/actions/ext-search?searchLegs[0].originPoint=YVR&searchLegs[0].destinationPoint=&promoCode=&adults=0&children=0&tripType=return&searchType=flexible&internalRevenueSource=cms-book-book-now-button{8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98}7Cbook&_ga=2.147022391.1465521394.1536551535-1663654223.1536551535 – not safe. Hopefully you noticed the hyphen before airnewzealand. The registered domain here is flightbookings-airnewzealand which is a completely different address.
  • https://www.airnewzealand.co.nz – not safe. Be careful, the text you read on the page (or text, Facebook, WhatsApp…) may be pointing to a different site. Which URL did you end up on when you clicked this link? In general it’s OK if you follow a link and end up on a dodgy site, as long as you then check the URL and exit. Bad stuff only happens once you interact with the site in some way eg. download a file or enter personal information.

Hopefully this helps someone out there!

Extra for experts: The first section of this post is an awesome introduction to how DNS works https://hacks.mozilla.org/2018/05/a-cartoon-intro-to-dns-over-https/.

Update (22/01/2019): Google just launched a great quiz which can test your knowledge of this stuff. Check it out: https://phishingquiz.withgoogle.com/.