Big data for the little guy

Building a serverless data warehouse on AWS with Firehose, Glue and Athena

motivation

I work at a company with fewer than 10 people on the engineering team. Everyone wears multiple hats and is responsible for their piece of the product end to end. There are no BAs, QAs, DBAs or DevOps engineers. There is no fence to throw anything over.

This presented a challenge when we determined the aggregate data we were getting from Google Analytics/Firebase was not sufficient and that we needed a new system to track user engagement. It needed to be able to collect large volumes of data from all of our products and expose this for both ad-hoc querying and BI reporting – a data warehouse.

Since we use AWS, the obvious approach would have been to use Redshift. However it’s expensive and realistically the data we needed to store was going to be in the order of gigabytes per year. Certainly not the terabytes or petabytes Redshift boasts of being able to handle. This was late 2018 and so Athena and Glue had been released, but not Lake Formation. I was vaguely aware that these services existed, but hadn’t really worked out what they might be useful for. To be fair, the AWS FAQs don’t exactly clarify things:

Q: What can I do with Amazon Athena?

Amazon Athena helps you analyze data stored in Amazon S3. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Amazon Athena can process unstructured, semi-structured, and structured data sets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC…

https://aws.amazon.com/athena/faqs/?nc=sn&loc=6

Not the most motivating bit of sales copy.

Two things helped things fall into place for me:

  1. Reading the book Designing Data-Intensive Applications. This gives a contemporary overview of the database ecosystem and underlying technologies. It helped me understand the Parquet file format (a compressed columnar alternative to CSVs/JSON) and that Athena is actually just AWS hosting the open source Presto query engine for you. Additionally, AWS Glue is just a way of defining Apache Hive tables.
  2. Reading the engineering blog posts of a few big tech companies (Netflix, AirBnB, Uber). It turns out even these big players are not using commercial propriety data warehouse systems but instead composing their own, with similar sets of open source technologies for streaming, storage and querying.

Implementation

It turns out the answers were all there in the AWS docs if you dig hard enough, but it took a while to work out how to combine the various services. We got there eventually and now have a cheap, low maintenance and simple ‘data warehouse for the little guy’. The moving parts of our solution are:

  1. A simple custom API that accepts requests from our web applications and mobile apps, adds some additional data (timestamp, unique uuid) and sends this JSON onto AWS Firehose as a base64 encoded string.
  2. An AWS Kinesis Firehose delivery stream which is configured to:
    1. Match the JSON with a specific glue schema
    2. Convert the record to Apache Parquet format
    3. Buffer 15 mins worth of events and then write them all to a specific S3 bucket in a year/month/day/hour folder structure.
  3. AWS Glue configured with a database, table and columns to match the format of events being sent. Additionally, an Glue crawler is configured to run several times a day to discover the new folders being added to the S3 bucket and to update the schema with this partition metadata.
  4. AWS Athena to query the database as defined in Glue.
  5. External SQL query tools and Microsoft Power BI also query Athena using the ODBC driver.

At the time of writing, we have 6Gb of Parquet files stored in S3 using this system. That’s ~130 million events. The monthly costs to run this are wonderfully low:

  • S3 storage – $0.80
  • Athena – $0.30 (varies based on your query volume)
  • Glue – $0.50
  • Kinesis Firehose – $0.50

gotchas

  • Modifying the schema in Glue (eg. adding a new column) does not automatically reflect in Firehose. You need to modify the stream config to use the new schema version.
  • The Glue crawlers have a number of configuration options. It turns out the settings you want if you wish to be in full control of the schema and not have deleted columns re-appear are:
    • Schema updates in the data store: Ignore the change and don’t update the table in the data catalog.
    • Object deletion in the data store: Ignore the change and don’t update the table in the data catalog.
  • I read that Parquet files perform best when they are ~1Gb in size. Ours are way smaller than that, but to try and get as close as possible we have the buffer settings in our Firehose stream at the maximum setting. One consequence of this is that when testing new events you have to remember to wait for the buffer to flush out and to run the Glue crawler before trying to query the data.
  • Finally, and this is a biggie – Athena is a read only service. If you want to modify or delete data, you need to work with the underlying files in S3. Our data is append only so this hasn’t been much of an issue, but definitely something to consider.
  • There is a one-one relationship between a Firehose stream and a AWS Glue table.
  • At some point you or an upstream events producer will send through bad data.
    • Firehose emits a FailedConversion.Records CloudWatch metric. It’s useful to put some monitoring around this so you can react quickly.
    • If an event does not match the configured schema, Firehose will put it in a S3 folder named ‘format-conversion-failed’. The raw data will be there in base64 encoded form.

Future plans

Overall, this system is working really well for us. A few things we’re considering for the future:

  • Investigating the new options in Lake Formation and Glue for ingesting changes directly from RDS databases. This would open up options for transitioning to more of a ‘data lake’ with data coming from multiple places.
  • Experimenting with reading and analyzing data with Databricks/Spark as an alternative to Athena – might be good for more involved queries.
  • Looking at using more complicated data structures such as struct and array columns.
  • Adding a downstream process to query the system and aggregate certain events into a format better suited for quick reporting.

Porting a humongous Perl script to Python

The Problem

The first project I was tasked with at my new job involved porting a large (>18k lines long!) Perl script to Python. I knew from experience that trying to do this in one ‘big bang’ step was sure to result in the new version having a bunch of bugs that had been squashed out of the Perl script over years of development. Instead I sought a more cautious approach which is described in this post.

The Perl script in question is run as a console app. It takes a document id along with various optional arguments. The script locates/generates a number of urls, writes them to a database and exits. It is invoked by another Perl script which reads the results from the database on completion, all in the lifecycle of a FastCGI API request.

Now, many years on from this script being created it seems an obvious fit for a ‘microservice’. Thus the goal is to both port the code to Python (as part of a company-wide push to consolidate languages) and to change it from a console app to a Flask API.

Interfacing Code

Going back to the cautious approach I mentioned above; fortunately the structure of the existing Perl script lent it to being gradually ported over piece by piece. I looked for a way to interface it with the new Flask API and the Python subprocess module looked like it would work nicely.

In terms of data transfer, I made a minor modification to the Perl script to write its output to stdout as JSON, rather than to the existing database (which I did not want the Python API to be coupled to). Writing data to stdout sounds fragile but I rationalised that this is exactly what Linux utilities piped to each other have been doing for years. It just means you have to be careful not to have any stray print statements floating around.

The interfacing Python code looks something like this:

def get_links_from_perl_script(start_process, process_info):
	input_list = [start_process, process_info]
	input_json = json.dumps(input_list)

	p = Popen(perl_script_path, stdin=PIPE, stdout=PIPE, stderr=PIPE, cwd=perl_script_dir)
	output, err = p.communicate(input_json.encode())
	rc = p.returncode

	if rc == 0:
		logger.info('Perl script finished successfully for start process: {8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98}s' {8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98} (start_process))
		if err:
			logger.warn('But the following errors were reported: {8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98}s' {8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98} err)
		links = []
		json_output = json.loads(output)
		return json_output 
	else:
		logger.error('Perl script exited with non-zero error code: {8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98}d for start process: {8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98}s. Error: {8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98}s' {8b0b96060afa46d1ca7c90aeb24f8da085e11471ba2cd3d6bf9cf57471789b98} (rc, start_process, err))

And on the Perl side:

use strict;
use JSON qw(encode_json decode_json);
my $str = do { local $/; <STDIN> };
my $decoded_json = decode_json($str);
# do stuff....
print encode_json(\@some_results);
exit(0);

One extra quirk is that I work on a Windows machine. Whilst options exist to install Perl on Windows, it definitely doesn’t seem to be a first class citizen. However we now have the WSL (Windows Subsystem for Linux)! My Ubuntu WSL already has Perl installed, so I wondered if I could get my Python Flask API to spin up a Perl subprocess on the WSL and pipe data to and from it. It turns out this is fairly easy. In the Python code above, the perl_script_path variable is declared as follows:

perl_script_path = 'wsl perl /mnt/c/Users/nware/Dev/this_project/humongous_script.pl'.split()

Note: a trick for young players is that this won’t work if you have a 32-bit version of Python installed. The WSL is 64-bit so Python won’t know how to find it. Ideally just install 64-bit Python, but you can work around it with this magical incantation:

perl_script_path = os.path.join(os.environ['SystemRoot'], 'SysNative', 'wsl perl /mnt/c/Users/nware/Dev/this_project/humongous_script.pl'.split()

Perl package management

A quick note on Perl package management. I was frustrated at the seemingly manual process of installing Perl packages with cspan. Coming from a Python/C#/Javascript background which all have good(ish) package management solutions, this seemed archaic. I went looking for something similar and found exactly what I was after: Carton.

Containerization

This all worked nicely for dev/test but I wanted the Flask API in a Docker container for production. The tricky thing here is that containers are meant (for good reason) to run a single workload. Thus there are official Python base containers and official Perl base containers but obviously none that have both.

I ended up creating an intermediary container, which is essentially the contents of the official Perl Dockerfile but with the based changed from buildpack-deps:stretch to python:3.6-stretchhttps://hub.docker.com/r/nwareing/perl-python/.

Note: The long term goal of this project is to gradually port all of the Perl code into Python. When this is done, the interfacing code can be removed and we will just use the offical Python docker image as per normal.

I could then create my actual application Dockerfile as follows:

FROM nwareing/perl-python:latest

RUN cpanm Carton && mkdir -p /perl_src    
WORKDIR /perl_src

COPY ./perl_src/cpanfile /perl_src/cpanfile
RUN carton install

WORKDIR /app

RUN set -ex && pip install pipenv uwsgi

COPY ./Pipfile* /app/

RUN pipenv install --system --deploy

COPY ./perl_src/humongous_script.pl /perl_src/humongous_script.pl

COPY ./src /app

# Start uwsgi web server
ENTRYPOINT [ "uwsgi", "--ini", "uwsgi.ini" ]

Summary

In summary, the work of art / monstrosity I’ve created looks something like this:

Hosting Upsource with Docker – DNS Dilemmas

Currently at work we are using an open source source code management tool called Kallithea. Unfortunately it doesn’t seem to be under active development any longer and is in general a bit unstable and lacking the features we need in a growing development team. For me the biggest pain point was not having a nice web interface to browse and review code. We’re currently evaluating other options (BitBucket, GitHub, VSO/TFS etc.) and trying to decide whether to self-host or not. This process is taking a bit of time, so I went looking for something to tide us over until we came up with a more permanent solution. This lead me to Upsource, one of JetBrains’ latest incarnations.

Upsource is web-based tool for browsing and reviewing code. The handy thing with Upsource is that it tacks onto your source code hosting tool, rather than being an all-in-one like the systems we are looking at moving to. This allowed me to quietly install it without ruffling any feathers and let members of the team decide whether or not they wanted to use it. Luckily I had a spare Linux box running Ubuntu on which I was quickly able to get it installed and hooked up with LDAP.

The interesting part came a month or later when the next version of Upsource was released (February 2017). As well as a bunch of handy new features (full-text search FTW) they also announced that new versions were being published as Docker images. This sounded like a good idea and one which would make future updates easier, so I followed the instructions to migrate my Upsource instance to being hosted under Docker. Unfortunately I found that after starting up my new version of Upsource inside a Docker container, it could no longer resolve internal URLs; neither those pointing to the source code repositories or to the LDAP server.

A bit of Googling revealed that this was a known issue with Docker on recent versions of Ubuntu: https://github.com/docker/docker/issues/23910. It sounds like it’s resolved in the latest version of Docker, but I couldn’t work out whether that had been released yet.

Luckily someone had already written up a handy blogpost showing how to get around the issue: https://robinwinslow.uk/2016/06/23/fix-docker-networking-dns/#the-permanent-system-wide-fix.

I went with the ‘quick fix’ approach described there:

  1. I ran this command to find the IP address of the DNS server running inside my company’s network. This spat out two contiguous IPs for me, so I just choose the first one.
    $ nmcli dev show | grep 'IP4.DNS'
    IP4.DNS[1]:                             10.0.0.2
    IP4.DNS[1]:                             10.0.0.3
  2. Added a –dns 10.0.0.2 argument to the docker run command I used to start the Upsource container.

Problem solved!

TradeMe to Trello Chrome Extension

It can be tricky keeping track of a bunch of listings when you’re looking to join a new flat. Have I already contacted that person? When am I viewing this place? Have I heard back from them yet? The built in functionality on TradeMe (the auction site just about everyone in New Zealand uses to list and look for flats) is just not up to the task.

Trello is a web application which allows you to create a custom set of lists and to move ‘cards’ back and forth between them. Many developers and others working in the tech industry are likely to be familiar with it already.

I’ve created a Chrome extension to link TradeMe and Trello together, making moving flats that little bit easier. Using this extension, it’s as simple as clicking the icon when you are on a TradeMe listing to have a card automatically created in Trello.

Trello board with TradeMe listings

May your flat hunting be forever more organised!

Get the extension: https://chrome.google.com/webstore/detail/trademe-to-trello/eapogjcjbcgaoocipcfcnedibnfdmlng?hl=en&gl=NZ.

The code is open source, and up on GitHub: https://github.com/nick-nz/trademe-to-trello.