Serverless is an architecture style that we're strongly drawn to at Beam - it allows us to move quickly and use the best tool for the job, allowing AWS to do the undifferentiated heavy lifting. We apply this same architecture preference to data analysis; so we built a serverless data lake.
Serverless data lake
We decouple our data storage from compute resource, which is a key architectural pattern when building out a data lake and is vital for cost-effective scalability. We use Amazon S3 to store vast amounts of data generated from connected vehicles; this is our long term data repository for analytics. Data received into the system is buffered and periodically written to S3 in batch to reduce the number of writes and thus keep costs in check.
The compute part of the analysis solution is provided by Amazon Athena. This is based on an open source distributed query engine called Presto SQL, which was developed by Facebook to power some of their privacy-infringing query workloads. It now has a massive community around it, an annual conference and many connectors to data repositories including S3 (via Apache Hive).
There are a number of great guides to general Amazon Athena tuning, so we won't repeat the advice here. See these articles from the teams at BryteFlow and Upsolver, as well as Amazon's own guidance:
- BryteFlow - https://bryteflow.com/how-to-get-your-amazon-athena-queries-to-run-5x-faster
- Upsolver - https://www.upsolver.com/blog/aws-athena-performance-best-practices-performance-tuning-tips
- Amazon Big Data Blog - https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena
We are going to focus on S3 partitioning strategies for connected car data in the remainder of this article.
To prevent wasteful (read: slow) scans of the data set when querying, partitioning the data when writing to S3 is vital; this allows the Athena query engine to only process the subset of the data required to answer the query.
As data is ingested from the fleet, buffered then written to S3, any given time slice is a mixture of data from the entire fleet. As a result, there is little option other than to write into a date specific partition, for example "year=2020/month=09/day=01/". One observation here is that data received just before midnight will actually get written into a partition dated with the following day, so date-based queries that need to be highly accurate cannot just use the partition date. Thinking more deeply about this, in any IoT system where you expect the device to be disconnected from the cloud for a period of time, as is the case for vehicles parking in underground car parks, the ingress timestamp can be wildly different to the timestamp of data capture. So although date based partitioning is better than no partitioning, it is in fact not particularly useful for anything other than approximate queries.
This leads us onto mechanisms to repartition the data some time after it lands in S3. The simplest approach here is to use an ETL tool, such as AWS Glue, which features a fully managed Apache Spark infrastructure. If you don't know Apache Spark then just know that is an open-source, distributed, general-purpose cluster-computing framework for big data - exactly what we need. The AWS Glue value-add makes this serverless, fully managed and takes the pain out of provisioning all the infrastructure.
At Beam, we have taken the approach of performing a nightly process to repartition the data into vehicle and date based partitions. This means that, whilst we can still serve queries on data that hasn't yet been repartitioned, performing historical queries against single vehicles is significantly faster. We do in fact repartition into multiple structures to serve single vehicle queries and fleet wide queries - the small cost of storing duplicate data and answering queries quickly works out cheaper than running slow running queries across a single, unoptimised data set.
Hopefully this discussion has been interest. As ever, please do get in touch if you want to discuss further.