Amazon Athena - Resources, Tips, and Tricks
Amazon Athena is an interactive query service built on the Presto (or PrestoDB) SQL query engine, which allows you to analyze large volumes of data in S3 efficiently. With minimal setup required, you can begin querying your data via standard SQL in minutes. As a serverless technology service, Athena users are not responsible for maintaining or covering the cost of infrastructure. Amazon charges for Athena based on use - charging by the TB of data scanned by the queries you run. Read more on Athena pricing and calculate your estimated price per TB here!
To get the most out of Athena at the lowest cost, users should be familiar with tools that limit how much data is scanned by any given query - understanding partitions, limits, compression, and columnar formatting is critical for an efficient Athena experience.
Let’s walk through the following example of the ‘athena_rocks’ table, which is pulling JSON files from an S3 data bucket called ‘athena_tips_and_tricks’ and parsing them into an Athena table.
Partitions and Limits
Partitions allow us to split up our data into more manageable chunks. With partitions we can be more specific about what data we are interested in scanning, thereby reducing the amount of data scanned.You’ll notice the “PARTITIONED BY” statement in the table above. Here I am telling Athena how my data will be divided up so that when I query the data I can specify which folder in S3 I want to scan:
Assuming we have already told Athena to create a partitions for 4/21/2022:
ALTER TABLE {table} ADD IF NOT EXISTS PARTITION (folder_name='{partition}', year='{year}', month='{month}', day='{day}') LOCATION 's3://athena_tips_and_tricks/wls/consulting/{tableFolder}/{partition}/{year}/{month}/{day}/
The SELECT statement will pull data from s3://athena_tips_and_tricks/wls/consulting/partitions/2022/4/21 because we used a WHERE clause to specify that we want to query data partitioned by a specific date (4/21/2022).
Let’s say we have a table with no partitions that is pointed at an S3 location with TBs of data collected everyday since November 30, 2016 (Athena’s launch date). If we ran a SELECT * statement on the table even with the WHERE clause, we would scan close to 2,000 TBs of data! That’s a lot, especially if we are only interested in a particular day, month, or year. See how powerful partitions can be?
Even 1 TB of data is a lot to scan at once – especially if we just want to take a peak at what our query is spitting out. To get even more restrictive about the amount of data we tell Athena to scan, we can add limits:
In addition to being smart about how we partition and limit data, you can prevent a big data accident by setting data usage control limits.
More on partitions, limits, and Presto documentation.
Compression and Columnar Formatting
Another important way to limit the amount of data we are regularly scanning is via compression and file formatting.
Athena supports a number of open-sourced file formats including Apache Parquet, CSV, and JSON. Using a SerDe (Serializer/Deserializer), in our case Hive JSON SerDe, allows Athena to interact with data in different formats. More specifically, we are using LazySimpleSerDe, which allows us to interact with JSON data as events.
There is a tradeoff between efficiency and cost when it comes to compressing files. When we compress files, we are repackaging our data so Athena reads less during any given query, but unpacking and repacking the data we query can take time. However, if you are working with a particularly large dataset, compression can help reduce the amount of data scanned significantly.
Columnar formatting also helps reduce the amount of data scanned by grouping long strings of data by field, allowing Athena to pinpoint what we are looking into and ignore the rest of our dataset.
More on Columnar Storage Formats and Compression.