Enabling CloudFront Access Logs and Querying with Athena Using Projected Partitions
AI

Enabling CloudFront Access Logs and Querying with Athena Using Projected Partitions

COPPER Nguyen

CloudFront Access Logs are a goldmine for understanding CDN traffic patterns, user behavior, and performance metrics. In this post, I’ll walk you through how to:

  • Enable CloudFront Access Logs in Parquet format
  • Store logs in Amazon S3
  • Create an Athena table with projected partitions for efficient querying

All of this is done using Terraform, with reusable modules and configurations.

Step 1: Provision S3 Buckets

We need two buckets:

  • One for storing CloudFront logs
  • One for Athena query results
resource "aws_s3_bucket" "cloudfront_logs" {
  bucket        = "${local.prefix}-cloudfront-access-logs"
  force_destroy = true
}

resource "aws_s3_bucket" "athena_query_results" {
  bucket        = "${local.prefix}-athena-query-results"
  force_destroy = true
}

Step 2: Enable CloudFront Access Logs (Parquet Format)

Using CloudWatch Log Delivery, we configure CloudFront to deliver logs in Parquet format to S3.

resource "aws_cloudwatch_log_delivery_destination" "s3" {
  region        = "us-east-1"
  name          = "${local.prefix}-cdn-access-log"
  output_format = "parquet"
  delivery_destination_configuration {
    destination_resource_arn = aws_s3_bucket.cloudfront_logs.arn
  }
}

Then, we define the delivery sources for both public and private CloudFront distributions:

Define the delivery source for the CloudFront distribution:

resource "aws_cloudwatch_log_delivery" "cdn_public" {
  region                    = "us-east-1"
  delivery_source_name      = aws_cloudwatch_log_delivery_source.cdn_public.name
  delivery_destination_arn  = aws_cloudwatch_log_delivery_destination.s3.arn
  record_fields             = local.cloudfront_log_record_fields
  s3_delivery_configuration = local.cloudfront_log_s3_delivery_configuration
}
``

Link the source and destination:

resource "aws_cloudwatch_log_delivery" "cdn_public" {
  region                    = "us-east-1"
  delivery_source_name      = aws_cloudwatch_log_delivery_source.cdn_public.name
  delivery_destination_arn  = aws_cloudwatch_log_delivery_destination.s3.arn
  record_fields             = local.cloudfront_log_record_fields
  s3_delivery_configuration = local.cloudfront_log_s3_delivery_configuration
}
``


Step 3: Grant Permissions for Log Delivery

Attach an S3 bucket policy to allow CloudWatch to write logs:

resource "aws_s3_bucket_policy" "this" {
  bucket = aws_s3_bucket.cloudfront_logs.id
  policy = jsonencode({
    "Version" : "2012-10-17",
    "Statement" : [
      {
        "Sid" : "AWSLogDeliveryWrite",
        "Effect" : "Allow",
        "Principal" : {
          "Service" : "delivery.logs.amazonaws.com"
        },
        "Action" : "s3:PutObject",
        "Resource" : "${aws_s3_bucket.cloudfront_logs.arn}/*",
        "Condition" : {
          "StringEquals" : {
            "aws:SourceAccount" : "${data.aws_caller_identity.current.account_id}"
          },
          "ArnLike" : {
            "aws:SourceArn" : aws_cloudwatch_log_delivery_source.cdn_public.arn
          }
        }
      }
    ]
  })
}

Step 4: Create Athena Database

resource "aws_athena_database" "cloudfront_logs" {
  name   = "${replace(local.prefix, "-", "_")}_cloudfront_logs_db"
  bucket = aws_s3_bucket.athena_query_results.bucket
}

Step 5: Define Athena Table with Projected Partitions

Once logs are delivered in Parquet format, define an Athena table using partition projection.

Example SQL DDL:

CREATE EXTERNAL TABLE cloudfront_logs (
  date string,
  time string,
  x_edge_location string,
  sc_bytes bigint,
  c_ip string,
  cs_method string,
  cs_host string,
  cs_uri_stem string,
  sc_status int,
  cs_referer string,
  cs_user_agent string,
  cs_uri_query string,
  cs_cookie string,
  x_edge_result_type string,
  x_edge_request_id string,
  x_host_header string,
  cs_protocol string,
  cs_bytes bigint,
  time_taken double,
  x_forwarded_for string,
  ssl_protocol string,
  ssl_cipher string,
  x_edge_response_result_type string,
  cs_protocol_version string,
  fle_status string,
  fle_encrypted_fields string,
  c_port int,
  time_to_first_byte double,
  x_edge_detailed_result_type string,
  sc_content_type string,
  sc_content_len bigint,
  sc_range_start bigint,
  sc_range_end bigint,
  c_country string,
  cache_behavior_path_pattern string
)
PARTITIONED BY (
  distributionid string,
  year string,
  month string,
  day string,
  hour string
)
STORED AS PARQUET
LOCATION 's3://dev-example-cloudfront-access-logs/AWSLogs/'
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.distributionid.type' = 'enum',
  'projection.distributionid.values' = 'YOUR_DISTRIBUTION_ID',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2023,2025',
  'projection.month.type' = 'integer',
  'projection.month.range' = '1,12',
  'projection.day.type' = 'integer',
  'projection.day.range' = '1,31',
  'projection.hour.type' = 'integer',
  'projection.hour.range' = '0,23',
  'storage.location.template' = 's3://dev-example-cloudfront-access-logs/AWSLogs/${distributionid}/CloudFront/${year}/${month}/${day}/${hour}/logs'
);

Step 6: Example Athena Query – Count Requests per Episode

Here’s a practical query to count how many times each episode page was requested in October 2025:

SELECT
    REGEXP_EXTRACT(cs_uri_stem, '^/episodes/([^/]+)', 1) AS episode_id,
    COUNT(*) AS request_count
FROM "dev_bump_cloudfront_logs_db"."cloudfront_logs_parquet_pp"
WHERE 
  year = 2025
  AND month = 10
  AND cs_uri_stem LIKE '/episodes/%'
GROUP BY REGEXP_EXTRACT(cs_uri_stem, '^/episodes/([^/]+)', 1)
ORDER BY request_count DESC;

What This Does:

  • Extracts episode_id from the URI path
  • Filters logs for October 2025
  • Groups and counts requests per episode
  • Orders results by popularity

✅ Summary

With this setup, you can:

  • Automatically collect CloudFront logs in Parquet format
  • Use Athena with projected partitions for fast, cost-effective queries
  • Gain insights into user behavior with SQL