Skip to content Skip to sidebar Skip to footer

Stream Huge Zip Files On S3 Using Lambda And Boto3

I have a bunch of CSV files compressed as one zip on S3. I only need to process one CSV file inside the zip using AWS lambda function import boto3 from zipfile import ZipFile BUCK

Solution 1:

Depending on your exact needs, you can use smart-open to handle the reading of the zip File. If you can fit the CSV data in RAM in your Lambda, it's fairly straightforward to call directly:

from smart_open import smart_open
from io import TextIOWrapper, BytesIO

deflambda_handler(event, context):
    # Simple test, just calculate the sum of the first column of a CSV file in a Zip file
    total_sum, row_count = 0, 0# Use smart open to handle the byte range requests for uswith smart_open("s3://example-bucket/many_csvs.zip", "rb") as f:
        # Wrap that in a zip file handlerzip = zipfile.ZipFile(f)
        # Open a specific CSV file in the zip file
        zf = zip.open("data_101.csv")
        # Read all of the data into memory, and prepare a text IO wrapper to read it row by row
        text = TextIOWrapper(BytesIO(zf.read()))
        # And finally, use python's csv library to parse the csv format
        cr = csv.reader(text)
        # Skip the header rownext(cr)
        # Just loop through each row and add the first columnfor row in cr:
            total_sum += int(row[0])
            row_count += 1# And output the resultsprint(f"Sum {row_count} rows for col 0: {total_sum}")

I tested this with a 1gb zip file containing hundreds of CSV files. The CSV file I picked was around 12mb uncompressed, or 100,000 rows, so it felt nicely into RAM in the Lambda environment, even when limited to 128mb of RAM.

If your CSV file can't be loaded at once like this, you'll need to take care to load it in sections, buffering the reads so you don't waste time reading it line-by-line and forcing smart-open to load small chunks at a time.

Post a Comment for "Stream Huge Zip Files On S3 Using Lambda And Boto3"