Skip to content Skip to sidebar Skip to footer

Luigi Pipeline Beginning In S3

My initial files are in AWS S3. Could someone point me how I need to setup this in a Luigi Task? I reviewed the documentation and found luigi.S3 but is not clear for me what to do

Solution 1:

The key here is to define an External Task that has no inputs and which outputs are those files you already have in living in S3. Luigi docs mention this in Requiring another Task:

Note that requires() can not return a Target object. If you have a simple Target object that is created externally you can wrap it in a Task class

So, basically you end up with something like this:

import luigi

from luigi.s3 import S3Target

from somewhere import do_something_with


classMyS3File(luigi.ExternalTask):

    defoutput(self):
        return luigi.S3Target('s3://my-bucket/path/to/file')

classProcessS3File(luigi.Task):

    defrequires(self):
        return MyS3File()

    defoutput(self):
        return luigi.S3Target('s3://my-bucket/path/to/output-file')

    defrun(self):
        result = None# this will return a file stream that reads the file from your aws s3 bucketwith self.input().open('r') as f:
            result = do_something_with(f)

        # and the you 
        out_file = self.output().open('w')
        # it'd better to serialize this result before writing it to a file, but this is a pretty simple example
        out_file.write(result)

UPDATE:

Luigi uses boto to read files from and/or write them to AWS S3, so in order to make this code work, you'll need to provide your credentials in your boto config file ~/boto (look for other possible config file locations here):

[Credentials]aws_access_key_id = <your_access_key_here>
aws_secret_access_key = <your_secret_key_here>

Post a Comment for "Luigi Pipeline Beginning In S3"