Luigi Pipeline Beginning In S3
My initial files are in AWS S3. Could someone point me how I need to setup this in a Luigi Task? I reviewed the documentation and found luigi.S3 but is not clear for me what to do
Solution 1:
The key here is to define an External Task that has no inputs and which outputs are those files you already have in living in S3. Luigi docs mention this in Requiring another Task:
Note that requires() can not return a Target object. If you have a simple Target object that is created externally you can wrap it in a Task class
So, basically you end up with something like this:
import luigi
from luigi.s3 import S3Target
from somewhere import do_something_with
classMyS3File(luigi.ExternalTask):
defoutput(self):
return luigi.S3Target('s3://my-bucket/path/to/file')
classProcessS3File(luigi.Task):
defrequires(self):
return MyS3File()
defoutput(self):
return luigi.S3Target('s3://my-bucket/path/to/output-file')
defrun(self):
result = None# this will return a file stream that reads the file from your aws s3 bucketwith self.input().open('r') as f:
result = do_something_with(f)
# and the you
out_file = self.output().open('w')
# it'd better to serialize this result before writing it to a file, but this is a pretty simple example
out_file.write(result)
UPDATE:
Luigi uses boto to read files from and/or write them to AWS S3, so in order to make this code work, you'll need to provide your credentials in your boto config file ~/boto
(look for other possible config file locations here):
[Credentials]aws_access_key_id = <your_access_key_here>
aws_secret_access_key = <your_secret_key_here>
Post a Comment for "Luigi Pipeline Beginning In S3"