Skip to content Skip to sidebar Skip to footer

Load Images Into A Dask Dataframe

I have a dask dataframe which contains image paths in a column (called img_paths). What I want to do in the next steps is to load images using those image paths into an another col

Solution 1:

The solution

import pandas as pd
import dask
import dask.dataframe as dd
import numpy as np
from skimage.io import imread

imgs = ['https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/so/so-logo.png?v=9c558ec15d8a'] * 4# create a pandas dataframe using image paths
df = pd.DataFrame({"img_paths": imgs})

# convert it into dask dataframe
ddf = dd.from_pandas(df, npartitions=2)

# convert imread function as delayed
delayed_imread = dask.delayed(imread, pure=True)

# give dask information about the function output type
ddf['img_paths'].apply(imread, meta=('img_loaded', np.uint8)).compute()

# OR turn it into dask.dealayed, which infers output type `object`
ddf['img_paths'].apply(delayed_imread).compute()

The explanation

If you do try applying the print function, without computation you see the reason for FileNotFoundError of code: ddf.images.apply(imread).compute()

ddf['img_paths'].apply(print)

Output:

> foo
> foo

When you add apply function to the graph, Dask runs through it string foo to infer the type of the output => imread was trying to open file named foo.

To get a better understanding I encourage you to try:

ddf.apply(print, axis=1)

And try to predict what gets printed.

Delayed cells after .compute()

The reason is apply expects a function reference which is then called. By creating lambda function calling the delayed function you are basically double-delaying your function.

Post a Comment for "Load Images Into A Dask Dataframe"