Skip to content Skip to sidebar Skip to footer

How Do You Import A Numerically Encoded Column In Pandas?

I'm importing a dataset which encodes a number of variables numerically, e.g.: SEX 1 - Male 2 - Female My best guess at how to convert these (so they appear in my dataframe as Mal

Solution 1:

You can use categories for this:

df = pd.DataFrame({"Sex": [1, 2, 1, 1, 2, 1, 2]})

Change the dtype:

df["Sex"] = df["Sex"].astype("category")
print(df["Sex"])
Out[33]: 
0    1
1    2
2    1
3    1
4    2
5    1
6    2
Name: Sex, dtype: category
Categories (2, int64): [1, 2]

Rename categories:

df["Sex"] = df["Sex"].cat.rename_categories(["Male", "Female"])
print(df)
Out[36]: 
      Sex
0    Male
1  Female
2    Male
3    Male
4  Female
5    Male
6  Female

I tried it on a ~75k dataset (choosing the most reviewed 30 beers from beer reviews dataset)

rep_dict = dict(zip(df.beer_name.unique(), range(len(df.beer_name.unique())))) #it constructs a dictionary where the beer names are assigned a number from 0 to 29.

replace is quite slow:

%timeit df["beer_name"].replace(rep_dict)
10 loops, best of 3: 139 ms per loop

map is faster as expected (because it looks for the exact matching):

%timeit df["beer_name"].map(rep_dict)
100 loops, best of 3: 2.78 ms per loop

Changing the category of a column takes almost as much as map:

%timeit df["beer_name"].astype("category")
100 loops, best of 3: 2.57 ms per loop

However, after changing, category renames are way faster:

df["beer_name"] = df["beer_name"].astype("category")
%timeit df["beer_name"].cat.rename_categories(range(30))
10000 loops, best of 3: 149 µs per loop

So, a second map would take as much time as the first map but once you change the category, rename_categories will be faster. Unfortunately, category dtype cannot be assigned while reading the file. You need to change the types afterwards.

As of version 0.19.0, you can pass dtype='category' to read_csv (or specify which columns to be parsed as categories with a dictionary). (docs)

Post a Comment for "How Do You Import A Numerically Encoded Column In Pandas?"