Stratified Samples From Pandas

December 21, 2023 Post a Comment

I have a pandas DataFrame which looks approximately as follows: cli_id | X1 | X2 | X3 | ... | Xn | Y | ---------------------------------------- 123 | 1 | A | XX | ... | 4 |

Solution 1:

If the number of samples is the same for every group, or if the proportion is constant for every group, you could try something like

df.groupby('Y').apply(lambda x: x.sample(n=200))

df.groupby('Y').apply(lambda x: x.sample(frac=.1))

To perform stratified sampling with respect to more than one variable, just group with respect to more variables. It may be necessary to construct new binned variables to this end.

However, if the group size is too small w.r.t. the proportion like groupsize 1 and propotion .25, then no item will be returned. This is due to pythons rounding implementation of the int function int(0.25)=0

Solution 2:

I'm not totally sure whether you mean this:

strats = []
for k in range(11):
    y_val = k*0.1
    dummy_df = your_df[your_df['Y'] == y_val]
    stats.append( dummy_df.sample(200) )

That makes a dummy dataframe consisting in only the Y values you want, and then takes a sample of 200.

OK so you need the different chunks to have the same structure. I guess that's a bit harder, here's how I would do it:

First of all, I would get a histogram of what X1 looks like:

hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins))

we have now a histogram with nbins bins.

Now the strategy is to draw a certain number of rows depending on what their value of X1 is. We will draw more from the bins with more observations and less from the bins with less, so that the structure of X is preserved.

In particular, the relative contribution of every bin should be:

rel = [float(i) / sum(hist) for i in hist]

This will be something like [0.1, 0.2, 0.1, 0.3, 0.3]

If we want 200 samples, we need to draw:

draws_in_bin = [int(i*200) for i in rel]

Now we know how many observations to draw from every bin:

strats = []
for k inrange(11):
        y_val = k*0.1#get a dataframe for every value of Y
        dummy_df = your_df[your_df['Y'] == y_val]

        bin_strat = []
        for left_edge, right_edge, n_draws inzip(edges[:-1], edges[1:], draws_in_bin):

             bin_df = dummy_df[ (dummy_df['X1']> left_edge) 
                              & (dummy_df['X1']< right_edge) ]

             bin_strat.append(bin_df.sample(n_draws))
             # this takes the right number of draws out # of the X1 bin where we currently are# Note that every element of bin_strat is a dataframe# with a number of entries that corresponds to the # structure of draws_in_bin##concatenate the dataframes for every bin and append to the list
        strats.append( pd.concat(bin_strat) )

Python Freelancers

Stratified Samples From Pandas

Solution 1:

Solution 2:

Post a Comment for "Stratified Samples From Pandas"