Dataframe group by a specific column, aggerage ratio of some other column?
Dataframe group by a specific column, aggerage ratio of some other column?
I have a Data Frame with columns: Year
and Min Delay
. Sample rows as follows:
Year
Min Delay
2014 0
2014 2
2014 0
2014 4
2015 4
2015 4
2015 2
2015 2
I want to group this dataframe by year and find the delay ratio per year (i.e. number of non-zero entries that year divided by total number of entries for that year). So if we consider the data frame above, what I am trying to get is:
2014 0.5
2015 1
(There are 2 delays in 2014, total 4, 4 delays in 2015 total 4. A delay is defined by Min Delay > 0)
This is what I tried:
def find_ratio(df):
ratio = 1 - (len(df[df == 0]) / len(df))
return ratio
print(df.groupby(["Year"])["Min Delay"].transform(find_ratio).unique())
which prints: [0.5 1]
[0.5 1]
How can I get a data frame instead of an array?
.unique()
ndarray
1 Answer
1
First I think unique
is not good idea use here. Because if need assign output of function to years, it is impossible.
unique
Also transform
is good idea if need new column to DataFrame, not aggregated DataFrame.
transform
I think need GroupBy.apply
, also function should be simplify by mean of boolean mask:
GroupBy.apply
def find_ratio(df):
ratio = (df != 0).mean()
return ratio
print(df.groupby(["Year"])["Min Delay"].apply(find_ratio).reset_index(name='ratio'))
Year ratio
0 2014 0.5
1 2015 1.0
Solution with lambda function:
print (df.groupby(["Year"])["Min Delay"]
.apply(lambda x: (x != 0).mean())
.reset_index(name='ratio'))
Year ratio
0 2014 0.5
1 2015 1.0
Solution with GroupBy.transform
return new column:
GroupBy.transform
df['ratio'] = df.groupby(["Year"])["Min Delay"].transform(find_ratio)
print (df)
Year Min Delay ratio
0 2014 0 0.5
1 2014 2 0.5
2 2014 0 0.5
3 2014 4 0.5
4 2015 4 0.0
5 2015 4 0.0
6 2015 2 0.0
7 2015 2 0.0
Downvoter, if there's something wrong with my answer, please let me know, so I can correct it. Thanks.
– jezrael
Jul 1 at 16:14
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
I think the problem is with the
.unique()
call - it returns a NumPy array -ndarray
(source)– Bill Armstrong
Jul 1 at 16:21