Count most frequent group with Nan values

Multi tool use
Count most frequent group with Nan values
basically I would like to count number of the most frequent item grouped by 2 variables. I use this code:
dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x)[1])
This code works, but does not work on columns that have Nan values, since NaN values are float and others are str. So this error is shown:
'<' not supported between instances of 'float' and 'str'
I would like to omit NaN values and count mode for the rest. So str(x) is not a solution. And scipy.stats.mode(x, nan_policy='omit') does not work neither with an error:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Could you please give me an advice how to deal with that.
Thanks
3 Answers
3
I think need dropna
for remove NaN
s:
dropna
NaN
dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x.dropna())[1])
If need set NaN
s for all NaNs groups:
NaN
dfgrouped = (data[COLUMNS.copy()]
.groupby(['Var1','Var2'])
.agg(lambda x: None if x.isnull().all() else stats.mode(x.dropna())[1]))
@hta - Please check edited answer.
– jezrael
Jul 2 at 9:33
You can dropna
as an initial step before performing a groupby
. If you attempt to dropna
within the aggregation, then a group with all NaN
values may yield an error with stats.mode
.
dropna
groupby
dropna
NaN
stats.mode
Here's a minimal example:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame([[1, 2, np.nan], [1, 2, 'hello'], [1, 2, np.nan],
[5, 6, 'next'], [5, 6, np.nan], [5, 6, 'next'],
[7, 8, np.nan], [7, 8, np.nan], [7, 8, np.nan]],
columns=['Var1', 'Var2', 'Value'])
res = df.dropna(subset=['Value'])
.groupby(['Var1', 'Var2'])
.agg(lambda x: stats.mode(x)[1][0])
print(res)
Value
Var1 Var2
1 2 1
5 6 2
If you need to keep groups with all NaN
values, then you can catch IndexError
:
NaN
IndexError
def mode_calc(x):
try:
return stats.mode(x.dropna())[1][0]
except IndexError:
return np.nan
res = df.groupby(['Var1', 'Var2'])
.agg(mode_calc)
print(res)
Value
Var1 Var2
1 2 1.0
5 6 2.0
7 8 NaN
Unfortunately I do not want to remove groups with only NaN values. If this happen the mode is supposed to be null. Any suggestion? Thanks a lot!
– hta
Jul 2 at 9:28
@hta, Sure, see the alternative I've added.
– jpp
Jul 2 at 9:36
nan is of type float and np.nan == np.nan is also False. If you need to group them together, you could try something like this:
# First replace nan values with something like 'Unavailable'
data.fillna('Unavailable', inplace=True)
# Then re-run your code
dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x)[1])
This will group all the unavailables together as a group. Hope that helps
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
I do not want to remove groups with only Nan values. If this happen the mode is supposed to be null. So I tried your first help (stats.mode(x.dropna())[1]), but it works only without [1]. Do you have any suggestion?
– hta
Jul 2 at 9:26