Count most frequent group with Nan values

Multi tool use
Multi tool use


Count most frequent group with Nan values



basically I would like to count number of the most frequent item grouped by 2 variables. I use this code:


dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x)[1])



This code works, but does not work on columns that have Nan values, since NaN values are float and others are str. So this error is shown:


'<' not supported between instances of 'float' and 'str'



I would like to omit NaN values and count mode for the rest. So str(x) is not a solution. And scipy.stats.mode(x, nan_policy='omit') does not work neither with an error:


TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''



Could you please give me an advice how to deal with that.
Thanks




3 Answers
3



I think need dropna for remove NaNs:


dropna


NaN


dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x.dropna())[1])



If need set NaNs for all NaNs groups:


NaN


dfgrouped = (data[COLUMNS.copy()]
.groupby(['Var1','Var2'])
.agg(lambda x: None if x.isnull().all() else stats.mode(x.dropna())[1]))





I do not want to remove groups with only Nan values. If this happen the mode is supposed to be null. So I tried your first help (stats.mode(x.dropna())[1]), but it works only without [1]. Do you have any suggestion?
– hta
Jul 2 at 9:26






@hta - Please check edited answer.
– jezrael
Jul 2 at 9:33



You can dropna as an initial step before performing a groupby. If you attempt to dropna within the aggregation, then a group with all NaN values may yield an error with stats.mode.


dropna


groupby


dropna


NaN


stats.mode



Here's a minimal example:


import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame([[1, 2, np.nan], [1, 2, 'hello'], [1, 2, np.nan],
[5, 6, 'next'], [5, 6, np.nan], [5, 6, 'next'],
[7, 8, np.nan], [7, 8, np.nan], [7, 8, np.nan]],
columns=['Var1', 'Var2', 'Value'])

res = df.dropna(subset=['Value'])
.groupby(['Var1', 'Var2'])
.agg(lambda x: stats.mode(x)[1][0])

print(res)

Value
Var1 Var2
1 2 1
5 6 2



If you need to keep groups with all NaN values, then you can catch IndexError:


NaN


IndexError


def mode_calc(x):
try:
return stats.mode(x.dropna())[1][0]
except IndexError:
return np.nan

res = df.groupby(['Var1', 'Var2'])
.agg(mode_calc)

print(res)

Value
Var1 Var2
1 2 1.0
5 6 2.0
7 8 NaN





Unfortunately I do not want to remove groups with only NaN values. If this happen the mode is supposed to be null. Any suggestion? Thanks a lot!
– hta
Jul 2 at 9:28





@hta, Sure, see the alternative I've added.
– jpp
Jul 2 at 9:36



nan is of type float and np.nan == np.nan is also False. If you need to group them together, you could try something like this:


# First replace nan values with something like 'Unavailable'
data.fillna('Unavailable', inplace=True)
# Then re-run your code
dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x)[1])



This will group all the unavailables together as a group. Hope that helps






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

G xy US SaF1zT 8dKnB7TDHnhfrGymXm7ipYHWW8AvV,UPBiDIkCPahj
GC9AzC8Mpwmf9yenUDZ8Z8 VePTfqFECUA5RC2pG,hRwYrvDOCOevAIEVAvk 9QwyTaMW2Z

Popular posts from this blog

Rothschild family

Cinema of Italy