Count most frequent group with Nan values

Count most frequent group with Nan values

basically I would like to count number of the most frequent item grouped by 2 variables. I use this code:

dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x)[1])

This code works, but does not work on columns that have Nan values, since NaN values are float and others are str. So this error is shown:

'<' not supported between instances of 'float' and 'str'

I would like to omit NaN values and count mode for the rest. So str(x) is not a solution. And scipy.stats.mode(x, nan_policy='omit') does not work neither with an error:

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Could you please give me an advice how to deal with that.
Thanks

3 Answers
3

I think need dropna for remove NaNs:

dropna

NaN

dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x.dropna())[1])

If need set NaNs for all NaNs groups:

NaN

dfgrouped = (data[COLUMNS.copy()] .groupby(['Var1','Var2']) .agg(lambda x: None if x.isnull().all() else stats.mode(x.dropna())[1]))

I do not want to remove groups with only Nan values. If this happen the mode is supposed to be null. So I tried your first help (stats.mode(x.dropna())[1]), but it works only without [1]. Do you have any suggestion?
– hta
Jul 2 at 9:26

@hta - Please check edited answer.
– jezrael
Jul 2 at 9:33

You can dropna as an initial step before performing a groupby. If you attempt to dropna within the aggregation, then a group with all NaN values may yield an error with stats.mode.

dropna

groupby

dropna

NaN

stats.mode

Here's a minimal example:

import pandas as pd import numpy as np from scipy import stats df = pd.DataFrame([[1, 2, np.nan], [1, 2, 'hello'], [1, 2, np.nan], [5, 6, 'next'], [5, 6, np.nan], [5, 6, 'next'], [7, 8, np.nan], [7, 8, np.nan], [7, 8, np.nan]], columns=['Var1', 'Var2', 'Value']) res = df.dropna(subset=['Value']) .groupby(['Var1', 'Var2']) .agg(lambda x: stats.mode(x)[1][0]) print(res) Value Var1 Var2 1 2 1 5 6 2

If you need to keep groups with all NaN values, then you can catch IndexError:

NaN

IndexError

def mode_calc(x): try: return stats.mode(x.dropna())[1][0] except IndexError: return np.nan res = df.groupby(['Var1', 'Var2']) .agg(mode_calc) print(res) Value Var1 Var2 1 2 1.0 5 6 2.0 7 8 NaN

Unfortunately I do not want to remove groups with only NaN values. If this happen the mode is supposed to be null. Any suggestion? Thanks a lot!
– hta
Jul 2 at 9:28

@hta, Sure, see the alternative I've added.
– jpp
Jul 2 at 9:36

nan is of type float and np.nan == np.nan is also False. If you need to group them together, you could try something like this:

# First replace nan values with something like 'Unavailable' data.fillna('Unavailable', inplace=True) # Then re-run your code dfgrouped = data[COLUMNS.copy()].groupby(['Var1','Var2']).agg(lambda x: stats.mode(x)[1])

This will group all the unavailables together as a group. Hope that helps

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

cKp LZuzrHXSYp,kbGbVATdHIEX,7stdD7P2 EopVGb iWfRC

搜尋此網誌

Gtjkyu