How to collapse multiple columns into one in pandas
How to collapse multiple columns into one in pandas
I have a pandas dataframe filled with users and categories, but multiple columns for those categories.
| user | category | val1 | val2 | val3 |
| ------ | ------------------| -----| ---- | ---- |
| user 1 | c1 | 3 | NA | None |
| user 1 | c2 | NA | 4 | None |
| user 1 | c3 | NA | NA | 7 |
| user 2 | c1 | 5 | NA | None |
| user 2 | c2 | NA | 7 | None |
| user 2 | c3 | NA | NA | 2 |
I want to get it so the values are compressed into a single column.
| user | category | value|
| ------ | ------------------| -----|
| user 1 | c1 | 3 |
| user 1 | c2 | 4 |
| user 1 | c3 | 7 |
| user 2 | c1 | 5 |
| user 2 | c2 | 7 |
| user 2 | c3 | 2 |
Ultimately, to get a matrix like the following:
np.array([[3, 4, 7], [5, 7, 2]])
2
Pretty sure that was a typo and the 2 in the
val3
column should be dropped down.– piRSquared
Jun 29 at 14:41
val3
Edited it. yes it was a typo
– hedebyhedge
Jul 2 at 10:29
3 Answers
3
You can use pd.DataFrame.bfill
to backfill values over selected columns. However, I'm not sure how you derive 2
for the final value, since no values are non-null in the final row.
pd.DataFrame.bfill
2
val_cols = ['val1', 'val2', 'val3']
df['value'] = pd.to_numeric(df[val_cols].bfill(axis=1).iloc[:, 0], errors='coerce')
print(df)
user0 category val1 val2 val3 value
0 user 1 c1 3.0 NaN None 3.0
1 user 1 c2 NaN 4.0 None 4.0
2 user 1 c3 NaN NaN 7 7.0
3 user 2 c1 5.0 NaN None 5.0
4 user 2 c2 NaN 7.0 2 7.0
5 user 2 c3 NaN NaN None NaN
bfill
is a nice way to do it.– piRSquared
Jun 29 at 14:39
bfill
['user', 'category']
d = df.set_index(['user', 'category'])
pd.Series(d.lookup(d.index, d.isna().idxmin(1)), d.index).reset_index(name='value')
user category value
0 user 1 c1 3
1 user 1 c2 4
2 user 1 c3 7
3 user 2 c1 5
4 user 2 c2 7
5 user 2 c3 2
You can skip the resetting of the index and unstack to get your final result
d = df.set_index(['user', 'category'])
pd.Series(d.lookup(d.index, d.isna().idxmin(1)), d.index).unstack()
category c1 c2 c3
user
user 1 3 4 7
user 2 5 7 2
You can simply fillna(0)
(df2 = df.fillna(0)
) and use |
operator.
fillna(0)
df2 = df.fillna(0)
|
Convert to int
first
int
df2.loc[:, ['val1','val2','val3']] = df2[['val1','val2','val3']].astype(int)
Then
df2['val4'] = df2.val1.values | df2.val2.values | df2.val3.values
Interesting approach (-:
– piRSquared
Jun 29 at 14:39
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
How do you get
2
for the final row, since all values are null there?– jpp
Jun 29 at 14:40