pandas merge: merge two dataframes on same column but keep different columns
pandas merge: merge two dataframes on same column but keep different columns
I have two pandas dataFrames that share one common column name. I would like to merge on the common column name but keep all the different columns from the second dataFrame where there's a match on the common column name. Here's an example of the two dataFrames:
DF1:
A B C D E F G H
value value value value value value value value
value value value value value value value value
value value value value value value value value
etc.
DF2:
A I J K L
value value value value value
value value value value value
value value value value value
I would like the expected result to be:
Result:
A B C D E F G H I J K L
value value value value value value value value value value value value
value value value value value value value value value value value value
value value value value value value value value value value value value
That is, where there's a match on column 'A' I want the rows of I, J, K, L preserved and not equal to 'NaN', similarly for the columns in DF1.
I've tried all of the pd.merge options, but they don't seem to be doing what I'm asking above. For example,
pd.merge(DF1, DF2, on='A', how='left')
matches on 'A' and keeps all keys to the left, but not from the second dataFrame. I know this is what it's supposed to be doing, but it is not what I want. Similarly, how='right' does the same but preserves all the keys on the right, and fills all the keys on the left with 'NaNs'. 'Outer' doesn't seem to be doing what I want either as it just sticks everything together and fills all the non-similar columns with NaNs.
Also, note that DF1 and DF2 are not in the same index order.
Can anyone provide me with a path forward here? Thanks!
1 Answer
1
If you have
z=io.StringIO(""" A B C D E F G H
value2 value2 value2 value2 value2 value2 value2 value2
value3 value3 value3 value3 value3 value3 value3 value3
value value value value value value value value""")
df = pd.read_table(z, delim_whitespace=True)
z2=io.StringIO(""" A I J K L
value value value value value
value2 value2 value2 value2 value2
value3 value3 value3 value3 value3""")
df2=pd.read_table(z2,delim_whitespace=True)
Then
pd.merge(df,df2, on="A",right_index=True, left_index=True)
A B C D E F G H I J K L
0 value value value value value value value value value value value value
1 value value value value value value value value value value value value
2 value value value value value value value value value value value value
To match the indexes as you described,
pd.merge(df.set_index("A"),df2.set_index("A"), right_index=True, left_index=True).reset_index()
A B C D E F G H I J K L
0 value2 value2 value2 value2 value2 value2 value2 value2 value2 value2 value2 value2
1 value3 value3 value3 value3 value3 value3 value3 value3 value3 value3 value3 value3
2 value value value value value value value value value value value value
@user1332577 I have edited the question. Hope this works :)
– RafaelC
Jul 2 at 4:08
this works to merge everything together. I was struggling for a bit because pd.merge was taking a long time to compute. It turned out I needed to change the index dtype prior to running otherwise it would just hang. 'A' dtype was an object, but it needed to be str, so I fixed it with df1.index.astype(str) and df2.index.astype(str) after I set the index to 'A'. I did this outside of the pd.merge call then just operated on the changed dataFrame.
– user1332577
Jul 2 at 14:30
However, I am getting an interesting result. Although everything is in the right place, since I do not have unique 'A' values, some rows share an 'A' value (their ID) because they are related, so say there are 3 rows with the same 'A' value, I end up getting 9 rows back, so my merged dataFrame is almost triple the length it should be. Ideas? I was wondering if I could set it to only use the left_index or something -- maybe that would work?
– user1332577
Jul 2 at 14:30
Nevermind, I figured it out. I was flattening one of my dataFrames and I didn't need to. That was the cause of the additional rows. Thanks for all your help!
– user1332577
Jul 2 at 14:53
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
This is close to what I want, but I failed to mentioned that they aren't in the same ordering index-wise so this does preserve I, J, K, L and matches the ids, but it seems to be matching I,J,K,L based on the index, not the match on 'A'. Does that make sense? I want when the values of 'A' match between the dataFrames that it still preserves the row order, so B, C, D, E, F, G, H seem to match correctly with the 'A' column but I, J, K, L seem to match with the index. I'll make sure to edit my question.
– user1332577
Jul 2 at 3:58