How to split column data and create new DataFrame with multiple columns

Multi tool use
How to split column data and create new DataFrame with multiple columns
I'd like to split the data in the following DataFrame
df = pd.DataFrame(data={'per': np.repeat([10,20,30], 32), 'r':12*range(8), 'cnt': np.random.randint(300, 400, 96)}); df
cnt per r
0 355 10 0
1 359 10 1
2 347 10 2
3 390 10 3
4 304 10 4
5 306 10 5
.. ... ... ..
87 357 30 7
88 371 30 0
89 396 30 1
90 357 30 2
91 353 30 3
92 306 30 4
93 301 30 5
94 329 30 6
95 312 30 7
[96 rows x 3 columns]
such that for each r
value a new column cnt_r{r}
exist in a DataFrame but also keeping the corresponding per
column.
r
cnt_r{r}
per
The following piece of code almost does what I want except that it looses the per
column:
per
pd.DataFrame({'cnt_r{}'.format(i): df[df.r==i].reset_index()['cnt'] for i in range(8)})
cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 355 359 347 390 304 306 366 310
1 394 331 384 312 380 350 318 396
2 340 336 360 389 352 370 353 319
...
9 341 300 386 334 386 314 358 326
10 357 386 311 382 356 339 375 357
11 371 396 357 353 306 301 329 312
I need a way to build the follow DataFrame:
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 355 359 347 390 304 306 366 310
1 10 394 331 384 312 380 350 318 396
2 10 340 336 360 389 352 370 353 319
...
7 20 384 385 376 323 345 339 339 347
9 30 341 300 386 334 386 314 358 326
10 30 357 386 311 382 356 339 375 357
11 30 371 396 357 353 306 301 329 312
Note that by construction my dataset has same number of values per per
for each r
. Obviously my dataset is much larger than the example one (about 800 million records).
per
r
Many thanks for your time.
1 Answer
1
If possible use reshape
for 2d array
and then insert
new colum per
:
reshape
2d array
insert
per
np.random.seed(1256)
df = pd.DataFrame(data={'per': np.repeat([10,20,30], 32),
'r': 12*list(range(8)),
'cnt': np.random.randint(300, 400, 96)})
df1 = pd.DataFrame(df['cnt'].values.reshape(-1, 8)).add_prefix('cnt_r')
df1.insert(0, 'per', np.repeat([10,20,30], 4))
print (df1)
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 365 358 305 311 393 343 340 313
1 10 393 319 358 351 322 387 316 359
2 10 360 301 337 333 322 337 393 396
3 10 320 344 325 310 338 381 314 339
4 20 323 305 342 340 343 319 332 371
5 20 398 308 350 320 340 319 305 369
6 20 344 340 345 332 373 334 304 331
7 20 323 349 301 334 344 374 300 336
8 30 357 375 396 354 309 391 304 334
9 30 311 395 372 359 370 342 351 330
10 30 378 302 306 341 308 392 387 332
11 30 350 373 316 376 338 351 398 304
Or use cumcount
for create new groups and reshape by set_index
with unstack
:
cumcount
set_index
unstack
df = (df.set_index([df.groupby('r').cumcount(), 'per','r'])['cnt']
.unstack()
.add_prefix('cnt_r')
.reset_index(level=1)
.rename_axis(None, axis=1))
print (df)
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 365 358 305 311 393 343 340 313
1 10 393 319 358 351 322 387 316 359
2 10 360 301 337 333 322 337 393 396
3 10 320 344 325 310 338 381 314 339
4 20 323 305 342 340 343 319 332 371
5 20 398 308 350 320 340 319 305 369
6 20 344 340 345 332 373 334 304 331
7 20 323 349 301 334 344 374 300 336
8 30 357 375 396 354 309 391 304 334
9 30 311 395 372 359 370 342 351 330
10 30 378 302 306 341 308 392 387 332
11 30 350 373 316 376 338 351 398 304
per
df1.insert(0, 'per', df['per'].values.reshape(-1,8).transpose()[0])
@Claudio - Hmmm, it depends of real data,
np.repeat
is not possible use?– jezrael
Jul 2 at 14:18
np.repeat
you are too fast... :D
– Claudio
Jul 2 at 14:18
@Claudio - OK, I can check first solution
– jezrael
Jul 2 at 14:20
hmmm, for me your code and similar working
df1.insert(0, 'per', df['per'].values.reshape(-1,8)[:, [0]])
– jezrael
Jul 2 at 14:22
df1.insert(0, 'per', df['per'].values.reshape(-1,8)[:, [0]])
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Thank you some much for the enlightening answers! The problem I faced (on my dataset) with the first answer is with the insert. I had to create the
per
column with the following codedf1.insert(0, 'per', df['per'].values.reshape(-1,8).transpose()[0])
Otherwise the second solution works like a charm.– Claudio
Jul 2 at 14:16