[Python 데이터 분석] groupby 집계함수 예제(Python groupby)
| 파이썬 집계함수 groupby
groupby는 SQL 문의 group 처럼 어느 특정 컬럼을 묶어 그에 대한 집계연산을 지원하는 함수입니다. 데이터프레임에서 인덱스 혹은 컬럼에 대하여 그 컬럼을 기준으로 그룹을 묶고 난 후 sum 이나 avg 같은 집계 함수를 통해 원하는 데이터를 추출 할 수 있습니다.
다음은 groupby에 대한 예제를 정리한 것입니다.
import pandas as pd
import numpy as np
df = pd.read_csv('census.csv')
# SUMLEV이 50인 데이터를 데이터프레임에서 추출
df = df[df['SUMLEV']==50]
print(df)
'''
SUMLEV REGION ... RNETMIG2014 RNETMIG2015
1 50 3 ... 2.592270 -2.187333
2 50 3 ... 20.317142 18.293499
3 50 3 ... -3.978583 -10.543299
4 50 3 ... 0.754533 1.107861
5 50 3 ... -1.577232 -0.884411
6 50 3 ... 17.243790 -13.193961
7 50 3 ... 1.184103 -6.430868
8 50 3 ... -3.912834 -2.806406
9 50 3 ... -1.290228 2.346901
10 50 3 ... 0.230419 -2.931307
11 50 3 ... 0.957636 -1.752709
12 50 3 ... -1.274984 -5.291205
13 50 3 ... -8.486280 -5.411736
14 50 3 ... 5.776708 3.986270
15 50 3 ... 3.654485 -3.123961
16 50 3 ... -0.767822 5.350738
17 50 3 ... 2.533249 0.588052
18 50 3 ... -14.645416 2.684140
19 50 3 ... -23.049483 -5.387581
20 50 3 ... 3.750759 -0.264121
21 50 3 ... 4.734577 5.087600
22 50 3 ... 5.593387 8.417777
23 50 3 ... -9.503292 -1.998668
24 50 3 ... -9.368689 -14.711389
25 50 3 ... 1.198187 0.956790
26 50 3 ... -0.951175 2.757093
27 50 3 ... -1.136950 1.243830
28 50 3 ... -1.234901 -1.588308
29 50 3 ... 0.237396 -2.857058
30 50 3 ... 1.553115 0.442422
... ... ... ... ... ...
3162 50 2 ... 3.638104 -4.995197
3163 50 2 ... 2.104796 0.059931
3164 50 2 ... 2.077633 0.593567
3165 50 2 ... -0.134227 -0.173022
3166 50 2 ... -4.535615 -4.024395
3167 50 2 ... -1.545153 -3.685304
3168 50 2 ... -5.040889 -3.414223
3170 50 4 ... 1.429233 -5.166460
3171 50 4 ... -7.608378 5.513554
3172 50 4 ... -8.509402 10.978525
3173 50 4 ... -2.338590 -22.600668
3174 50 4 ... -19.358233 1.126443
3175 50 4 ... 7.071547 19.309219
3176 50 4 ... -12.013555 -13.352750
3177 50 4 ... -5.761986 -10.635133
3178 50 4 ... -6.224712 -5.663940
3179 50 4 ... -10.715742 0.933652
3180 50 4 ... -0.207819 1.673640
3181 50 4 ... 6.234414 1.662823
3182 50 4 ... -0.110593 0.793743
3183 50 4 ... -12.603387 7.492114
3184 50 4 ... -5.585731 0.856839
3185 50 4 ... 5.598720 4.207414
3186 50 4 ... 4.781489 -2.198937
3187 50 4 ... -14.409938 -20.664059
3188 50 4 ... -14.075283 -14.070195
3189 50 4 ... 16.308671 1.520747
3190 50 4 ... -14.740608 -12.606351
3191 50 4 ... -18.020168 1.441961
3192 50 4 ... 1.533635 6.935294
[3142 rows x 100 columns]
'''
# STNAME 컬럼을 기준으로 집계, group과 frame으로 나눌 수 있음
for group, frame in df.groupby('STNAME'):
avg = np.average(frame['CENSUS2010POP'])
print('Countries in state ' + group + ' have an average population of ' + str(avg))
'''
Countries in state Alabama have an average population of 71339.34328358209
Countries in state Alaska have an average population of 24490.724137931036
Countries in state Arizona have an average population of 426134.4666666667
Countries in state Arkansas have an average population of 38878.90666666667
Countries in state California have an average population of 642309.5862068966
Countries in state Colorado have an average population of 78581.1875
Countries in state Connecticut have an average population of 446762.125
Countries in state Delaware have an average population of 299311.3333333333
Countries in state District of Columbia have an average population of 601723.0
Countries in state Florida have an average population of 280616.5671641791
Countries in state Georgia have an average population of 60928.63522012578
Countries in state Hawaii have an average population of 272060.2
Countries in state Idaho have an average population of 35626.86363636364
Countries in state Illinois have an average population of 125790.50980392157
Countries in state Indiana have an average population of 70476.10869565218
Countries in state Iowa have an average population of 30771.262626262625
Countries in state Kansas have an average population of 27172.55238095238
Countries in state Kentucky have an average population of 36161.39166666667
Countries in state Louisiana have an average population of 70833.9375
Countries in state Maine have an average population of 83022.5625
Countries in state Maryland have an average population of 240564.66666666666
Countries in state Massachusetts have an average population of 467687.78571428574
Countries in state Michigan have an average population of 119080.0
Countries in state Minnesota have an average population of 60964.65517241379
Countries in state Mississippi have an average population of 36186.54878048781
Countries in state Missouri have an average population of 52077.62608695652
Countries in state Montana have an average population of 17668.125
Countries in state Nebraska have an average population of 19638.075268817203
Countries in state Nevada have an average population of 158855.9411764706
Countries in state New Hampshire have an average population of 131647.0
Countries in state New Jersey have an average population of 418661.61904761905
Countries in state New Mexico have an average population of 62399.36363636364
Countries in state New York have an average population of 312550.03225806454
Countries in state North Carolina have an average population of 95354.83
Countries in state North Dakota have an average population of 12690.396226415094
Countries in state Ohio have an average population of 131096.63636363635
Countries in state Oklahoma have an average population of 48718.844155844155
Countries in state Oregon have an average population of 106418.72222222222
Countries in state Pennsylvania have an average population of 189587.74626865672
Countries in state Rhode Island have an average population of 210513.4
Countries in state South Carolina have an average population of 100551.39130434782
Countries in state South Dakota have an average population of 12336.060606060606
Countries in state Tennessee have an average population of 66801.1052631579
Countries in state Texas have an average population of 98998.27165354331
Countries in state Utah have an average population of 95306.37931034483
Countries in state Vermont have an average population of 44695.78571428572
Countries in state Virginia have an average population of 60111.29323308271
Countries in state Washington have an average population of 172424.10256410256
Countries in state West Virginia have an average population of 33690.8
Countries in state Wisconsin have an average population of 78985.91666666667
Countries in state Wyoming have an average population of 24505.478260869564
'''
# 인덱스를 STNAME으로 한 후, fun 함수를 groupby에 인자로 넣어 데이터 추출
# 이때, fun에 들어가는 item 인수는 인덱스를 뜻함.
df = df.set_index('STNAME')
def fun(item):
if item[0] < 'M':
return 0
if item[0] < 'Q':
return 1
return 2
for group, frame in df.groupby(fun):
print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing')
'''
There are 1177 records in group 0 for processing
There are 1134 records in group 1 for processing
There are 831 records in group 2 for processing
'''
df = pd.read_csv('census.csv')
df = df[df['SUMLEV']==50]
print(df)
'''
SUMLEV REGION ... RNETMIG2014 RNETMIG2015
1 50 3 ... 2.592270 -2.187333
2 50 3 ... 20.317142 18.293499
3 50 3 ... -3.978583 -10.543299
4 50 3 ... 0.754533 1.107861
5 50 3 ... -1.577232 -0.884411
6 50 3 ... 17.243790 -13.193961
7 50 3 ... 1.184103 -6.430868
8 50 3 ... -3.912834 -2.806406
9 50 3 ... -1.290228 2.346901
10 50 3 ... 0.230419 -2.931307
11 50 3 ... 0.957636 -1.752709
12 50 3 ... -1.274984 -5.291205
13 50 3 ... -8.486280 -5.411736
14 50 3 ... 5.776708 3.986270
15 50 3 ... 3.654485 -3.123961
16 50 3 ... -0.767822 5.350738
17 50 3 ... 2.533249 0.588052
18 50 3 ... -14.645416 2.684140
19 50 3 ... -23.049483 -5.387581
20 50 3 ... 3.750759 -0.264121
21 50 3 ... 4.734577 5.087600
22 50 3 ... 5.593387 8.417777
23 50 3 ... -9.503292 -1.998668
24 50 3 ... -9.368689 -14.711389
25 50 3 ... 1.198187 0.956790
26 50 3 ... -0.951175 2.757093
27 50 3 ... -1.136950 1.243830
28 50 3 ... -1.234901 -1.588308
29 50 3 ... 0.237396 -2.857058
30 50 3 ... 1.553115 0.442422
... ... ... ... ... ...
3162 50 2 ... 3.638104 -4.995197
3163 50 2 ... 2.104796 0.059931
3164 50 2 ... 2.077633 0.593567
3165 50 2 ... -0.134227 -0.173022
3166 50 2 ... -4.535615 -4.024395
3167 50 2 ... -1.545153 -3.685304
3168 50 2 ... -5.040889 -3.414223
3170 50 4 ... 1.429233 -5.166460
3171 50 4 ... -7.608378 5.513554
3172 50 4 ... -8.509402 10.978525
3173 50 4 ... -2.338590 -22.600668
3174 50 4 ... -19.358233 1.126443
3175 50 4 ... 7.071547 19.309219
3176 50 4 ... -12.013555 -13.352750
3177 50 4 ... -5.761986 -10.635133
3178 50 4 ... -6.224712 -5.663940
3179 50 4 ... -10.715742 0.933652
3180 50 4 ... -0.207819 1.673640
3181 50 4 ... 6.234414 1.662823
3182 50 4 ... -0.110593 0.793743
3183 50 4 ... -12.603387 7.492114
3184 50 4 ... -5.585731 0.856839
3185 50 4 ... 5.598720 4.207414
3186 50 4 ... 4.781489 -2.198937
3187 50 4 ... -14.409938 -20.664059
3188 50 4 ... -14.075283 -14.070195
3189 50 4 ... 16.308671 1.520747
3190 50 4 ... -14.740608 -12.606351
3191 50 4 ... -18.020168 1.441961
3192 50 4 ... 1.533635 6.935294
[3142 rows x 100 columns]
'''
# STNAME을 통해 집계를 한 후, CENSUS2010POP 컬럼을 기준으로 avg 집계
df_avg = df.groupby('STNAME').agg({'CENSUS2010POP':np.average})
print(df_avg)
'''
CENSUS2010POP
STNAME
Alabama 71339.343284
Alaska 24490.724138
Arizona 426134.466667
Arkansas 38878.906667
California 642309.586207
Colorado 78581.187500
Connecticut 446762.125000
Delaware 299311.333333
District of Columbia 601723.000000
Florida 280616.567164
Georgia 60928.635220
Hawaii 272060.200000
Idaho 35626.863636
Illinois 125790.509804
Indiana 70476.108696
Iowa 30771.262626
Kansas 27172.552381
Kentucky 36161.391667
Louisiana 70833.937500
Maine 83022.562500
Maryland 240564.666667
Massachusetts 467687.785714
Michigan 119080.000000
Minnesota 60964.655172
Mississippi 36186.548780
Missouri 52077.626087
Montana 17668.125000
Nebraska 19638.075269
Nevada 158855.941176
New Hampshire 131647.000000
New Jersey 418661.619048
New Mexico 62399.363636
New York 312550.032258
North Carolina 95354.830000
North Dakota 12690.396226
Ohio 131096.636364
Oklahoma 48718.844156
Oregon 106418.722222
Pennsylvania 189587.746269
Rhode Island 210513.400000
South Carolina 100551.391304
South Dakota 12336.060606
Tennessee 66801.105263
Texas 98998.271654
Utah 95306.379310
Vermont 44695.785714
Virginia 60111.293233
Washington 172424.102564
West Virginia 33690.800000
Wisconsin 78985.916667
Wyoming 24505.478261
'''
# STNAME을 인덱스로 함. level=0은 멀티인덱스일 시 맨 첫번째 인덱스 기준으로
# 집계하라는 뜻. agg 함수에 전체 평균과 합을 동시에 쓸 수 있음
census_avg_sum = (df.set_index('STNAME').groupby(level=0)['CENSUS2010POP']
.agg({'avg':np.average, 'sum':np.sum}))
print(census_avg_sum)
'''
avg sum
STNAME
Alabama 71339.343284 4779736
Alaska 24490.724138 710231
Arizona 426134.466667 6392017
Arkansas 38878.906667 2915918
California 642309.586207 37253956
Colorado 78581.187500 5029196
Connecticut 446762.125000 3574097
Delaware 299311.333333 897934
District of Columbia 601723.000000 601723
Florida 280616.567164 18801310
Georgia 60928.635220 9687653
Hawaii 272060.200000 1360301
Idaho 35626.863636 1567582
Illinois 125790.509804 12830632
Indiana 70476.108696 6483802
Iowa 30771.262626 3046355
Kansas 27172.552381 2853118
Kentucky 36161.391667 4339367
Louisiana 70833.937500 4533372
Maine 83022.562500 1328361
Maryland 240564.666667 5773552
Massachusetts 467687.785714 6547629
Michigan 119080.000000 9883640
Minnesota 60964.655172 5303925
Mississippi 36186.548780 2967297
Missouri 52077.626087 5988927
Montana 17668.125000 989415
Nebraska 19638.075269 1826341
Nevada 158855.941176 2700551
New Hampshire 131647.000000 1316470
New Jersey 418661.619048 8791894
New Mexico 62399.363636 2059179
New York 312550.032258 19378102
North Carolina 95354.830000 9535483
North Dakota 12690.396226 672591
Ohio 131096.636364 11536504
Oklahoma 48718.844156 3751351
Oregon 106418.722222 3831074
Pennsylvania 189587.746269 12702379
Rhode Island 210513.400000 1052567
South Carolina 100551.391304 4625364
South Dakota 12336.060606 814180
Tennessee 66801.105263 6346105
Texas 98998.271654 25145561
Utah 95306.379310 2763885
Vermont 44695.785714 625741
Virginia 60111.293233 7994802
Washington 172424.102564 6724540
West Virginia 33690.800000 1852994
Wisconsin 78985.916667 5686986
Wyoming 24505.478261 563626
'''
pop_avg_sum = df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010', 'POPESTIMATE2011'] \
.agg({'avg':np.average, 'sum':np.sum})
print(pop_avg_sum)
'''
avg ... sum
POPESTIMATE2010 ... POPESTIMATE2011
STNAME ...
Alabama 71420.313433 ... 4801108
Alaska 24621.413793 ... 722720
Arizona 427213.866667 ... 6468732
Arkansas 38965.253333 ... 2938538
California 643691.017241 ... 37700034
Colorado 78878.968750 ... 5119480
Connecticut 447464.625000 ... 3589759
Delaware 299930.333333 ... 907916
District of Columbia 605126.000000 ... 620472
Florida 281341.641791 ... 19105533
Georgia 61090.905660 ... 9812280
Hawaii 272796.000000 ... 1378227
Idaho 35704.227273 ... 1584134
Illinois 125894.598039 ... 12861882
Indiana 70549.891304 ... 6516845
Iowa 30815.090909 ... 3065389
Kansas 27226.895238 ... 2869917
Kentucky 36232.808333 ... 4367882
Louisiana 71014.859375 ... 4575381
Maine 82980.937500 ... 1328257
Maryland 241183.708333 ... 5844171
Massachusetts 468931.142857 ... 6611797
Michigan 119004.445783 ... 9876589
Minnesota 61044.862069 ... 5348119
Mississippi 36223.365854 ... 2977999
Missouri 52139.582609 ... 6010587
Montana 17690.053571 ... 997746
Nebraska 19677.688172 ... 1842383
Nevada 159025.882353 ... 2718819
New Hampshire 131670.800000 ... 1318344
New Jersey 419232.428571 ... 8842934
New Mexico 62567.909091 ... 2078226
New York 312950.322581 ... 19523202
North Carolina 95589.790000 ... 9651025
North Dakota 12726.981132 ... 685326
Ohio 131145.068182 ... 11545442
Oklahoma 48825.922078 ... 3786626
Oregon 106610.333333 ... 3868509
Pennsylvania 189731.552239 ... 12745202
Rhode Island 210643.800000 ... 1051856
South Carolina 100780.304348 ... 4672733
South Dakota 12368.166667 ... 824289
Tennessee 66911.421053 ... 6398408
Texas 99387.255906 ... 25654464
Utah 95704.344828 ... 2816440
Vermont 44713.142857 ... 626687
Virginia 60344.263158 ... 8110783
Washington 172898.974359 ... 6823229
West Virginia 33713.181818 ... 1854948
Wisconsin 79030.611111 ... 5709720
Wyoming 24544.173913 ... 567768
'''
참고자료 : https://www.coursera.org/learn/python-data-analysis
출처: https://engkimbs.tistory.com/720?category=763908 [새로비]