[Python 데이터 분석] groupby 집계함수 예제(Python groupby)

2021. 3. 27. 03:14 Python/파이썬 데이터 분석

| 파이썬 집계함수 groupby

 

groupby SQL 문의 group 처럼 어느 특정 컬럼을 묶어 그에 대한 집계연산을 지원하는 함수입니다. 데이터프레임에서 인덱스 혹은 컬럼에 대하여 그 컬럼을 기준으로 그룹을 묶고 난 후 sum 이나 avg 같은 집계 함수를 통해 원하는 데이터를 추출 할 수 있습니다.

 

다음은 groupby에 대한 예제를 정리한 것입니다.

 

census.csv
2.06MB

 

import pandas as pd
import numpy as np

df = pd.read_csv('census.csv')
# SUMLEV이 50인 데이터를 데이터프레임에서 추출
df = df[df['SUMLEV']==50]
print(df)
'''
      SUMLEV  REGION     ...       RNETMIG2014  RNETMIG2015
1         50       3     ...          2.592270    -2.187333
2         50       3     ...         20.317142    18.293499
3         50       3     ...         -3.978583   -10.543299
4         50       3     ...          0.754533     1.107861
5         50       3     ...         -1.577232    -0.884411
6         50       3     ...         17.243790   -13.193961
7         50       3     ...          1.184103    -6.430868
8         50       3     ...         -3.912834    -2.806406
9         50       3     ...         -1.290228     2.346901
10        50       3     ...          0.230419    -2.931307
11        50       3     ...          0.957636    -1.752709
12        50       3     ...         -1.274984    -5.291205
13        50       3     ...         -8.486280    -5.411736
14        50       3     ...          5.776708     3.986270
15        50       3     ...          3.654485    -3.123961
16        50       3     ...         -0.767822     5.350738
17        50       3     ...          2.533249     0.588052
18        50       3     ...        -14.645416     2.684140
19        50       3     ...        -23.049483    -5.387581
20        50       3     ...          3.750759    -0.264121
21        50       3     ...          4.734577     5.087600
22        50       3     ...          5.593387     8.417777
23        50       3     ...         -9.503292    -1.998668
24        50       3     ...         -9.368689   -14.711389
25        50       3     ...          1.198187     0.956790
26        50       3     ...         -0.951175     2.757093
27        50       3     ...         -1.136950     1.243830
28        50       3     ...         -1.234901    -1.588308
29        50       3     ...          0.237396    -2.857058
30        50       3     ...          1.553115     0.442422
...      ...     ...     ...               ...          ...
3162      50       2     ...          3.638104    -4.995197
3163      50       2     ...          2.104796     0.059931
3164      50       2     ...          2.077633     0.593567
3165      50       2     ...         -0.134227    -0.173022
3166      50       2     ...         -4.535615    -4.024395
3167      50       2     ...         -1.545153    -3.685304
3168      50       2     ...         -5.040889    -3.414223
3170      50       4     ...          1.429233    -5.166460
3171      50       4     ...         -7.608378     5.513554
3172      50       4     ...         -8.509402    10.978525
3173      50       4     ...         -2.338590   -22.600668
3174      50       4     ...        -19.358233     1.126443
3175      50       4     ...          7.071547    19.309219
3176      50       4     ...        -12.013555   -13.352750
3177      50       4     ...         -5.761986   -10.635133
3178      50       4     ...         -6.224712    -5.663940
3179      50       4     ...        -10.715742     0.933652
3180      50       4     ...         -0.207819     1.673640
3181      50       4     ...          6.234414     1.662823
3182      50       4     ...         -0.110593     0.793743
3183      50       4     ...        -12.603387     7.492114
3184      50       4     ...         -5.585731     0.856839
3185      50       4     ...          5.598720     4.207414
3186      50       4     ...          4.781489    -2.198937
3187      50       4     ...        -14.409938   -20.664059
3188      50       4     ...        -14.075283   -14.070195
3189      50       4     ...         16.308671     1.520747
3190      50       4     ...        -14.740608   -12.606351
3191      50       4     ...        -18.020168     1.441961
3192      50       4     ...          1.533635     6.935294

[3142 rows x 100 columns]
'''
# STNAME 컬럼을 기준으로 집계, group과 frame으로 나눌 수 있음
for group, frame in df.groupby('STNAME'):
    avg = np.average(frame['CENSUS2010POP'])
    print('Countries in state ' + group + ' have an average population of ' + str(avg))
'''
Countries in state Alabama have an average population of 71339.34328358209
Countries in state Alaska have an average population of 24490.724137931036
Countries in state Arizona have an average population of 426134.4666666667
Countries in state Arkansas have an average population of 38878.90666666667
Countries in state California have an average population of 642309.5862068966
Countries in state Colorado have an average population of 78581.1875
Countries in state Connecticut have an average population of 446762.125
Countries in state Delaware have an average population of 299311.3333333333
Countries in state District of Columbia have an average population of 601723.0
Countries in state Florida have an average population of 280616.5671641791
Countries in state Georgia have an average population of 60928.63522012578
Countries in state Hawaii have an average population of 272060.2
Countries in state Idaho have an average population of 35626.86363636364
Countries in state Illinois have an average population of 125790.50980392157
Countries in state Indiana have an average population of 70476.10869565218
Countries in state Iowa have an average population of 30771.262626262625
Countries in state Kansas have an average population of 27172.55238095238
Countries in state Kentucky have an average population of 36161.39166666667
Countries in state Louisiana have an average population of 70833.9375
Countries in state Maine have an average population of 83022.5625
Countries in state Maryland have an average population of 240564.66666666666
Countries in state Massachusetts have an average population of 467687.78571428574
Countries in state Michigan have an average population of 119080.0
Countries in state Minnesota have an average population of 60964.65517241379
Countries in state Mississippi have an average population of 36186.54878048781
Countries in state Missouri have an average population of 52077.62608695652
Countries in state Montana have an average population of 17668.125
Countries in state Nebraska have an average population of 19638.075268817203
Countries in state Nevada have an average population of 158855.9411764706
Countries in state New Hampshire have an average population of 131647.0
Countries in state New Jersey have an average population of 418661.61904761905
Countries in state New Mexico have an average population of 62399.36363636364
Countries in state New York have an average population of 312550.03225806454
Countries in state North Carolina have an average population of 95354.83
Countries in state North Dakota have an average population of 12690.396226415094
Countries in state Ohio have an average population of 131096.63636363635
Countries in state Oklahoma have an average population of 48718.844155844155
Countries in state Oregon have an average population of 106418.72222222222
Countries in state Pennsylvania have an average population of 189587.74626865672
Countries in state Rhode Island have an average population of 210513.4
Countries in state South Carolina have an average population of 100551.39130434782
Countries in state South Dakota have an average population of 12336.060606060606
Countries in state Tennessee have an average population of 66801.1052631579
Countries in state Texas have an average population of 98998.27165354331
Countries in state Utah have an average population of 95306.37931034483
Countries in state Vermont have an average population of 44695.78571428572
Countries in state Virginia have an average population of 60111.29323308271
Countries in state Washington have an average population of 172424.10256410256
Countries in state West Virginia have an average population of 33690.8
Countries in state Wisconsin have an average population of 78985.91666666667
Countries in state Wyoming have an average population of 24505.478260869564
'''

# 인덱스를 STNAME으로 한 후, fun 함수를 groupby에 인자로 넣어 데이터 추출
# 이때, fun에 들어가는 item 인수는 인덱스를 뜻함.
df = df.set_index('STNAME')
def fun(item):
    if item[0] < 'M':
        return 0
    if item[0] < 'Q':
        return 1
    return 2

for group, frame in df.groupby(fun):
    print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing')
'''
There are 1177 records in group 0 for processing
There are 1134 records in group 1 for processing
There are 831 records in group 2 for processing
'''
df = pd.read_csv('census.csv')
df = df[df['SUMLEV']==50]
print(df)
'''
      SUMLEV  REGION     ...       RNETMIG2014  RNETMIG2015
1         50       3     ...          2.592270    -2.187333
2         50       3     ...         20.317142    18.293499
3         50       3     ...         -3.978583   -10.543299
4         50       3     ...          0.754533     1.107861
5         50       3     ...         -1.577232    -0.884411
6         50       3     ...         17.243790   -13.193961
7         50       3     ...          1.184103    -6.430868
8         50       3     ...         -3.912834    -2.806406
9         50       3     ...         -1.290228     2.346901
10        50       3     ...          0.230419    -2.931307
11        50       3     ...          0.957636    -1.752709
12        50       3     ...         -1.274984    -5.291205
13        50       3     ...         -8.486280    -5.411736
14        50       3     ...          5.776708     3.986270
15        50       3     ...          3.654485    -3.123961
16        50       3     ...         -0.767822     5.350738
17        50       3     ...          2.533249     0.588052
18        50       3     ...        -14.645416     2.684140
19        50       3     ...        -23.049483    -5.387581
20        50       3     ...          3.750759    -0.264121
21        50       3     ...          4.734577     5.087600
22        50       3     ...          5.593387     8.417777
23        50       3     ...         -9.503292    -1.998668
24        50       3     ...         -9.368689   -14.711389
25        50       3     ...          1.198187     0.956790
26        50       3     ...         -0.951175     2.757093
27        50       3     ...         -1.136950     1.243830
28        50       3     ...         -1.234901    -1.588308
29        50       3     ...          0.237396    -2.857058
30        50       3     ...          1.553115     0.442422
...      ...     ...     ...               ...          ...
3162      50       2     ...          3.638104    -4.995197
3163      50       2     ...          2.104796     0.059931
3164      50       2     ...          2.077633     0.593567
3165      50       2     ...         -0.134227    -0.173022
3166      50       2     ...         -4.535615    -4.024395
3167      50       2     ...         -1.545153    -3.685304
3168      50       2     ...         -5.040889    -3.414223
3170      50       4     ...          1.429233    -5.166460
3171      50       4     ...         -7.608378     5.513554
3172      50       4     ...         -8.509402    10.978525
3173      50       4     ...         -2.338590   -22.600668
3174      50       4     ...        -19.358233     1.126443
3175      50       4     ...          7.071547    19.309219
3176      50       4     ...        -12.013555   -13.352750
3177      50       4     ...         -5.761986   -10.635133
3178      50       4     ...         -6.224712    -5.663940
3179      50       4     ...        -10.715742     0.933652
3180      50       4     ...         -0.207819     1.673640
3181      50       4     ...          6.234414     1.662823
3182      50       4     ...         -0.110593     0.793743
3183      50       4     ...        -12.603387     7.492114
3184      50       4     ...         -5.585731     0.856839
3185      50       4     ...          5.598720     4.207414
3186      50       4     ...          4.781489    -2.198937
3187      50       4     ...        -14.409938   -20.664059
3188      50       4     ...        -14.075283   -14.070195
3189      50       4     ...         16.308671     1.520747
3190      50       4     ...        -14.740608   -12.606351
3191      50       4     ...        -18.020168     1.441961
3192      50       4     ...          1.533635     6.935294

[3142 rows x 100 columns]
'''
# STNAME을 통해 집계를 한 후, CENSUS2010POP 컬럼을 기준으로 avg 집계
df_avg = df.groupby('STNAME').agg({'CENSUS2010POP':np.average})
print(df_avg)
'''
                      CENSUS2010POP
STNAME                             
Alabama                71339.343284
Alaska                 24490.724138
Arizona               426134.466667
Arkansas               38878.906667
California            642309.586207
Colorado               78581.187500
Connecticut           446762.125000
Delaware              299311.333333
District of Columbia  601723.000000
Florida               280616.567164
Georgia                60928.635220
Hawaii                272060.200000
Idaho                  35626.863636
Illinois              125790.509804
Indiana                70476.108696
Iowa                   30771.262626
Kansas                 27172.552381
Kentucky               36161.391667
Louisiana              70833.937500
Maine                  83022.562500
Maryland              240564.666667
Massachusetts         467687.785714
Michigan              119080.000000
Minnesota              60964.655172
Mississippi            36186.548780
Missouri               52077.626087
Montana                17668.125000
Nebraska               19638.075269
Nevada                158855.941176
New Hampshire         131647.000000
New Jersey            418661.619048
New Mexico             62399.363636
New York              312550.032258
North Carolina         95354.830000
North Dakota           12690.396226
Ohio                  131096.636364
Oklahoma               48718.844156
Oregon                106418.722222
Pennsylvania          189587.746269
Rhode Island          210513.400000
South Carolina        100551.391304
South Dakota           12336.060606
Tennessee              66801.105263
Texas                  98998.271654
Utah                   95306.379310
Vermont                44695.785714
Virginia               60111.293233
Washington            172424.102564
West Virginia          33690.800000
Wisconsin              78985.916667
Wyoming                24505.478261
'''
# STNAME을 인덱스로 함. level=0은 멀티인덱스일 시 맨 첫번째 인덱스 기준으로
# 집계하라는 뜻. agg 함수에 전체 평균과 합을 동시에 쓸 수 있음
census_avg_sum = (df.set_index('STNAME').groupby(level=0)['CENSUS2010POP']
                    .agg({'avg':np.average, 'sum':np.sum}))
print(census_avg_sum)
'''
                                avg       sum
STNAME                                       
Alabama                71339.343284   4779736
Alaska                 24490.724138    710231
Arizona               426134.466667   6392017
Arkansas               38878.906667   2915918
California            642309.586207  37253956
Colorado               78581.187500   5029196
Connecticut           446762.125000   3574097
Delaware              299311.333333    897934
District of Columbia  601723.000000    601723
Florida               280616.567164  18801310
Georgia                60928.635220   9687653
Hawaii                272060.200000   1360301
Idaho                  35626.863636   1567582
Illinois              125790.509804  12830632
Indiana                70476.108696   6483802
Iowa                   30771.262626   3046355
Kansas                 27172.552381   2853118
Kentucky               36161.391667   4339367
Louisiana              70833.937500   4533372
Maine                  83022.562500   1328361
Maryland              240564.666667   5773552
Massachusetts         467687.785714   6547629
Michigan              119080.000000   9883640
Minnesota              60964.655172   5303925
Mississippi            36186.548780   2967297
Missouri               52077.626087   5988927
Montana                17668.125000    989415
Nebraska               19638.075269   1826341
Nevada                158855.941176   2700551
New Hampshire         131647.000000   1316470
New Jersey            418661.619048   8791894
New Mexico             62399.363636   2059179
New York              312550.032258  19378102
North Carolina         95354.830000   9535483
North Dakota           12690.396226    672591
Ohio                  131096.636364  11536504
Oklahoma               48718.844156   3751351
Oregon                106418.722222   3831074
Pennsylvania          189587.746269  12702379
Rhode Island          210513.400000   1052567
South Carolina        100551.391304   4625364
South Dakota           12336.060606    814180
Tennessee              66801.105263   6346105
Texas                  98998.271654  25145561
Utah                   95306.379310   2763885
Vermont                44695.785714    625741
Virginia               60111.293233   7994802
Washington            172424.102564   6724540
West Virginia          33690.800000   1852994
Wisconsin              78985.916667   5686986
Wyoming                24505.478261    563626
'''
pop_avg_sum = df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010', 'POPESTIMATE2011'] \
                .agg({'avg':np.average, 'sum':np.sum})
print(pop_avg_sum)
'''
                                 avg       ...                   sum
                     POPESTIMATE2010       ...       POPESTIMATE2011
STNAME                                     ...                      
Alabama                 71420.313433       ...               4801108
Alaska                  24621.413793       ...                722720
Arizona                427213.866667       ...               6468732
Arkansas                38965.253333       ...               2938538
California             643691.017241       ...              37700034
Colorado                78878.968750       ...               5119480
Connecticut            447464.625000       ...               3589759
Delaware               299930.333333       ...                907916
District of Columbia   605126.000000       ...                620472
Florida                281341.641791       ...              19105533
Georgia                 61090.905660       ...               9812280
Hawaii                 272796.000000       ...               1378227
Idaho                   35704.227273       ...               1584134
Illinois               125894.598039       ...              12861882
Indiana                 70549.891304       ...               6516845
Iowa                    30815.090909       ...               3065389
Kansas                  27226.895238       ...               2869917
Kentucky                36232.808333       ...               4367882
Louisiana               71014.859375       ...               4575381
Maine                   82980.937500       ...               1328257
Maryland               241183.708333       ...               5844171
Massachusetts          468931.142857       ...               6611797
Michigan               119004.445783       ...               9876589
Minnesota               61044.862069       ...               5348119
Mississippi             36223.365854       ...               2977999
Missouri                52139.582609       ...               6010587
Montana                 17690.053571       ...                997746
Nebraska                19677.688172       ...               1842383
Nevada                 159025.882353       ...               2718819
New Hampshire          131670.800000       ...               1318344
New Jersey             419232.428571       ...               8842934
New Mexico              62567.909091       ...               2078226
New York               312950.322581       ...              19523202
North Carolina          95589.790000       ...               9651025
North Dakota            12726.981132       ...                685326
Ohio                   131145.068182       ...              11545442
Oklahoma                48825.922078       ...               3786626
Oregon                 106610.333333       ...               3868509
Pennsylvania           189731.552239       ...              12745202
Rhode Island           210643.800000       ...               1051856
South Carolina         100780.304348       ...               4672733
South Dakota            12368.166667       ...                824289
Tennessee               66911.421053       ...               6398408
Texas                   99387.255906       ...              25654464
Utah                    95704.344828       ...               2816440
Vermont                 44713.142857       ...                626687
Virginia                60344.263158       ...               8110783
Washington             172898.974359       ...               6823229
West Virginia           33713.181818       ...               1854948
Wisconsin               79030.611111       ...               5709720
Wyoming                 24544.173913       ...                567768
'''

 

참고자료 : https://www.coursera.org/learn/python-data-analysis

출처: https://engkimbs.tistory.com/720?category=763908 [새로비]