[Python, 데이터분석] 데이터 프레임 합치기, 조인 예제(DataFrame Join, Merge)

2021. 3. 27. 03:09 Python/파이썬 데이터 분석

| 데이터 프레임 조인하기(DataFrame Join)

 

판다스(Pandas)에서는 데이터 프레임간에 SQL문의 테이블 간 조인 연산처럼 데이터 프레임을 합칠 수 있는 기능을 지원합니다.

 

아래 그림에 나와있는 모든 조인 연산을 지원합니다.

 

<출처 : https://www.dofactory.com/sql/join>

 

 

| 데이터 프레임 조인 예제(DataFrame Join Example)

import numpy as np
import pandas as pd

df = pd.DataFrame([{'Name': 'Chris', 'Item Purchased': 'Sponge', 'Cost': 22.50},
                   {'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},
                   {'Name': 'Filip', 'Item Purchased': 'Spoon', 'Cost': 5.00}],
                  index=['Store 1', 'Store 1', 'Store 2'])
df['Date'] = ['December 1', 'January 1', 'mid-May']
df['Delivered'] = True
df['Feedback'] = ['Positive', None, 'Negative']
print(df)
'''
         Cost Item Purchased   Name        Date  Delivered  Feedback
Store 1  22.5         Sponge  Chris  December 1       True  Positive
Store 1   2.5   Kitty Litter  Kevyn   January 1       True      None
Store 2   5.0          Spoon  Filip     mid-May       True  Negative
'''
adf = df.reset_index() # 인덱스 재설정 원래 인덱스를 index column으로 빼고 0~N으로 인덱스 대체
adf['Date'] = pd.Series({0:'December 1', 2:'mid-May'})
print(adf)
'''
     index  Cost Item Purchased    ...           Date Delivered  Feedback
0  Store 1  22.5         Sponge    ...     December 1      True  Positive
1  Store 1   2.5   Kitty Litter    ...            NaN      True      None
2  Store 2   5.0          Spoon    ...        mid-May      True  Negative
'''

staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name') # Name 컬럼을 인덱스로 뺌
print(staff_df)
'''
                 Role
Name                 
Kelly  Director of HR
Sally  Course liasion
James          Grader
'''
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
print(student_df)
'''
            School
Name              
James     Business
Mike           Law
Sally  Engineering
'''

# staff_df 와 student_df를 인덱스를 기준으로 outer_join
outer_join = pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)
print(outer_join)
'''
                 Role       School
Name                              
James          Grader     Business
Kelly  Director of HR          NaN
Mike              NaN          Law
Sally  Course liasion  Engineering
'''
# staff_df 와 student_df를 인덱스를 기준으로 inner_join
inner_join = pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)
print(inner_join)
'''
                 Role       School
Name                              
Sally  Course liasion  Engineering
James          Grader     Business
'''
# staff_df 와 student_df를 인덱스를 기준으로 left_join
left_join = pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)
print(left_join)
'''
                 Role       School
Name                              
Kelly  Director of HR          NaN
Sally  Course liasion  Engineering
James          Grader     Busines
'''
# staff_df 와 student_df를 인덱스를 기준으로 right_join
right_join = pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)
print(right_join)
'''
                 Role       School
Name                              
James          Grader     Business
Mike              NaN          Law
Sally  Course liasion  Engineering
'''

staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
# Name 컬럼을 기준으로 left 조인
left_join2 = pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')
print(left_join2)
'''
    Name            Role       School
0  Kelly  Director of HR          NaN
1  Sally  Course liasion  Engineering
2  James          Grader     Business
'''
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins', 'Role': 'Director of HR'},
                         {'First Name': 'Sally', 'Last Name': 'Brooks', 'Role': 'Course liasion'},
                         {'First Name': 'James', 'Last Name': 'Wilde', 'Role': 'Grader'}])
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond', 'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith', 'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks', 'School': 'Engineering'}])
# 복수의 컬럼을 기준으로 join
multiple_join = pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], \
                         right_on=['First Name','Last Name'])
print(multiple_join)
'''
  First Name Last Name            Role       School
0      Sally    Brooks  Course liasion  Engineering
'''


참고자료 : https://www.coursera.org/learn/python-data-analysis

출처: https://engkimbs.tistory.com/689?category=763908 [새로비]