파이썬(Python) beautifulSoup 사용해서 html page 파싱(parsing)

데이터 scrap을 할때 많이 사용하는 beautifulsoup을 이용해서
html page를 parsing 하는 코드 (아래 참고)

코드 간단 설명

특정 tag의 값(text)을 가져오기
- html page의 값을 가져온 이후에 soup.p의 tag의 값을 가져온다
- p의 tag를 갖고 있는 text를 가지고 와서 words로 split
- p의 id로 가져오기
모든 paragraph의 리스트를 가져오기
p의 tag 중 important class의 값을 갖고 있는 paragraphs 가져오기
span의 요소 안에 포함된 모든 div를 가져오기

코드

from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.example.com").text
soup = BeautifulSoup(html, 'html5lib')

first_paragraph = soup.find('p') # soup.p
first_paragraph_text = soup.p.text
first_paragraph_words = soup.p.text.split()

first_paragraph_id = soup.p['id'] # id가 존재하지 않으면 KeyError 
first_paragraph_id2 = soup.p.get('id') # id가 존재하지 않으면 None

all_paragraphs = soup.find_all('p') # soup('p')
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]

important_paragraphs = soup('p', {'class': 'important'})
important_paragraphs2 = soup('p', 'important')
important_paragraphs3 = [p for p in soup('p')
                         if 'important' in p.get('class',[])]

# <span> 요소 안에 포함된 모든 <div> 요소를 찾아보자.
spans_inside_divs = [span
                     for div in soup('div') # 모든 <div> 
                     for span in div('span')] # 포함된 <span>을

출처 : ourcstory.tistory.com/366?category=630693

저작자표시 비영리 변경금지 (새창열림)

'Python > Python 프로그래밍' 카테고리의 다른 글

[파이썬] gzip.open, zcat & pipeline 성능비교 (0)	2021.04.28
파이썬(Python) 리스트 모든 조합 구하기 (combination vs permutations vs product) (0)	2021.04.28
파이썬(Python) 한글 문자 길이 (0)	2021.04.28
C를 Python으로 Wrapping하는 방법 (0)	2021.04.28
파이썬 스케일이 다른 그래프 (0)	2021.04.28
파이썬 디렉토리 생성 코드 (0)	2021.04.28
파이썬 데이터 읽기 (pandas) (0)	2021.04.28
[Python] Flask Response Encoding 문제 (0)	2021.04.28

Trackback: Comment: 0

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

어제보다 성장한 오늘

어제보다 성장한 오늘

파이썬 HTML 파싱 하는 방법

파이썬(Python) beautifulSoup 사용해서 html page 파싱(parsing)

코드 간단 설명

코드

'Python > Python 프로그래밍' 카테고리의 다른 글

티스토리툴바