크롤링(Crawling) - BeatifulSoup 라이브러리

크롤링(Crawling)이란 한마디로 웹 사이트에서 원하는 데이터를 긁어오는 작업이다. Python에서는 크롤링을 위한 다양한 라이브러리를 제공해준다.

먼저 requests 라이브러리를 이용하여 URL에 해당하는 HTML document를 모두 가져온 다음, BeautifulSoup 라는 라이브러리를 이용하여 그것을 파싱해 볼 것이다.

import requests  
from bs4 import BeautifulSoup  
BASE_URL = 'http://search.naver.com/search.naver?where=post&sm=tab_jum&ie=utf8&query=%EA%B0%9C%EB%B0%9C%EC%9E%90'  
response = requests.get(BASE_URL)

먼저 requests 라이브러리를 이용하여 BASE_URL에 해당하는 HTML 데이터들을 모두 가져온다. 그리고 그 결과를 response라는 변수에 저장한다. 참고로 위의 BASE_URL은 NAVER 사이트 블로그 탭에서 '개발자'라 검색했을 때 나온 URL이다.

response.status_code

status_code가 200으로 출력된다면 정상적으로 가져온 것이다.

response.content

content를 한번 출력해보면 HTML document가 상당히 알아보기 힘든 텍스트로 출력될 것이다.

이제 requests를 이용하여 HTML document는 성공적으로 가져왔으니, BeautifulSoup를 이용해 파싱하여 유의미한 데이터만 뽑아보자.

먼저, BeautifulSoup의 생성자에 아까 가져온 HTML document와 사용할 parser를 넣어주어 인스턴스를 하나 생성하자.(parser에 대한 자세한 내용은 문서에 잘 나와있다.)

dom = BeautifulSoup(response.content, "html.parser")

BeautifulSoup 문서에 나와있는 아래의 문장에서 볼 수 있듯이,

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.

BeautifulSoup는 HTML document안에 있는 수많은 HTML 태그들을 사용하기 편한 Python 객체 형태로 만들어 준다.

따라서 모든 HTML 태그들이 객체화 되었기 때문에 다음과 같이 호출할 수 있다.

dom.html  
dom.html.head

이 밖에도 BeautifulSoup 클래스에는 파싱를 편리하게 해주는 다양한 함수가 내장되어 있기 때문에 아주 편하게 원하는 데이터들을 뽑아올 수 있다.

select 함수를 쓰면, class가 'sh_blog_top'인 li 태그들을 element로 하는 배열로 반환해준다.

post_elements = dom.select("li.sh_blog_top")

하나의 element를 가져오고 싶다면 select_one 함수를 사용하면 된다.

post_element = post_elements[0]  
title_element = post_element.select_one("a.sh_blog_title")

title_element 를 출력해보면 다음과 같은 a 태그가 출력된다.

<a class="sh_blog_title _sp_each_url _sp_each_title" href="http://blog.naver.com/todoskr?Redirect=Log&logNo=220706762037" onclick="goCR(this,'rd','u='+urlencode('http://blog.naver.com/todoskr?Redirect=Log&logNo=220706762037')+'&a=blg*i.tit&r=1&i=90000003_00000000000000336325ED35');" target="_blank" title="웹 접근성, 개발자도 안다! - Prologue">웹 접근성, <strong class="hl">개발자</strong>도 안다! - Prologue</a>

여기서 title과 href를 뽑아내고 싶다면 get 함수를 사용하면 된다.

title_element.get("title")  
title_element.get("href")