【Python入門】Webスクレイピングで情報収集を自動化する方法

Webスクレイピングとは

Webスクレイピングとは、プログラムでWebページにアクセスし、必要な情報を自動的に取得する技術です。手作業でコピペしていた情報収集を自動化できるため、エンジニアにとって非常に強力なスキルです。

PythonにはrequestsとBeautifulSoupという2つのライブラリがあり、この組み合わせで大抵のスクレイピング処理をシンプルに実装できます。

環境準備

まずライブラリをインストールします。

pip install requests beautifulsoup4

基本的なスクレイピングの流れ

スクレイピングは大きく3ステップです。

1. requestsでWebページのHTMLを取得する

2. BeautifulSoupでHTMLを解析する

3. 必要な要素を抽出する

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
response.encoding = "utf-8"

soup = BeautifulSoup(response.text, "html.parser")
print(soup.title.text)

要素の取得方法

タグ名で取得

# h1タグをすべて取得
headings = soup.find_all("h1")
for h in headings:
    print(h.text.strip())

CSSセレクタで取得

# classが"article-title"の要素を取得
titles = soup.select(".article-title")
for title in titles:
    print(title.text.strip())

属性で絞り込む

# href属性を持つaタグをすべて取得
links = soup.find_all("a", href=True)
for link in links:
    print(link["href"])

実践例：ニュースサイトの見出しを収集する

実際のユースケースとして、ニュースサイトから記事タイトルとURLを収集するスクリプトを作ります。

import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_news(url: str) -> list[dict]:
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"
    }
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")
    articles = []

    for item in soup.select("article"):
        title_tag = item.find("h2") or item.find("h3")
        link_tag = item.find("a", href=True)

        if title_tag and link_tag:
            articles.append({
                "title": title_tag.text.strip(),
                "url": link_tag["href"]
            })

    return articles


def save_to_csv(data: list[dict], filename: str) -> None:
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["title", "url"])
        writer.writeheader()
        writer.writerows(data)


if __name__ == "__main__":
    target_url = "https://example-news.com"
    news = scrape_news(target_url)
    save_to_csv(news, "news.csv")
    print(f"{len(news)}件の記事を収集しました")

スクレイピング時の注意事項

robots.txtを確認する

スクレイピング前にhttps://example.com/robots.txtを必ず確認してください。クロールが禁止されているパスがある場合は従う必要があります。

アクセス間隔を空ける

サーバーへの負荷を避けるため、複数ページを巡回する場合はtime.sleep()でリクエスト間隔を空けます。

import time

urls = ["https://example.com/page/1", "https://example.com/page/2"]
for url in urls:
    data = scrape_news(url)
    time.sleep(2)  # 2秒待機

エラーハンドリングを入れる

ネットワークエラーやページ構造の変化に対応するため、try-exceptを入れておきます。

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"エラー: {e}")

JavaScriptで動的に生成されるページの場合

requestsはHTMLを取得するだけなので、JavaScriptで動的に生成されるコンテンツは取得できません。その場合はSeleniumやPlaywrightを使います。

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")
    content = page.content()
    browser.close()

ただし、動的ページのスクレイピングは処理が重くなるため、まず公式APIが提供されていないか確認するのがおすすめです。

まとめ

PythonのWebスクレイピングは以下の組み合わせで始めるのが最速です。

静的ページ: `requests` + `BeautifulSoup`

動的ページ: `Playwright` または `Selenium`

robots.txtの確認とアクセス間隔の設定を守れば、情報収集の自動化を安全に進められます。定期実行したい場合はcronやタスクスケジューラと組み合わせると、さらに実用的な自動化システムが構築できます。