08. Web document (WebBaseLoader)

WebBaseLoader

WebBaseLoader Is a loader that loads web-based documents.

bs4 Parse web pages using the library.

  • bs4.SoupStrainer Use to specify the elements to parse.

  • bs_kwargs Using parameters bs4.SoupStrainer Specifies additional arguments.

Reference

Copy

import bs4
from langchain_community.document_loaders import WebBaseLoader

# Load news article content.
loader = WebBaseLoader(
    web_paths=("https://n.news.naver.com/article/437/0000378416",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["newsct_article _article_body", "media_end_head_title"]},
        )
    ),
    header_template={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
    },
)

docs = loader.load()
print(f"Number of documents: {len(docs)}")
docs

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Proxy use

Sometimes you may need to use a proxy to bypass IP blocking.

To use a proxy, loader (and below it) requests ) To pass the proxy dictionary.

Copy

Last updated