08. Web document (WebBaseLoader)

WebBaseLoader

WebBaseLoader Is a loader that loads web-based documents.

bs4 Parse web pages using the library.

bs4.SoupStrainer Use to specify the elements to parse.
bs_kwargs Using parameters bs4.SoupStrainer Specifies additional arguments.

Reference

API document

Copy

import bs4
from langchain_community.document_loaders import WebBaseLoader

# Load news article content.
loader = WebBaseLoader(
    web_paths=("https://n.news.naver.com/article/437/0000378416",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["newsct_article _article_body", "media_end_head_title"]},
        )
    ),
    header_template={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
    },
)

docs = loader.load()
print(f"Number of documents: {len(docs)}")
docs

Copy

Number of documents: 1

Copy

[Document (metadata={'source':'https://n.news.naver.com/article/437/0000378416'}, page_content="\n shoots '100 million won' to the maternity employee... The company's catastrophic low-birth policy\n\n\n[anchor] If you are a family planning to give birth to a child this year, this is the news to sell. The government's monthly salary as a low-birth measure, and a zero-year-old child raised to 1 million won. Adding to this first-time ticket, even the child's allowance, you get 15.2 million won for a year up to the child stone. The municipality was also competing for support. Incheon City is a new born baby, and will give you 100 million won until you are 18 years old. Gwangju City also said he would give 74 million won until he was 17 years old. There was a man who appeared in the election and said he would give cash if he had a child. In the past, only the vote was followed by criticism of Norin's'Emperor Commitment'. But now the fertility rate can't be worse than this, so it's even a situation where we seriously policy this cash aid. Besides, companies are also jumping. This time, it turned out to be a company that would give 100 million won to a given employee.Idealization reporter covered.[Reporter] A group company today has a catastrophic low-birth policy. One billion won for children born after 2021, a total of 70 billion won, and we decided to continue this policy in the future.If you have a lifetime and twin children in that period, you will receive a total of 200 million won.[Oh Hyun-seok/Vyoung Group employees: You're a world that's hard to raise children. I think it will be a great help in education or living.]If it were born by the third, it also said that it would provide national housing.  [Chairman of the dual-center/non-profit group: I think it will come out to have three children within three years, and thus it will be an opportunity to provide housing.][Quiet/Buyoung Group employees: Wipe wanted to have a third time, but it was negative because of the economic burden. (Now) I think I can think positively.]At today's event, there was also a proposal for the government to be tax-free, taking into account the tax burden of the employees receiving the company.These maternity measures are an increasingly spreading atmosphere.Some places have longer parental leave than during statutory periods, or where male employees are obliged to take parental leave.I run my in-house children's house until 10pm, and if I give birth, I will promote it unconditionally.One company attracted attention by supporting medical expenses to employees who gave birth to four twins last year.As the expectation of a company to go out on behalf of the government to change the social climate, there is also a voice that small and medium-sized support is needed.[Video design]\n\t\t\n")]  One company attracted attention by supporting medical expenses to employees who gave birth to four twins last year.As the expectation of a company to go out on behalf of the government to change the social climate, there is also a voice that small and medium-sized support is needed.[Video design]\n\t\t\n")]  One company attracted attention by supporting medical expenses to employees who gave birth to four twins last year.As the expectation of a company to go out on behalf of the government to change the social climate, there is also a voice that small and medium-sized support is needed.[Video design]\n\t\t\n")]

Copy

# ssl Authentication Bypass
loader.requests_kwargs = {"verify": False}

# load data
docs = loader.load()

Copy

You can also load multiple webpages at once. for this urls When you pass the list of to the loader, urls Returns the document list in the order of.

Copy

loader = WebBaseLoader(
    web_paths=[
        "https://n.news.naver.com/article/437/0000378416",
        "https://n.news.naver.com/mnews/hotissue/article/092/0002340014?type=series&cid=2000063",
    ],
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["newsct_article _article_body", "media_end_head_title"]},
        )
    ),
    header_template={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
    },
)

# load data
docs = loader.load()

# Check the number of documents
print(len(docs))

Copy

Outputs results from the web.

Copy

print(docs[0].page_content[:500])
print("===" * 10)
print(docs[1].page_content[:500])

Copy

'10 million won' for maternity staff... The company's disparate low birth policy 


[Anchor] If you're a family planning to have a child this year, it's good news. The government's monthly salary as a low-birth measure, and a zero-year-old child raised to 1 million won. Adding to this first-time ticket, even the Children's Party, I get 15.2 million won for a year up to the child stone. The municipality was also competing for support. Incheon City is a new born baby, and will give you 100 million won until you are 18 years old. Gwangju City also said he would give 74 million won until he was 17 years old. There was a man who appeared in the election and said he would give cash if he had a child. In the past, only the vote was followed by criticism of Norin's'Emperor Commitment'. But now the fertility rate can't be worse than this, so it's even a situation where we seriously policy this cash aid. Besides, companies are also jumping. This time, it turned out to be a company that would give 100 million won to a given employee.Idealization reporter covered.[Reporter] A group company today has a disparate low-birth policy. 
============================== 

A high-speed growing startup needs a red team 


[Hyunggyeongseong of Eunseong] Chosim, a start-up founder who lost his essence and had a recent lunch together. There was something to ask for advice. It wasn't that there was an urgent issue right now. I've had many startups, and the items I'm doing right now feel great. But I think I should be more careful. The desire to seek advice was that I wanted to know what to watch out for when growth was expected. I met a decent start-up founder, but it was a rare case: there are two things that I thought were meaningful when interviewing a relay for a startup founder for nearly two years. First, it was that he wrote the vocabulary of teams rather than the word company. It is more difficult to say that it is a team than because of the origin or meaning of the expression. It was more difficult to think that it was an expression that focused more on the oriented will than the stakeholder system and pointed to a group running towards one place with one mind.Most of the start-up motives for start-up representatives are ‘Social

Copy

Scraping multiple URLs simultaneously can accelerate the scraping process.

Simultaneous requests have reasonable limits, the default is 2 times per second. If you are not worried about the server load, or if you control the scrapping server requests_per_second You can increase the maximum number of concurrent requests by changing the parameters. This method can speed up scraping, but it can be blocked from the server, so be careful.

Copy

# jupyter notebook Runs only on(asyncio)
import nest_asyncio

nest_asyncio.apply()

Copy

# Set number of requests per second
loader.requests_per_second = 1

# Asynchronous load
docs = loader.aload()

Copy

 Fetching pages: 100%|#########| 2/2 [00:00<00:00, 7.68it/s]

Copy

# Output the results
docs

Copy

[Document (metadata={'source':'https://n.news.naver.com/article/437/0000378416'}, page_content="\n shoots '100 million won' to the maternity employee... The company's catastrophic low-birth policy\n\n\n[anchor] If you are a family planning to give birth to a child this year, this is the news to sell. The government's monthly salary as a low-birth measure, and a zero-year-old child raised to 1 million won. Adding to this first-time ticket, even the Children's Party, I get 15.2 million won for a year up to the child stone. The municipality was also competing for support. Incheon City is a new born baby, and will give you 100 million won until you are 18 years old. Gwangju City also said he would give 74 million won until he was 17 years old. There was a man who appeared in the election and said he would give cash if he had a child. In the past, only the vote was followed by criticism of Norin's'Emperor Commitment'. But now the fertility rate can't be worse than this, so it's even a situation where we seriously policy this cash aid. Besides, companies are also jumping. This time, it turned out to be a company that would give 100 million won to a given employee.Idealization reporter covered.[Reporter] A group company today has a catastrophic low-birth policy. One billion won for children born after 2021, a total of 70 billion won, and we decided to continue this policy in the future.If you have a lifetime and twin children in that period, you will receive a total of 200 million won.[Oh Hyun-seok/Vyoung Group employees: You're a world that's hard to raise children. I think it will be a great help in education or living.]If it were born by the third, it also said that it would provide national housing.  [Chairman of the dual-center/non-profit group: I think it will come out to have three children within three years, and thus it will be an opportunity to provide housing.][Quiet/Buyoung Group employees: Wipe wanted to have a third time, but it was negative because of the economic burden. (Now) I think I can think positively.]At today's event, there was also a proposal for the government to be tax-free, taking into account the tax burden of the employees receiving the company.These maternity measures are an increasingly spreading atmosphere.Some places have longer parental leave than during statutory periods, or where male employees are obliged to take parental leave.I run my in-house children's house until 10pm, and if I give birth, I will promote it unconditionally.One company attracted attention by supporting medical expenses to employees who gave birth to four twins last year.As the expectation of a company to go out on behalf of the government to change the social climate, there is also a voice that small and medium-sized support is needed.[Video design contour]\n\t\t\n"), Document (metadata={'source':'https://n.news.naver.com/mnews/hotissue/article/092/0002340014?type=series&cid=2000063'}, page_content='\n High-growing startup needs a red team \n\n\n[Ecularity's 图 ] Beginner, start-up founder when losing essence With recent lunch. There was something to ask for advice. It wasn't that there was an urgent issue right now. I've had many startups, and the items I'm doing right now feel great. But I think I should be more careful.  The desire to seek advice was that I wanted to know what to watch out for when growth was expected. I met a decent start-up founder, but it was a rare case: there are two things that I thought were meaningful when interviewing a relay for a startup founder for nearly two years. First, it was that he wrote the vocabulary of teams rather than the word company. It is more difficult to say that it is a team than because of the origin or meaning of the expression. It was more difficult to think that it was an expression that focused more on the oriented will than the stakeholder system and pointed to a group running towards one place with one mind.Most of the start-up motives for start-up representatives are ‘to solve social problems in a corporate way ’. There are many problems to solve in human society, and solutions vary. By the way, some can be the most efficient to solve in a corporate way. There are also many issues that can solve problems better when combined with the motives of profit. So ‘solve social problems in a corporate way ’ was a bad idea that the start-up motive was also a team. \n\n\n\nstart-up founder (source =pixabay), but in some respects ‘Clearing Orientation ’ It may not be an intrinsic entity ‘. Orientation is enough for notion and relief, but the reality would be implemented little by little only with unrested and endless action, for some companies only the former is disturbing and the latter seems loose. As the size of the company grows little by little, the tendency of the former and latter gaps to take place is also felt. This trend is thought to be the essence of corporate risk.The start-up ‘spreads social problems in a corporate way ’ starts with the notion. It is the start-up to feel the consciousness of a problem and to take action by teaming up when the solution comes to mind. That's the action. It is ‘Challenge ’.  When solving the problems that consumers feel through the action, the start-up is shining and there is a reason for the company to exist. The sad thing is that it is difficult to solve the first problem, but after solving the problem, you face a bigger problem.‘Challenge ’ is so mostly ‘foreseen frustration ’. In a short time, many entrepreneurs are frustrated, and when time gets longer, infrequently frustrated teeth are rare. The problem is that it increases exponentially and the founder's competency develops arithmetic. The problem is like a running tiger, and the founder is like a precarious climb on it. As time goes by, the problem goes beyond his control. What he can do is less and less.Recent noise is not cut off by many companies that have grown significantly, starting with startups. Founders are bound, in bankruptcy, or face strong backlash from partners and consumers. Analysis of companies in crisis is also pouring. Most of the causes are found in ‘Noz’. These analyzes include octopus expansion, an unbeaten investment for mergers, a bulging turnover, and a way to pull future profits for the right-wing indicators.The crowd is not really special. ‘Cleared orientation ’ and ‘Intrinsic entity ’ refers to a large open state. When orientation and reality take place, the initial corporate culture of the team shakes, and the entrepreneurship of solving social problems for consumers disappears. Only growth that does not cover the back and forth of the back and forth is the only value. It's a pursuit of blind growth. It must fall sometime in words. It's a horse tiger, etc., and I don't know that before Sadal.Companies are likened to running bikes. Running means growth. It means that if you do not grow, you will fall. Sustainable growth is so necessary, but it's that hard.  But if growth itself turns to its only value, companies are not an innovative group that solves social problems, but Dori himself can turn into a chunk of problems and put a strain on society. It's instant that tribute turns into blame. It is also true that I fall into the fall.Companies that are more in trouble than solving problems are arranged so that the gap between the oriented motto and real management is irreversible. It seems that the multiplier effect also applies to the growing gap. For the first time, the smaller it was, the bigger it couldn't afford. So a high-speed growing startup needs a red team to check the gap. If you don't check the gap, you lose your essence and you lose your essence, there is no persistence.\n\n')]

Proxy use

Sometimes you may need to use a proxy to bypass IP blocking.

To use a proxy, loader (and below it) requests ) To pass the proxy dictionary.

Copy

loader = WebBaseLoader(
    "https://www.google.com/search?q=parrots",
    proxies={
        "http": "http://{username}:{password}:@proxy.service.com:6666/",
        "https": "https://{username}:{password}:@proxy.service.com:6666/",
    },
    # Initializing the web-based loader
    # Proxy settings
)

# load documents
docs = loader.load()

Previous07. PowerPoint Next09. Text (TextLoader)

Last updated 5 months ago