TTS / web_scraper.py
Subham629's picture
Upload web_scraper.py
48d5934 verified
raw
history blame contribute delete
591 Bytes
import trafilatura
def get_website_text_content(url: str) -> str:
"""
This function takes a url and returns the main text content of the website.
The text content is extracted using trafilatura and easier to understand.
The results is not directly readable, better to be summarized by LLM before consume
by the user.
Some common website to crawl information from:
MLB scores: https://www.mlb.com/scores/YYYY-MM-DD
"""
# Send a request to the website
downloaded = trafilatura.fetch_url(url)
text = trafilatura.extract(downloaded)
return text