How to Build an AI Scraper Powered by Hugging Face and Bright Data
Craw4AI, ScrapeGraphAI, and Firecrawl are just a few examples of the emerging trend of AI-powered web scraping. These successful open-source projects have in common thousands of GitHub stars and an active, growing community. The best part? You can build your own AI scraper by following a similar approach.
In the article, you’ll learn how to create an AI scraping agent powered by Hugging Face and Bright Data!
What Is An AI Scraper? And How Does It Work?
An AI scraper is a web scraping tool that uses artificial intelligence to extract data from web pages. Unlike traditional scrapers that depend on rigid data parsing rules, AI scrapers leverage machine learning and natural language processing, so you don’t need to constantly update your scraping script when a site changes its layout.
If you describe the data you need clearly enough in a human-like prompt, the AI will apply that instruction to retrieve the information from the page—no matter what the page layout is. At a high level, this is how an AI scraper works:
- Use a traditional method (HTTP client or browser automation) to fetch the page’s HTML.
- Apply the data extraction prompt to interpret and pull out the required data.
- Return the extracted data in the desired format.
Create an AI-Powered Scraper with the Hugging Face Agent and Bright Data: Step-by-Step Guide
In this guide section, you’ll learn how to extend the MCP-powered Hugging Face AI agent we built in the previous article: “Building a Hugging Face AI Agent with Bright Data’s Web MCP Integration.” (Note: You should take a look at that tutorial first.)
Specifically, you’ll extend the original agent so that it has access only to the scrape_as_markdown
Bright Data tool via MCP and can:
- Retrieve the Markdown content of your target webpage, regardless of which site it comes from.
- Parse it programmatically into a Pydantic model using AI-based scraping.
- Export the parsed result to a JSON file.
In simpler terms, you’ll see how to build a programmatic AI scraper that works against any website and is powered by Hugging Face models. Let’s get started!
Step #1: Set Up the Project
Follow the instructions outlined in the previous article, which explains how to set up a Hugging Face AI agent that connects to the Bright Data Web MCP server. In that guide, you’ll find detailed steps to create your Bright Data and Hugging Face accounts, as well as how to write the Python code to define the AI agent.
From this point onward, we’ll assume you already have the Python project set up, with the agent stored in an agent.py
file.
Step #2: Limit the MCP Tools to scrape_as_markdown
The Bright Data Web MCP provides over 60 tools for web data extraction, interaction, search, and more. For building a general-purpose AI scraper, you only need the scrape_as_markdown
tool:
Tool | Description |
---|---|
scrape_as_markdown |
Extract content from any single webpage URL with advanced scraping options and receive the output in Markdown format. This tool can access pages even if they have bot protections or CAPTCHA challenges. |
Remember: The tool is available in the free tier, so you won’t be charged as long as you stay within the generous usage limits.
To limit the agent to this tool, simply filter the available MCP tools by name after loading them with await agent.load_tools()
:
agent.available_tools = [
tool for tool in agent.available_tools
if tool.get("function", {}).get("name") == "scrape_as_markdown"
]
Great! The Hugging Face agent will have access only to the scrape_as_markdown
tool from the Bright Data Web MCP server.
You can verify that by asking the agent:
What tools do you have access to?
The response should be similar to:
I have access to the following tools:
1. **`task_complete`** - This tool is used to indicate that the task given by the user is complete.
2. **`ask_question`** - This tool allows me to ask the user for more information needed to solve or clarify their problem.
3. **`scrape_as_markdown`** - This tool can scrape a single webpage URL with advanced options for content extraction and return the results in Markdown language. It can also unlock webpages that use bot detection or CAPTCHA.
Would you like more details on any of these tools? Or do you have a specific task in mind that I can help with?
task_complete
and ask_question
are part of the Hugging Face AI agent implementation, so you can ignore them. Note that the only configured extra tool is actually scrape_as_markdown
. Perfect!
Next, encapsulate the agent creation logic into a function:
async def initialize_scraping_agent():
# Bright Data Web MCP configuration
bright_data_mcp_server = {
"type": "stdio",
"command": "npx",
"args": ["-y", "@brightdata/mcp"],
"env": {
"API_TOKEN": "<YOUR_BRIGHT_DATA_API_KEY>"
}
}
# Initialize the agent
agent = Agent(
servers=[bright_data_mcp_server],
provider="nebius",
model="Qwen/Qwen2.5-72B-Instruct",
api_key="<YOUR_HUGGING_FACE_API_KEY>",
)
# Load the MCP tools
await agent.load_tools()
# Restrict the available tools to only "scrape_as_markdown"
agent.available_tools = [
tool for tool in agent.available_tools
if tool.get("function", {}).get("name") == "scrape_as_markdown"
]
return agent
This way, the resulting code will be easier to read and manage.
Remember: Retrieve your Bright Data API key token by following the official guide!
Step #3: Define the Scraping Task
Now you need to instruct your AI agent to perform the web scraping task. To make this general-purpose, you can create a function that:
- Accepts the target URL of the webpage to scrape.
- Accepts a Pydantic model representing the expected structured output.
- Uses the URL and model to create a contextual scraping prompt.
- Executes the scraping prompt.
- Retrieves the populated Pydantic model as a JSON string, parses it, and returns it.
All clear, but you may be wondering, “*Why a Pydantic model?*”
The reason is that using a Pydantic model is easier and more maintainable than manually specifying in the prompt which fields to extract from the page and the structure of the desired output. You define a Pydantic model representing the expected output, automatically convert it to a JSON schema in the prompt, and instruct the AI to produce results adhering to that schema.
Note that this is nothing new, but rather a best practice followed by most AI scrapers, such as ScrapeGraphAI and Crawl4AI.
Implement the above 5-step logic with this function:
async def execute_scraping_task(agent, url, model):
# Define the task for the agent
scraping_prompt = f"""
Scrape the following product page as Markdown:
"{url}"
From the scraped content, extract the product information and populate it into the JSON schema below:
```json
{model.model_json_schema()}
```
Return only the JSON object as a plain string (no explanations or extra output).
"""
# List to collect each partial streamed output
output = []
# Run the task with the agent and stream the response
async for chunk in agent.run(scraping_prompt):
if hasattr(chunk, "role") and chunk.role == "tool":
print(f"[TOOL] {chunk.name}: {chunk.content}\n\n", flush=True)
else:
delta_content = chunk.choices[0].delta.content
if delta_content:
print(delta_content, end="", flush=True)
output.append(delta_content)
# Assemble all streamed chunks into a single string
string_output = "".join(output)
# Remove the ```json prefix and ``` suffix from the output
json_string_output = string_output.removeprefix("```json").removesuffix("```").strip()
# Parse the output to an instance of the provided Pydantic model and return it
return model.model_validate_json(json_string_output)
The output
array is used to keep track of the streamed chunks of content produced by the AI. You can then aggregate all the chunks into a single string.
Once you have the full string, it has to be converted into an instance of the Pydantic model provided as input. The problem is that, when configured to return JSON content, LLMs tend to produce output like this:
```json
{
"images": [
"https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg",
"https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_alt1.jpg",
"https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_back.jpg"
],
"name": "Abominable Hoodie",
"price": 69.0,
"currency": "$",
"sku": "MH09",
"category": {
"name": "Hoodies & Sweatshirts",
"url": "https://www.scrapingcourse.com/ecommerce/product-category/clothing/men/tops/hoodies-sweatshirts/"
},
"short_description": "This is a variable product called a Abominable Hoodie",
"long_description": "It took CoolTech™ weather apparel know-how and lots of wind-resistant fabric to get the Abominable Hoodie just right. It’s aggressively warm when it needs to be, while maintaining your comfort in milder climes.\n\n• Blue heather hoodie.\n• Relaxed fit.\n• Moisture-wicking.\n• Machine wash/dry.",
"additional_info": {
"size": ["XS", "S", "M", "L", "XL"],
"color": ["Blue", "Green", "Red"]
}
}
```
In other terms, the configured AI model returns a JSON block in Markdown. Again, this is a common pattern among all LLMs. To handle that output, you can first remove the JSON code block prefix and suffix, then parse the string using the Pydantic model’s model_validate_json()
method.
The result will be an instance of the given Pydantic model
, fully populated with the scraped data. Well done!
Step #4: Export the Scraped Data
The final step is to define a function that exports the scraped Pydantic instance to a JSON file:
def export_to_json(scraped_object, file_name):
with open(file_name, "w", encoding="utf-8") as f:
f.write(scraped_object.model_dump_json(indent=2))
Wonderful! All the building blocks for the Bright Data-powered Hugging Face AI scraper are now ready. It’s time to put everything together.
Step #5: Start the Scraping Process
Now, suppose you want to scrape data from the following e-commerce product page on ScrapingCourse.com:
As the name suggests, this site is specifically designed for web scraping, making it perfect for a first test.
Inspect all the data on the page, and you'll see that you can represent it with a Pydantic Product
model like this:
from typing import Optional, List
from pydantic import BaseModel, HttpUrl
class Category(BaseModel):
name: Optional[str] = None
url: Optional[HttpUrl] = None
class AdditionalInfo(BaseModel):
size: Optional[List[str]] = None
color: Optional[List[str]] = None
class Product(BaseModel):
images: Optional[List[HttpUrl]] = None
name: str
price: Optional[float] = None
currency: Optional[str] = None
sku: Optional[str] = None
category: Optional[Category] = None
short_description: Optional[str] = None
long_description: Optional[str] = None
additional_info: Optional[AdditionalInfo] = None
Note that setting almost all fields as optional (except those you are sure exist) is best practice. This prevents the AI scraper from generating made-up data for fields that may not be present on the page.
Pro Tip: If a field’s name isn’t enough to fully describe the product, add a description
using Pydantic’s Field
. For example:
short_description: Optional[str] = Field(
None, description="A brief description of the product displayed below the product name."
)
Now, start the scraping process by calling the functions you’ve defined earlier:
agent = await initialize_scraping_agent()
scraped_object = await execute_scraping_task(
agent,
"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
Product
)
export_to_json(scraped_object, "product.json")
The resulting AI scraping prompt will look like this:
Scrape the following product page as Markdown:
"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"
From the scraped content, extract the product information and populate it into the JSON schema below:
```json
{'$defs': {'AdditionalInfo': {'properties': {'size': {'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'default': None, 'title': 'Size'}, 'color': {'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'default': None, 'title': 'Color'}}, 'title': 'AdditionalInfo', 'type': 'object'}, 'Category': {'properties': {'name': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Name'}, 'url': {'anyOf': [{'format': 'uri', 'maxLength': 2083, 'minLength': 1, 'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Url'}}, 'title': 'Category', 'type': 'object'}}, 'properties': {'images': {'anyOf': [{'items': {'format': 'uri', 'maxLength': 2083, 'minLength': 1, 'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'default': None, 'title': 'Images'}, 'name': {'title': 'Name', 'type': 'string'}, 'price': {'anyOf': [{'type': 'number'}, {'type': 'null'}], 'default': None, 'title': 'Price'}, 'currency': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Currency'}, 'sku': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Sku'}, 'category': {'anyOf': [{'$ref': '#/$defs/Category'}, {'type': 'null'}], 'default': None}, 'short_description': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Short Description'}, 'long_description': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Long Description'}, 'additional_info': {'anyOf': [{'$ref': '#/$defs/AdditionalInfo'}, {'type': 'null'}], 'default': None}}, 'required': ['name'], 'title': 'Product', 'type': 'object'}
```
Return only the populated JSON object as a plain string (no explanations or other output)
See how clear this is thanks to the Pydantic model trick. Fantastic!
Step #6: Complete Code and Execution
Below is the final AI scraper code that you should now have in agent.py
:
import asyncio
from huggingface_hub import Agent
from typing import Optional, List
from pydantic import BaseModel, HttpUrl
class Category(BaseModel):
name: Optional[str] = None
url: Optional[HttpUrl] = None
class AdditionalInfo(BaseModel):
size: Optional[List[str]] = None
color: Optional[List[str]] = None
class Product(BaseModel):
images: Optional[List[HttpUrl]] = None
name: str
price: Optional[float] = None
currency: Optional[str] = None
sku: Optional[str] = None
category: Optional[Category] = None
short_description: Optional[str] = None
long_description: Optional[str] = None
additional_info: Optional[AdditionalInfo] = None
async def initialize_scraping_agent():
# Bright Data Web MCP configuration
bright_data_mcp_server = {
"type": "stdio",
"command": "npx",
"args": ["-y", "@brightdata/mcp"],
"env": {
"API_TOKEN": "<YOUR_BRIGHT_DATA_API_KEY>"
}
}
# Initialize the agent
agent = Agent(
servers=[bright_data_mcp_server],
provider="nebius",
model="Qwen/Qwen2.5-72B-Instruct",
api_key="<YOUR_HUGGING_FACE_API_KEY>",
)
# Load the MCP tools
await agent.load_tools()
# Restrict the available tools to only "scrape_as_markdown"
agent.available_tools = [
tool for tool in agent.available_tools
if tool.get("function", {}).get("name") == "scrape_as_markdown"
]
return agent
async def execute_scraping_task(agent, url, model):
# Define the task for the agent
scraping_prompt = f"""
Scrape the following product page as Markdown:
"{url}"
From the scraped content, extract the product information and populate it into the JSON schema below:
```json
{model.model_json_schema()}
```
Return only the JSON object as a plain string (no explanations or extra output).
"""
# List to collect each partial streamed output
output = []
# Run the task with the agent and stream the response
async for chunk in agent.run(scraping_prompt):
if hasattr(chunk, "role") and chunk.role == "tool":
print(f"[TOOL] {chunk.name}: {chunk.content}\n\n", flush=True)
else:
delta_content = chunk.choices[0].delta.content
if delta_content:
print(delta_content, end="", flush=True)
output.append(delta_content)
# Assemble all streamed chunks into a single string
string_output = "".join(output)
# Remove the ```json prefix and ``` suffix from the output
json_string_output = string_output.removeprefix("```json").removesuffix("```").strip()
# Parse the output to an instance of the provided Pydantic model and return it
return model.model_validate_json(json_string_output)
def export_to_json(scraped_object, file_name):
# Save the Pydantic object to a JSON file
with open(file_name, "w", encoding="utf-8") as f:
f.write(scraped_object.model_dump_json(indent=2))
async def main():
# Initialize the agent
agent = await initialize_scraping_agent()
# Execute the scraping task and get a Product instance
scraped_object = await execute_scraping_task(
agent,
"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
Product
)
# Export the scraped data to a JSON file
export_to_json(scraped_object, "product.json")
if __name__ == "__main__":
asyncio.run(main())
In just around 100 lines of Python code, you’ve built a general-purpose AI scraper that can transform any web page into a structured JSON output. This wouldn’t have been possible without Hugging Face and Bright Data. Impressive!
Test the AI scraper on the configured scraping scenario with:
python agent.py
The result should be a product.json
file like this:
As you can see, this contains the same product data visible on the target page, but in a structured format that adheres to the Product
Pydantic model. That data was retrieved on the fly by your script using the scrape_as_markdown
tool from Bright Data and then parsed by an AI model available through Hugging Face. Mission complete!
Test the AI Scraper Against a Real-World Site
The AI scraping example above was built around ScrapingCourse.com, a site that welcomes web scraping bots. But what if you want to target a website that’s notoriously difficult to scrape, like G2?
Suppose your target page is the G2.com page for Bright Data:
For this case, a suitable Pydantic model could be:
class Link(BaseModel):
text: Optional[str]
url: Optional[HttpUrl]
class G2Product(BaseModel):
name: Optional[str]
review_score: Optional[float]
num_reviews: Optional[int]
seller: Optional[Link]
# ...
Execute the AI scraper with the new inputs like this:
async def main():
# Initialize the agent
agent = await initialize_scraping_agent()
# Execute the scraping task and get a Product instance
scraped_object = await execute_scraping_task(
agent,
"https://www.g2.com/products/bright-data/reviews",
G2Product
)
# Export the scraped data to a JSON file
export_to_json(scraped_object, "g2_brightdata.json")
The resulting g2_brightdata.json
file will look like this:
Et voilà! Your custom AI scraper now works against any site, even those protected by anti-scraping measures.
Conclusion
In this step-by-step blog post, you learned how to build a scraping AI agent capable of retrieving pages from virtually any website using Bright Data Web MCP, combined with a Hugging Face AI model for data parsing. Here, you saw that this Hugging Face AI scraper can handle even the most complex sites.
Now it’s your turn: let us know your thoughts on this implementation, share your feedback, and feel free to ask any questions you might have.