How to Build an AI Scraper Powered by Hugging Face and Bright Data

Community Article Published September 15, 2025

Craw4AI, ScrapeGraphAI, and Firecrawl are just a few examples of the emerging trend of AI-powered web scraping. These successful open-source projects have in common thousands of GitHub stars and an active, growing community. The best part? You can build your own AI scraper by following a similar approach.

In the article, you’ll learn how to create an AI scraping agent powered by Hugging Face and Bright Data!

What Is An AI Scraper? And How Does It Work?

An AI scraper is a web scraping tool that uses artificial intelligence to extract data from web pages. Unlike traditional scrapers that depend on rigid data parsing rules, AI scrapers leverage machine learning and natural language processing, so you don’t need to constantly update your scraping script when a site changes its layout.

If you describe the data you need clearly enough in a human-like prompt, the AI will apply that instruction to retrieve the information from the page—no matter what the page layout is. At a high level, this is how an AI scraper works:

  1. Use a traditional method (HTTP client or browser automation) to fetch the page’s HTML.
  2. Apply the data extraction prompt to interpret and pull out the required data.
  3. Return the extracted data in the desired format.

Create an AI-Powered Scraper with the Hugging Face Agent and Bright Data: Step-by-Step Guide

In this guide section, you’ll learn how to extend the MCP-powered Hugging Face AI agent we built in the previous article: Building a Hugging Face AI Agent with Bright Data’s Web MCP Integration.” (Note: You should take a look at that tutorial first.)

Specifically, you’ll extend the original agent so that it has access only to the scrape_as_markdown Bright Data tool via MCP and can:

  1. Retrieve the Markdown content of your target webpage, regardless of which site it comes from.
  2. Parse it programmatically into a Pydantic model using AI-based scraping.
  3. Export the parsed result to a JSON file.

In simpler terms, you’ll see how to build a programmatic AI scraper that works against any website and is powered by Hugging Face models. Let’s get started!

Step #1: Set Up the Project

Follow the instructions outlined in the previous article, which explains how to set up a Hugging Face AI agent that connects to the Bright Data Web MCP server. In that guide, you’ll find detailed steps to create your Bright Data and Hugging Face accounts, as well as how to write the Python code to define the AI agent.

From this point onward, we’ll assume you already have the Python project set up, with the agent stored in an agent.py file.

Step #2: Limit the MCP Tools to scrape_as_markdown

The Bright Data Web MCP provides over 60 tools for web data extraction, interaction, search, and more. For building a general-purpose AI scraper, you only need the scrape_as_markdown tool:

Tool Description
scrape_as_markdown Extract content from any single webpage URL with advanced scraping options and receive the output in Markdown format. This tool can access pages even if they have bot protections or CAPTCHA challenges.

Remember: The tool is available in the free tier, so you won’t be charged as long as you stay within the generous usage limits.

To limit the agent to this tool, simply filter the available MCP tools by name after loading them with await agent.load_tools():

agent.available_tools = [
    tool for tool in agent.available_tools
    if tool.get("function", {}).get("name") == "scrape_as_markdown"
]

Great! The Hugging Face agent will have access only to the scrape_as_markdown tool from the Bright Data Web MCP server.

You can verify that by asking the agent:

What tools do you have access to?

The response should be similar to:

I have access to the following tools:

1. **`task_complete`** - This tool is used to indicate that the task given by the user is complete.
2. **`ask_question`** - This tool allows me to ask the user for more information needed to solve or clarify their problem.
3. **`scrape_as_markdown`** - This tool can scrape a single webpage URL with advanced options for content extraction and return the results in Markdown language. It can also unlock webpages that use bot detection or CAPTCHA.

Would you like more details on any of these tools? Or do you have a specific task in mind that I can help with?

task_complete and ask_question are part of the Hugging Face AI agent implementation, so you can ignore them. Note that the only configured extra tool is actually scrape_as_markdown. Perfect!

Next, encapsulate the agent creation logic into a function:

async def initialize_scraping_agent():
    # Bright Data Web MCP configuration
    bright_data_mcp_server = {
        "type": "stdio",
        "command": "npx",
        "args": ["-y", "@brightdata/mcp"],
        "env": {
          "API_TOKEN": "<YOUR_BRIGHT_DATA_API_KEY>"
        }
    }

    # Initialize the agent
    agent = Agent(
        servers=[bright_data_mcp_server],
        provider="nebius",
        model="Qwen/Qwen2.5-72B-Instruct",
        api_key="<YOUR_HUGGING_FACE_API_KEY>",
    )
    # Load the MCP tools
    await agent.load_tools()
    # Restrict the available tools to only "scrape_as_markdown"
    agent.available_tools = [
        tool for tool in agent.available_tools
        if tool.get("function", {}).get("name") == "scrape_as_markdown"
    ]

    return agent

This way, the resulting code will be easier to read and manage.

Remember: Retrieve your Bright Data API key token by following the official guide!

Step #3: Define the Scraping Task

Now you need to instruct your AI agent to perform the web scraping task. To make this general-purpose, you can create a function that:

  1. Accepts the target URL of the webpage to scrape.
  2. Accepts a Pydantic model representing the expected structured output.
  3. Uses the URL and model to create a contextual scraping prompt.
  4. Executes the scraping prompt.
  5. Retrieves the populated Pydantic model as a JSON string, parses it, and returns it.

All clear, but you may be wondering, “*Why a Pydantic model?*”

The reason is that using a Pydantic model is easier and more maintainable than manually specifying in the prompt which fields to extract from the page and the structure of the desired output. You define a Pydantic model representing the expected output, automatically convert it to a JSON schema in the prompt, and instruct the AI to produce results adhering to that schema.

Note that this is nothing new, but rather a best practice followed by most AI scrapers, such as ScrapeGraphAI and Crawl4AI.

Implement the above 5-step logic with this function:

async def execute_scraping_task(agent, url, model):
    # Define the task for the agent
    scraping_prompt = f"""
        Scrape the following product page as Markdown:
        "{url}"

        From the scraped content, extract the product information and populate it into the JSON schema below:
        ```json
        {model.model_json_schema()}
        ```
        Return only the JSON object as a plain string (no explanations or extra output).
    """

    # List to collect each partial streamed output
    output = []

    # Run the task with the agent and stream the response
    async for chunk in agent.run(scraping_prompt):
        if hasattr(chunk, "role") and chunk.role == "tool":
            print(f"[TOOL] {chunk.name}: {chunk.content}\n\n", flush=True)
        else:
            delta_content = chunk.choices[0].delta.content
            if delta_content:
                print(delta_content, end="", flush=True)
                output.append(delta_content)

    # Assemble all streamed chunks into a single string
    string_output = "".join(output)

    # Remove the ```json prefix and ``` suffix from the output
    json_string_output = string_output.removeprefix("```json").removesuffix("```").strip()

    # Parse the output to an instance of the provided Pydantic model and return it
    return model.model_validate_json(json_string_output)

The output array is used to keep track of the streamed chunks of content produced by the AI. You can then aggregate all the chunks into a single string.

Once you have the full string, it has to be converted into an instance of the Pydantic model provided as input. The problem is that, when configured to return JSON content, LLMs tend to produce output like this:

    ```json
    {
      "images": [
        "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_main.jpg",
        "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_alt1.jpg",
        "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh09-blue_back.jpg"
      ],
      "name": "Abominable Hoodie",
      "price": 69.0,
      "currency": "$",
      "sku": "MH09",
      "category": {
        "name": "Hoodies & Sweatshirts",
        "url": "https://www.scrapingcourse.com/ecommerce/product-category/clothing/men/tops/hoodies-sweatshirts/"
      },
      "short_description": "This is a variable product called a Abominable Hoodie",
      "long_description": "It took CoolTech™ weather apparel know-how and lots of wind-resistant fabric to get the Abominable Hoodie just right. It’s aggressively warm when it needs to be, while maintaining your comfort in milder climes.\n\n• Blue heather hoodie.\n• Relaxed fit.\n• Moisture-wicking.\n• Machine wash/dry.",
      "additional_info": {
        "size": ["XS", "S", "M", "L", "XL"],
        "color": ["Blue", "Green", "Red"]
      }
    }
    ```

In other terms, the configured AI model returns a JSON block in Markdown. Again, this is a common pattern among all LLMs. To handle that output, you can first remove the JSON code block prefix and suffix, then parse the string using the Pydantic model’s model_validate_json() method.

The result will be an instance of the given Pydantic model, fully populated with the scraped data. Well done!

Step #4: Export the Scraped Data

The final step is to define a function that exports the scraped Pydantic instance to a JSON file:

def export_to_json(scraped_object, file_name):
    with open(file_name, "w", encoding="utf-8") as f:
        f.write(scraped_object.model_dump_json(indent=2))

Wonderful! All the building blocks for the Bright Data-powered Hugging Face AI scraper are now ready. It’s time to put everything together.

Step #5: Start the Scraping Process

Now, suppose you want to scrape data from the following e-commerce product page on ScrapingCourse.com:

The target webpage

As the name suggests, this site is specifically designed for web scraping, making it perfect for a first test.

Inspect all the data on the page, and you'll see that you can represent it with a Pydantic Product model like this:

from typing import Optional, List
from pydantic import BaseModel, HttpUrl

class Category(BaseModel):
    name: Optional[str] = None
    url: Optional[HttpUrl] = None

class AdditionalInfo(BaseModel):
    size: Optional[List[str]] = None
    color: Optional[List[str]] = None

class Product(BaseModel):
    images: Optional[List[HttpUrl]] = None
    name: str
    price: Optional[float] = None
    currency: Optional[str] = None
    sku: Optional[str] = None
    category: Optional[Category] = None
    short_description: Optional[str] = None
    long_description: Optional[str] = None
    additional_info: Optional[AdditionalInfo] = None

Note that setting almost all fields as optional (except those you are sure exist) is best practice. This prevents the AI scraper from generating made-up data for fields that may not be present on the page.

Pro Tip: If a field’s name isn’t enough to fully describe the product, add a description using Pydantic’s Field. For example:

short_description: Optional[str] = Field(
    None, description="A brief description of the product displayed below the product name."
)

Now, start the scraping process by calling the functions you’ve defined earlier:

agent = await initialize_scraping_agent()
scraped_object = await execute_scraping_task(
    agent,
    "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
    Product
)
export_to_json(scraped_object, "product.json")

The resulting AI scraping prompt will look like this:

Scrape the following product page as Markdown:
"https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/"

From the scraped content, extract the product information and populate it into the JSON schema below:
    ```json
    {'$defs': {'AdditionalInfo': {'properties': {'size': {'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'default': None, 'title': 'Size'}, 'color': {'anyOf': [{'items': {'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'default': None, 'title': 'Color'}}, 'title': 'AdditionalInfo', 'type': 'object'}, 'Category': {'properties': {'name': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Name'}, 'url': {'anyOf': [{'format': 'uri', 'maxLength': 2083, 'minLength': 1, 'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Url'}}, 'title': 'Category', 'type': 'object'}}, 'properties': {'images': {'anyOf': [{'items': {'format': 'uri', 'maxLength': 2083, 'minLength': 1, 'type': 'string'}, 'type': 'array'}, {'type': 'null'}], 'default': None, 'title': 'Images'}, 'name': {'title': 'Name', 'type': 'string'}, 'price': {'anyOf': [{'type': 'number'}, {'type': 'null'}], 'default': None, 'title': 'Price'}, 'currency': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Currency'}, 'sku': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Sku'}, 'category': {'anyOf': [{'$ref': '#/$defs/Category'}, {'type': 'null'}], 'default': None}, 'short_description': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Short Description'}, 'long_description': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None, 'title': 'Long Description'}, 'additional_info': {'anyOf': [{'$ref': '#/$defs/AdditionalInfo'}, {'type': 'null'}], 'default': None}}, 'required': ['name'], 'title': 'Product', 'type': 'object'}
    ```
Return only the populated JSON object as a plain string (no explanations or other output)

See how clear this is thanks to the Pydantic model trick. Fantastic!

Step #6: Complete Code and Execution

Below is the final AI scraper code that you should now have in agent.py:

import asyncio
from huggingface_hub import Agent
from typing import Optional, List
from pydantic import BaseModel, HttpUrl

class Category(BaseModel):
    name: Optional[str] = None
    url: Optional[HttpUrl] = None

class AdditionalInfo(BaseModel):
    size: Optional[List[str]] = None
    color: Optional[List[str]] = None

class Product(BaseModel):
    images: Optional[List[HttpUrl]] = None
    name: str
    price: Optional[float] = None
    currency: Optional[str] = None
    sku: Optional[str] = None
    category: Optional[Category] = None
    short_description: Optional[str] = None
    long_description: Optional[str] = None
    additional_info: Optional[AdditionalInfo] = None

async def initialize_scraping_agent():
    # Bright Data Web MCP configuration
    bright_data_mcp_server = {
        "type": "stdio",
        "command": "npx",
        "args": ["-y", "@brightdata/mcp"],
        "env": {
          "API_TOKEN": "<YOUR_BRIGHT_DATA_API_KEY>"
        }
    }

    # Initialize the agent
    agent = Agent(
        servers=[bright_data_mcp_server],
        provider="nebius",
        model="Qwen/Qwen2.5-72B-Instruct",
        api_key="<YOUR_HUGGING_FACE_API_KEY>",
    )
    # Load the MCP tools
    await agent.load_tools()
    # Restrict the available tools to only "scrape_as_markdown"
    agent.available_tools = [
        tool for tool in agent.available_tools
        if tool.get("function", {}).get("name") == "scrape_as_markdown"
    ]

    return agent

async def execute_scraping_task(agent, url, model):
    # Define the task for the agent
    scraping_prompt = f"""
        Scrape the following product page as Markdown:
        "{url}"

        From the scraped content, extract the product information and populate it into the JSON schema below:
        ```json
        {model.model_json_schema()}
        ```
        Return only the JSON object as a plain string (no explanations or extra output).
    """

    # List to collect each partial streamed output
    output = []

    # Run the task with the agent and stream the response
    async for chunk in agent.run(scraping_prompt):
        if hasattr(chunk, "role") and chunk.role == "tool":
            print(f"[TOOL] {chunk.name}: {chunk.content}\n\n", flush=True)
        else:
            delta_content = chunk.choices[0].delta.content
            if delta_content:
                print(delta_content, end="", flush=True)
                output.append(delta_content)

    # Assemble all streamed chunks into a single string
    string_output = "".join(output)

    # Remove the ```json prefix and ``` suffix from the output
    json_string_output = string_output.removeprefix("```json").removesuffix("```").strip()

    # Parse the output to an instance of the provided Pydantic model and return it
    return model.model_validate_json(json_string_output)

def export_to_json(scraped_object, file_name):
    # Save the Pydantic object to a JSON file
    with open(file_name, "w", encoding="utf-8") as f:
        f.write(scraped_object.model_dump_json(indent=2))

async def main():
    # Initialize the agent
    agent = await initialize_scraping_agent()
    # Execute the scraping task and get a Product instance
    scraped_object = await execute_scraping_task(
        agent,
        "https://www.scrapingcourse.com/ecommerce/product/abominable-hoodie/",
        Product
    )
    # Export the scraped data to a JSON file
    export_to_json(scraped_object, "product.json")

if __name__ == "__main__":
    asyncio.run(main())

In just around 100 lines of Python code, you’ve built a general-purpose AI scraper that can transform any web page into a structured JSON output. This wouldn’t have been possible without Hugging Face and Bright Data. Impressive!

Test the AI scraper on the configured scraping scenario with:

python agent.py

The result should be a product.json file like this:

The output product.json file with the scraped data

As you can see, this contains the same product data visible on the target page, but in a structured format that adheres to the Product Pydantic model. That data was retrieved on the fly by your script using the scrape_as_markdown tool from Bright Data and then parsed by an AI model available through Hugging Face. Mission complete!

Test the AI Scraper Against a Real-World Site

The AI scraping example above was built around ScrapingCourse.com, a site that welcomes web scraping bots. But what if you want to target a website that’s notoriously difficult to scrape, like G2?

Suppose your target page is the G2.com page for Bright Data:

The target page from G2

For this case, a suitable Pydantic model could be:

class Link(BaseModel):
    text: Optional[str]
    url: Optional[HttpUrl]

class G2Product(BaseModel):
    name: Optional[str]
    review_score: Optional[float]
    num_reviews: Optional[int]
    seller: Optional[Link]
    # ...

Execute the AI scraper with the new inputs like this:

async def main():
    # Initialize the agent
    agent = await initialize_scraping_agent()
    # Execute the scraping task and get a Product instance
    scraped_object = await execute_scraping_task(
        agent,
        "https://www.g2.com/products/bright-data/reviews",
        G2Product
    )
    # Export the scraped data to a JSON file
    export_to_json(scraped_object, "g2_brightdata.json")

The resulting g2_brightdata.json file will look like this:

The g2_brightdata.json file produced by the AI scraper

Et voilà! Your custom AI scraper now works against any site, even those protected by anti-scraping measures.

Conclusion

In this step-by-step blog post, you learned how to build a scraping AI agent capable of retrieving pages from virtually any website using Bright Data Web MCP, combined with a Hugging Face AI model for data parsing. Here, you saw that this Hugging Face AI scraper can handle even the most complex sites.

Now it’s your turn: let us know your thoughts on this implementation, share your feedback, and feel free to ask any questions you might have.

Community

Sign up or log in to comment