Spaces:
Running
Running
| # /// script | |
| # requires-python = ">=3.10" | |
| # dependencies = [ | |
| # "marimo", | |
| # "duckdb==1.3.2", | |
| # "pyarrow==19.0.1", | |
| # "plotly.express", | |
| # "sqlglot==27.0.0", | |
| # ] | |
| # /// | |
| import marimo | |
| __generated_with = "0.14.10" | |
| app = marimo.App(width="medium") | |
| def _(mo): | |
| mo.md( | |
| r""" | |
| # Loading Parquet files with DuckDB | |
| *By [Thomas Liang](https://github.com/thliang01)* | |
| # | |
| """ | |
| ) | |
| return | |
| def _(mo): | |
| mo.md( | |
| r""" | |
| [Apache Parquet](https://parquet.apache.org/) is a popular columnar storage format, optimized for analytics. Its columnar nature allows query engines like DuckDB to read only the necessary columns, leading to significant performance gains, especially for wide tables. | |
| DuckDB has excellent, built-in support for reading Parquet files, making it incredibly easy to query and analyze Parquet data directly without a separate loading step. | |
| In this notebook, we'll explore how to load and analyze Airbnb's stock price data from a remote Parquet file: | |
| <ul> | |
| <li>Querying a remote Parquet file directly.</li> | |
| <li>Using the `read_parquet` function for more control.</li> | |
| <li>Creating a persistent table from a Parquet file.</li> | |
| <li>Performing basic data analysis and visualization.</li> | |
| </ul> | |
| """ | |
| ) | |
| return | |
| def _(): | |
| AIRBNB_URL = 'https://huggingface.co/datasets/BatteRaquette58/airbnb-stock-price/resolve/main/data/airbnb-stock.parquet' | |
| return (AIRBNB_URL,) | |
| def _(mo): | |
| mo.md(r"""## Using `FROM` to query Parquet files""") | |
| return | |
| def _(mo): | |
| mo.md( | |
| r""" | |
| The simplest way to query a Parquet file is to use it directly in a `FROM` clause, just like you would with a table. DuckDB will automatically detect that it's a Parquet file and read it accordingly. | |
| Let's query a dataset of Airbnb's stock price from Hugging Face. | |
| """ | |
| ) | |
| return | |
| def _(AIRBNB_URL, mo, null): | |
| mo.sql( | |
| f""" | |
| SELECT * | |
| FROM '{AIRBNB_URL}' | |
| LIMIT 5; | |
| """ | |
| ) | |
| return | |
| def _(mo): | |
| mo.md(r"""## Using `read_parquet`""") | |
| return | |
| def _(mo): | |
| mo.md( | |
| r""" | |
| For more control, you can use the `read_parquet` table function. This is useful when you need to specify options, for example, when dealing with multiple files or specific data types. | |
| Some useful options for `read_parquet` include: | |
| - `binary_as_string=True`: Reads `BINARY` columns as `VARCHAR`. | |
| - `filename=True`: Adds a `filename` column with the path of the file for each row. | |
| - `hive_partitioning=True`: Enables reading of Hive-partitioned datasets. | |
| Here, we'll use `read_parquet` to select only a few relevant columns. This is much more efficient than `SELECT *` because DuckDB only needs to read the data for the columns we specify. | |
| """ | |
| ) | |
| return | |
| def _(AIRBNB_URL, mo): | |
| mo.sql( | |
| f""" | |
| SELECT Date, Open, "close_last", High, Low | |
| FROM read_parquet('{AIRBNB_URL}') | |
| LIMIT 5; | |
| """ | |
| ) | |
| return | |
| def _(mo): | |
| mo.md( | |
| r""" | |
| You can also read multiple Parquet files at once using a glob pattern. For example, to read all Parquet files in a directory `data/`: | |
| ```sql | |
| SELECT * FROM read_parquet('data/*.parquet'); | |
| ``` | |
| """ | |
| ) | |
| return | |
| def _(mo): | |
| mo.md(r"""## Creating a table from a Parquet file""") | |
| return | |
| def _(mo): | |
| mo.md( | |
| r""" | |
| While querying Parquet files directly is powerful, sometimes it's useful to load the data into a persistent table within your DuckDB database. This can simplify subsequent queries and is a good practice if you'll be accessing the data frequently. | |
| """ | |
| ) | |
| return | |
| def _(AIRBNB_URL, mo): | |
| stock_table = mo.sql( | |
| f""" | |
| CREATE OR REPLACE TABLE airbnb_stock AS | |
| SELECT * FROM read_parquet('{AIRBNB_URL}'); | |
| """ | |
| ) | |
| return airbnb_stock, stock_table | |
| def _(mo, stock_table): | |
| mo.md( | |
| f""" | |
| {stock_table} | |
| Now that the `airbnb_stock` table is created, we can query it like any other SQL table. | |
| """ | |
| ) | |
| return | |
| def _(airbnb_stock, mo): | |
| mo.sql( | |
| f""" | |
| SELECT * FROM airbnb_stock LIMIT 5; | |
| """ | |
| ) | |
| return | |
| def _(mo): | |
| mo.md(r"""## Analysis and Visualization""") | |
| return | |
| def _(mo): | |
| mo.md(r"""Let's perform a simple analysis: plotting the closing stock price over time.""") | |
| return | |
| def _(airbnb_stock, mo): | |
| stock_data = mo.sql( | |
| f""" | |
| SELECT | |
| CAST(to_timestamp(Date) AS DATE) AS "Date", | |
| "close_last" | |
| FROM airbnb_stock | |
| ORDER BY "Date"; | |
| """ | |
| ) | |
| return (stock_data,) | |
| def _(mo): | |
| mo.md(r"""Now we can easily visualize this result using marimo's integration with plotting libraries like Plotly.""") | |
| return | |
| def _(px, stock_data): | |
| px.line( | |
| stock_data, | |
| x="Date", | |
| y="close_last", | |
| title="Airbnb (ABNB) Stock Price Over Time", | |
| labels={"Date": "Date", "close_last": "Closing Price (USD)"}, | |
| ) | |
| return | |
| def _(mo): | |
| mo.md(r"""## Conclusion""") | |
| return | |
| def _(mo): | |
| mo.md( | |
| r""" | |
| In this notebook, we've seen how easy it is to work with Parquet files in DuckDB. We learned how to: | |
| <ul> | |
| <li>Query Parquet files directly from a URL using a simple `FROM` clause.</li> | |
| <li>Use the `read_parquet` function for more fine-grained control and efficiency.</li> | |
| <li>Load data from a Parquet file into a DuckDB table.</li> | |
| <li>Seamlessly analyze and visualize the data using SQL and Python.</li> | |
| </ul> | |
| DuckDB's native Parquet support makes it a powerful tool for interactive data analysis on large datasets without complex ETL pipelines. | |
| """ | |
| ) | |
| return | |
| def _(): | |
| import marimo as mo | |
| import plotly.express as px | |
| return mo, px | |
| def _(): | |
| import pyarrow | |
| return | |
| if __name__ == "__main__": | |
| app.run() | |