Henry Harbeck commited on
Commit
b62cbdd
·
1 Parent(s): c080865

update marimo version, move import to bottom

Browse files
Files changed (1) hide show
  1. polars/13_window_functions.py +107 -105
polars/13_window_functions.py CHANGED
@@ -11,28 +11,22 @@
11
 
12
  import marimo
13
 
14
- __generated_with = "0.12.9"
15
  app = marimo.App(width="medium", app_title="Window Functions")
16
 
17
 
18
- @app.cell
19
- def _():
20
- import marimo as mo
21
- return (mo,)
22
-
23
-
24
  @app.cell(hide_code=True)
25
  def _(mo):
26
  mo.md(
27
  r"""
28
- # Window Functions
29
- _By [Henry Harbeck](https://github.com/henryharbeck)._
30
 
31
- In this notebook, you'll learn how to perform different types of window functions in Polars.
32
- You'll work with partitions, ordering and Polars' available "mapping strategies".
33
 
34
- We'll use a dataset with a few days of paid and organic digital revenue data.
35
- """
36
  )
37
  return
38
 
@@ -54,23 +48,23 @@ def _():
54
  )
55
 
56
  df
57
- return date, dates, df, pl
58
 
59
 
60
  @app.cell(hide_code=True)
61
  def _(mo):
62
  mo.md(
63
  r"""
64
- ## What is a window function?
65
 
66
- A window function performs a calculation across a set of rows that are related to the current row.
67
- They allow you to perform aggregations and other calculations within a group without collapsing
68
- the number of rows (opposed to a group by aggregation, which does collapse the number of rows). Typically the result of a
69
- window function is assigned back to rows within the group, but Polars also offers additional alternatives.
70
 
71
- Window functions can be used by specifying the [`over`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
72
- method on an expression.
73
- """
74
  )
75
  return
76
 
@@ -79,10 +73,10 @@ def _(mo):
79
  def _(mo):
80
  mo.md(
81
  r"""
82
- ## Partitions
83
- Partitions are the "group by" columns. We will have one "window" of data per unique value in the partition column(s), to
84
- which the function will be applied.
85
- """
86
  )
87
  return
88
 
@@ -91,10 +85,10 @@ def _(mo):
91
  def _(mo):
92
  mo.md(
93
  r"""
94
- ### Partitioning by a single column
95
 
96
- Let's get the total revenue per date...
97
- """
98
  )
99
  return
100
 
@@ -109,7 +103,9 @@ def _(df, pl):
109
 
110
  @app.cell(hide_code=True)
111
  def _(mo):
112
- mo.md(r"""And then see what percentage of the daily total was Paid and what percentage was Organic.""")
 
 
113
  return
114
 
115
 
@@ -123,9 +119,9 @@ def _(daily_revenue, df, pl):
123
  def _(mo):
124
  mo.md(
125
  r"""
126
- Let's now calculate the maximum revenue, cumulative revenue, rank the revenue and calculate the day-on-day change,
127
- all partitioned (split) by channel.
128
- """
129
  )
130
  return
131
 
@@ -145,10 +141,10 @@ def _(df, pl):
145
  def _(mo):
146
  mo.md(
147
  r"""
148
- Note that aggregation functions such as `sum` and `max` have their value applied back to each row in the partition
149
- (group). Non-aggregate functions such as `cum_sum`, `rank` and `diff` can produce different values per row, but
150
- still only consider rows within their partition.
151
- """
152
  )
153
  return
154
 
@@ -157,13 +153,13 @@ def _(mo):
157
  def _(mo):
158
  mo.md(
159
  r"""
160
- ### Partitioning by multiple columns
161
 
162
- We can also partition by multiple columns.
163
 
164
- Let's add a column to see whether it is a weekday (business day), then get the maximum revenue by that and
165
- the channel.
166
- """
167
  )
168
  return
169
 
@@ -184,12 +180,12 @@ def _(df, pl):
184
  def _(mo):
185
  mo.md(
186
  r"""
187
- ### Partitioning by expressions
188
 
189
- Polars also lets you partition by expressions without needing to create them as columns first.
190
 
191
- So, we could re-write the previous window function as...
192
- """
193
  )
194
  return
195
 
@@ -208,10 +204,10 @@ def _(df, pl):
208
  def _(mo):
209
  mo.md(
210
  r"""
211
- Window functions fit into Polars' composable [expressions API](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions),
212
- so can be combined with all [aggregation methods](https://docs.pola.rs/api/python/stable/reference/expressions/aggregation.html)
213
- and methods that consider more than 1 row (e.g., `cum_sum`, `rank` and `diff` as we just saw).
214
- """
215
  )
216
  return
217
 
@@ -220,14 +216,14 @@ def _(mo):
220
  def _(mo):
221
  mo.md(
222
  r"""
223
- ## Ordering
224
 
225
- The `order_by` parameter controls how to order the data within the window. The function is applied to the data in this
226
- order.
227
 
228
- Up until this point, we have been letting Polars do the window function calculations based on the order of the rows in the
229
- DataFrame. There can be times where we would like order of the calculation and the order of the output itself to differ.
230
- """
231
  )
232
  return
233
 
@@ -236,11 +232,11 @@ def _(mo):
236
  def _(mo):
237
  mo.md(
238
  """
239
- ### Ordering in a window function
240
 
241
- Let's say we want the DataFrame ordered by day of week, but we still want cumulative revenue and the first revenue observation, both
242
- ordered by date and partitioned by channel...
243
- """
244
  )
245
  return
246
 
@@ -269,13 +265,13 @@ def _(df, pl):
269
  def _(mo):
270
  mo.md(
271
  r"""
272
- ### Note about window function ordering compared to SQL
273
 
274
- It is worth noting that traditionally in SQL, many more functions require an `ORDER BY` within `OVER` than in
275
- equivalent functions in Polars.
276
 
277
- For example, an SQL `RANK()` expression like...
278
- """
279
  )
280
  return
281
 
@@ -301,9 +297,9 @@ def _(df, mo):
301
  def _(mo):
302
  mo.md(
303
  r"""
304
- ...does not require an `order_by` in Polars as the column and the function are already bound (including with the
305
- `descending=True` argument).
306
- """
307
  )
308
  return
309
 
@@ -323,10 +319,10 @@ def _(df, pl):
323
  def _(mo):
324
  mo.md(
325
  r"""
326
- ### Descending order
327
 
328
- We can also order in descending order by passing `descending=True`...
329
- """
330
  )
331
  return
332
 
@@ -356,13 +352,13 @@ def _(df_sorted, pl):
356
  def _(mo):
357
  mo.md(
358
  """
359
- ## Mapping Strategies
360
 
361
- Mapping Strategies control how Polars maps the result of the window function back to the original DataFrame
362
 
363
- Generally (by default) the result of a window function is assigned back to rows within the group. Through Polars' mapping
364
- strategies, we will explore other possibilities.
365
- """
366
  )
367
  return
368
 
@@ -371,11 +367,11 @@ def _(mo):
371
  def _(mo):
372
  mo.md(
373
  """
374
- ### Group to rows
375
 
376
- "group_to_rows" is the default mapping strategy and assigns the result of the window function back to the rows in the
377
- window.
378
- """
379
  )
380
  return
381
 
@@ -392,10 +388,10 @@ def _(df, pl):
392
  def _(mo):
393
  mo.md(
394
  """
395
- ### Join
396
 
397
- The "join" mapping strategy aggregates the resulting values in a list and repeats the list for all rows in the group.
398
- """
399
  )
400
  return
401
 
@@ -412,14 +408,14 @@ def _(df, pl):
412
  def _(mo):
413
  mo.md(
414
  r"""
415
- ### Explode
416
 
417
- The "explode" mapping strategy is similar to "group_to_rows", but is typically faster and does not preserve the order of
418
- rows. Due to this, it requires sorting columns (including those not in the window function) for the result to make sense.
419
- It should also only be used in a `select` context and not `with_columns`.
420
 
421
- The result of "explode" is similar to a `group_by` followed by an `agg` followed by an `explode`.
422
- """
423
  )
424
  return
425
 
@@ -451,11 +447,11 @@ def _(mo):
451
  def _(mo):
452
  mo.md(
453
  r"""
454
- ### Reusing a window
455
 
456
- In SQL there is a `WINDOW` keyword, which easily allows the re-use of the same window specification across expressions
457
- without needing to repeat it. In Polars, this can be achieved by using `dict` unpacking to pass arguments to `over`.
458
- """
459
  )
460
  return
461
 
@@ -474,21 +470,21 @@ def _(df_sorted, pl):
474
  daily_revenue_rank=pl.col("revenue").rank().over(**window),
475
  cumulative_daily_revenue=pl.col("revenue").cum_sum().over(**window),
476
  )
477
- return (window,)
478
 
479
 
480
  @app.cell(hide_code=True)
481
  def _(mo):
482
  mo.md(
483
  r"""
484
- ### Rolling Windows
485
 
486
- Much like in SQL, Polars also gives you the ability to do rolling window computations. In Polars, the rolling calculation
487
- is also aware of temporal data, making it easy to express if the data is not contiguous (i.e., observations are missing).
488
 
489
- Let's look at an example of that now by filtering out one day of our data and then calculating both a 3-day and 3-row
490
- max revenue split by channel...
491
- """
492
  )
493
  return
494
 
@@ -524,15 +520,21 @@ def _(mo):
524
  def _(mo):
525
  mo.md(
526
  r"""
527
- ## Additional References
528
 
529
- - [Polars User guide - Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/)
530
- - [Polars over method API reference](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
531
- - [PostgreSQL window function documentation](https://www.postgresql.org/docs/current/tutorial-window.html)
532
- """
533
  )
534
  return
535
 
536
 
 
 
 
 
 
 
537
  if __name__ == "__main__":
538
  app.run()
 
11
 
12
  import marimo
13
 
14
+ __generated_with = "0.13.11"
15
  app = marimo.App(width="medium", app_title="Window Functions")
16
 
17
 
 
 
 
 
 
 
18
  @app.cell(hide_code=True)
19
  def _(mo):
20
  mo.md(
21
  r"""
22
+ # Window Functions
23
+ _By [Henry Harbeck](https://github.com/henryharbeck)._
24
 
25
+ In this notebook, you'll learn how to perform different types of window functions in Polars.
26
+ You'll work with partitions, ordering and Polars' available "mapping strategies".
27
 
28
+ We'll use a dataset with a few days of paid and organic digital revenue data.
29
+ """
30
  )
31
  return
32
 
 
48
  )
49
 
50
  df
51
+ return date, df, pl
52
 
53
 
54
  @app.cell(hide_code=True)
55
  def _(mo):
56
  mo.md(
57
  r"""
58
+ ## What is a window function?
59
 
60
+ A window function performs a calculation across a set of rows that are related to the current row.
61
+ They allow you to perform aggregations and other calculations within a group without collapsing
62
+ the number of rows (opposed to a group by aggregation, which does collapse the number of rows). Typically the result of a
63
+ window function is assigned back to rows within the group, but Polars also offers additional alternatives.
64
 
65
+ Window functions can be used by specifying the [`over`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
66
+ method on an expression.
67
+ """
68
  )
69
  return
70
 
 
73
  def _(mo):
74
  mo.md(
75
  r"""
76
+ ## Partitions
77
+ Partitions are the "group by" columns. We will have one "window" of data per unique value in the partition column(s), to
78
+ which the function will be applied.
79
+ """
80
  )
81
  return
82
 
 
85
  def _(mo):
86
  mo.md(
87
  r"""
88
+ ### Partitioning by a single column
89
 
90
+ Let's get the total revenue per date...
91
+ """
92
  )
93
  return
94
 
 
103
 
104
  @app.cell(hide_code=True)
105
  def _(mo):
106
+ mo.md(
107
+ r"""And then see what percentage of the daily total was Paid and what percentage was Organic."""
108
+ )
109
  return
110
 
111
 
 
119
  def _(mo):
120
  mo.md(
121
  r"""
122
+ Let's now calculate the maximum revenue, cumulative revenue, rank the revenue and calculate the day-on-day change,
123
+ all partitioned (split) by channel.
124
+ """
125
  )
126
  return
127
 
 
141
  def _(mo):
142
  mo.md(
143
  r"""
144
+ Note that aggregation functions such as `sum` and `max` have their value applied back to each row in the partition
145
+ (group). Non-aggregate functions such as `cum_sum`, `rank` and `diff` can produce different values per row, but
146
+ still only consider rows within their partition.
147
+ """
148
  )
149
  return
150
 
 
153
  def _(mo):
154
  mo.md(
155
  r"""
156
+ ### Partitioning by multiple columns
157
 
158
+ We can also partition by multiple columns.
159
 
160
+ Let's add a column to see whether it is a weekday (business day), then get the maximum revenue by that and
161
+ the channel.
162
+ """
163
  )
164
  return
165
 
 
180
  def _(mo):
181
  mo.md(
182
  r"""
183
+ ### Partitioning by expressions
184
 
185
+ Polars also lets you partition by expressions without needing to create them as columns first.
186
 
187
+ So, we could re-write the previous window function as...
188
+ """
189
  )
190
  return
191
 
 
204
  def _(mo):
205
  mo.md(
206
  r"""
207
+ Window functions fit into Polars' composable [expressions API](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions),
208
+ so can be combined with all [aggregation methods](https://docs.pola.rs/api/python/stable/reference/expressions/aggregation.html)
209
+ and methods that consider more than 1 row (e.g., `cum_sum`, `rank` and `diff` as we just saw).
210
+ """
211
  )
212
  return
213
 
 
216
  def _(mo):
217
  mo.md(
218
  r"""
219
+ ## Ordering
220
 
221
+ The `order_by` parameter controls how to order the data within the window. The function is applied to the data in this
222
+ order.
223
 
224
+ Up until this point, we have been letting Polars do the window function calculations based on the order of the rows in the
225
+ DataFrame. There can be times where we would like order of the calculation and the order of the output itself to differ.
226
+ """
227
  )
228
  return
229
 
 
232
  def _(mo):
233
  mo.md(
234
  """
235
+ ### Ordering in a window function
236
 
237
+ Let's say we want the DataFrame ordered by day of week, but we still want cumulative revenue and the first revenue observation, both
238
+ ordered by date and partitioned by channel...
239
+ """
240
  )
241
  return
242
 
 
265
  def _(mo):
266
  mo.md(
267
  r"""
268
+ ### Note about window function ordering compared to SQL
269
 
270
+ It is worth noting that traditionally in SQL, many more functions require an `ORDER BY` within `OVER` than in
271
+ equivalent functions in Polars.
272
 
273
+ For example, an SQL `RANK()` expression like...
274
+ """
275
  )
276
  return
277
 
 
297
  def _(mo):
298
  mo.md(
299
  r"""
300
+ ...does not require an `order_by` in Polars as the column and the function are already bound (including with the
301
+ `descending=True` argument).
302
+ """
303
  )
304
  return
305
 
 
319
  def _(mo):
320
  mo.md(
321
  r"""
322
+ ### Descending order
323
 
324
+ We can also order in descending order by passing `descending=True`...
325
+ """
326
  )
327
  return
328
 
 
352
  def _(mo):
353
  mo.md(
354
  """
355
+ ## Mapping Strategies
356
 
357
+ Mapping Strategies control how Polars maps the result of the window function back to the original DataFrame
358
 
359
+ Generally (by default) the result of a window function is assigned back to rows within the group. Through Polars' mapping
360
+ strategies, we will explore other possibilities.
361
+ """
362
  )
363
  return
364
 
 
367
  def _(mo):
368
  mo.md(
369
  """
370
+ ### Group to rows
371
 
372
+ "group_to_rows" is the default mapping strategy and assigns the result of the window function back to the rows in the
373
+ window.
374
+ """
375
  )
376
  return
377
 
 
388
  def _(mo):
389
  mo.md(
390
  """
391
+ ### Join
392
 
393
+ The "join" mapping strategy aggregates the resulting values in a list and repeats the list for all rows in the group.
394
+ """
395
  )
396
  return
397
 
 
408
  def _(mo):
409
  mo.md(
410
  r"""
411
+ ### Explode
412
 
413
+ The "explode" mapping strategy is similar to "group_to_rows", but is typically faster and does not preserve the order of
414
+ rows. Due to this, it requires sorting columns (including those not in the window function) for the result to make sense.
415
+ It should also only be used in a `select` context and not `with_columns`.
416
 
417
+ The result of "explode" is similar to a `group_by` followed by an `agg` followed by an `explode`.
418
+ """
419
  )
420
  return
421
 
 
447
  def _(mo):
448
  mo.md(
449
  r"""
450
+ ### Reusing a window
451
 
452
+ In SQL there is a `WINDOW` keyword, which easily allows the re-use of the same window specification across expressions
453
+ without needing to repeat it. In Polars, this can be achieved by using `dict` unpacking to pass arguments to `over`.
454
+ """
455
  )
456
  return
457
 
 
470
  daily_revenue_rank=pl.col("revenue").rank().over(**window),
471
  cumulative_daily_revenue=pl.col("revenue").cum_sum().over(**window),
472
  )
473
+ return
474
 
475
 
476
  @app.cell(hide_code=True)
477
  def _(mo):
478
  mo.md(
479
  r"""
480
+ ### Rolling Windows
481
 
482
+ Much like in SQL, Polars also gives you the ability to do rolling window computations. In Polars, the rolling calculation
483
+ is also aware of temporal data, making it easy to express if the data is not contiguous (i.e., observations are missing).
484
 
485
+ Let's look at an example of that now by filtering out one day of our data and then calculating both a 3-day and 3-row
486
+ max revenue split by channel...
487
+ """
488
  )
489
  return
490
 
 
520
  def _(mo):
521
  mo.md(
522
  r"""
523
+ ## Additional References
524
 
525
+ - [Polars User guide - Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/)
526
+ - [Polars over method API reference](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
527
+ - [PostgreSQL window function documentation](https://www.postgresql.org/docs/current/tutorial-window.html)
528
+ """
529
  )
530
  return
531
 
532
 
533
+ @app.cell(hide_code=True)
534
+ def _():
535
+ import marimo as mo
536
+ return (mo,)
537
+
538
+
539
  if __name__ == "__main__":
540
  app.run()