Handling large data with R and Python

Plan for today

Introduction:
- What is Polars and how does it work?
- Q&A number 1

Case study: IPUMS census data (samples)
- Install and load polars, load data, perform operations, export data
- Using expressions
- Writing custom functions
- Q&A number 2

Going further
- Using streaming, plugins, and future extensions
- Alternative tools

What this session will not cover

Statistical modelling with big data.

I want to show how to go from fine-grained data (full-count census, mobile phone data at the minute level) to some smaller, aggregated datasets for regressions.

Note that some statistical packages can still perform very well on millions of observations, like fixest in R.

Introduction to Polars

Introduction

polars is a recent DataFrame library that is available for several languages:

Python
R
Rust
and more

Built to be very fast and memory-efficient thanks to several mechanisms.

Very consistent and readable syntax.

Eager vs Lazy

Eager evaluation: all operations are run line by line, in order, and directly applied on the data. This is the way we’re used to.

### Eager

# Get the data...
my_data = eager_data
  # ... and then sort by iso...
  .sort(pl.col("iso"))
  # ... and then keep only Japan...
  .filter(pl.col("country") == "Japan")
  # ... and then compute GDP per cap.
  .with_columns(gdp_per_cap = pl.col("gdp") / pl.col("pop"))

# => get the output

Eager vs Lazy

Lazy evaluation: operations are only run when we call a specific function at the end of the chain, usually called collect().

### Lazy

# Get the data...
my_data = lazy_data
  # ... and then sort by iso...
  .sort(pl.col("iso"))
  # ... and then keep only Japan...
  .filter(pl.col("country") == "Japan")
  # ... and then compute GDP per cap.
  .with_columns(gdp_per_cap = pl.col("gdp") / pl.col("pop"))

# => you don't get results yet!

my_data.collect() # this is how to get results

Eager vs Lazy

When dealing with large data, it is better to use lazy evaluation:

Optimize the code
Catch errors before computations
Use streaming mode

1. Optimize the code

The code below takes some data, sorts it by a variable, and then filters it based on a condition:

data
  .sort(pl.col("country"))
  .filter(pl.col("year").is_in([1950, 1960, 1970, 1980, 1990]))

Do you see what could be improved here?

1. Optimize the code

The problem lies in the order of operations: sorting data is much slower than filtering it.

Let’s test with a dataset of 50M observations and 10 columns:

data
  .sort(pl.col("country"))
  .filter(pl.col("year").is_in([1950, 1960, 1970, 1980, 1990]))

   user  system elapsed 
   3.72    1.41    5.12

data
  .filter(pl.col("year").is_in([1950, 1960, 1970, 1980, 1990]))
  .sort(pl.col("country"))

   user  system elapsed 
   0.89    0.23    1.13

1. Optimize the query plan

There is probably tons of suboptimal code in our scripts.

But it’s already hard enough to make scripts that work and that are reproducible, we don’t want to spend even more time trying to optimize them.

Let polars do this automatically by using lazy data.

1. Optimize the query plan

When we call collect(), polars doesn’t directly execute the code. Before that, it does a lot of optimizations to be sure that we don’t do inefficient operations.

Examples of optimizations (see the entire list):

do not load rows that are filtered out in the query;
do not load variables (columns) that are never used;
cache and reuse computations
and many more things

1. Optimize the query plan

Workflow:

scan the data to get it in lazy mode.

Python
R (polars)
R (tidypolars)

import polars as pl

raw_data = pl.scan_parquet("path/to/file.parquet")

# Or
pl.scan_csv("path/to/file.csv")
pl.scan_json("path/to/file.json")
...

library(polars)

raw_data = pl$scan_parquet("path/to/file.parquet")

# Or
pl$scan_csv("path/to/file.csv")
pl$scan_json("path/to/file.json")
...

library(tidypolars)

raw_data = scan_parquet_polars("path/to/file.parquet")

# Or
scan_csv_polars("path/to/file.csv")
scan_json_polars("path/to/file.json")
...

This only returns the schema of the data: the column names and their types (character, integers, dates, …).

1. Optimize the query plan

Workflow:

Write the code that you want to run on the data: filter, sort, create new variables, etc.

Python
R (polars)
R (tidypolars)

my_data = raw_data
   .sort("iso")
   .filter(
      pl.col("gdp") > 123,
      pl.col("country").is_in(["United Kingdom", "Japan", "Vietnam"])
   )
   .with_columns(gdp_per_cap = pl.col("gdp") / pl.col("population"))

my_data = raw_data$
   sort("iso")$
   filter(
      pl$col("gdp") > 123,
      pl$col("country")$is_in(c("United Kingdom", "Japan", "Vietnam"))
   )$
   with_columns(gdp_per_cap = pl$col("gdp") / pl$col("population"))

my_data = raw_data |>
   arrange(iso) |>
   filter(
      gdp > 123,
      country %in% c("United Kingdom", "Japan", "Vietnam")
   ) |>
   mutate(gdp_per_cap = gdp / population)

1. Optimize the code

Workflow:

Call collect() at the end of the code to execute it.

Python
R (polars)
R (tidypolars)

my_data.collect()

my_data$collect()

my_data |> compute()

1. Optimize the query plan

Tip

You can see the “query plan” (i.e. all the steps in the code) as it was originally written with my_data.explain(optimized=False).

The query plan that is actually run by Polars can be seen with my_data.explain().

2. Catch errors before computations

Calling collect() doesn’t start computations right away.

First, polars scans the code to ensure there are no schema errors, i.e. check that we don’t do “forbidden” operations.

For instance, doing pl.col("gdp") > "France" would be an error: we can’t compare a number to a character.

In this case, that would return:

polars.exceptions.ComputeError: cannot compare string with numeric data

3. Use streaming mode

See in the last part.

Q&A number 1

Case study: IPUMS census data (samples)

A note on file formats

We’re used to a few file formats: CSV, Excel, .dta. Polars can read most of them (.dta is not possible for now).

When possible, use the Parquet format (.parquet).

Pros:

large file compression
stores statistics on the columns (e.g. useful for filters)

Cons:

might be harder to share with others (e.g. Stata doesn’t have an easy way to read/write Parquet)

Setting up Python

Keeping a clean Python setup is notoriously hard:

XKCD 1987

Setting up Python

Python relies a lot on virtual environments: each project should have its own environment that contains the Python version and the libraries used in the project.

This is to avoid conflicts between different requirements:

project A requires polars <= 1.20.0
project B requires polars >= 1.22.0

If we had only a system-wide installation of polars, we would constantly have to reinstall one or the other.

Having one virtual environment per project allows better project reproducibility.

Setting up Python

For the entire setup, I recommend using uv:

deals with installing both Python and the libraries
user-friendly error messages
very fast

Basically 5 commands to remember (once uv is installed):

uv init my-project (or uv init) creates the basic files required
uv venv to create a virtual environment
uv add to add a library to the environment
uv sync to restore a project using uv to the exact setup that was used
uv run file.py to run a script in the project’s virtual environment

Setting up Python

The files uv.lock and .python-version are the only thing needed to restore a project.

You do not need to share the .venv folder with colleagues, they can just call uv sync and the .venv with the exact same packages will be created.

Setting up Python

For more exploratory analysis (i.e. before writing scripts meant to be run on the whole data), we can use Jupyter notebooks.

create a new file called for instance “demo.ipynb” and save it in the project folder
run uv add ipykernel in the terminal
in the top right corner of the editor, you should see a “Select kernel” button
if you don’t have the “Jupyter” extension installed, the “Select kernel” button will suggest you install it (and do it)
click on it and select the first option (the path should be “.venv.exe”)

Setting up Python

Automatically formatting the code when we save the file is a nice feature (not essential for today however).

If you want this:

in the left sidebar, go to the “Extensions” tab (the four squares icon)
search for “Ruff” and install it
open the “Settings” (File > Preferences > Settings)
search for “Format on save” and tick the box
search for “Default formatter” and select “Ruff”

and you should be good to go!

Data to use

IPUMS:
- US Census data for 1850-1960 (per decade)
- 1% (or 5% in some cases) sample

About 23M observations, 116 variables

See the SwissTransfer link I sent you

Set up in R

polars and tidypolars are not on CRAN, so install.packages("polars") is not enough.

Sys.setenv(NOT_CRAN = "true")
install.packages(
  c("polars", "tidypolars"), 
  repos = "https://community.r-multiverse.org"
)

Objectives

Python
- set up the python project and environment
- install and load polars
- scan the data and read a subset
- perform a simple filter
- perform an aggregation
- explore expression namespaces
- chain expressions
- export data
Same with R (both polars and tidypolars)

Q&A number 2

Going further

Custom functions

Maybe you need to use a function that doesn’t exist as-is in Polars.

You have mainly two options:

write the function using Polars syntax
use a map_ function (e.g. map_elements() or .map_batches()), but this is slower.

Let’s write a custom function for standardizing numeric variables.

Custom functions

Warning

map_elements() and map_batches() work fine in Python, but sometimes they might get a bit buggy in R.

In the new version of Polars in R, it will be recommended to convert the data to R, apply the function needed, and convert it back to Polars.

Larger-than-memory data

Sometimes, data is just too big for our computer, even after all optimizations.

In this case, collect()ing the data will crash the Python or R session.

What are possible strategies for this?

Larger-than-memory data

Use streaming mode

Streaming is a way to run the code on data by batches to avoid using all memory at the same time.

Polars takes care of splitting the data and runs the code piece by piece.

Larger-than-memory data

Using this is extremely simple: instead of calling collect(), we call collect(engine = "streaming").

Warning

Some operations might be unavailable in streaming mode but this number should decrease.

Warning

The streaming engine isn’t very reliable in R for now. It is better to use it in Python.

Larger-than-memory data

Think more about whether you actually need the data in the session

Maybe you just want to write the data to another file.

Instead of doing collect() + write_*(), use sink_* functions (e.g. sink_parquet()).

This will run the query and write the results progressively to the output path, without collecting in the session.

Plugins

Python only (for now)

Polars accepts extensions taking the form of new expressions namespaces.

Just like we have .str.split() for instance, we could have .dist.jarowinkler().

List of Polars plugins (not made by the Polars developers): https://github.com/ddotta/awesome-polars#polars-plugins

GIS and Polars

While there is some demand for a geopolars that would enable GIS operations in Polars DataFrame or LazyFrame, this doesn’t exist for now.

Groundwork for this should start in the coming months so you might expect some movement in geopolars in 2026.

Going further

To get more details:

Alternative tools

There isn’t one tool to rule them all.

Polars is great in many cases but your experience might vary depending on your type of data or operations you perform.

There are other tools to process large data in R.

Alternative tools

DuckDB:
- also lazy evaluation
- has geospatial extensions
- accompanying package duckplyr (same goal as tidypolars but uses DuckDB in the background)
- supported by Posit (the company behind RStudio and many packages)
arrow: also has lazy evaluation but less optimizations than Polars and DuckDB.
Spark: never used but available in R and Python.

Conclusion

Polars allows one to handle large data even on a laptop.

Use eager evaluation to explore a sample, build the data processing scripts, etc. but do it like this:

my_data = pl.scan_parquet("my_parquet_file.parquet").head(100).collect()

# then use eager evaluation

Use lazy evaluation to perform the data processing on the entire data.

Conclusion

Which language to use (for Polars)?

Python:
- Polars core developers focus on the Python library.
- This is where bug fixes and new features appear first.
- Very reactive for bug reports.
R:
- Voluntary developers.
- We implement new features and bug fixes slightly later.
- Package is being rewritten right now, hopefully released in the coming weeks.
- tidypolars is available.

Conclusion

There exists other tools using the same mechanisms, they are worth exploring!

In R, several packages use the tidyverse syntax but use those more powerful tools under the hood: tidypolars, duckplyr, arrow, etc.