1. single node

get dummies / select case when polars (5x)< pandas (5x)< duckdb

dates polars (10x) < pandas < ?(syntax too complicated and not automated)

adjoin with sklearn: one hot encoding: pd sparse < pl sparse < pd < pl

feature engine only supports pandas , not polars

  1. distributed
    1. ray
    2. dask
    3. pyspark

Spark, Dask, and Ray: Choosing the Right Framework (domino.ai)

conceptual difference between polars and spark

compatibility layer on top of dataframe libraries:

tabulated comparison

libraryspeeddistributedapisql interface
pandasslownopandaspandasql/duckdb
polarsfastnopolarsY
duckdbfastnoduckdbY
daskyespandasdask-sql
sparkyessparkspark-sql
narwhalsNANApolarsN

narwhals vs ibis for compatibility

  • Ibis provides a Pythonic frontend to various SQL (as well as Polars LazyFrame) engines
  • Ibis supports SQL engines (and can translate to SQL)
  • Narwhals is extremely lightweight and comes with zero required dependencies, Ibis requires pandas and PyArrow for all backends
  • Ibis has no way to get back to the input type exactly comparison to ibis