Python Pandas and Pandera


  • Description: Pandas, Series and DataFrame, indexing with .loc/.iloc, missing data, groupby, joins/merges, time series, I/O, and Pandera schemas for runtime DataFrame validation
  • My Notion Note ID: K2A-D2-4
  • Created: 2023-11-12
  • Updated: 2026-05-11
  • License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

Table of Contents


1. What Pandas Gives You

  • Tabular-data workhorse for Python data analysis
  • DataFrame ≈ in-memory SQL table, columns of typed arrays + a labelled index
  • Built on NumPy (most numeric columns are ndarray)
  • Rich API for filter / transform / group / join

When to use:

  • Ad-hoc analysis
  • CSV/Parquet ETL
  • Prototyping feature pipelines

When NOT to:

  • Streaming bigger-than-memory data → Polars, DuckDB, PySpark
  • Microsecond-latency online serving
import pandas as pd
import numpy as np

2. Series and DataFrame

  • Series, 1-D labelled array
  • DataFrame, dict-of-Series sharing the same row index
s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="x")
df = pd.DataFrame({
    "name": ["Yu", "Lee", "Ada"],
    "age":  [30, 25, 40],
    "team": ["A",  "B",  "A"],
})

df.shape          # (3, 3)
df.dtypes         # name: object, age: int64, team: object
df.columns        # Index(['name', 'age', 'team'], dtype='object')
df.index          # RangeIndex(0, 3)
df.head(2)
df.describe()     # numeric summary
df.info()         # types + memory

3. Selection: [], .loc, .iloc

Form Meaning
df["col"] one column (Series)
df[["a", "b"]] several columns (DataFrame)
df[bool_mask] row filter
df.loc[row_label, col_label] label-based
df.iloc[row_pos, col_pos] position-based
df.at[row, col] / df.iat[i, j] fast scalar access
df[df["age"] >= 30]
df.loc[df["team"] == "A", ["name", "age"]]
df.iloc[0]                       # first row
df.iloc[:2, 1:]                  # first two rows, cols from idx 1
df.loc[:, "name":"age"]          # label slice, INCLUSIVE on both ends

# Multiple conditions: bitwise & | ~ and PARENTHESIZE
df[(df["age"] >= 30) & (df["team"] == "A")]
  • Combine booleans with &, |, ~, not and/or, Python's keywords don't broadcast

4. Missing Data

  • NaN (numeric) or pd.NA (newer nullable dtypes)
  • Propagate through arithmetic
df["age"].isna()                 # mask
df.dropna()                      # drop rows with any NaN
df.dropna(subset=["age"])
df.fillna({"age": 0, "team": "?"})
df["age"].fillna(df["age"].mean())

# Forward/backward fill (time series)
ts.ffill()
ts.bfill()
  • Never write == NaN, NaN != NaN. Use .isna() / .notna()

5. Transform: assign, apply, Vectorized Ops

df = df.assign(
    age_in_months = lambda d: d["age"] * 12,
    senior        = lambda d: d["age"] >= 30,
)

# Element-wise op (vectorized, fastest)
df["age"] = df["age"] + 1

# Row-wise with a Python function (slow; avoid if you can vectorize)
df["label"] = df.apply(lambda r: f"{r['name']} ({r['team']})", axis=1)

# Series.map for elementwise function over a Series
df["team_name"] = df["team"].map({"A": "Alpha", "B": "Bravo"})
  • Vectorized ops run in C/NumPy, 10–1000× faster than .apply(axis=1)
  • Reach for .apply only when no vectorized form exists

6. groupby

g = df.groupby("team")
g["age"].mean()                  # mean age per team
g.agg({"age": ["mean", "max"], "name": "count"})

# Multi-key
df.groupby(["team", df["age"] >= 30])["name"].count()

# Transform = same-shape result, broadcast back
df["age_z"] = df.groupby("team")["age"].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Filter groups
df.groupby("team").filter(lambda g: len(g) >= 2)
  • Group keys become the result's index by default
  • Pass as_index=False (or .reset_index()) to keep them as columns

7. Joins and Concatenation

# Vertical stack (rows)
pd.concat([df1, df2], ignore_index=True)

# Horizontal stack (cols), aligns on index
pd.concat([df1, df2], axis=1)

# SQL-style join
pd.merge(orders, users, on="user_id", how="left")
pd.merge(orders, users, left_on="uid", right_on="id", how="inner",
         suffixes=("_o", "_u"))
  • how: "left", "right", "inner", "outer", "cross"
  • indicator=True adds a _merge column showing source side

8. Reshape: pivot, melt, stack, unstack

# Long → wide
df.pivot(index="date", columns="metric", values="value")
df.pivot_table(index="team", values="age", aggfunc="mean")

# Wide → long
wide.melt(id_vars=["id"], var_name="metric", value_name="value")

# MultiIndex pivot
df.set_index(["team", "date"]).unstack("team")
df_multiindex.stack()                    # one level → into the index
  • pivot errors on duplicate (index, columns) pairs
  • pivot_table aggregates them (default mean)

9. Time Series

df["t"] = pd.to_datetime(df["t"])         # parse strings
df = df.set_index("t").sort_index()

df["2026-01"]                             # all rows in Jan 2026 (string slice)
df.loc["2026-01-01":"2026-01-07"]

df.resample("D").mean()                   # daily average
df.resample("1h").agg({"x": "mean", "y": "sum"})
df.rolling(window="7D").mean()            # 7-day rolling mean
df.shift(1)                               # lag by 1
df.tz_localize("UTC").tz_convert("America/Los_Angeles")
  • Time zones bite, always localize naive timestamps before converting

10. I/O

df = pd.read_csv("data.csv", parse_dates=["date"], dtype={"id": "string"})
df = pd.read_parquet("data.parquet")       # fastest columnar format
df = pd.read_json("data.json", lines=True) # ndjson
df = pd.read_sql("SELECT * FROM users", con=engine)

df.to_csv("out.csv", index=False)
df.to_parquet("out.parquet", compression="snappy")
df.to_json("out.json", orient="records", lines=True)
  • For anything larger than a few hundred MB, prefer Parquet over CSV, 10–100× smaller and far faster to read

11. Performance and Gotchas

  • Chained assignment (df[df.x > 0]["y"] = 1) is ambiguous and may not modify df. Use .loc[mask, "y"] = 1 instead. With pandas 3.0, copy-on-write is the default and chained inplace updates raise ChainedAssignmentError; SettingWithCopyWarning has been removed.
  • SettingWithCopyWarning (pre-3.0) came from chained indexing; same .loc-on-one-line fix.
  • object dtype for strings is slow and memory-heavy. Use "string" (nullable) or pyarrow-backed (dtype="string[pyarrow]") for big columns.
  • apply(axis=1) is row-by-row in Python, try vectorized arithmetic, .map, or np.where first.
  • Categoricals for low-cardinality strings: df["team"] = df["team"].astype("category"), saves memory, speeds up groupby.
  • inplace=True is deprecated in pandas 2.x, always re-assign.
  • Polars is 5–50× faster for many workloads and has a saner API. Worth learning when Pandas becomes the bottleneck.

12. Pandera: Schemas and Validation

  • pip install pandera
  • Runtime schema validation for DataFrames
  • Catches "column missing", "wrong dtype", "negative price", "duplicate ID" at the boundary, with pinpointed offending rows
import pandera.pandas as pa
from pandera.pandas import Column, Check

schema = pa.DataFrameSchema(
    {
        "id":     Column(int,   Check.gt(0), unique=True),
        "name":   Column(str,   Check.str_length(1, 100)),
        "age":    Column(int,   Check.in_range(0, 120), nullable=True),
        "email":  Column(str,   Check.str_matches(r"^[^@]+@[^@]+$")),
        "team":   Column(str,   Check.isin(["A", "B", "C"])),
        "joined": Column(pa.DateTime),
    },
    strict=True,                # fail on unexpected columns
    coerce=True,                # cast columns to declared dtype where possible
)

clean = schema.validate(raw_df, lazy=True)   # lazy: collect ALL errors before raising
  • lazy=True, aggregates every failure into one SchemaErrors instead of stopping at the first

Class-style schemas (preferred, closer to dataclasses, plays well with IDEs):

from pandera.typing import Series, DataFrame
import pandera.pandas as pa

class UserSchema(pa.DataFrameModel):
    id:    Series[int] = pa.Field(gt=0, unique=True)
    name:  Series[str] = pa.Field(str_length={"min_value": 1, "max_value": 100})
    age:   Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 120}, nullable=True)
    email: Series[str] = pa.Field(str_matches=r"^[^@]+@[^@]+$")
    team:  Series[str] = pa.Field(isin=["A", "B", "C"])

    class Config:
        strict = True
        coerce = True

# Use as a decorator to validate inputs/outputs of a function
@pa.check_types
def clean_users(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
    return df.assign(name=df["name"].str.strip())

Custom row-wise checks:

class OrderSchema(pa.DataFrameModel):
    qty:   Series[int]
    price: Series[float]
    total: Series[float]

    @pa.dataframe_check
    def total_matches(cls, df: pd.DataFrame) -> Series[bool]:
        return df["total"] == df["qty"] * df["price"]

Where to put it in a pipeline:

  • Validate at I/O boundaries, file → DataFrame, DataFrame → DB, function I/O
  • For hot-path / huge columns: Config.strict = "filter" drops unknowns; or sample rows for spot checks

13. References