Description: Pandas, Series and DataFrame, indexing with .loc/.iloc, missing data, groupby, joins/merges, time series, I/O, and Pandera schemas for runtime DataFrame validation
My Notion Note ID: K2A-D2-4
Created: 2023-11-12
Updated: 2026-05-11
License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

1. What Pandas Gives You
2. Series and DataFrame
3. Selection: [], .loc, .iloc
4. Missing Data
5. Transform: assign, apply, Vectorized Ops
6. groupby
7. Joins and Concatenation
8. Reshape: pivot, melt, stack, unstack
9. Time Series
10. I/O
11. Performance and Gotchas
12. Pandera: Schemas and Validation
13. References

1. What Pandas Gives You

Tabular-data workhorse for Python data analysis
DataFrame ≈ in-memory SQL table, columns of typed arrays + a labelled index
Built on NumPy (most numeric columns are ndarray)
Rich API for filter / transform / group / join

When to use:

Ad-hoc analysis
CSV/Parquet ETL
Prototyping feature pipelines

When NOT to:

Streaming bigger-than-memory data → Polars, DuckDB, PySpark
Microsecond-latency online serving

import pandas as pd
import numpy as np

2. `Series` and `DataFrame`

Series, 1-D labelled array
DataFrame, dict-of-Series sharing the same row index

s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="x")
df = pd.DataFrame({
    "name": ["Yu", "Lee", "Ada"],
    "age":  [30, 25, 40],
    "team": ["A",  "B",  "A"],
})

df.shape          # (3, 3)
df.dtypes         # name: object, age: int64, team: object
df.columns        # Index(['name', 'age', 'team'], dtype='object')
df.index          # RangeIndex(0, 3)
df.head(2)
df.describe()     # numeric summary
df.info()         # types + memory

3. Selection: `[]`, `.loc`, `.iloc`

Form	Meaning
`df["col"]`	one column (Series)
`df[["a", "b"]]`	several columns (DataFrame)
`df[bool_mask]`	row filter
`df.loc[row_label, col_label]`	label-based
`df.iloc[row_pos, col_pos]`	position-based
`df.at[row, col]` / `df.iat[i, j]`	fast scalar access

df[df["age"] >= 30]
df.loc[df["team"] == "A", ["name", "age"]]
df.iloc[0]                       # first row
df.iloc[:2, 1:]                  # first two rows, cols from idx 1
df.loc[:, "name":"age"]          # label slice, INCLUSIVE on both ends

# Multiple conditions: bitwise & | ~ and PARENTHESIZE
df[(df["age"] >= 30) & (df["team"] == "A")]

Combine booleans with &, |, ~, not and/or, Python's keywords don't broadcast

4. Missing Data

NaN (numeric) or pd.NA (newer nullable dtypes)
Propagate through arithmetic

df["age"].isna()                 # mask
df.dropna()                      # drop rows with any NaN
df.dropna(subset=["age"])
df.fillna({"age": 0, "team": "?"})
df["age"].fillna(df["age"].mean())

# Forward/backward fill (time series)
ts.ffill()
ts.bfill()

Never write == NaN, NaN != NaN. Use .isna() / .notna()

5. Transform: `assign`, `apply`, Vectorized Ops

df = df.assign(
    age_in_months = lambda d: d["age"] * 12,
    senior        = lambda d: d["age"] >= 30,
)

# Element-wise op (vectorized, fastest)
df["age"] = df["age"] + 1

# Row-wise with a Python function (slow; avoid if you can vectorize)
df["label"] = df.apply(lambda r: f"{r['name']} ({r['team']})", axis=1)

# Series.map for elementwise function over a Series
df["team_name"] = df["team"].map({"A": "Alpha", "B": "Bravo"})

Vectorized ops run in C/NumPy, 10–1000× faster than .apply(axis=1)
Reach for .apply only when no vectorized form exists

6. `groupby`

g = df.groupby("team")
g["age"].mean()                  # mean age per team
g.agg({"age": ["mean", "max"], "name": "count"})

# Multi-key
df.groupby(["team", df["age"] >= 30])["name"].count()

# Transform = same-shape result, broadcast back
df["age_z"] = df.groupby("team")["age"].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Filter groups
df.groupby("team").filter(lambda g: len(g) >= 2)

Group keys become the result's index by default
Pass as_index=False (or .reset_index()) to keep them as columns

7. Joins and Concatenation

# Vertical stack (rows)
pd.concat([df1, df2], ignore_index=True)

# Horizontal stack (cols), aligns on index
pd.concat([df1, df2], axis=1)

# SQL-style join
pd.merge(orders, users, on="user_id", how="left")
pd.merge(orders, users, left_on="uid", right_on="id", how="inner",
         suffixes=("_o", "_u"))

how: "left", "right", "inner", "outer", "cross"
indicator=True adds a _merge column showing source side

8. Reshape: `pivot`, `melt`, `stack`, `unstack`

# Long → wide
df.pivot(index="date", columns="metric", values="value")
df.pivot_table(index="team", values="age", aggfunc="mean")

# Wide → long
wide.melt(id_vars=["id"], var_name="metric", value_name="value")

# MultiIndex pivot
df.set_index(["team", "date"]).unstack("team")
df_multiindex.stack()                    # one level → into the index

pivot errors on duplicate (index, columns) pairs
pivot_table aggregates them (default mean)

9. Time Series

df["t"] = pd.to_datetime(df["t"])         # parse strings
df = df.set_index("t").sort_index()

df["2026-01"]                             # all rows in Jan 2026 (string slice)
df.loc["2026-01-01":"2026-01-07"]

df.resample("D").mean()                   # daily average
df.resample("1h").agg({"x": "mean", "y": "sum"})
df.rolling(window="7D").mean()            # 7-day rolling mean
df.shift(1)                               # lag by 1
df.tz_localize("UTC").tz_convert("America/Los_Angeles")

Time zones bite, always localize naive timestamps before converting

10. I/O

df = pd.read_csv("data.csv", parse_dates=["date"], dtype={"id": "string"})
df = pd.read_parquet("data.parquet")       # fastest columnar format
df = pd.read_json("data.json", lines=True) # ndjson
df = pd.read_sql("SELECT * FROM users", con=engine)

df.to_csv("out.csv", index=False)
df.to_parquet("out.parquet", compression="snappy")
df.to_json("out.json", orient="records", lines=True)

For anything larger than a few hundred MB, prefer Parquet over CSV, 10–100× smaller and far faster to read

11. Performance and Gotchas

Chained assignment (df[df.x > 0]["y"] = 1) is ambiguous and may not modify df. Use .loc[mask, "y"] = 1 instead. With pandas 3.0, copy-on-write is the default and chained inplace updates raise ChainedAssignmentError; SettingWithCopyWarning has been removed.
SettingWithCopyWarning (pre-3.0) came from chained indexing; same .loc-on-one-line fix.
object dtype for strings is slow and memory-heavy. Use "string" (nullable) or pyarrow-backed (dtype="string[pyarrow]") for big columns.
apply(axis=1) is row-by-row in Python, try vectorized arithmetic, .map, or np.where first.
Categoricals for low-cardinality strings: df["team"] = df["team"].astype("category"), saves memory, speeds up groupby.
inplace=True is deprecated in pandas 2.x, always re-assign.
Polars is 5–50× faster for many workloads and has a saner API. Worth learning when Pandas becomes the bottleneck.

12. Pandera: Schemas and Validation

pip install pandera
Runtime schema validation for DataFrames
Catches "column missing", "wrong dtype", "negative price", "duplicate ID" at the boundary, with pinpointed offending rows

import pandera.pandas as pa
from pandera.pandas import Column, Check

schema = pa.DataFrameSchema(
    {
        "id":     Column(int,   Check.gt(0), unique=True),
        "name":   Column(str,   Check.str_length(1, 100)),
        "age":    Column(int,   Check.in_range(0, 120), nullable=True),
        "email":  Column(str,   Check.str_matches(r"^[^@]+@[^@]+$")),
        "team":   Column(str,   Check.isin(["A", "B", "C"])),
        "joined": Column(pa.DateTime),
    },
    strict=True,                # fail on unexpected columns
    coerce=True,                # cast columns to declared dtype where possible
)

clean = schema.validate(raw_df, lazy=True)   # lazy: collect ALL errors before raising

lazy=True, aggregates every failure into one SchemaErrors instead of stopping at the first

Class-style schemas (preferred, closer to dataclasses, plays well with IDEs):

from pandera.typing import Series, DataFrame
import pandera.pandas as pa

class UserSchema(pa.DataFrameModel):
    id:    Series[int] = pa.Field(gt=0, unique=True)
    name:  Series[str] = pa.Field(str_length={"min_value": 1, "max_value": 100})
    age:   Series[int] = pa.Field(in_range={"min_value": 0, "max_value": 120}, nullable=True)
    email: Series[str] = pa.Field(str_matches=r"^[^@]+@[^@]+$")
    team:  Series[str] = pa.Field(isin=["A", "B", "C"])

    class Config:
        strict = True
        coerce = True

# Use as a decorator to validate inputs/outputs of a function
@pa.check_types
def clean_users(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]:
    return df.assign(name=df["name"].str.strip())

Custom row-wise checks:

class OrderSchema(pa.DataFrameModel):
    qty:   Series[int]
    price: Series[float]
    total: Series[float]

    @pa.dataframe_check
    def total_matches(cls, df: pd.DataFrame) -> Series[bool]:
        return df["total"] == df["qty"] * df["price"]

Where to put it in a pipeline:

Validate at I/O boundaries, file → DataFrame, DataFrame → DB, function I/O
For hot-path / huge columns: Config.strict = "filter" drops unknowns; or sample rows for spot checks

Python Pandas and Pandera

Table of Contents

1. What Pandas Gives You

2. `Series` and `DataFrame`

3. Selection: `[]`, `.loc`, `.iloc`

4. Missing Data

5. Transform: `assign`, `apply`, Vectorized Ops

6. `groupby`

7. Joins and Concatenation

8. Reshape: `pivot`, `melt`, `stack`, `unstack`

9. Time Series

10. I/O

11. Performance and Gotchas

12. Pandera: Schemas and Validation

13. References

Table of Contents

1. What Pandas Gives You

2. Series and DataFrame

3. Selection: [], .loc, .iloc

4. Missing Data

5. Transform: assign, apply, Vectorized Ops

6. groupby

7. Joins and Concatenation

8. Reshape: pivot, melt, stack, unstack

9. Time Series

10. I/O

11. Performance and Gotchas

12. Pandera: Schemas and Validation

13. References

2. `Series` and `DataFrame`

3. Selection: `[]`, `.loc`, `.iloc`

5. Transform: `assign`, `apply`, Vectorized Ops

6. `groupby`

8. Reshape: `pivot`, `melt`, `stack`, `unstack`