Pandas quick reference

Sudheer Keshav Bhat · October 7, 2021

Data Engineering

Load

Load dataframe from a file.

df = pd.read_parquet("sample.parquet")

Replace cell values

df.replace({"find1": "replace1", "find2": "replace2"})

Create/update column by applying a function on existing column

Slow. Prefer vectorized form

df["score_in_percent"] = df["score"].apply(lambda cell: cell * 100)

Rename a column

df.rename({"height": "height_in_inches"}, axis=1)

Update a cell

df.loc[index, "height_in_cms"] = df.loc[index, "height_in_inches"] * 2.54
#OR
df.iloc[row_idx, col_idx] = 100

Create new index ignoring existing

df.reset_index()

Reset an existing index/level

df.reset_index(name="counts")

Convert timestamp into pandas timestamp

pd.Timestamp(ts, unit="ms")

Rename axis

df.rename_axis("sample")

Drop duplicate rows with same name

df.drop_duplicates(["name"])

Add a row

new_df = pd.DataFrame(columns=["ID", "Name"])
row = {"ID": "2365782", "Name": "Sudheer"}
new_df.append(row, ignore_index=True)

Convert datatype

df = df.astype({"ID": "int64"})

Concat two dataframes by row

pd.concat([existing_employees_df, new_employees_df], axis=0, ignore_index=True)

Filter by column names

df.filter(items=["col1", "col4"])
# OR
df[["col1", "col4"]]

Find unique values of a column

df["height"].unique()
# Unique values with counts
df["height"].value_counts()

Filter based on condition Similar to SQL where clause.

df_tall = df[df["height_in_inches"] >= 72.]
# OR
df_tall = df.loc[df["height_in_inches"] >= 72.]

Multiple where clauses

df_big_tall = df.loc[
    (df["height_in_inches"] >= 72.) & (df["weight_in_kgs"] > 90)
]

Similar to SQL in clause

df_it_hr = df[df["department"].isin(["IT", "HR"])]

Iterate through rows

Avoid when possible

for index, row in df.iterrows():
    pass

Access cell at given row index & column name

df.at[0, "height"]

Dataframe size = rows * cols

df.size

Get a column’s index

df.columns.get_loc("height")

Group by column(s)

df_group = df.groupby("department")

Group size, similar to SQL count(*)

df_group.size()

Sort series

df_height_sorted = df["heigh"].sort()

Sort dataframe

df.sort_values(by=["created_date"], ascending=True)

df.to_parquet("path/to/save/file.parquet")

Get dates between a range of dates

pd.date_range("2022-01-28", "2022-02-04")

Get hours between range of dates/timestamps

pd.date_range("2022-01-28T00:00:00", "2022-01-29T08:00:00", freq="H")