Sudheer Keshav Bhat · October 7, 2021
Load dataframe from a file.
df = pd.read_parquet("sample.parquet")
Replace cell values
df.replace({"find1": "replace1", "find2": "replace2"})
Create/update column by applying a function on existing column
Slow. Prefer vectorized form
df["score_in_percent"] = df["score"].apply(lambda cell: cell * 100)
Rename a column
df.rename({"height": "height_in_inches"}, axis=1)
Update a cell
df.loc[index, "height_in_cms"] = df.loc[index, "height_in_inches"] * 2.54
#OR
df.iloc[row_idx, col_idx] = 100
Create new index ignoring existing
df.reset_index()
Reset an existing index/level
df.reset_index(name="counts")
Convert timestamp into pandas timestamp
pd.Timestamp(ts, unit="ms")
Rename axis
df.rename_axis("sample")
Drop duplicate rows with same name
df.drop_duplicates(["name"])
Add a row
new_df = pd.DataFrame(columns=["ID", "Name"])
row = {"ID": "2365782", "Name": "Sudheer"}
new_df.append(row, ignore_index=True)
Convert datatype
df = df.astype({"ID": "int64"})
Concat two dataframes by row
pd.concat([existing_employees_df, new_employees_df], axis=0, ignore_index=True)
Filter by column names
df.filter(items=["col1", "col4"])
# OR
df[["col1", "col4"]]
Find unique values of a column
df["height"].unique()
# Unique values with counts
df["height"].value_counts()
Filter based on condition
Similar to SQL where
clause.
df_tall = df[df["height_in_inches"] >= 72.]
# OR
df_tall = df.loc[df["height_in_inches"] >= 72.]
Multiple where clauses
df_big_tall = df.loc[
(df["height_in_inches"] >= 72.) & (df["weight_in_kgs"] > 90)
]
Similar to SQL in
clause
df_it_hr = df[df["department"].isin(["IT", "HR"])]
Iterate through rows
Avoid when possible
for index, row in df.iterrows():
pass
Access cell at given row index & column name
df.at[0, "height"]
Dataframe size = rows * cols
df.size
Get a column’s index
df.columns.get_loc("height")
Group by column(s)
df_group = df.groupby("department")
Group size, similar to SQL count(*)
df_group.size()
Sort series
df_height_sorted = df["heigh"].sort()
Sort dataframe
df.sort_values(by=["created_date"], ascending=True)
df.to_parquet("path/to/save/file.parquet")
Get dates between a range of dates
pd.date_range("2022-01-28", "2022-02-04")
Get hours between range of dates/timestamps
pd.date_range("2022-01-28T00:00:00", "2022-01-29T08:00:00", freq="H")