top of page
Search
Writer's picturesrnamatej

Pandas library: Tips and tricks

Updated: Feb 16, 2023

Pandas is open source Python library mainly used for data processing. The development began in 2008 by AQR Capital Management and was officially published as an open source a year later.

This library is praised for its fast and efficient DataFrame which allows manipulating data easily. It also provides tools for merging the datasets, finding and replacing missing values, slicing the data, reading, and writing various data formats (CSV, XLSX, SQL, JSON, XML, and others).


Here are tips and tricks I wish I knew right from the beginning I started using the Pandas library:


This function allows using the expressions as a boolean form. Basically, it queries the column of a particular DataFrame with this expression.

The expression can consist of multiple conditions based on the need. This function allows data to be easily filtered and also the syntax is less complex.


The query filtering data by price and brand name
The query filtering data by the price and brand name

In this query, the data is being filtered by the price higher than 17 000 and also by the brand which equals Volvo. The syntax is less complex rather than using the pandas loc function.

NOTE: if the type of the column is an object instead of int or float the number values in the filter must be in quotation marks otherwise, the number values do not need quotation marks.

The query function also able users to use variables to filter through the dataset. This also saves time in writing the syntax.


data.query("price > 17000 & make == 'volvo'")

The query filtering data by the price and brand name stored in variable
The query filtering data by the price and brand name stored in variable

To connect the variable in the query function the symbol "@" is used. Otherwise, the variable will not work.


brands = ["volvo"]
data.query("price > 17000 & make == @brands")


This functionality easily shows the largest and smallest items in the dataset. Instead of sorting the data based on filter value in ascending/descending order and then selecting the specified n number of items this function does this within itself.


This function shows the smallest and largest defined number of data.
This function shows the smallest and largest defined number of data.

The nlargest function is looking for the largest price numbers sorted in descending order and it is showing the first 5 items in the dataset.

On the other hand, the nsmallest function does this in the opposite order, so the data is sorted in ascending order and it shows the 5 lowest items in the dataset.

NOTE: the data type of the column must be int or float, otherwise the function cannot be used.


data.nlargest(5, "price")

data.nsmallest(5,"price")


This function assigns the data into bins based on predefined categories and it's especially helpful to sort data into category ranges. To define the category range the function uses bins and to specify the name of the category the function uses labels.

This cut function assigns category names based on predefined category keys
This cut function assigns category names based on predefined category keys

For example, the category range from 0 up to 8 is defined as a category with a label small. the category range from 8 up to12 is defined as a category with a label medium and so on. The bin parameter always needs to be one fewer than the label parameter.


bins = [0, 8, 12, float("inf")]
labels = ["small", "medium", "high"]
data["category"] = pd.cut(data["compression-ratio"], bins=bins, labels=labels)
data


This function loc is mostly used for filtering and finding the rows based on specific keys as single labels, a list of labels, or slicing with the labels and also setting the values. However, I use this function for replacing the data with different data based on specific keys or also setting the values for the new column.

Creating a new column category_num with specific data based on the specific key value in the row category
Creating a new column category_num with specific data based on the specific key value in the row category

For example, the Loc function locates the medium in the category column and set the value in the new category category_num value as 0.

If the data set contains more categories it might be more useful to use the map function.


data.loc[data["category"] == "medium", "category_num"] = 0
data.loc[data["category"] == "small", "category_num"] = 1
data.loc[data["category"] == "high", "category_num"] = 2


The map function sets the values based on the mapping set in the mapping dictionary. The map function set the values on the specified series. The specified series should have the same keys as the keys in the dictionary.

Creating a new column category_num with specific data based on the specific key value stored in a dictionary.
Creating a new column category_num with specific data based on the specific key value stored in a dictionary.

The dictionary in the labels_map contains the same keys as the keys in the column category. Values in the dictionary set the values in the new column category_num. The medium category is set to category_num as one.


labels_map = {"small": 0, "medium": 1, "high": 2}
data["category_num"] = data["category"].map(labels_map)
data

These functions are functions I wish I knew earlier. They might be super helpful with data processing and manipulating the datasets before doing further analysis or machine learning.



Have you learned anything new?

  • Yes, I have

  • No, I have not


 

Matěj Srna




Comentarios

Obtuvo 0 de 5 estrellas.
Aún no hay calificaciones

Agrega una calificación
bottom of page