Introduction to Pandas (Python)
A very brief explanation of Pandas
This post belongs to a series of blog posts about the Pandas library. Pandas is an easy open-source software library for data manipulation and analysis. It is built on top of the Python programming language. Pandas provides tools for reading and writing data between in-memory data structures and different formats such as CSV, JSON and Excel. It also provides tools for data cleaning, aggregation, and transformation. Pandas is a very powerful tool for working with Python and can be used in many different occasions.
In the scope those posts we will focus mainly on a business setting and our examples will match that logic. In this regard, we will utilize Pandas as an Excel alternative for fast data manipulation and automation of tasks. For example we may want to create or modify a sales report or raw data exported by our ERP. Those kind of automations are particularly useful for Sales Managers, Product Managers, Financial Analysts, Accountants and workers in similar disciplines.
The logic behind this series of posts is to help the average business workers (not programmers) make their everyday work a lot easier with the power of Python & Pandas. It also intends to be a quick “refresh” guide for those with some basic knowledge (or experience) that need to revise or remember again some bits of code.
Pandas Installation
In order to install Pandas (assuming we already have Python installed in our computer) we need to open the command prompt (or terminal in case of Linux) and type:
pip3 install pandas
Our first Excel import
Note: Everything after the character “#” is a comment and it is ignored by Python. Comments help those reading the code to make sense of the logic and better understand parts of the code. It is an essential part of programming. Below is the Python Code we need to write:
# Since Pandas is an external library we need to import it first.
import pandas as pd
# Read the first sheet of the Excel file, skipping the first 4 rows and last footer
df = pd.read_excel("file.xlsx", sheet_name=0, skiprows=4, skipfooter=1)
# Print the contents of the dataframe
print(df)
This code above will read the excel file named: “file.xlsx” using the read_excel function. It will then store the data in a Pandas DataFrame with the name: “df“.
The additional parameters we declare are:
- “sheet_name=0”: This will read the first sheet of our excel (in case we have many sheets; otherwise we can skip that part). We write “0” because in Python the first number is always zero.
- “skiprows=4”: In this example we skip the first 4 rows. Sometimes this is the header of a business report such as Report title, date of the report, etc.
- “skipfooter=1”: This will skip the last row. Sometimes this is a “Grand Totals” row or some kind of report summary.
Finally, we use “Print” to have Python print the contents of our DataFrame. This way, we can be sure that our import was successful.
I will stop at this point in order to avoid lengthy & difficult to read posts. In the next examples that will follow, we will see how to manipulate our data in various ways.