Pandas is an open source data manipulation library for the Python programming language. It provides simple data structures for efficiently storing, manipulating, reading, and writing data in various formats to large datasets. Pandas is widely used in data science, machine learning, and other fields where data analysis and manipulation are critical. Two data structures are very popular in Pandas –
- Series
- DataFrame
Note: While working with artificial intelligence, one has to work with very large data. Often they are in .csv or other formats. Pandas is commonly used to bring data of different formats into the desired format very easily.
Installing Pandas
Pandas can be installed using the pip command, just like installing nampy earlier. Since we have already learned how to use the pip command, it will not be discussed in detail here. Type the following command to install Pandas:
pip install pandas
If everything is fine, you will see the image below-
Hope everyone was able to install Pandas properly.
Pandas Series
Pandas series is a one dimensional array that can store any type of data. Earlier we saw that arrays have to hold values of the same data type but here there is no such obligation. Those of us who have worked with tabular data such as Microsoft Excel will find it a little easier to understand Pandas. Each column of the table can be called a series. Let’s look at an example-
import pandas as pd
data = [1, 2, 'three']
pd_data = pd.Series(data)
print(pd_data[1]) #Accessing bu index
print(pd_data)
Result-
2
0 1
1 2
2 three
dtype: object
In the 2nd line we take a variable of different data type and convert it to pandas’ series. Some additional numbers appear on the left side of the result, these are the index. We can access pandas variables by index; However, these indices can be given a label if desired.
Label
To create a label, you need to pass the name of the label inside the index argument.
import pandas as pd
data = [1, 2, 'three']
pd_data = pd.Series(data, index=['First','Second','Third'])
print(pd_data)
Result-
First 1
Second 2
Third three
dtype: object
DataFrame
Pandas dataframe is a two dimensional data structure i.e. there will be one row and one column. If we think of a table of data then the entire table is a dataframe. Let’s look at an example-
import pandas as pd
data = {"Course ID": ['CSE10', 'CSE20', 'CSE30', 'CSE40'],
"GPA": [4, 3.68, 3.55,3.98]
}
pd_data = pd.DataFrame(data)
print(pd_data)
Result-
Course ID GPA
0 CSE10 4.00
1 CSE20 3.68
2 CSE30 3.55
3 CSE40 3.98
Since DataFrame is a two-dimensional datatype, we also created a two-dimensional variable named data.
Accessing a specific row
Earlier we learned that the Pandas series is a column; But if we have to work with a row then we can use loc(). Let us understand the point using the above example-
import pandas as pd
data = {"Course ID": ['CSE10', 'CSE20', 'CSE30', 'CSE40'],
"GPA": [4, 3.68, 3.55,3.98]
}
pd_data = pd.DataFrame(data)
print(pd_data.loc[1])
Result-
Course ID CSE20
GPA 3.68
Name: 1, dtype: object
That means we are able to access the 2nd row data.
Working with CSV files
I’m assuming you don’t have a .csv file on your computer or system, so we’ll create a .csv file. I am using the above dataframe for the sake of understanding.
import pandas as pd
data = {"Course ID": ['CSE10', 'CSE20', 'CSE30', 'CSE40'],
"GPA": [4, 3.68, 3.55,3.98]
}
pd_data = pd.DataFrame(data)
pd_data.to_csv('DataFrame.csv', index=False)
Run this program and you will see that a .csv file called DataFrame is created in your file directory. If you open this file, you will see that it contains only the values of pd_data. Here the pandas dataframe is converted to a csv file using the to_csv method. index is set to false because we don’t want the value of index to be saved to the file either. You can also try omitting it.
Since we have a .csv file in our files directory, we can easily read that file. For this I will use the read_csv() method.
import pandas as pd
pd_data = pd.read_csv('DataFrame.csv')
print(pd_data)
Result-
Course ID GPA
0 CSE10 4.00
1 CSE20 3.68
2 CSE30 3.55
3 CSE40 3.98
Note: json format is also commonly used in many applications. The read_json method is used to read the json data.
Loop
Values in Pandas dataframes can be accessed through loops, in which case any conditional statement can produce the desired result.
import pandas as pd
pd_data = pd.read_csv('DataFrame.csv')
for x in pd_data.index:
if pd_data.loc[x, "GPA"] > 3.9:
pd_data.loc[x, "GPA"] = 4
print(pd_data)
Result-
Course ID GPA
0 CSE10 4.00
1 CSE20 3.68
2 CSE30 3.55
3 CSE40 4.00
In the 6th line, I have run the loop according to the index of the dataframe and written in the condition that if the value of GPA is more than 3.9 then it will be 4 with grace. As a result of which changes are seen in the final value of the result.
Data Manipulation in Python
Let’s take a look at some of the methods available for data manipulation in Pandas. Pre-processing is usually required when working on large tables. You can create a large Excel file to practice these methods. A small file is used here for good reason.
drop()
Used to delete any value in the dataframe. axis indicates which axis to delete – 0 for rows and 1 for columns.
import pandas as pd
pd_data = pd.read_csv('DataFrame.csv')
new_data=pd_data.drop('GPA',axis=1)
print(new_data)
dropna()
The dropna() method deletes the row if any element in the array has no value.
import pandas as pd
pd_data = pd.read_csv('DataFrame.csv')
new_data=pd_data.dropna()
print(new_data)
duplicated()
import pandas as pd
pd_data = pd.read_csv('DataFrame.csv')
new_data=pd_data.duplicated()
print(new_data)
Note: Any duplicate values can be deleted using the drop_duplicates() method.
fillna()
The fillna() method fills a cell with a new value if there is no value in it.
import pandas as pd
pd_data = pd.read_csv('DataFrame.csv')
new_data=pd_data.fillna(0)
print(new_data)
to_string()
This function converts dataframe elements to strings.
import pandas as pd
pd_data = pd.read_csv('DataFrame.csv')
new_data=pd_data.to_string()
print(new_data)