Reflections of a Data Scientist: (Python) Data Frame Maintenance

The topic of today's post is: Data Frame Maintenance within the Python platform, specifically pertaining to data imported through the utilization of the “pandas” package.

All of the exercises featured within this article require the following file to be successfully demonstrated:

PythonImportTestIII.csv

This file can be found within the exercise code and example data set repository:

GitHub Repository

Files are sorted based on article date.

Also, for any of these examples to work, you must be sure to include:

import pandas

import numpy

within the first initial few lines of your Python program code.

Be sure to re-import the data set after performing an example which modifies the underlying data structure.

Checking Data Integrity

After your data has been successfully imported into Python, you should check the integrity of the data structure to ensure that all of the original data was imported correctly. Listed below, are some of the commands that can be utilized to ensure that data integrity was maintained.

If the utilization of this command is infeasible due to the size of the data frame, you could instead utilize the head or tail commands.

The head command template is:

<DataFrameName>.head(<number of rows to display>)

Executing this command will display the first n number of rows contained within the data frame.

# Example: #

# Print the first 10 rows of the data set #

PythonImportTestIII.head(10)

The tail command template is:

<DataFrameName>.tail(<number of rows to display>)

Executing this demand will display the last n number of rows contained within the data frame.

# Example: #

# Print the last 5 rows of the data set #

PythonImportTestIII.tail(5)

Adding a List as a Column

For this example, we'll pretend that you wanted to add a new column in the form of a list, to an existing data frame.

# Add Column #

# Create List #

VarG = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

# Modify List into Panda Series #

VarG = pandas.Series(VarG)

# Add Column as Panda Series #

PythonImportTestIII['VarG'] = VarG.values

# Print Results #

print(PythonImportTestIII)

To demonstrate the scenario in which the list possesses a length which is less that observational size of the data frame:

# Add Column of Un-equal Length #

# Create List #

VarH = [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

# Modify List into Panda Series #

newvar = pandas.Series(VarH)

# Rename Column within Data Frame #

newvar.name = 'VarH'

# Add Column to Data Frame #

PythonImportTestIII = pandas.concat([PythonImportTestIII, newvar], axis=1)

# Print Results #

print(PythonImportTestIII)

Adding an Observation to a Data Frame

In the case of adding an additional row (observation) to an already existent data frame, the following code can be utilized.

We must first code the entries that we wish to add in the same manner in which the initial data frame is encoded.

newobservation = pandas.DataFrame({'VarA': ['30'],

'VarB': ['1000'],

'VarC': ['833'],

'VarD': ['400']},

index = [20])

We will print the row observation to illustrate its structure.

print(newobservation)

To provide this addition to the existing data frame, the following code can be utilized:

PythonImportTestIII = pandas.concat([PythonImportTestIII, newobservation])

Again, we will print to illustrate the structure of the amended data frame:

print(PythonImportTestIII)

Changing a Column Name

Let's say, for example, that you are working with the data frame named: "PythonImportTestIII". For whatever reason, the first column of this particular data frame needs to be re-named. The code to accomplish this task is below:

DataFrame.rename(columns={'originalcolumnname':'newcolumnname'}, inplace = True)

So, if you wanted to change the name of the first column of “PythonImportTestIII" to, “DataBlog", the code would resemble:

PythonImportTestIII.rename(columns={'VarA':'DataBlog'}, inplace=True)

print(PythonImportTestIII)

Changing Column Variable Type

Now, let's say that you wanted to change the data type that is contained within a column of an existing data frame. Again, we will use "PythonImportTestIII" for our example.

This code will change a column variable to a "string" type:

PythonImportTestIII['VarA'] = PythonImportTestIII['VarA'].astype('str')

This code will change a column variable to an "integer" type:

PythonImportTestIII['VarA'] = PythonImportTestIII['VarA'].astype('int')

This code will change a column variable to a "float" type:

PythonImportTestIII['VarA'] = PythonImportTestIII['VarA'].astype('float')

To check the current variable types of variables contained within a data frame, utilize the following code:

PythonImportTestIII.dtypes

Re-Ordering Columns within a Data Frame

For example, if you were working on a data frame (“PythonImportTestIII”), with the column names of ("VarA", "VarB", "VarC", "VarD", "VarE", "VarF"), and you wanted to re-order the columns so that they were displayed such as ("VarF", "VarE", "VarA", "VarB", "VarC", "VarD") you could run the code:

PythonImportTestIII = PythonImportTestIII[['VarF', 'VarE', 'VarA', 'VarB', 'VarC', 'VarD']]

Removing Columns from a Data Frame

Assuming that we were still utilizing the same data frame as previously ("PythonImportTestIII"), and we desired to remove certain columns from it, the following code code be utilized:

# Remove Column Variables: "VarE" and "VarF" from PythonImportTestIII #

PythonImportTestIII = PythonImportTestIII.drop(columns=['VarE', 'VarF'])

print(PythonImportTestIII)

Removing Select Rows from a Data Frame

If we desired to instead, remove certain rows within a data frame, we could achieve such subsequent to determining which rows required removal.

# Remove Rows: "0" and "1" from PythonImportTestIII #

PythonImportTestIII = PythonImportTestIII.drop([0, 1])

Create a New Column Variable from Established Column Variable(s)

To create a new column variable from an already existent variable, the code resembles:

# Create a copy of variable: 'VarE', as a new variable: 'VarG' #

PythonImportTestIII['VarG'] = PythonImportTestIII['VarE']

To create a new column variable as a product of existing variables, the code resembles:

# Create a new variable: 'VarH', as the product value of 'VarA' and 'VarB' #

PythonImportTestIII['VarH'] = PythonImportTestIII['VarA'] * PythonImportTestIII['VarB']

This can similarly be achieved with the code:

PythonImportTestIII['VarH'] = PythonImportTestIII[PythonImportTestIII.columns[0]] + PythonImportTestIII[PythonImportTestIII.columns[1]]

Drop a Data Frame or Python Variable

There may arise an instance in which you desire to remove a previously designated variable, for example:

# Create List Variable: 'a' #

a = [0,1,2,3,4,5]

# Print 'a' to Console #

print(a)

# Delete Variable: 'a' #

del a

# Print 'a' to Console #

print(a)

Console Output:

NameError: name 'a' is not defined

In this case, you will notice the error which is displayed in lieu of the variable description. The variable 'a' is now free to be re-assigned as necessary.

Create a Data Frame without Importing Data

If there is ever the case that you wish to create a data frame from scratch, without importing a previously created data structure, the following code can be utilized:

# Create a new Data Frame #

sampledataframe = pandas.DataFrame({

'Column1': [0, 1, 2, 3],

'Column2': [4, 5, 6, 7]

})

# Print to console #

print(sampledataframe)

# This will also achieve a similar result #

# Create Data Variables #

a = [0, 1, 2, 3]

b = [4, 5, 6, 7]

# Create a new Data Frame #

sampledataframe0 = pandas.DataFrame({

'Column1': a,

'Column2': b

})

# Print to console #

print(sampledataframe0)

Stacking Data Frames

Perhaps you want to stack two data frames, one on top of the other. This can be achieved with the example code:

# Stack the Data Frame: "PythonImportTestIII" on top of itself #

PythonImportTestIIIConcat = pandas.concat([PythonImportTestIII, PythonImportTestIII])

# Print to console #

print(PythonImportTestIIIConcat)

If there were instances where variables from one data frame were not present in the other, a 'NaN' would indicate this discrepancy.

Using Conditionals to Create New Data Frame Variables

In the previous article: Pip and SQL ,we discussed how to download and appropriately utilize the wonderful 'pandasql' package. However, no working demonstration was provided within the entry.

There are many ways to conditionally utilize Python's pandas to create new variables and filter through variables based on conditions. However, I have found that the best way to achieve multi-variable query results is through the utilization of SQL emulation with the 'pandasql' package.

In this first scenario, we will be creating a new variable "VarG" and assigning it a value based on the following conditions:

If "VarD" is less than or equal to 450, then "VarG" will be assigned the value: "<= 450"

If "VarD" is greater than 450 and less than 500, then "VarG" will be assigned the value: "451-499"

If "VarD" is greater than or equal to 500, then "VarG" will be assigned the value: ">= 500"

To achieve this, we will be utilizing the following code:

# Requires the "pandasql" package to have been previously downloaded #

from pandasql import *

pysqldf = lambda q: sqldf(q, globals())

q = """

SELECT *,

CASE

WHEN (VarD <= 450) THEN '<= 450'

WHEN (VarD > 450 AND VarD < 500) THEN '451-499'

WHEN (VarD >= 500) THEN '>= 500'

ELSE 'UNKNOWN' END AS VarG

from PythonImportTestIII;

"""

df = pysqldf(q)

print(df)

PythonImportTestIII = df

# Print Data Frame to Console #

print(PythonImportTestIII)

In this next scenario, we will delete row entries which meet the following conditions:

If "VarD" is less than or equal to 450, then "VarG" will be assigned the value: "X"

If "VarD" is greater than or equal to 500, then "VarG" will be assigned the value: "X"

If "VarD" does not satisfy either of the prior conditions, then "VarG" will be assigned the value:" "

# Requires the "pandasql" package to have been previously downloaded #

pysqldf = lambda q: sqldf(q, globals())

q = """

SELECT *,

CASE

WHEN (VarD <= 450) THEN 'X'

WHEN (VarD >= 500) THEN 'X'

ELSE " " END AS VarG

from PythonImportTestIII;

"""

df = pysqldf(q)

PythonImportTestIII = df

# Print Data Frame to Console #

print(PythonImportTestIII)

# Filter Out Row Observations in which variable: 'VarG' equals 'X' #

PythonImportTestIIIFilter = PythonImportTestIII[PythonImportTestIII.VarG != 'X']

# Print Data Frame to Console #

print(PythonImportTestIIIFilter)

# Remove variable: 'VarG' in its entirety #

PythonImportTestIII = PythonImportTestIII.drop(columns=['VarG'])

# Print Data Frame to Console #

print(PythonImportTestIII)

Extracting Rows and Columns from a Data Frame

Finally, we arrive at the simplest demonstrable task, extracting data entries and variables from an imported data frame. If multiple conditions must be met as it pertains to specifying variable ranges, I would recommend utilizing the above examples to prepare the data frame prior to the extraction process.

Extracting Column Data

# Extract Columns by Variable Name #

ExtractedCols = PythonImportTestIII[['VarA', 'VarB']]

# Extract Columns by Variable Position #

ExtractedCols0 = PythonImportTestIII.iloc[:, 0:2]

# Print to Console #

print(ExtractedCols)

print(ExtractedCols0)

Extracting Row Data

# Extract Rows by Obervation Position #

ExtractedRows0 = PythonImportTestIII.iloc[0:5, :]

# Print to Console #

print(ExtractedRows0)

Reset Index

In our prior example demonstrating pandasql, we produced a new data set resembling:

VarA VarB VarC VarD VarE VarF

1 93 2015 804 465 Volvo None

14 4 1334 802 484 Subaru None

17 7 1161 803 489 Lexus One

As you may notice, the left-most column, the 'index' column, is now mis-labeled.

To correct this, we must rest the index values. This is accomplished through the utilization of the following code:

# Correct Index Values #

PythonImportTestIII = PythonImportTestIII.reset_index(drop = True)

# Print to Console #

print(PythonImportTestIII)

Console Output:

VarA VarB VarC VarD VarE VarF

0 93 2015 804 465 Volvo None

1 4 1334 802 484 Subaru None

2 7 1161 803 489 Lexus One

Sorting Data Frames

Now that you have all of your data clean and extracted, you may want to sort it. Below are the functions to accomplish this task, and the options available within each.

# Sort Data #

# Sort by variable: 'VarA' #

PythonImportTestIII = PythonImportTestIII.sort_values(by = ['VarA'])

print(PythonImportTestIII)

# Sort by variable: 'VarA' and 'VarB' #

PythonImportTestIII = PythonImportTestIII.sort_values(by = ['VarA', 'VarB'])

print(PythonImportTestIII)

# Sort by variable: 'VarA' (descending order) #

PythonImportTestIII = PythonImportTestIII.sort_values(by='VarA', ascending=False)

print(PythonImportTestIII)

# Sort by variable: 'VarA' (put NAs first) #

PythonImportTestIII = PythonImportTestIII.sort_values(by = 'VarA', na_position = 'first')

print(PythonImportTestIII)

Reflections of a Data Scientist

Saturday, September 1, 2018

(Python) Data Frame Maintenance

No comments:

Post a Comment