In this article, we will discuss how to properly import data frames into Python through the utilization of various Python modules. Throughout this post, there are numerous links to online resources maintained by the module creators. These resources should be utilized as necessary, as they provide important options which may be useful in addressing additional aspects related to package functions.
Importing (.csv) Data as a Multidimensional Numpy Array
There may be instances in which you wish to have a data frame transformed and stored as a multi-dimensional array. The reason for perusing such an option would typically be necessitated due to a desire to produce a machine learning model. Models of this type require the aforementioned variable format.
Limitations exist pertaining to this import function. Primarily amongst such, is the functions inability to import data which contains non-numerical elements, and the inability of the function to import columns which contain non-existent entries.
Therefore, for the following function featured within the example to correctly perform its purpose, the data called therein, must not contain missing values, and must not contain non-numerical elements.
# Enable Numpy #
import numpy
# Specify the appropriate file path #
# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #
filepath = "C:\\Users\\Username\\Desktop\\PythonImportTest.csv"
# Create a variable to store the data #
# The "delimiter" option specifies the delimiter contained within the data file #
# The "skiprows" option indicates that the first row containing variable names will be omitted #
numpyex = numpy.loadtxt(filepath, delimiter=',', skiprows=1)
# Print the result of the data import process to the console #
print(numpyex)
Console Output:
[[ 83. 2036. 803. 544. 243. 28. 843. 46.]
[ 93. 2015. 804. 465. 296. 15. 815. 32.]
[ 49. 1967. 804. 430. 189. 47. 817. 46.]
[ 100. 1957. 802. 511. 256. 42. 561. 37.]
[ 22. 1925. 803. 529. 172. 96. 345. 32.]
[ 31. 1895. 810. 435. 194. 40. 861. 46.]
[ 94. 1889. 802. 503. 228. 7. 883. 46.]
[ 4. 1722. 802. 535. 260. 80. 300. 50.]
[ 25. 1715. 808. 437. 200. 77. 776. 37.]
[ 46. 1704. 809. 445. 310. 52. 410. 53.]
[ 15. 1646. 802. 502. 223. 79. 296. 31.]
[ 74. 1611. 800. 420. 200. 99. 808. 43.]
[ 79. 1429. 805. 504. 185. 67. 806. 50.]
[ 13. 1401. 801. 415. 283. 23. 235. 53.]
[ 4. 1334. 802. 484. 277. 79. 946. 37.]
[ 47. 1290. 807. 428. 171. 15. 481. 42.]
[ 49. 1274. 805. 406. 306. 12. 296. 34.]
[ 7. 1161. 803. 489. 298. 93. 381. 28.]
[ 93. 1132. 805. 415. 195. 31. 221. 40.]
[ 60. 1131. 804. 413. 185. 5. 308. 33.]]
Let us consider another example which demonstrates additional options.
# Enable Numpy #
import numpy
# Specify the appropriate file path #
# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #
filepath = "C:\\Users\\Username\\Desktop\\PythonImportTest.csv"
# Create a variable to store the data #
# The "delimiter" option specifies the delimiter contained within the data file #
# The "dtype" option indicates that the data file will consist of string variables only #
# The "skiprows" option indicates that the first row containing variable names will be omitted #
# The "usecols" option specifies which rows will be included within the input #
numpyex = numpy.loadtxt(filepath, delimiter=',', dtype=str, skiprows=1, usecols=[0,1,2,3])
# Print the result of the data import process to the console #
print(numpyex)
Console Output:
[[ 83. 2036. 803. 544.]
[ 93. 2015. 804. 465.]
[ 49. 1967. 804. 430.]
[ 100. 1957. 802. 511.]
[ 22. 1925. 803. 529.]
[ 31. 1895. 810. 435.]
[ 94. 1889. 802. 503.]
[ 4. 1722. 802. 535.]
[ 25. 1715. 808. 437.]
[ 46. 1704. 809. 445.]
[ 15. 1646. 802. 502.]
[ 74. 1611. 800. 420.]
[ 79. 1429. 805. 504.]
[ 13. 1401. 801. 415.]
[ 4. 1334. 802. 484.]
[ 47. 1290. 807. 428.]
[ 49. 1274. 805. 406.]
[ 7. 1161. 803. 489.]
[ 93. 1132. 805. 415.]
[ 60. 1131. 804. 413.]]
For more information pertaining to this function and its internal options:
Now, if you absolutely must import array data which contains both string data and numerical data, a different array function exists within the “numpy” package. This function also allows for columns which contain missing elements.
# Enable Numpy #
import numpy
# Specify the appropriate file path #
# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #
filepath = "C:\\Users\\Username\\Desktop\\PythonImportTestII.csv"
# Create a variable to store the data #
# The "delimiter" option specifies the delimiter contained within the data file #
# The "skip_header" option indicates that the first row containing variable names will be omitted #
# The "dtype" option indicates that the data type of each element will be automatically decided #
# The "encoding" option specifies which encoding methodology should be employed when decoding the input file #
numpyex2 = numpy.genfromtxt(filepath, delimiter=',', skip_header=1, dtype=None, encoding=None)
# Print the result of the data import process to the console #
print(numpyex)
Console Output:
[( 83, 2036, 803, 544, 'BMW') ( 93, 2015, 804, 465, 'Volvo')
( 49, 1967, 804, 430, 'Jeep') (100, 1957, 802, 511, 'Subaru')
( 22, 1925, 803, 529, 'Mitsubishi') ( 31, 1895, 810, 435, '')
( 94, 1889, 802, 503, '') ( 4, 1722, 802, 535, '')
( 25, 1715, 808, 437, '') ( 46, 1704, 809, 445, 'Ford')
( 15, 1646, 802, 502, 'Chevy') ( 74, 1611, 800, 420, 'BMW')
( 79, 1429, 805, 504, 'Volvo') ( 13, 1401, 801, 415, 'Jeep')
( 4, 1334, 802, 484, 'Subaru') ( 47, 1290, 807, 428, 'Mitsubishi')
( 49, 1274, 805, 406, 'Toyota') ( 7, 1161, 803, 489, 'Lexus')
( 93, 1132, 805, 415, 'Nissan') ( 60, 1131, 804, 413, 'Honda')]
For more information pertaining to this function and its internal options:
Importing (.csv) Data as a Panda Data Frame
Typically, due to the common data variable type, and the traditional aspects of data integrity and presentation, you will most likely prefer to import data into Python through the utilization of functions inherit within the
"pandas" package.
# Enable Pandas
import pandas
# Specify the appropriate file path #
# Utilize "\\" instead of "\" to proactively prevent errors related to escape characters #
filepath = "C:\\Users\\Username\\Desktop\\PythonImportTestII.csv"
# Create a variable to store the data #
pandadataframe = pandas.read_csv(filepath)
# Print the result of the data import process to the console #
print(pandadataframe)
Console Output:
VarA VarB VarC VarD VarE
0 83 2036 803 544 BMW
1 93 2015 804 465 Volvo
2 49 1967 804 430 Jeep
3 100 1957 802 511 Subaru
4 22 1925 803 529 Mitsubishi
5 31 1895 810 435 NaN
6 94 1889 802 503 NaN
7 4 1722 802 535 NaN
8 25 1715 808 437 NaN
9 46 1704 809 445 Ford
10 15 1646 802 502 Chevy
11 74 1611 800 420 BMW
12 79 1429 805 504 Volvo
13 13 1401 801 415 Jeep
14 4 1334 802 484 Subaru
15 47 1290 807 428 Mitsubishi
16 49 1274 805 406 Toyota
17 7 1161 803 489 Lexus
18 93 1132 805 415 Nissan
19 60 1131 804 413 Honda
If only the first seven rows of data were required, the following code could be utilized to accomplish this task:
# Import only the first seven rows of data from the example data frame #
pandadataframe = pandas.read_csv(filepath, nrows=7)
# Print the result of the data import process to the console #
print(pandadataframe)
Console Output:
VarA VarB VarC VarD VarE
0 83 2036 803 544 BMW
1 93 2015 804 465 Volvo
2 49 1967 804 430 Jeep
3 100 1957 802 511 Subaru
4 22 1925 803 529 Mitsubishi
5 31 1895 810 435 NaN
6 94 1889 802 503 NaN
For more information pertaining to this function and its internal options:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Importing (.xlsx) Data as a Panda Data Frame
There may be instances in which, you wish to import Microsoft Excel data into the Python coding platform. The following code will enable you to achieve such.
# Enable Pandas
import pandas
# Specify the appropriate file path #
filepath = "C:\\Users\\Username\\Desktop\\PythonImportTestIII.xlsx"
# Create a variable to store the data #
pandadataframe = pandas.ExcelFile(filepath)
# Print workbook spreadsheet names #
print(pandadataframe.sheet_names)
# Assign "Sheet1" to a variable #
# (This variable is specifying which sheet of the workbook you will be importing) #
pandadataframesheet1 = pandadataframe.parse('Sheet1')
# Print the result of the data import process to the console #
print(pandadataframesheet1)
Console Output:
['Sheet1']
VarA VarB VarC VarD VarE VarF
0 83 2036 803 544 BMW One
1 93 2015 804 465 Volvo NaN
2 49 1967 804 430 Jeep NaN
3 100 1957 802 511 Subaru One
4 22 1925 803 529 Mitsubishi One
5 31 1895 810 435 Toyota NaN
6 94 1889 802 503 Lexus NaN
7 4 1722 802 535 Nissan NaN
8 25 1715 808 437 Honda NaN
9 46 1704 809 445 Ford One
10 15 1646 802 502 Chevy NaN
11 74 1611 800 420 BMW NaN
12 79 1429 805 504 Volvo One
13 13 1401 801 415 Jeep NaN
14 4 1334 802 484 Subaru NaN
15 47 1290 807 428 Mitsubishi NaN
16 49 1274 805 406 Toyota One
17 7 1161 803 489 Lexus One
18 93 1132 805 415 Nissan NaN
19 60 1131 804 413 Honda NaN
To remove the “NaN” entries from the “VarF” column, you can utilize the following line of code:
# Replace "NaN" values with "N/A" #
pandadataframesheet1.fillna('N/A', inplace=True)
# Print sheet #
print(pandadataframesheet1)
Console Output:
['Sheet1']
VarA VarB VarC VarD VarE VarF
0 83 2036 803 544 BMW One
1 93 2015 804 465 Volvo N/A
2 49 1967 804 430 Jeep N/A
3 100 1957 802 511 Subaru One
4 22 1925 803 529 Mitsubishi One
5 31 1895 810 435 Toyota N/A
6 94 1889 802 503 Lexus N/A
7 4 1722 802 535 Nissan N/A
8 25 1715 808 437 Honda N/A
9 46 1704 809 445 Ford One
10 15 1646 802 502 Chevy N/A
11 74 1611 800 420 BMW N/A
12 79 1429 805 504 Volvo One
13 13 1401 801 415 Jeep N/A
14 4 1334 802 484 Subaru N/A
15 47 1290 807 428 Mitsubishi N/A
16 49 1274 805 406 Toyota One
17 7 1161 803 489 Lexus One
18 93 1132 805 415 Nissan N/A
19 60 1131 804 413 Honda N/A
Additionally, if you desired to create a new pandas data frame which contained only "
VarA" and "
VarB", the code would resemble:
pandasdataframevaravarb = pandadataframesheet1 [['VarA', 'VarB']]
print(pandasdataframevaravarb)
Console Output:
VarA VarB
0 83 2036
1 93 2015
2 49 1967
3 100 1957
4 22 1925
5 31 1895
6 94 1889
7 4 1722
8 25 1715
9 46 1704
10 15 1646
11 74 1611
12 79 1429
13 13 1401
14 4 1334
15 47 1290
16 49 1274
17 7 1161
18 93 1132
19 60 1131
For more information pertaining to this function and its internal options:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
Importing (.sas7bdat) Data as a Panda Data Frame
The following example presents a scenario in which a file within the SAS format is imported as a pandas data frame. I cannot provide exercise data as it pertains to this exercise as my SAS License has since expired.
# Enable Pandas #
import pandas
# Specify the appropriate file path #
filepath = "C:\\Users\\Username\\Desktop\\SASFile.sas7bdat"
# Create a variable to store the data #
pandadataframesas = pandas.read_sas(filepath)
# Print the result of the data import process to the console #
print(pandadataframesas)
With SAS imports, the possibility is always present that the data contained within certain variable columns may appear with a mysterious (b’) occurring prior to each entry.
For example:
b’ 11111
b’ 22222
b’ 33333
To rectify this issue, which is caused by encoding formats, utilize the subsequent code:
dataframename['variablename'] = dataframename['variablename'].str.decode('utf-8')
In a more realistic scenario, the code might resemble something similar to the following:
DataFrameA ['id'] = DataFrameA ['id'] .str.decode('utf-8')
For more information pertaining to this function and its internal options:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html#pandas.read_sas
Exporting Python Data
After you have sorted, edited, and analyzed your data, you’ll most likely want to export the finalized version of the data frame to an outside format. The code below will assist you with this task.
# Exporting Data to (.csv) Format #
# Option: 'sep' specifies the designation which will be utilized to separate the data file contents #
pandadataframe.to_csv("C:\\Users\\Username\\Desktop\\pandadataframe.csv", sep=',', encoding='utf-8')
# Exporting Data to (.xlsx) Format #
# Option: 'sheet_name' designates the name of the first sheet of the Excel workbook #
pandadataframesheet1.to_excel("C:\\Users\\Username\\Desktop\\pandadataframesheet1.xlsx", sheet_name='Sheet1')
For more information pertaining to this function and its internal options: