How to Read Csv With Pandas Datareader
CSV (comma-separated value) files are a common file format for transferring and storing data. The ability to read, dispense, and write data to and from CSV files using Python is a key skill to principal for any data scientist or business organization analysis. In this post, we'll go over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames dorsum to CSV files mail service analysis.
Pandas is the most popular data manipulation packet in Python, and DataFrames are the Pandas data type for storing tabular 2d data.
- Load CSV files to Python Pandas
- i. File Extensions and File Types
- 2. Data Representation in CSV files
- Other Delimiters / Separators – TSV files
- Delimiters in Text Fields – Quotechar
- 3. Python – Paths, Folders, Files
- Finding your Python Path
- File Loading: Absolute and Relative Paths
- 4. Pandas CSV File Loading Errors
- Advanced Read CSV Files
- Specifying Data Types
- Skipping and Picking Rows and Columns From File
- Custom Missing Value Symbols
- CSV Format Advantages and Disadvantages
- Additional Reading
Load CSV files to Python Pandas
The basic process of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the "read_csv" function in Pandas:
# Load the Pandas libraries with allonym 'pd' import pandas equally pd # Read information from file 'filename.csv' # (in the same directory that your python process is based) # Control delimiters, rows, column names with read_csv (see later) data = pd.read_csv("filename.csv") # Preview the starting time 5 lines of the loaded data data.head()
While this code seems simple, an understanding of three fundamental concepts is required to fully grasp and debug the operation of the data loading process if you run into issues:
- Agreement file extensions and file types – what do the letters CSV actually hateful? What's the divergence between a .csv file and a .txt file?
- Understanding how information is represented inside CSV files – if y'all open up a CSV file, what does the information actually look like?
- Understanding the Python path and how to reference a file – what is the absolute and relative path to the file you lot are loading? What directory are you working in?
- CSV data formats and errors – common errors with the role.
Each of these topics is discussed below, and we finish this tutorial by looking at some more avant-garde CSV loading mechanisms and giving some broad advantages and disadvantages of the CSV format.
1. File Extensions and File Types
The first step to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.
- Data is stored on your estimator in individual "files", or containers, each with a different proper noun.
- Each file contains data of different types – the internals of a Word certificate is quite different from the internals of an prototype.
- Computers make up one's mind how to read files using the "file extension", that is the code that follows the dot (".") in the filename.
- So, a filename is typically in the course "<random name>.<file extension>". Examples:
- project1.DOCX – a Microsoft Word file called Project1.
- shanes_file.TXT – a uncomplicated text file chosen shanes_file
- IMG_5673.JPG – An image file called IMG_5673.
- Other well known file types and extensions include: XLSX: Excel, PDF: Portable Certificate Format, PNG – images, Nada – compressed file format, GIF – blitheness, MPEG – video, MP3 – music etc. See a consummate list of extensions here.
- A CSV file is a file with a ".csv" file extension, eastward.k. "data.csv", "super_information.csv". The "CSV" in this case lets the computer know that the information independent in the file is in "comma separated value" format, which we'll talk over below.
File extensions are hidden by default on a lot of operating systems. The start step that any self-respecting engineer, software engineer, or information scientist volition practice on a new computer is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.
To check if file extensions are showing in your arrangement, create a new text certificate with Notepad (Windows) or TextEdit (Mac) and save it to a folder of your option. If you can't see the ".txt" extension in your folder when you view it, you will accept to change your settings.
- In Microsoft Windows: Open Command Panel > Appearance and Personalization. Now, click on Folder Options or File Explorer Option, every bit it is now called > View tab. In this tab, nether Accelerate Settings, you will come across the option Hibernate extensions for known file types. Uncheck this pick and click on Use and OK.
- In Mac Os: Open up Finder > In menu, click Finder > Preferences, Click Advanced, Select the checkbox for "Prove all filename extensions".
2. Data Representation in CSV files
A "CSV" file, that is, a file with a "csv" filetype, is a basic text file. Any text editor such as NotePad on windows or TextEdit on Mac, can open up a CSV file and show the contents. Sublime Text is a wonderful and multi-functional text editor choice for whatsoever platform.
CSV is a standard for storing tabular data in text format, where commas are used to separate the different columns, and newlines (carriage return / press enter) used to separate rows. Typically, the get-go row in a CSV file contains the names of the columns for the data.
And case table data set and the corresponding CSV-format information is shown in the diagram below.
Annotation that almost any tabular data can be stored in CSV format – the format is pop because of its simplicity and flexibility. You lot can create a text file in a text editor, save it with a .csv extension, and open up that file in Excel or Google Sheets to see the table form.
Other Delimiters / Separators – TSV files
The comma separation scheme is by far the almost popular method of storing tabular data in text files.
Yet, the option of the ',' comma graphic symbol to delimiters columns, all the same, is capricious, and can be substituted where needed. Popular alternatives include tab ("\t") and semi-colon (";"). Tab-separate files are known every bit TSV (Tab-Separated Value) files.
When loading information with Pandas, the read_csv function is used for reading whatsoever delimited text file, and by changing the delimiter using the sep
parameter.
Delimiters in Text Fields – Quotechar
One complexity in creating CSV files is if you have commas, semicolons, or tabs actually in 1 of the text fields that you want to shop. In this instance, it'due south of import to use a "quote grapheme" in the CSV file to create these fields.
The quote grapheme tin can be specified in Pandas.read_csv using the quotechar
statement. By default (as with many systems), it'due south fix as the standard quotation marks ("). Any commas (or other delimiters as demonstrated below) that occur between two quote characters will be ignored as column separators.
In the example shown, a semicolon-delimited file, with quotation marks every bit a quotechar is loaded into Pandas, and shown in Excel. The use of the quotechar allows the "NickName" column to contain semicolons without existence split up into more columns.
three. Python – Paths, Folders, Files
When y'all specify a filename to Pandas.read_csv, Python will await in your "electric current working directory". Your working directory is typically the directory that you started your Python process or Jupyter notebook from.
Finding your Python Path
Your Python path can be displayed using the built-in os
module. The OS module is for operating organisation dependent functionality into Python programs and scripts.
To find your current working directory, the function required is os.getcwd()
. Theos.listdir()
role tin exist used to display all files in a directory, which is a good check to see if the CSV file you are loading is in the directory as expected.
# Notice out your electric current working directory import bone impress(os.getcwd()) # Out: /Users/shane/Documents/blog # Display all of the files found in your current working directory impress(os.listdir(os.getcwd()) # Out: ['test_delimted.ssv', 'CSV Web log.ipynb', 'test_data.csv']
In the example to a higher place, my electric current working directory is in the '/Users/Shane/Document/web log' directory. Any files that are places in this directory will be immediately bachelor to the Python file open() part or the Pandas read csv function.
Instead of moving the required data files to your working directory, you lot can also alter your current working directory to the directory where the files reside usingos.chdir()
.
File Loading: Absolute and Relative Paths
When specifying file names to the read_csv function, you can supply both accented or relative file paths.
- A relative pathis the path to the file if you lot first from your current working directory. In relative paths, typically the file will exist in a subdirectory of the working directory and the path will not start with a drive specifier, due east.g. (data/test_file.csv). The characters '..' are used to move to a parent directory in a relative path.
- An absolute pathis the complete path from the base of your file system to the file that you want to load, e.yard. c:/Documents/Shane/information/test_file.csv. Accented paths volition start with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)
It'south recommended and preferred to use relative paths where possible in applications, considering absolute paths are unlikely to work on different computers due to different directory structures.
4. Pandas CSV File Loading Errors
The near common error'due south yous'll go while loading data from CSV files into Pandas will be:
-
FileNotFoundError: File b'filename.csv' does non exist
A File Not Found error is typically an outcome with path setup, current directory, or file proper noun defoliation (file extension can play a function hither!) -
UnicodeDecodeError: 'utf-8' codec can't decode byte in position : invalid continuation byte
A Unicode Decode Error is typically acquired by non specifying the encoding of the file, and happens when yous have a file with non-standard characters. For a quick ready, try opening the file in Sublime Text, and re-saving with encoding 'UTF-eight'. -
pandas.parser.CParserError: Error tokenizing data.
Parse Errors can be acquired in unusual circumstances to practice with your data format – effort to add the parameter "engine='python'" to the read_csv function phone call; this changes the data reading function internally to a slower only more than stable method.
Advanced Read CSV Files
In that location are some additional flexible parameters in the Pandas read_csv() function that are useful to have in your arsenal of data science techniques:
Specifying Data Types
As mentioned earlier, CSV files exercise not contain any blazon information for data. Data types are inferred through exam of the meridian rows of the file, which tin can lead to errors. To manually specify the data types for different columns, thedtype parameter can exist used with a lexicon of column names and data types to exist applied, for example:dtype={"name": str, "age": np.int32}
.
Note that for dates and date times, the format, columns, and other behaviour tin exist adapted using parse_dates, date_parser, dayfirst, keep_dateparameters.
Skipping and Picking Rows and Columns From File
Thenrows parameter specifies how many rows from the top of CSV file to read, which is useful to accept a sample of a large file without loading completely. Similarly theskiprowsparameter allows you to specify rows to leave out, either at the offset of the file (provide an int), or throughout the file (provide a list of row indices). Similarly, theusecolsparameter can exist used to specify which columns in the data to load.
Custom Missing Value Symbols
When data is exported to CSV from different systems, missing values tin can be specified with different tokens. Thena_values parameter allows yous to customise the characters that are recognised equally missing values. The default values interpreted as NA/NaN are: '', '#N/A', '#N/A Northward/A', '#NA', '-1.#IND', '-one.#QNAN', '-NaN', '-nan', 'i.#IND', 'one.#QNAN', 'Due north/A', 'NA', 'Cipher', 'NaN', 'n/a', 'nan', 'null'.
# Advanced CSV loading case data = pd.read_csv( "information/files/complex_data_example.tsv", # relative python path to subdirectory sep='\t' # Tab-separated value file. quotechar="'", # single quote allowed equally quote character dtype={"salary": int}, # Parse the bacon column equally an integer usecols=['proper noun', 'birth_date', 'salary']. # Simply load the iii columns specified. parse_dates=['birth_date'], # Intepret the birth_date column equally a date skiprows=10, # Skip the first 10 rows of the file na_values=['.', '??'] # Have any '.' or '??' values as NA )
CSV Format Advantages and Disadvantages
As with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Be aware of the potential pitfalls and issues that you will encounter as you load, store, and exchange data in CSV format:
On the plus side:
- CSV format is universal and the data can exist loaded by about any software.
- CSV files are elementary to empathize and debug with a bones text editor
- CSV files are quick to create and load into retentiveness before analysis.
All the same, the CSV format has some negative sides:
- There is no data type information stored in the text file, all typing (dates, int vs bladder, strings) are inferred from the data merely.
- There's no formatting or layout information storable – things similar fonts, borders, column width settings from Microsoft Excel volition exist lost.
- File encodings can go a problem if in that location are non-ASCII compatible characters in text fields.
- CSV format is inefficient; numbers are stored every bit characters rather than binary values, which is wasteful. You will detect yet that your CSV data compresses well using zip compression.
Equally and bated, in an effort to counter some of these disadvantages, two prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Feather Format, which aims to exist a fast, uncomplicated, open, flexible and multi-platform data format that supports multiple data types natively.
Additional Reading
- Official Pandas documentation for the read_csv function.
- Python 3 Notes on file paths, working directories, and using the OS module.
- Datacamp Tutorial on loading CSV files, including some boosted Bone commands.
- PythonHow Loading CSV tutorial.
Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/
0 Response to "How to Read Csv With Pandas Datareader"
Post a Comment