Read Data From Continuously Updated Csv File

March 10, 2022 Post a Comment

Read CSV (RapidMiner Studio Core)

Synopsis

This Operator reads an ExampleSet from the specified CSV file.

Clarification

CSV is an abbreviation for Comma-Separated Values. The CSV files store data (both numerical and text) in plain-text course. All values corresponding to an Example are stored as one line in the CSV file. Values for different Attributes are separated by a separator character. The separator remains constant. Each row in the file uses the constant separator for separating Attribute values. The term 'CSV' suggests that the Attribute values would exist separated by commas, simply other separators tin can besides be used.

The easiest way to import a CSV file is to employ the Import Configuration Wizard from the Parameters panel. All parameters can also directly be set in the Parameters panel. For more than details about the Operator, see the description of the parameters.

Please make certain that the CSV file is read correctly equally an ExampleSet earlier building a Process that uses information technology.

Differentiation

In that location are many Read <source> Operators in the Data Access group and Files/Read sub-group. For example, at that place is Read Excel, Read URL, Read SPSS, Read XML and other Operators, which tin read ExampleSet from dissimilar file formats.

Input

file (File)
A CSV file can be optionally passed in as a file object. This tin can be created with Operators having file output ports such every bit the Read File Operator.

Output

output (Data Table)
This port delivers the ExampleSet created from the CSV file provided at the input port, imported through the Import Configuration Wizard or loaded from the path given to the csv file parameter.

Parameters

Import_Configuration_Wizard
This user-friendly sorcerer guides you to easily configure this Operator to import the CSV file.
Range:
csv_file
The path of the CSV file is specified here. Information technology tin likewise exist selected using the 'Choose a file' push button.
Range:
column_separators
Cavalcade separators for CSV files can be specified hither. It can besides exist provided as a regular expression. A expert understanding of regular expressions tin be developed past studying the description of Select Attributes Operator and its tutorial Processes.
Range:
trim_lines
This parameter indicates if lines should be trimmed (removal of empty spaces at the beginning and the end) earlier the column split is performed. This option might exist problematic if TABs ('\t') are used as separators.
Range:
use_quotes
This parameter indicates if quotes should be regarded. Quotes can exist used to store special characters like column separators. For case if (,) is prepare every bit column separator and (") is set equally quotes character, so a row (a,b,c,d) will be translated equally 4 values for 4 columns. On the other mitt ("a,b,c,d") will exist translated every bit a single cavalcade value a,b,c,d. If this parameter is set to false, the quotes character parameter and the escape character parameter cannot exist defined.
Range:
quotes_character
This parameter defines the quotes character and is but available if utilize quotes is set to true.
Range:
escape_character
This parameter specifies the graphic symbol used to escape the quotes and is only available if utilize quotes is set to true. For example, if (") is used as quotes grapheme and ('\') is used as escape character, so ("aye") volition be translated as (yes) and (\"yes\") volition exist translated every bit ("yes").
Range:
skip_comments
This parameter is used to ignore comments in the CSV file (if any). If this option is set to true, a comment character should be divers using the comment characters parameter.
Range:
comment_characters
This parameter is bachelor if comment characters is set to true. Lines showtime with these characters are ignored. If this character is present in the middle of the line, anything that comes in that line later this character is ignored. The comment character itself is also ignored.
Range:
parse_numbers
This parameter specifies whether numbers are parsed or non.
Range:
decimal_character
This character is used every bit the decimal graphic symbol.
Range:
grouped_digits
This parameter decides whether grouped digits should be parsed or non. If this parameter is prepare to true, a grouping character parameter has to exist specified.
Range:
grouping_character
This character is used as the group character. If this character is found between numbers, the numbers are combined and this character is ignored. For example if "22-xiv" is nowadays in the CSV file and "-" is set equally the grouping character, then "2214" volition exist stored.
Range:
infinity_string
This parameter can be gear up to parse a specific infinity representation (e.g. "Infinity"). If it is not set, the local specific infinity representation will be used.
Range: string
date_format
The parameter specifies the date and time format. Many predefined options exist simply users can also specify a new format. If text in a CSV file column matches this date format, that column is automatically converted to date type.

Some corrections are automatically made on invalid engagement values. For example, a value '32-March' will automatically be converted to '1-Apr'.

Columns containing values which cannot be interpreted equally numbers will exist interpreted as nominal, as long equally they do not match the date and time pattern of the date format parameter. If they match, this column of the CSV file volition be automatically parsed as date and the respective Aspect volition exist of type date.
Range:
first_row_as_names
If this parameter is gear up to truthful, it is causeless that the first line of the CSV file has the names of the Attributes. If then, the Attributes are automatically named and the outset line of the CSV file is non treated as a data line.
Range:
annotations
If the showtime row as names is not set to truthful, annotations can be added using the 'Edit Listing' button of this parameter, which opens a new menu. This menu allows y'all to select whatever row and assign an annotation to it. Name, Comment and Unit of measurement annotations tin exist assigned. If row 0 is assigned a Name annotation, it is equivalent to setting the first row as names parameter to true. If you want to ignore any row, you can comment them as Comment. Call back that row number in this menu does not count commented lines.
Range:
time_zone
Users tin can select whatsoever time zone from the list of provided time zones.
Range:
locale
Users tin can select any locale from the list of provided locales.
Range:
encoding
Users can select whatever encoding from the list of provided encodings.
Range:
read_all_values_as_polynominal
This option allows you to disable the type handling for this operator. Every column will be read as a polynominal attribute.
Range:
data_set_meta_data_information
This parameter allows to accommodate or override the meta data of the CSV file. Column index, name, type and role can be specified here.

The Read CSV Operator automatically tries to determine an appropriate data type of the Attributes by reading the first few lines and checking the occurring values. Integer values are assigned the integer data blazon, real values the existent data type. Values which cannot be interpreted equally numbers are assigned the nominal data blazon, every bit long every bit they do not friction match the format of the appointment format parameter.

With the information prepare meta data information parameter, this automated assignment tin be adjusted or overwritten.
Range:
read_not_matching_values_as_missings
If this parameter is set to true, values that do non match with the expected value type are considered equally missing values and are replaced by '?'. For case, if 'abc' is written in an integer column, it volition be treated every bit a missing value. A question mark (?) in the CSV file is also read every bit a missing value.
Range:
data_management
This parameter determines how the data is represented internally. Users can select any pick from the provided listing.
Range:

Tutorial Processes

Read a CSV file

(Optional) Save the following text in a text file:

att1,att2,att3,att4 # row 1

80.vi, aye , 1996.Jan.21 ,22-14 # row 2

12.43,"yes",1997.MAR.30,23-22 # row 3

xiii.5,\"no\",1998.AUG.22,23-14 # row 4

23.iii,yeah,1876.Jan.32,42-65# row v

21.6,yes,2001.JUL.12,xyz # row vi

12.56,",_?",2002.SEP.18,15-xc# row 7

This is a sample CSV file.

(Optional) You can load this with the given tutorial procedure by providing its path in the csv file parameter or by using the 'Choose a file' button.

Run the Process and compare the results in the Results view with the CSV file. The Process performs the post-obit deportment:

'#' is defined as a comment character so 'row {number}' is ignored in all rows. As the first row as names parameter is set to true, att1, att2, att3 and att4 are prepare as Attribute names. The Attribute att1 is set up as real , att2 as polynominal, att3 as engagement and att4 as real. For Attribute att4, the '-' character is ignored in all rows because the grouped digits parameter is set to true and '-' is specified every bit the group character. In row 2, the white spaces at the start and end of values are ignored because trim lines parameter is gear up to true. In row three, quotes are not ignored because use quotes is fix to true, the content within the quotes is taken as the value for Attribute att2. In row 4, (\"no\") is taken as a (no) in quotes, crusade the escape grapheme is set to '\'. In row v, the appointment value is automatically corrected from 'Jan.32' to 'Feb.one'. In row 6, an invalid real value for the Attribute att4 is replaced by '?' because the read not matching values every bit missings parameter is set to true. In row 7, quotes are used to retrieve special characters as values including the cavalcade separator (,) and a question mark.

barrerahatand.blogspot.com

Source: https://docs.rapidminer.com/latest/studio/operators/data_access/files/read/read_csv.html

Barrera Hatand