Extract Text processor¶
This processor allows to extract any part of a text, a number or a combination of both into a new column. It's similar to the Replace via Regexp processor, except that instead of replacing the content in the same original column, a new column is created with the extracted text. The idea is indeed to put the part to extract in parentheses. This part will then be extracted in a new column.
Setting the processor¶
To set the parameters of the Extract Text processor, follow the indications from the table below.
|Field||Field which contains the value(s) to extract||yes|
|Regular expression||Regular expression to determine which part of the values will be extracted. See http://en.wikipedia.org/wiki/Regular_expression for more details on how to write a regular expressions. It is also possible to test regular expressions with an online debugger tool like Regex101.||yes|
Using the same example as for the Replace via Regexp processor (from a French zip code like 44100, keep only the area code 44), the Extract Text processor can be used to create another column with the area code selected, instead of replacing the content like with the Replace via Regexp processor.
From a more technical point of view, this processor can be used to extract an arbitrary pattern expressed as a regular expression out of a string using sub matching.
The syntax of the sub-matching expression to specify is the following:
NAMEis the name of a new field which will receive the result of the extraction. This field name can only contain letters, numbers and underscores (special characters like accentuated letters or commas are not allowed).
REXGEXPis the submatch expression
For example, let's assume that you want to extract a street name out of an address. That is, for the address
600 Pennsylvania Ave NW, Washington, DC 20500, États-Unis
you might want to extract the value
Pennsylvania Ave NW in a field
You would have to write the following expression:
[0-9]+ (?P<street_name>.*), .*, .*, .*
And if you want to extract the street number in a field
street_number, simply extend the previous expression:
(?P<street_number>[0-9]+) (?P<street_name>.*), .*, .*, .*