XML files

You can use an XML (.xml) file as a source.

This platform creates records from an arbitrary XML structure by converting all elements at a specific depth (optionally filtered by tag) to a set of records. For each element converted to a record, attributes, enclosed tags, and content are converted to fields. Complex data inside fields is converted to a JSON representation containing both attributes and content.

Creation

For more information about adding a file source, see Retrieving a file.

Configuration

XML parameters
Name Description Usage
Parent tags

Number of parents to get attributes from

If the enclosing tags contain relevant attributes, use this option to add them to the records.

Enter the number of parent tags to get attributes from (for example, 2).
Name of the tags to be extracted If irrelevant tags are at the same depth as the extracted elements, use this option to only filter relevant tags. Enter the tag to extract (for example, item) in the Name of the tags to be extracted box. If you leave the box empty, all tags at the specified depth will be converted into records.
Tag depth Depth of the tags that must be converted to records Enter the depth of the repeated tag in the Tag depth box (for example, 3).
Extract filename
Name Description Usage
Extract filename Creates a new column with the name of the source file. By default, this option is toggled off. Toggle on this option to extract the file name in an additional column.

Technical specifications

Field creation

The policy for creating fields from an item is defined as shown in the following examples.

1
2
3
4
5
6
7
8
<rows>
  <data attribute="attribute value" other_attribute="other attribute value">
    <indicator>GDP per capita</indicator>
    <country>Andean Region</country>
    <decimal>0</decimal>
  </data>
  <data attribute="2nd data tag">Text only</data>
</rows>
attribute other_attribute indicator country decimal content
attribute value other attribute value GDP per capita Andean Region 0  
2nd data tag         Text only

JSON representation

Complex data inside fields is converted to JSON as shown in the following example.

1
2
3
4
5
6
7
8
9
<mydocument has="an attribute">
  <and>
    <many>elements</many>
    <many>more elements</many>
  </and>
  <plus a="complex">
    element as well
  </plus>
</mydocument>
{
    "mydocument": {
        "@has": "an attribute",
        "and": {
            "many": [
                "elements",
                "more elements"
            ]
        },
        "plus": {
            "@a": "complex",
            "#text": "element as well"
        }
    }
}

Examples

Example 1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<?xml version="1.0" encoding="utf-8"?>
<wb:rows xmlns:wb="http://www.worldbank.org">
  <wb:data>
    <wb:indicator id="6.0.GDPpc">GDP per capita (2005 USD)</wb:indicator>
    <wb:country id="L5">Andean Region</wb:country>
    <wb:date>2012</wb:date>
    <wb:value>10561.668936515</wb:value>
    <wb:decimal>0</wb:decimal>
  </wb:data>
  <wb:data>
    <wb:indicator id="6.0.GDPpc">GDP per capita (2005 USD)</wb:indicator>
    <wb:country id="L5">Andean Region</wb:country>
    <wb:date>2011</wb:date>
    <wb:value>10215.3319157514</wb:value>
    <wb:decimal>0</wb:decimal>
  </wb:data>
  <wb:data>
    <wb:indicator id="6.0.GDPpc">GDP per capita (2005 USD)</wb:indicator>
    <wb:country id="L5">Andean Region</wb:country>
    <wb:date>2010</wb:date>
    <wb:value>9711.85739310366</wb:value>
    <wb:decimal>0</wb:decimal>
  </wb:data>
</wb:rows>

In this example:

  • Tag depth is set to 2 because wb:data is at the second level of the XML tree (wb:rows/wb:data).

  • You do not need to filter tags out because all elements at this depth are records.

The resulting dataset looks like this:

wb:indicator wb:country wb:date wb:value wb:decimal
{"#text": "GDP per capita (2005 USD)", "@id": "6.0.GDPpc"} {"#text": "Andean Region", "@id": "L5"} 2005 8154.72913271721 0
{"#text": "GDP per capita (2005 USD)", "@id": "6.0.GDPpc"} {"#text": "Bolivia", "@id": "BO"} 2009 5152.46337890625 0
{"#text": "GDP per capita (2005 USD)", "@id": "6.0.GDPpc"} {"#text": "Bolivia", "@id": "BO"} 2006 4715.9892578125 0

Example 2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
<?xml version="1.0" encoding="UTF-8"?>
<shoppingList>
    <id>abcdef123</id>
    <description>A shopping list</description>
    <basket>
        <itemCount>3</itemCount>
        <totalQuantity>19</totalQuantity>
        <item>
            <name>tomato</name>
            <quantity>10</quantity>
        </item>
        <item>
            <name>potato</name>
            <quantity>5</quantity>
        </item>
        <item>
            <name>banana</name>
            <quantity>4</quantity>
        </item>
    </basket>
</shoppingList>

In this example, the XML tree is complex. As a result, the automatic parameters detection cannot guess the proper depth. You must configure the source manually:

  • Tag depth must be set to 3 because the item node is at the third level of the XML tree (shoppingList/basket/item).

  • Name of the tags to be extracted must be set to item because itemCount and totalQuantity are also at the third level but not relevant.

The resulting dataset looks like this:

name quantity
potato 5
banana 4
tomato 10