Post

Streaming xml SAX parsing with elixir

Photo by Adam-Springer/Getty

The tools

There are two big players in elixir’s XML parsing ecosystem:

  • SweetXml , which is the traditional “parser” when you feed it with some string or stream with XML content, it will produce some structure of elements in which each element is what SweetXml thinks an XML element’s representation should be.
  • Saxy , which is based on SAX parsing. You provide it with a module and some state. The module is to know what to do when sax xml-related events occur (such as -start-document_, start-element, and so on …) while reading some string or stream with XML contents on it. The module is to do things to the state provided to him depending of these events.

What I want to achieve

I want to read a huge XML file that has some elements very repeated, and want to produce some kind of “iterator” from it. Something like this:

Given the XML file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<rss>
  <item>
    <title>I'm a title</title>
    <id>xx1</id>
  </item>
  <item>
    <title>second title</title>
    <id>xx2</id>
    <extra>extra content</extra>
    <whatever>
      <some>thing</some>
      <some>thing2</some>
    </whatever>
  </item>
  <item>
    <extra>thing</extra>
  </item>
</rss>

I’d like to produce some iterator that, when iterated, produces this:

1
2
3
4
5
6
7
8
9
10
11
12
[
  {"title": "I'm a title", "id": "xx1"},
  {
    "title": "second title",
    "id": "xx2",
    "extra": "extra content",
    "whatever": {
      "some": ["thing", "thing2"]
    }
  },
  {"extra": "thing"}
]

What I don’t need

  • I don’t need to hold some structure that represents the entire xml file in memory.
  • I don’t even need to hold the entire list of items into memory, because I can use some sort of “iterator” meaning that the only thing I have to hold into memory is one item per time

What the tools offer me “out of the box”

Saxy is incredibly fast and performant, but it’s based on the concept that, as you read the XML file, you “fill” some state object (with whatever you want, and the amount you want, but, nevertheless, you fill it).

In this scenario, I could “fill” the state with the list of items. That, of course, is a lot less memory than it would take to hold the entire XML structure in memory. But still it establishes a relationship between the size of the XML file and the size of the stored in-memory list, which I don’t like because that means that if I use a big enough file, I can consume more memory than I’m allowed to.

SweetXml provides some function called stream_tags and when you see what it does, it seems that it hits the spot!!! because it says it’s just what I need: parse an xml and, as it finds certain tags, stream the SweetXml representation of them , and it doesn’t build into memory any structure representing xml. So this should be all I need:

1
2
iex(1)> list_iterator = File.stream("some_feed.xml")
...(1)> |> SweetXml.stream_tags!(:item, discard: [:item])

and that would be it. list_iterator is not the entire list but just an iterator over it, which means I don’t have to hold the entire list in memory. So theoretically, "some_feed.xml" can be as big as I want, and no memory penalty for that. But …

It doesn’t work like that. I don’t know exactly why, but there’s some accumulation in place, which means there’s some memory hoarding in place, that means for enough big xml files, I will pass my memory allowance.

What I finally did: “tricking” saxy

The idea is that, even if saxy “accumulates” something in some state , I “clean” the state as long as I need to, during the parsing, so the state doesn’t end up consuming a lot of memory.

Saxy can parse an XML file in several “partial” attempts (although the state it accumulates to remains the same) like this:

1
2
3
4
5
{:ok, partial} = Partial.new(MyEventHandler, initial_state)
{:cont, partial} = Partial.parse(partial, "<foo>")
{:cont, partial} = Partial.parse(partial, "<bar></bar>")
{:cont, partial} = Partial.parse(partial, "</foo>")
{:ok, state} = Partial.terminate(partial)

Again, the trick is to make sure initial_state it’s “emptied” from time to time.


The gore details

The handler: what to do when sax events are found

First, we have to build a module that knows what to do to the state when XML events are found during XML parsing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
defmodule XmlEventHandler do
  @moduledoc """
  Module to parse xml with lists of items based on events
  """

  @behaviour Saxy.Handler

  @type element :: %{
          name: binary,
          attributes: [{binary, binary}],
          childs: [element()],
          text: binary
        }

  @impl true
  def handle_event(:start_document, _prolog),
    do: {:ok, %{items: [], current_element: nil, stack: nil}}

  def handle_event(:start_element, {"item", _attributes}, %{stack: nil} = state), do:
        {:ok,
         %{
           state
           | current_element: %{name: "item", text: "", attributes: [], childs: []},
             stack: ["item"]
         }}

  def handle_event(
        :start_element,
        {name, attributes},
        %{current_element: element, stack: stack} = state
      )
      when is_list(stack),
      do:
        {:ok,
         %{
           state
           | current_element: add_child(element, {name, attributes}, stack),
             stack: [name | stack]
         }}

  def handle_event(:end_element, "item", %{items: items, current_element: element} = state),
    do: {:ok, %{state | items: [element | items], stack: nil}}

  def handle_event(:end_element, name, %{stack: [name | stack]} = state),
    do: {:ok, %{state | stack: stack}}

  def handle_event(:characters, chars, %{stack: stack, current_element: element} = state)
      when is_list(stack) do
    case String.trim(chars) do
      "" ->
        {:ok, state}

      trimmed_chars ->
        {:ok, %{state | current_element: add_text(element, trimmed_chars, stack)}}
    end
  end

  def handle_event(:cdata, cdata, state), do: handle_event(:characters, cdata, state)
  def handle_event(_, _, state), do: {:ok, state}

  @spec add_child(element(), {binary, [{binary, binary}]}, [binary]) :: element()
  defp add_child(%{name: name, childs: childs} = element, {child_name, attributes}, [name]),
    do: Map.put(element, :childs, [build_element(child_name, attributes) | childs])

  defp add_child(%{childs: [last_child | childs]} = element, {name, attributes}, stack) do
    {_, new_stack} = List.pop_at(stack, -1)
    updated_child = add_child(last_child, {name, attributes}, new_stack)
    Map.put(element, :childs, [updated_child | childs])
  end

  @spec add_text(element(), binary, [binary]) :: element()
  defp add_text(%{name: name, text: ""} = element, added_text, [name]),
    do: Map.put(element, :text, added_text)

  defp add_text(%{name: name, text: text} = element, added_text, [name]),
    do: Map.put(element, :text, "#{text} #{added_text}")

  defp add_text(%{childs: [last_child | childs]} = element, added_text, stack) do
    {_, new_stack} = List.pop_at(stack, -1)
    updated_child = add_text(last_child, added_text, new_stack)
    Map.put(element, :childs, [updated_child | childs])
  end

  @spec build_element(binary, [{binary, binary}]) :: element()
  defp build_element(name, attributes),
    do: %{name: name, attributes: attributes, childs: [], text: ""}
end

As you can see:

  • The initial state provided to parser is %{current_element: nil, stack: nil, items: []} That is, the initial state contains an empty list of items
  • When the module parser finishes parsing an <item> element, a new item is added to the state
  • So, at any point during the parsing, the state (under the key items ) contains a list of parsed items so far. If we did nothing else, then after parsing the entire XML file, we could fetch the entire list of parsed items obtained from that file, and that would be ok for not too big xml files, but, if the XML file is big enough, then the accumulated list of items is big, too and that means eating up a lot of memory!.

Cleaning the state while parsing the xml file, so it never gets too big

The output we want is a Stream, so the only memory consumption is the memory to hold one item. In order to do that, we’ve got to streamize XML parsing, which we can do thanks to Saxy.Partial module. While we do that, we also take care of yielding already processed items and removing them from the state. This way, we can be sure that state is never going to grow too much.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
defmodule XmlStreamer do
  def to_elements_stream(xml_stream) do
    {:ok, partial} = Saxy.Partial.new(XmlEventHandler)

    Stream.chunk_while(xml_stream, partial, &maybe_items/2, &finalize/1)
  end

  def maybe_items(chunk, partial) do
    with {:cont, partial} <- Saxy.Partial.parse(partial, chunk),
         {:ok, partial, items} <- fetch_items(partial) do
      if items == [], do: {:cont, partial}, else: {:cont, items, partial}
    else
      {:error, exception} when is_exception(exception) ->
        raise Saxy.ParseError.message(exception)

      {:error, reason} ->
        raise reason

      _ ->
        raise "Xml parsing error"
    end
  end

  def finalize(partial) do
    case Saxy.Partial.terminate(partial) do
      {:ok, %{items: []}} ->
        {:cont, %{}}

      {:ok, %{items: items}} ->
        {:cont, items, %{}}

      {:error, exception} when is_exception(exception) ->
        raise Saxy.ParseError.message(exception)
    end
  end

  defp fetch_items(%{state: %{user_state: %{items: items} = user_state} = state} = partial) do
    partial = %{partial | state: %{state | user_state: %{user_state | items: []}}}
    {:ok, partial, items}
  end

  defp fetch_items(_), do: {:error, "Something wrong when processing sax events"}

end

As you can see, as long as it parses the XML it generates a stream in which every element is a list of a few items, and when retrieving those items (via the fetch_items function), we also remove those items from the state , therefore the state never holds too many items, therefore the state never eats too much memory!!!


Wow! it’s been quite hard to explain it (and I have to say, I’m not very sure I did it successfully). Anyway, the code is real and I really hope it can help somebody.