Gray Soft / Tags / Parsing

23
MAY
2015
Rich Methods

Some APIs provide collections of dirt simple methods that just do one little thing.

This approach in less common in Ruby though, especially in the core and standard library of the language itself. Ruby often gives us rich methods with lots of switches we can toggle and half hidden behaviors.

Let's look at some examples of what I am talking about.

Get a Line at a Time

I suspect most Rubyists have used gets() to read lines of input from some kind of IO. Here's the basic usage:
```
>> require "stringio"
=> true
>> f = StringIO.new(<<END_STR)
<xml>
  <tags>Content</tags>
</xml>
END_STR
=> #<StringIO:0x007fd5a264fa08>
>> f.gets
=> "<xml>\n"
>> f.gets
=> "  <tags>Content</tags>\n"
```
I didn't want to mess with external files for these trivial examples, so I just loaded StringIO from the standard library. It allows us to wrap a simple String (defined in this example using the heredoc syntax) in the IO interface. In other words, I'm calling gets() here for a String just as I could with a File or $stdin.
Read more…
In: Ruby Voodoo | Tags: Hidden Features & Parsing | 0 Comments

19

SEP
2014

"You can't parse [X]HTML with regex."

The only explanation I'll give for the following code it to provide this link to my favorite Stack Overflow answer.

#!/usr/bin/env ruby -w

require "open-uri"

URL    = "http://stackoverflow.com/questions/1732348/" +
         "regex-match-open-tags-except-xhtml-self-contained-tags"
PARSER = %r{
  (?<doctype_declaration>
    <!DOCTYPE\b (?<doctype> [^>]* ) >
  ){0}
  (?<comment>
    <!-- .* -->
  ){0}

  (?<script_tag>
    < \s* (?<tag_name> script ) \s* (?<attributes> [^>]* > )
      (?<script> .*? )
    < \s* / \s* script \s* >
  ){0}
  (?<self_closed_tag>
    < \s* (?<tag_name> \w+ ) \s* (?<attributes> [^>]* / \s* > )
  ){0}
  (?<unclosed_tag>
    < \s*
    (?<tag_name> link | meta | br | input | hr | img ) \b
    \s*
    (?<attributes> [^>]* > )
  ){0}
  (?<open_tag>
    < \s* (?<tag_name> \w+ ) \s* (?<attributes> [^>]* > )
  ){0}
  (?<close_tag>
    < \s* / \s* (?<tag_name> \w+ ) \s* >
  ){0}

  (?<attribute>
    (?<attribute_name> [-\w]+ )
    (?: \s* = \s* (?<attribute_value> "[^"]*" | '[^']*' | [^>\s]+ ) )? \s*
  ){0}
  (?<attribute_list>
    \g<attribute>
    (?= [^>]* > \z )  # attributes keep a trailing > to disambiguate from text
  ){0}

  (?<text>
    (?! [^<]* /?\s*> \z )  # a guard to prevent this from parsing attributes
    [^<]+
  ){0}

  \G
  (?:
    \g<doctype_declaration>
    |
    \g<comment>
    |
    \g<script_tag>
    |
    \g<self_closed_tag>
    |
    \g<unclosed_tag>
    |
    \g<open_tag>
    |
    \g<attribute_list>
    |
    \g<close_tag>
    |
    \g<text>
  )
  \s*
}mix

def parse(html)
  stack = [{attributes: [ ], contents: [ ], name: :root}]
  loop do
    html.sub!(PARSER, "") or break
    if $~[:doctype_declaration]
      add_to_tree(stack.last, "DOCTYPE", $~[:doctype].strip)
    elsif $~[:script_tag]
      add_to_stack(stack, $~[:tag_name], $~[:attributes], $~[:script])
    elsif $~[:self_closed_tag] || $~[:unclosed_tag] || $~[:open_tag]
      add_to_stack(stack, $~[:tag_name], $~[:attributes], "", $~[:open_tag])
    elsif $~[:close_tag]
      stack.pop
    elsif $~[:text]
      stack.last[:contents] << $~[:text]
    end
  end
  stack.pop
end

def add_to_tree(branch, name, value)
  if branch.include?(name)
    branch[name]  = [branch[name]] unless branch[name].is_a?(Array)
    branch[name] << value
  else
    branch[name] = value
  end
end

def add_to_stack(stack, tag_name, attributes_html, contents, open = false)
  tag = { attributes: parse_attributes(attributes_html),
          contents:   [contents].reject(&:empty?),
          name:       tag_name }
  add_to_tree(stack.last, tag_name, tag)
  stack.last[:contents] << tag
  stack                 << tag if open
end

def parse_attributes(attributes_html)
  attributes = { }
  loop do
    attributes_html.sub!(PARSER, "") or break
    add_to_tree(
      attributes,
      $~[:attribute_name],
      ($~[:attribute_value] || $~[:attribute_name]).sub(/\A(["'])(.*)\1\z/, '\2')
    )
  end
  attributes
end

def convert_to_bbcode(node)
  if node.is_a?(Hash)
    name = node[:name].sub(/\Astrike\z/, "s")
    "[#{name}]#{node[:contents].map { |c| send(__method__, c) }.join}[/#{name}]"
  else
    node
  end
end

html = open(URL, &:read).strip
ast  = parse(html)
puts ast["html"]["body"]["div"]
  .find { |div| div[:attributes]["class"] == "container"      }["div"]
  .find { |div| div[:attributes]["id"]    == "content"        }["div"]["div"]
  .find { |div| div[:attributes]["id"]    == "mainbar"        }["div"]
  .find { |div| div[:attributes]["id"]    == "answers"        }["div"]
  .find { |div| div[:attributes]["id"]    == "answer-1732454" }["table"]["tr"]
  .first["td"]
  .find { |div| div[:attributes]["class"] == "answercell"     }["div"]["p"]
  .first[:contents]
  .map(&method(:convert_to_bbcode))  # to reach a wider audience
  .join

11

NOV
2011

Doing it Wrong

Continuing with my Breaking All of the Rules series, I want to peek into several little areas where I've been caught doing the wrong thing. I'm a rule breaker and I'm determined to take someone down with me!

My Forbidden Parser

In one application, I work with an API that hands me very simple data like this:

<emails>
  <email>user1@example.com</email>
  <email>user2@example.com</email>
  <email>user3@example.com</email>
  …
</emails>

Now I need to make a dirty confession: I parsed this with a Regular Expression.

I know, I know. We should never parse HTML or XML with a Regular Expression. If you don't believe me, just take a moment to actually read that response. Yikes!

Oh and you shouldn't validate emails with a Regular Expression. Oops. We're talking about at least two violations here.

But it gets worse.

You may be think I rolled a little parser based on Regular Expressions. That might look like this:

#!/usr/bin/env ruby -w

require "strscan"

class EmailParser
  def initialize(data)
    @scanner = StringScanner.new(data)
  end

  def parse(&block)
    parse_emails(&block)
  end

  private

  def parse_emails(&block)
    @scanner.scan(%r{\s*<emails>\s*}) or fail "Failed to match list start"
    loop do
      parse_email(&block) or break
    end
    @scanner.scan(%r{\s*</emails>}) or fail "Failed to match list end"
  end

  def parse_email(&block)
    if @scanner.scan(%r{<email>\s*})
      if email = @scanner.scan_until(%r{</email>\s*})
        block[email.strip[0..-9].strip]
        return true
      else
        fail "Failed to match email end"
      end
    end
    false
  end
end

EmailParser.new(ARGF.read).parse do |email|
  puts email
end

18

NOV
2007

Ghost Wheel Example

There has been a fair bit of buzz around the Treetop parser in the Ruby community lately. Part of that is fueled by the nice screencast that shows off how to use the parser generator.

It doesn't get talked about as much, but I wrote a parser generator too, called Ghost Wheel. Probably the main reason Ghost Wheel doesn't receive much attention yet is that I have been slow in getting the documentation written. Given that, I thought I would show how the code built in the Treetop screencast translates to Ghost Wheel:

#!/usr/bin/env ruby -wKU

require "rubygems"
require "ghost_wheel"

# define a parser using Ghost Wheel's Ruby DSL
RubyParser    = GhostWheel.build_parser do
  rule( :additive,
        alt( seq( :multiplicative,
                  :space,
                  :additive_op,
                  :space,
                  :additive ) { |add| add[0].send(add[2], add[-1])},
             :multiplicative ) )
  rule(:additive_op, alt("+", "-"))

  rule( :multiplicative,
        alt( seq( :primary,
                  :space,
                  :multiplicative_op,
                  :space,
                  :multiplicative ) { |mul| mul[0].send(mul[2], mul[-1])},
             :primary ) )
  rule(:multiplicative_op, alt("*", "/"))

  rule(:primary, alt(:parenthized_additive, :number))
  rule( :parenthized_additive,
        seq("(", :space, :additive, :space, ")") { |par| par[2] } )
  rule(:number, /[1-9][0-9]*|0/) { |n| Integer(n) }

  rule(:space, /\s*/)
  parser(:exp, seq(:additive, eof) { |e| e[0] })
end

# define a parser using Ghost Wheel's grammar syntax
GrammarParser = GhostWheel.build_parser %q{
  additive             =  multiplicative space additive_op space additive
                          { ast[0].send(ast[2], ast[-1]) }
                       |  multiplicative
  additive_op          =  "+" | "-"

  multiplicative       =  primary space multiplicative_op space multiplicative
                          { ast[0].send(ast[2], ast[-1])}
                       |  primary
  multiplicative_op    =  "*" | "/"

  primary              = parenthized_additive | number
  parenthized_additive =  "(" space additive space ")" { ast[2] }
  number               =  /[1-9][0-9]*|0/ { Integer(ast) }

  space                =  /\s*/
  exp                  := additive EOF { ast[0] }
}

if __FILE__ == $PROGRAM_NAME
  require "test/unit"

  class TestArithmetic < Test::Unit::TestCase
    def test_paring_numbers
      assert_parses         "0"
      assert_parses         "1"
      assert_parses         "123"
      assert_does_not_parse "01"
    end

    def test_parsing_multiplicative
      assert_parses "1*2"
      assert_parses "1 * 2"
      assert_parses "1/2"
      assert_parses "1 / 2"
    end

    def test_parsing_additive
      assert_parses "1+2"
      assert_parses "1 + 2"
      assert_parses "1-2"
      assert_parses "1 - 2"

      assert_parses "1*2 + 3 * 4"
    end

    def test_parsing_parenthized_expressions
      assert_parses "1 * (2 + 3) * 4"
    end

    def test_parse_results
      assert_correct_result "0"
      assert_correct_result "1"
      assert_correct_result "123"

      assert_correct_result "1*2"
      assert_correct_result "1 * 2"
      assert_correct_result "1/2"
      assert_correct_result "1 / 2"

      assert_correct_result "1+2"
      assert_correct_result "1 + 2"
      assert_correct_result "1-2"
      assert_correct_result "1 - 2"

      assert_correct_result "1*2 + 3 * 4"
      assert_correct_result "1 * (2 + 3) * 4"
    end

    private

    PARSERS = [RubyParser, GrammarParser]

    def assert_parses(input)
      PARSERS.each do |parser|
        assert_nothing_raised(GhostWheel::FailedParseError) do
          parser.parse(input)
        end
      end
    end

    def assert_does_not_parse(input)
      PARSERS.each do |parser|
        assert_raises(GhostWheel::FailedParseError) { parser.parse(input) }
      end
    end

    def assert_correct_result(input)
      PARSERS.each { |parser| assert_equal(eval(input), parser.parse(input)) }
    end
  end
end

Parsing

23

Rich Methods

Get a Line at a Time

19

"You can't parse [X]HTML with regex."

11

Doing it Wrong

My Forbidden Parser

18

Ghost Wheel Example

Search

Categories

Tags