Parsing

Posts tagged with "Parsing."
  • 23

    MAY
    2015

    Rich Methods

    Some APIs provide collections of dirt simple methods that just do one little thing.

    This approach in less common in Ruby though, especially in the core and standard library of the language itself. Ruby often gives us rich methods with lots of switches we can toggle and half hidden behaviors.

    Let's look at some examples of what I am talking about.

    Get a Line at a Time

    I suspect most Rubyists have used gets() to read lines of input from some kind of IO. Here's the basic usage:

    >> require "stringio"
    => true
    >> f = StringIO.new(<<END_STR)
    <xml>
      <tags>Content</tags>
    </xml>
    END_STR
    => #<StringIO:0x007fd5a264fa08>
    >> f.gets
    => "<xml>\n"
    >> f.gets
    => "  <tags>Content</tags>\n"
    

    I didn't want to mess with external files for these trivial examples, so I just loaded StringIO from the standard library. It allows us to wrap a simple String (defined in this example using the heredoc syntax) in the IO interface. In other words, I'm calling gets() here for a String just as I could with a File or $stdin.

    Read more…

  • 19

    SEP
    2014

    "You can't parse [X]HTML with regex."

    The only explanation I'll give for the following code it to provide this link to my favorite Stack Overflow answer.

    #!/usr/bin/env ruby -w
    
    require "open-uri"
    
    URL    = "http://stackoverflow.com/questions/1732348/" +
             "regex-match-open-tags-except-xhtml-self-contained-tags"
    PARSER = %r{
      (?<doctype_declaration>
        <!DOCTYPE\b (?<doctype> [^>]* ) >
      ){0}
      (?<comment>
        <!-- .* -->
      ){0}
    
      (?<script_tag>
        < \s* (?<tag_name> script ) \s* (?<attributes> [^>]* > )
          (?<script> .*? )
        < \s* / \s* script \s* >
      ){0}
      (?<self_closed_tag>
        < \s* (?<tag_name> \w+ ) \s* (?<attributes> [^>]* / \s* > )
      ){0}
      (?<unclosed_tag>
        < \s*
        (?<tag_name> link | meta | br | input | hr | img ) \b
        \s*
        (?<attributes> [^>]* > )
      ){0}
      (?<open_tag>
        < \s* (?<tag_name> \w+ ) \s* (?<attributes> [^>]* > )
      ){0}
      (?<close_tag>
        < \s* / \s* (?<tag_name> \w+ ) \s* >
      ){0}
    
      (?<attribute>
        (?<attribute_name> [-\w]+ )
        (?: \s* = \s* (?<attribute_value> "[^"]*" | '[^']*' | [^>\s]+ ) )? \s*
      ){0}
      (?<attribute_list>
        \g<attribute>
        (?= [^>]* > \z )  # attributes keep a trailing > to disambiguate from text
      ){0}
    
      (?<text>
        (?! [^<]* /?\s*> \z )  # a guard to prevent this from parsing attributes
        [^<]+
      ){0}
    
      \G
      (?:
        \g<doctype_declaration>
        |
        \g<comment>
        |
        \g<script_tag>
        |
        \g<self_closed_tag>
        |
        \g<unclosed_tag>
        |
        \g<open_tag>
        |
        \g<attribute_list>
        |
        \g<close_tag>
        |
        \g<text>
      )
      \s*
    }mix
    
    def parse(html)
      stack = [{attributes: [ ], contents: [ ], name: :root}]
      loop do
        html.sub!(PARSER, "") or break
        if $~[:doctype_declaration]
          add_to_tree(stack.last, "DOCTYPE", $~[:doctype].strip)
        elsif $~[:script_tag]
          add_to_stack(stack, $~[:tag_name], $~[:attributes], $~[:script])
        elsif $~[:self_closed_tag] || $~[:unclosed_tag] || $~[:open_tag]
          add_to_stack(stack, $~[:tag_name], $~[:attributes], "", $~[:open_tag])
        elsif $~[:close_tag]
          stack.pop
        elsif $~[:text]
          stack.last[:contents] << $~[:text]
        end
      end
      stack.pop
    end
    
    def add_to_tree(branch, name, value)
      if branch.include?(name)
        branch[name]  = [branch[name]] unless branch[name].is_a?(Array)
        branch[name] << value
      else
        branch[name] = value
      end
    end
    
    def add_to_stack(stack, tag_name, attributes_html, contents, open = false)
      tag = { attributes: parse_attributes(attributes_html),
              contents:   [contents].reject(&:empty?),
              name:       tag_name }
      add_to_tree(stack.last, tag_name, tag)
      stack.last[:contents] << tag
      stack                 << tag if open
    end
    
    def parse_attributes(attributes_html)
      attributes = { }
      loop do
        attributes_html.sub!(PARSER, "") or break
        add_to_tree(
          attributes,
          $~[:attribute_name],
          ($~[:attribute_value] || $~[:attribute_name]).sub(/\A(["'])(.*)\1\z/, '\2')
        )
      end
      attributes
    end
    
    def convert_to_bbcode(node)
      if node.is_a?(Hash)
        name = node[:name].sub(/\Astrike\z/, "s")
        "[#{name}]#{node[:contents].map { |c| send(__method__, c) }.join}[/#{name}]"
      else
        node
      end
    end
    
    html = open(URL, &:read).strip
    ast  = parse(html)
    puts ast["html"]["body"]["div"]
      .find { |div| div[:attributes]["class"] == "container"      }["div"]
      .find { |div| div[:attributes]["id"]    == "content"        }["div"]["div"]
      .find { |div| div[:attributes]["id"]    == "mainbar"        }["div"]
      .find { |div| div[:attributes]["id"]    == "answers"        }["div"]
      .find { |div| div[:attributes]["id"]    == "answer-1732454" }["table"]["tr"]
      .first["td"]
      .find { |div| div[:attributes]["class"] == "answercell"     }["div"]["p"]
      .first[:contents]
      .map(&method(:convert_to_bbcode))  # to reach a wider audience
      .join
    

    Read more…

  • 11

    NOV
    2011

    Doing it Wrong

    Continuing with my Breaking All of the Rules series, I want to peek into several little areas where I've been caught doing the wrong thing. I'm a rule breaker and I'm determined to take someone down with me!

    My Forbidden Parser

    In one application, I work with an API that hands me very simple data like this:

    <emails>
      <email>user1@example.com</email>
      <email>user2@example.com</email>
      <email>user3@example.com</email></emails>
    

    Now I need to make a dirty confession: I parsed this with a Regular Expression.

    I know, I know. We should never parse HTML or XML with a Regular Expression. If you don't believe me, just take a moment to actually read that response. Yikes!

    Oh and you shouldn't validate emails with a Regular Expression. Oops. We're talking about at least two violations here.

    But it gets worse.

    You may be think I rolled a little parser based on Regular Expressions. That might look like this:

    #!/usr/bin/env ruby -w
    
    require "strscan"
    
    class EmailParser
      def initialize(data)
        @scanner = StringScanner.new(data)
      end
    
      def parse(&block)
        parse_emails(&block)
      end
    
      private
    
      def parse_emails(&block)
        @scanner.scan(%r{\s*<emails>\s*}) or fail "Failed to match list start"
        loop do
          parse_email(&block) or break
        end
        @scanner.scan(%r{\s*</emails>}) or fail "Failed to match list end"
      end
    
      def parse_email(&block)
        if @scanner.scan(%r{<email>\s*})
          if email = @scanner.scan_until(%r{</email>\s*})
            block[email.strip[0..-9].strip]
            return true
          else
            fail "Failed to match email end"
          end
        end
        false
      end
    end
    
    EmailParser.new(ARGF.read).parse do |email|
      puts email
    end
    

    Read more…

  • 18

    NOV
    2007

    Ghost Wheel Example

    There has been a fair bit of buzz around the Treetop parser in the Ruby community lately. Part of that is fueled by the nice screencast that shows off how to use the parser generator.

    It doesn't get talked about as much, but I wrote a parser generator too, called Ghost Wheel. Probably the main reason Ghost Wheel doesn't receive much attention yet is that I have been slow in getting the documentation written. Given that, I thought I would show how the code built in the Treetop screencast translates to Ghost Wheel:

    #!/usr/bin/env ruby -wKU
    
    require "rubygems"
    require "ghost_wheel"
    
    # define a parser using Ghost Wheel's Ruby DSL
    RubyParser    = GhostWheel.build_parser do
      rule( :additive,
            alt( seq( :multiplicative,
                      :space,
                      :additive_op,
                      :space,
                      :additive ) { |add| add[0].send(add[2], add[-1])},
                 :multiplicative ) )
      rule(:additive_op, alt("+", "-"))
    
      rule( :multiplicative,
            alt( seq( :primary,
                      :space,
                      :multiplicative_op,
                      :space,
                      :multiplicative ) { |mul| mul[0].send(mul[2], mul[-1])},
                 :primary ) )
      rule(:multiplicative_op, alt("*", "/"))
    
      rule(:primary, alt(:parenthized_additive, :number))
      rule( :parenthized_additive,
            seq("(", :space, :additive, :space, ")") { |par| par[2] } )
      rule(:number, /[1-9][0-9]*|0/) { |n| Integer(n) }
    
      rule(:space, /\s*/)
      parser(:exp, seq(:additive, eof) { |e| e[0] })
    end
    
    # define a parser using Ghost Wheel's grammar syntax
    GrammarParser = GhostWheel.build_parser %q{
      additive             =  multiplicative space additive_op space additive
                              { ast[0].send(ast[2], ast[-1]) }
                           |  multiplicative
      additive_op          =  "+" | "-"
    
      multiplicative       =  primary space multiplicative_op space multiplicative
                              { ast[0].send(ast[2], ast[-1])}
                           |  primary
      multiplicative_op    =  "*" | "/"
    
      primary              = parenthized_additive | number
      parenthized_additive =  "(" space additive space ")" { ast[2] }
      number               =  /[1-9][0-9]*|0/ { Integer(ast) }
    
      space                =  /\s*/
      exp                  := additive EOF { ast[0] }
    }
    
    if __FILE__ == $PROGRAM_NAME
      require "test/unit"
    
      class TestArithmetic < Test::Unit::TestCase
        def test_paring_numbers
          assert_parses         "0"
          assert_parses         "1"
          assert_parses         "123"
          assert_does_not_parse "01"
        end
    
        def test_parsing_multiplicative
          assert_parses "1*2"
          assert_parses "1 * 2"
          assert_parses "1/2"
          assert_parses "1 / 2"
        end
    
        def test_parsing_additive
          assert_parses "1+2"
          assert_parses "1 + 2"
          assert_parses "1-2"
          assert_parses "1 - 2"
    
          assert_parses "1*2 + 3 * 4"
        end
    
        def test_parsing_parenthized_expressions
          assert_parses "1 * (2 + 3) * 4"
        end
    
        def test_parse_results
          assert_correct_result "0"
          assert_correct_result "1"
          assert_correct_result "123"
    
          assert_correct_result "1*2"
          assert_correct_result "1 * 2"
          assert_correct_result "1/2"
          assert_correct_result "1 / 2"
    
          assert_correct_result "1+2"
          assert_correct_result "1 + 2"
          assert_correct_result "1-2"
          assert_correct_result "1 - 2"
    
          assert_correct_result "1*2 + 3 * 4"
          assert_correct_result "1 * (2 + 3) * 4"
        end
    
        private
    
        PARSERS = [RubyParser, GrammarParser]
    
        def assert_parses(input)
          PARSERS.each do |parser|
            assert_nothing_raised(GhostWheel::FailedParseError) do
              parser.parse(input)
            end
          end
        end
    
        def assert_does_not_parse(input)
          PARSERS.each do |parser|
            assert_raises(GhostWheel::FailedParseError) { parser.parse(input) }
          end
        end
    
        def assert_correct_result(input)
          PARSERS.each { |parser| assert_equal(eval(input), parser.parse(input)) }
        end
      end
    end
    

    Read more…