-
23
MAY
2015Rich Methods
Some APIs provide collections of dirt simple methods that just do one little thing.
This approach in less common in Ruby though, especially in the core and standard library of the language itself. Ruby often gives us rich methods with lots of switches we can toggle and half hidden behaviors.
Let's look at some examples of what I am talking about.
Get a Line at a Time
I suspect most Rubyists have used
gets()
to read lines of input from some kind ofIO
. Here's the basic usage:>> require "stringio" => true >> f = StringIO.new(<<END_STR) <xml> <tags>Content</tags> </xml> END_STR => #<StringIO:0x007fd5a264fa08> >> f.gets => "<xml>\n" >> f.gets => " <tags>Content</tags>\n"
I didn't want to mess with external files for these trivial examples, so I just loaded
StringIO
from the standard library. It allows us to wrap a simpleString
(defined in this example using the heredoc syntax) in theIO
interface. In other words, I'm callinggets()
here for aString
just as I could with aFile
or$stdin
. -
19
SEP
2014"You can't parse [X]HTML with regex."
The only explanation I'll give for the following code it to provide this link to my favorite Stack Overflow answer.
#!/usr/bin/env ruby -w require "open-uri" URL = "http://stackoverflow.com/questions/1732348/" + "regex-match-open-tags-except-xhtml-self-contained-tags" PARSER = %r{ (?<doctype_declaration> <!DOCTYPE\b (?<doctype> [^>]* ) > ){0} (?<comment> <!-- .* --> ){0} (?<script_tag> < \s* (?<tag_name> script ) \s* (?<attributes> [^>]* > ) (?<script> .*? ) < \s* / \s* script \s* > ){0} (?<self_closed_tag> < \s* (?<tag_name> \w+ ) \s* (?<attributes> [^>]* / \s* > ) ){0} (?<unclosed_tag> < \s* (?<tag_name> link | meta | br | input | hr | img ) \b \s* (?<attributes> [^>]* > ) ){0} (?<open_tag> < \s* (?<tag_name> \w+ ) \s* (?<attributes> [^>]* > ) ){0} (?<close_tag> < \s* / \s* (?<tag_name> \w+ ) \s* > ){0} (?<attribute> (?<attribute_name> [-\w]+ ) (?: \s* = \s* (?<attribute_value> "[^"]*" | '[^']*' | [^>\s]+ ) )? \s* ){0} (?<attribute_list> \g<attribute> (?= [^>]* > \z ) # attributes keep a trailing > to disambiguate from text ){0} (?<text> (?! [^<]* /?\s*> \z ) # a guard to prevent this from parsing attributes [^<]+ ){0} \G (?: \g<doctype_declaration> | \g<comment> | \g<script_tag> | \g<self_closed_tag> | \g<unclosed_tag> | \g<open_tag> | \g<attribute_list> | \g<close_tag> | \g<text> ) \s* }mix def parse(html) stack = [{attributes: [ ], contents: [ ], name: :root}] loop do html.sub!(PARSER, "") or break if $~[:doctype_declaration] add_to_tree(stack.last, "DOCTYPE", $~[:doctype].strip) elsif $~[:script_tag] add_to_stack(stack, $~[:tag_name], $~[:attributes], $~[:script]) elsif $~[:self_closed_tag] || $~[:unclosed_tag] || $~[:open_tag] add_to_stack(stack, $~[:tag_name], $~[:attributes], "", $~[:open_tag]) elsif $~[:close_tag] stack.pop elsif $~[:text] stack.last[:contents] << $~[:text] end end stack.pop end def add_to_tree(branch, name, value) if branch.include?(name) branch[name] = [branch[name]] unless branch[name].is_a?(Array) branch[name] << value else branch[name] = value end end def add_to_stack(stack, tag_name, attributes_html, contents, open = false) tag = { attributes: parse_attributes(attributes_html), contents: [contents].reject(&:empty?), name: tag_name } add_to_tree(stack.last, tag_name, tag) stack.last[:contents] << tag stack << tag if open end def parse_attributes(attributes_html) attributes = { } loop do attributes_html.sub!(PARSER, "") or break add_to_tree( attributes, $~[:attribute_name], ($~[:attribute_value] || $~[:attribute_name]).sub(/\A(["'])(.*)\1\z/, '\2') ) end attributes end def convert_to_bbcode(node) if node.is_a?(Hash) name = node[:name].sub(/\Astrike\z/, "s") "[#{name}]#{node[:contents].map { |c| send(__method__, c) }.join}[/#{name}]" else node end end html = open(URL, &:read).strip ast = parse(html) puts ast["html"]["body"]["div"] .find { |div| div[:attributes]["class"] == "container" }["div"] .find { |div| div[:attributes]["id"] == "content" }["div"]["div"] .find { |div| div[:attributes]["id"] == "mainbar" }["div"] .find { |div| div[:attributes]["id"] == "answers" }["div"] .find { |div| div[:attributes]["id"] == "answer-1732454" }["table"]["tr"] .first["td"] .find { |div| div[:attributes]["class"] == "answercell" }["div"]["p"] .first[:contents] .map(&method(:convert_to_bbcode)) # to reach a wider audience .join
-
11
NOV
2011Doing it Wrong
Continuing with my Breaking All of the Rules series, I want to peek into several little areas where I've been caught doing the wrong thing. I'm a rule breaker and I'm determined to take someone down with me!
My Forbidden Parser
In one application, I work with an API that hands me very simple data like this:
<emails> <email>user1@example.com</email> <email>user2@example.com</email> <email>user3@example.com</email> … </emails>
Now I need to make a dirty confession: I parsed this with a Regular Expression.
I know, I know. We should never parse HTML or XML with a Regular Expression. If you don't believe me, just take a moment to actually read that response. Yikes!
Oh and you shouldn't validate emails with a Regular Expression. Oops. We're talking about at least two violations here.
But it gets worse.
You may be think I rolled a little parser based on Regular Expressions. That might look like this:
#!/usr/bin/env ruby -w require "strscan" class EmailParser def initialize(data) @scanner = StringScanner.new(data) end def parse(&block) parse_emails(&block) end private def parse_emails(&block) @scanner.scan(%r{\s*<emails>\s*}) or fail "Failed to match list start" loop do parse_email(&block) or break end @scanner.scan(%r{\s*</emails>}) or fail "Failed to match list end" end def parse_email(&block) if @scanner.scan(%r{<email>\s*}) if email = @scanner.scan_until(%r{</email>\s*}) block[email.strip[0..-9].strip] return true else fail "Failed to match email end" end end false end end EmailParser.new(ARGF.read).parse do |email| puts email end
-
18
NOV
2007Ghost Wheel Example
There has been a fair bit of buzz around the Treetop parser in the Ruby community lately. Part of that is fueled by the nice screencast that shows off how to use the parser generator.
It doesn't get talked about as much, but I wrote a parser generator too, called Ghost Wheel. Probably the main reason Ghost Wheel doesn't receive much attention yet is that I have been slow in getting the documentation written. Given that, I thought I would show how the code built in the Treetop screencast translates to Ghost Wheel:
#!/usr/bin/env ruby -wKU require "rubygems" require "ghost_wheel" # define a parser using Ghost Wheel's Ruby DSL RubyParser = GhostWheel.build_parser do rule( :additive, alt( seq( :multiplicative, :space, :additive_op, :space, :additive ) { |add| add[0].send(add[2], add[-1])}, :multiplicative ) ) rule(:additive_op, alt("+", "-")) rule( :multiplicative, alt( seq( :primary, :space, :multiplicative_op, :space, :multiplicative ) { |mul| mul[0].send(mul[2], mul[-1])}, :primary ) ) rule(:multiplicative_op, alt("*", "/")) rule(:primary, alt(:parenthized_additive, :number)) rule( :parenthized_additive, seq("(", :space, :additive, :space, ")") { |par| par[2] } ) rule(:number, /[1-9][0-9]*|0/) { |n| Integer(n) } rule(:space, /\s*/) parser(:exp, seq(:additive, eof) { |e| e[0] }) end # define a parser using Ghost Wheel's grammar syntax GrammarParser = GhostWheel.build_parser %q{ additive = multiplicative space additive_op space additive { ast[0].send(ast[2], ast[-1]) } | multiplicative additive_op = "+" | "-" multiplicative = primary space multiplicative_op space multiplicative { ast[0].send(ast[2], ast[-1])} | primary multiplicative_op = "*" | "/" primary = parenthized_additive | number parenthized_additive = "(" space additive space ")" { ast[2] } number = /[1-9][0-9]*|0/ { Integer(ast) } space = /\s*/ exp := additive EOF { ast[0] } } if __FILE__ == $PROGRAM_NAME require "test/unit" class TestArithmetic < Test::Unit::TestCase def test_paring_numbers assert_parses "0" assert_parses "1" assert_parses "123" assert_does_not_parse "01" end def test_parsing_multiplicative assert_parses "1*2" assert_parses "1 * 2" assert_parses "1/2" assert_parses "1 / 2" end def test_parsing_additive assert_parses "1+2" assert_parses "1 + 2" assert_parses "1-2" assert_parses "1 - 2" assert_parses "1*2 + 3 * 4" end def test_parsing_parenthized_expressions assert_parses "1 * (2 + 3) * 4" end def test_parse_results assert_correct_result "0" assert_correct_result "1" assert_correct_result "123" assert_correct_result "1*2" assert_correct_result "1 * 2" assert_correct_result "1/2" assert_correct_result "1 / 2" assert_correct_result "1+2" assert_correct_result "1 + 2" assert_correct_result "1-2" assert_correct_result "1 - 2" assert_correct_result "1*2 + 3 * 4" assert_correct_result "1 * (2 + 3) * 4" end private PARSERS = [RubyParser, GrammarParser] def assert_parses(input) PARSERS.each do |parser| assert_nothing_raised(GhostWheel::FailedParseError) do parser.parse(input) end end end def assert_does_not_parse(input) PARSERS.each do |parser| assert_raises(GhostWheel::FailedParseError) { parser.parse(input) } end end def assert_correct_result(input) PARSERS.each { |parser| assert_equal(eval(input), parser.parse(input)) } end end end