Gray Soft / The Gateway / mail_to

5

DEC
2006

mail_to_news.rb

This post is part of a series.

[Note: You need to know what the Gateway is before reading this article.]

There are two halves to the Ruby Gateway. One half runs as a qmail filter for an email address on the Ruby Talk mailing list. Every message sent to that address is piped through this filter with a shell script like:

ruby /path/to/gateway/bin/mail_to_news.rb /path/to/mail_to_news.log

The email is piped to the filter via the standard input and the code is expected to handle the message by posting it to comp.lang.ruby or choosing to ignore it. If the filter exits normally, qmail considers the matter handled. A non-zero exit code will cause the filter to be called with that same message again later.

The Code

Let's dive right into the source of this half of the Gateway:

GATEWAY_DIR = File.join(File.dirname(__FILE__), "..").freeze

$LOAD_PATH << File.join(GATEWAY_DIR, "config") << File.join(GATEWAY_DIR, "lib")

# ...

The code above just sets things up so this script can require some other files in the project normally. Here are those requires:

# ...

require "servers_config"
require "nntp"

require "net/smtp"
require "logger"
require "timeout"

# ...

The last three requires are standard Ruby libraries. The first two are not.

The servers_config.rb file sets up a ServerConfig Module with information needed to connect to our email and Usenet hosts. I will not show this file, but the references should be obvious when see them.

The nntp.rb is pretty much a vendored copy of the net-nntp library. I've made some minor changes for debugging output purposes, but the library functions the same.

We're now ready to initiate logging. Here's the code that starts that process:

# ...

# prepare log
log = Logger.new(ARGV.shift || $stdout)
log.datetime_format = "%Y-%m-%d %H:%M "

# ...

That just builds a Logger object and cleans up the default date and time formatting.

Now it's time to start setting variables in preparation for the coming email parse:

# ...

# only allow certain headers through
VALID_HEADERS    = %w[ From Subject References In-Reply-To Message-Id Content-Type
                       Content-Transfer-Encoding Date X-ML-Name X-Mail-Count
                       X-X-Sender ]
valid_headers_re = /^(?:#{VALID_HEADERS.join("|")}):/i

# ...

The Gateway passes through only a subset of the email headers for the Usenet post. The above is the list of those headers and the Regexp that will locate them.

Now, there are two types of messages we do not wish to forward: spam and a message sent to Ruby Talk by the other half of the Gateway (causing an infinite loop of sending). The following code prepares flags for these conditions:

# ...

# message flags
spam     = false
mirrored = false

# ...

The following code allocates variables to hold key header information parsed from the message:

# ...

# header data
msg_id   = "unknown"
subject  = "unknown"
from     = "unknown"
reply_to = "unknown"
ref      = "unknown"
head     = <<END_RECEIVED  # build received header, including loop flag
Newsgroups: #{ServersConfig::NEWSGROUP}
X-received-from: This message has been automatically forwarded from the
   ruby-talk mailing list by a gateway at #{ServersConfig::NEWSGROUP}. If it is
   SPAM, it did not originate at #{ServersConfig::NEWSGROUP}. Please report the
   original sender, and not us. Thanks!
   Please see http://hypermetrics.com/rubyhacker/clrFAQ.html#tag24 too.
X-rubymirror: yes
END_RECEIVED

# ...

Note that we get the headers started with an X-received-from explaining our service and add the X-rubymirror flag the other half of the Gateway will use to detect that this half of the Gateway sent this new post we are creating.

Now we need to parse the email headers:

# ...

# process message headers
valid_header = false
$stdin.each do |line|
  case line
  when /^X-Spam-Status: Yes/      # flag message as spam
    spam = true
  when /^X-rubymirror: yes/       # flag messages from news_to_mail
    mirrored = true
  when /^\s*$/                    # end of headers
    break
  when /^\s/                      # continuation line
    head << line if valid_header  # only allow after valid headers
  when valid_headers_re           # valid header
    valid_header = true

    # parse header data
    case line
    when /Message-Id:\s+(.*)/i
      msg_id = $1.sub(/\.+>$/, ">")
    when /In-Reply-To:\s*(.*)/i
      reply_to = $1
    when /References:\s*(.*)/i
      ref = $1
    when /^Subject:\s*(.*)/i
      subject = $1
    when /^From:\s*(.*)/i
      from = $1
    end

    head << line
  else                            # invalid header, discard
    valid_header = false
  end
end

# ...

There's nothing too tricky in the above code. We match headers with simple expressions, pulling the information we need into variables. We also set flags as appropriate and add to the headers we have started for the newsgroup post. This code stops reading at the blank line signaling the end of the email headers.

The code above didn't address flagged messages immediately, because we wanted to be able to log the key details about them. We now have those details, so it's time to address the flags:

# ...

# skip any flagged messages
if mirrored
  log.info "Skipping message ##{msg_id}, sent by news_to_mail"
  exit
elsif spam
  log.info "Ignoring Spam ##{msg_id}:  #{subject} -- #{from}"
  exit
end

# ...

As you can see, flagged messages are noted in the log and we exit cleanly without further processing.

The Gateway does some final header doctoring in an attempt to set a reasonable References header and also includes the Ruby Talk message id for reader reference:

# ...

# doctor headers for Ruby Talk
if ref.nil?
    if reply_to.nil?
        if subject =~ /^Re:/
            head << "In-Reply-To: <this_is_a_dummy_message-id@rubygate>\n"
            head << "References: <this_is_a_dummy_message-id@rubygate>\n"
        end
    else
        head << "References: #{reply_to}\n"
    end
end
head << "X-ruby-talk: #{msg_id}\n"

# ...

We are finally ready to construct a complete Usenet post:

# ...

# construct final message
body = $stdin.read
msg  = head + "\n" + body
msg.gsub!(/\r?\n/, "\r\n")

log.info "Sending message ##{msg_id}:  #{subject} -- #{from}..."
log.info "Message looks like: #{msg.inspect}"

# ...

The above code just joins the headers and existing message body, cleans up the newlines and logs our progress.

Actually sending the message is a two-step process. First we connect to our Usenet host:

# ...

# connect to NNTP host
begin
  nntp = nil
  Timeout.timeout(30) do
    nntp = Net::NNTP.new( ServersConfig::NEWS_SERVER,
                          Net::NNTP::NNTP_PORT,
                          ServersConfig::NEWS_USER,
                          ServersConfig::NEWS_PASS )
  end
rescue Timeout::Error
    log.error "The NNTP connection timed out."
    exit -1
rescue
    log.fatal "Unable to establish connection to NNTP host:  #{$!.message}"
    exit -1
end

# ...

Above you can see several references to the ServerConfig Module I spoke of earlier. These constants contain exactly what their names indicate.

Note that we exit with an error code if anything goes wrong here, assuming the problem is temporary and allowing qmail to try again later.

The final step is to send the message:

# ...

# attempt to send newsgroup post
unless $DEBUG
  begin
    result = nil
    Timeout.timeout(30) { result = nntp.post(msg) }
  rescue Timeout::Error
      log.error "The NNTP post timed out."
      exit -1
  rescue
    log.fatal "Unable to post to NNTP host:  #{$!.message}"
    exit -1
  end
  log.info "...  Sent.  nntp.post() result = #{result}"
end

The above code makes the post and logs what we have accomplished. Again we exit with error codes if something goes wrong, to signal retries. The $DEBUG check allows me to test Gateway operation with everything but the actual post send when needed.

Possible Improvements

Usenet and email are two different worlds with opposing rules. Our Usenet host, like many, does not allow the posting of multipart/alternative messages (used to send HTML email). Some have expressed a desire for the Gateway to convert these messages into a Usenet safe format. This could possibly be done by using the text/plain variant of content, when provided, and stripping the HTML when it is not. This change is of low importance to me, since I don't believe posters should be sending HTML email to Ruby Talk.

In a similar vein, some Usenet hosts reject certain types of multipart/mixed messages (used to send email attachments), generally those that have Base 64 encoded portions to avoid allowing binary content through. Our host allows such posts, but they may not be well circulated on Usenet for these reasons. Again we might be able to inline the content for these files, but this could get pretty tricky for some attachments. For example, imagine a post with a zip archive of files. This problem interests me more than HTML email.

The first step to either off these changes is probably to switch to a real email parsing library. I imagine the original code didn't use one because the choices weren't convenient when the Gateway was designed. I just cleaned up the code in my rewrite and don't have enough experience with such libraries to select the proper replacement. Odds are this could simplify a fair portion of the Gateway code though, if we find the right one. We are looking for a library that:

Makes it easy to read email headers.
Allow us to set new headers, for things like the no-mirror flag.
Allows us to remove unwanted headers. Alternately I guess we could build a new message object and copy over the headers we wish to keep.
Supports easy manipulation of multipart/alternative and multipart/mixed content.

If you have experience with such a library, please leave a comment below showing how this could be used to simplify the code above.

This post is part of a series.

← Previous Post

→ Next Post

In: The Gateway | Tags: Community | 2 Comments

Comments (2)

Matt M. December 5th, 2006 Reply Link
Thanks for writing this up.

I've been messing around with my own email list to blog gateway. For that I've found TMail to work pretty well for parsing email messages. I believe it supports all your criteria.

I have a brain-dead technique for trying to extract text from multipart emails:
```
def get_text(msg)
  if msg.multipart?
    msg.parts.each do |part|
      if part.content_type == 'text/plain'
        return part.body
      elsif part.multipart?
        return get_text(part)
      end
      if part.multipart?
        get_text(part)
      end
    end
  end

  return ""
end

def get_message_body
  if @msg.multipart?
    return get_text(@msg)
  else
    return @msg.body
  end    
end
```
This doesn't always work great, emails that have been cut and paste from Word into Outlook are particularly painful, but it works most of the time.

Have you considered trying to implement References or In-Reply-To header support for threading messages? I've worked on that but only to one level deep since in my gateway a new email is a post and a response becomes a comment to that post. The problem I've found in email is that some people respond to messages but start entirely new threads, and some people compose an entirely new email but are actually responding. Some even put an Re: subject line to make it look like they hit reply instead of compose.

I've found that using In-Reply-To, References and Thread-Index (Microsoft clients send this) with a dash of Subject line comparisons I get fairly accurate representations of the threads regardless of whether they hit reply or compose. I haven't tried implementing JWZ's threading algorithm.
v
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II December 5th, 2006 Reply Link
  
  Thanks for the tips. I will definitely play with TMail and see if I can use that to make this process easier.
  
  Just FYI, your code checked part.multipart? twice, once in the elsif condition and again in the following if condition.
  
  I did show the code that handled References in my write-up. It's very basic. Improving this might be another possible area of improvement.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *