5
DEC2006
mail_to_news.rb
[Note: You need to know what the Gateway is before reading this article.]
There are two halves to the Ruby Gateway. One half runs as a qmail filter for an email address on the Ruby Talk mailing list. Every message sent to that address is piped through this filter with a shell script like:
ruby /path/to/gateway/bin/mail_to_news.rb /path/to/mail_to_news.log
The email is piped to the filter via the standard input and the code is expected to handle the message by posting it to comp.lang.ruby or choosing to ignore it. If the filter exits normally, qmail considers the matter handled. A non-zero exit code will cause the filter to be called with that same message again later.
The Code
Let's dive right into the source of this half of the Gateway:
GATEWAY_DIR = File.join(File.dirname(__FILE__), "..").freeze
$LOAD_PATH << File.join(GATEWAY_DIR, "config") << File.join(GATEWAY_DIR, "lib")
# ...
The code above just sets things up so this script can require
some other files in the project normally. Here are those require
s:
# ...
require "servers_config"
require "nntp"
require "net/smtp"
require "logger"
require "timeout"
# ...
The last three requires are standard Ruby libraries. The first two are not.
The servers_config.rb
file sets up a ServerConfig
Module
with information needed to connect to our email and Usenet hosts. I will not show this file, but the references should be obvious when see them.
The nntp.rb
is pretty much a vendored copy of the net-nntp library. I've made some minor changes for debugging output purposes, but the library functions the same.
We're now ready to initiate logging. Here's the code that starts that process:
# ...
# prepare log
log = Logger.new(ARGV.shift || $stdout)
log.datetime_format = "%Y-%m-%d %H:%M "
# ...
That just builds a Logger
object and cleans up the default date and time formatting.
Now it's time to start setting variables in preparation for the coming email parse:
# ...
# only allow certain headers through
VALID_HEADERS = %w[ From Subject References In-Reply-To Message-Id Content-Type
Content-Transfer-Encoding Date X-ML-Name X-Mail-Count
X-X-Sender ]
valid_headers_re = /^(?:#{VALID_HEADERS.join("|")}):/i
# ...
The Gateway passes through only a subset of the email headers for the Usenet post. The above is the list of those headers and the Regexp
that will locate them.
Now, there are two types of messages we do not wish to forward: spam and a message sent to Ruby Talk by the other half of the Gateway (causing an infinite loop of sending). The following code prepares flags for these conditions:
# ...
# message flags
spam = false
mirrored = false
# ...
The following code allocates variables to hold key header information parsed from the message:
# ...
# header data
msg_id = "unknown"
subject = "unknown"
from = "unknown"
reply_to = "unknown"
ref = "unknown"
head = <<END_RECEIVED # build received header, including loop flag
Newsgroups: #{ServersConfig::NEWSGROUP}
X-received-from: This message has been automatically forwarded from the
ruby-talk mailing list by a gateway at #{ServersConfig::NEWSGROUP}. If it is
SPAM, it did not originate at #{ServersConfig::NEWSGROUP}. Please report the
original sender, and not us. Thanks!
Please see http://hypermetrics.com/rubyhacker/clrFAQ.html#tag24 too.
X-rubymirror: yes
END_RECEIVED
# ...
Note that we get the headers started with an X-received-from
explaining our service and add the X-rubymirror
flag the other half of the Gateway will use to detect that this half of the Gateway sent this new post we are creating.
Now we need to parse the email headers:
# ...
# process message headers
valid_header = false
$stdin.each do |line|
case line
when /^X-Spam-Status: Yes/ # flag message as spam
spam = true
when /^X-rubymirror: yes/ # flag messages from news_to_mail
mirrored = true
when /^\s*$/ # end of headers
break
when /^\s/ # continuation line
head << line if valid_header # only allow after valid headers
when valid_headers_re # valid header
valid_header = true
# parse header data
case line
when /Message-Id:\s+(.*)/i
msg_id = $1.sub(/\.+>$/, ">")
when /In-Reply-To:\s*(.*)/i
reply_to = $1
when /References:\s*(.*)/i
ref = $1
when /^Subject:\s*(.*)/i
subject = $1
when /^From:\s*(.*)/i
from = $1
end
head << line
else # invalid header, discard
valid_header = false
end
end
# ...
There's nothing too tricky in the above code. We match headers with simple expressions, pulling the information we need into variables. We also set flags as appropriate and add to the headers we have started for the newsgroup post. This code stops reading at the blank line signaling the end of the email headers.
The code above didn't address flagged messages immediately, because we wanted to be able to log the key details about them. We now have those details, so it's time to address the flags:
# ...
# skip any flagged messages
if mirrored
log.info "Skipping message ##{msg_id}, sent by news_to_mail"
exit
elsif spam
log.info "Ignoring Spam ##{msg_id}: #{subject} -- #{from}"
exit
end
# ...
As you can see, flagged messages are noted in the log and we exit cleanly without further processing.
The Gateway does some final header doctoring in an attempt to set a reasonable References header and also includes the Ruby Talk message id for reader reference:
# ...
# doctor headers for Ruby Talk
if ref.nil?
if reply_to.nil?
if subject =~ /^Re:/
head << "In-Reply-To: <this_is_a_dummy_message-id@rubygate>\n"
head << "References: <this_is_a_dummy_message-id@rubygate>\n"
end
else
head << "References: #{reply_to}\n"
end
end
head << "X-ruby-talk: #{msg_id}\n"
# ...
We are finally ready to construct a complete Usenet post:
# ...
# construct final message
body = $stdin.read
msg = head + "\n" + body
msg.gsub!(/\r?\n/, "\r\n")
log.info "Sending message ##{msg_id}: #{subject} -- #{from}..."
log.info "Message looks like: #{msg.inspect}"
# ...
The above code just joins the headers and existing message body, cleans up the newlines and logs our progress.
Actually sending the message is a two-step process. First we connect to our Usenet host:
# ...
# connect to NNTP host
begin
nntp = nil
Timeout.timeout(30) do
nntp = Net::NNTP.new( ServersConfig::NEWS_SERVER,
Net::NNTP::NNTP_PORT,
ServersConfig::NEWS_USER,
ServersConfig::NEWS_PASS )
end
rescue Timeout::Error
log.error "The NNTP connection timed out."
exit -1
rescue
log.fatal "Unable to establish connection to NNTP host: #{$!.message}"
exit -1
end
# ...
Above you can see several references to the ServerConfig
Module
I spoke of earlier. These constants contain exactly what their names indicate.
Note that we exit with an error code if anything goes wrong here, assuming the problem is temporary and allowing qmail to try again later.
The final step is to send the message:
# ...
# attempt to send newsgroup post
unless $DEBUG
begin
result = nil
Timeout.timeout(30) { result = nntp.post(msg) }
rescue Timeout::Error
log.error "The NNTP post timed out."
exit -1
rescue
log.fatal "Unable to post to NNTP host: #{$!.message}"
exit -1
end
log.info "... Sent. nntp.post() result = #{result}"
end
The above code makes the post and logs what we have accomplished. Again we exit with error codes if something goes wrong, to signal retries. The $DEBUG
check allows me to test Gateway operation with everything but the actual post send when needed.
Possible Improvements
Usenet and email are two different worlds with opposing rules. Our Usenet host, like many, does not allow the posting of multipart/alternative messages (used to send HTML email). Some have expressed a desire for the Gateway to convert these messages into a Usenet safe format. This could possibly be done by using the text/plain variant of content, when provided, and stripping the HTML when it is not. This change is of low importance to me, since I don't believe posters should be sending HTML email to Ruby Talk.
In a similar vein, some Usenet hosts reject certain types of multipart/mixed messages (used to send email attachments), generally those that have Base 64 encoded portions to avoid allowing binary content through. Our host allows such posts, but they may not be well circulated on Usenet for these reasons. Again we might be able to inline the content for these files, but this could get pretty tricky for some attachments. For example, imagine a post with a zip archive of files. This problem interests me more than HTML email.
The first step to either off these changes is probably to switch to a real email parsing library. I imagine the original code didn't use one because the choices weren't convenient when the Gateway was designed. I just cleaned up the code in my rewrite and don't have enough experience with such libraries to select the proper replacement. Odds are this could simplify a fair portion of the Gateway code though, if we find the right one. We are looking for a library that:
- Makes it easy to read email headers.
- Allow us to set new headers, for things like the no-mirror flag.
- Allows us to remove unwanted headers. Alternately I guess we could build a new message object and copy over the headers we wish to keep.
- Supports easy manipulation of multipart/alternative and multipart/mixed content.
If you have experience with such a library, please leave a comment below showing how this could be used to simplify the code above.
Comments (2)
-
Matt M. December 5th, 2006 Reply Link
Thanks for writing this up.
I've been messing around with my own email list to blog gateway. For that I've found TMail to work pretty well for parsing email messages. I believe it supports all your criteria.
I have a brain-dead technique for trying to extract text from multipart emails:
def get_text(msg) if msg.multipart? msg.parts.each do |part| if part.content_type == 'text/plain' return part.body elsif part.multipart? return get_text(part) end if part.multipart? get_text(part) end end end return "" end def get_message_body if @msg.multipart? return get_text(@msg) else return @msg.body end end
This doesn't always work great, emails that have been cut and paste from Word into Outlook are particularly painful, but it works most of the time.
Have you considered trying to implement
References
orIn-Reply-To
header support for threading messages? I've worked on that but only to one level deep since in my gateway a new email is a post and a response becomes a comment to that post. The problem I've found in email is that some people respond to messages but start entirely new threads, and some people compose an entirely new email but are actually responding. Some even put anRe:
subject line to make it look like they hit reply instead of compose.I've found that using
In-Reply-To
,References
andThread-Index
(Microsoft clients send this) with a dash of Subject line comparisons I get fairly accurate representations of the threads regardless of whether they hit reply or compose. I haven't tried implementing JWZ's threading algorithm.
v-
Thanks for the tips. I will definitely play with TMail and see if I can use that to make this process easier.
Just FYI, your code checked
part.multipart?
twice, once in theelsif
condition and again in the followingif
condition.I did show the code that handled
References
in my write-up. It's very basic. Improving this might be another possible area of improvement.
-