Skip navigation

STANFORD UNIVERSITY

INFORMATION TECHNOLOGY SERVICES

Search Engine integration with SWISH-E

Overview

Stanford's mailing list service was migrated from Majordomo to Mailman in 2006. Mailing list archiving is one of the new features offered with the new mailing list service roll-out. By default, the archiving feature is turned off. List owners can choose to turn on archiving and set archive type (private or public) through their list's management website.

As of Nov. 2007, of 20K total mailing lists, about 15% lists are archived mailing list. There are 110GB worth of data currently, 99% of the archived lists are of private type. Although public lists can use Google or other mailing list service sites to search their archives, private lists cannot, because these sites cannot handle private list authentication and authorization. We anticipate the number of archived lists will grow as the feature becomes well known. The amount of content will be more valuable if it can be easily searchable on the web. Search will allow users quickly locate information and should be part of mail and web service user experience.

Mailman is an open source package we use for the mailing list service. It supports public and private archives but it doesn't come with an searching function. We have to use external search tools to make the archives - both public and private - searchable, and integrate a search package with Stanford's Webauth and Mailman's authentication for private lists.

Software package selection

No matter what tools we take, we will need code modification to integrate it with Mailman and Stanford WebAuth authentication. It means in addition to the basic requirements for a search engine, we would want to take an open source package that allows us to modify the code and customize the search forms to implement authentication and authorization.

I have chosen SWISH-E (Simple Web Indexing Systems for Humans - Enhanced) for the initial prototype work. It is a complete package with fast indexing tool, a CGI search engine, and it uses templates to present queries and search output. Features of this package can be found here.

(For more information on differences of search tools, you can refer a valuable search tools's comparison article Search Engine Comparison.)

The CGI search script comes with SWISH-E will need to be modified to handle public and private list search when authentication and authorization are needed.

The templates will need to be customized so we can present separate search forms for public and private search, with list name being carried between searches and action pointing to different search CGI script based on archive type.

Implementation details

The main goal of design is to do authorized search for private mailing list. For that reason, each list will have their own index file. SWISH-E search engine and forms need to be modified to use list name value and build queries against the list's index file.

Search related implementation

  • Adding Search Box

Obviously search box needs to be inserted to messages, archive's table of contents etc. The following archive templates located in /etc/mailman/en directory are modified to include the search form. These Mailman templates are applied to each message to generate HTML format of the message in the archive.

    article.html
    archtoc.html
    archtocnombox.html
    archidxhead.html
    

The following two lines are added between </HEAD> and <BODY>:

    <!--#set var="listname" value=%(listname)s -->
    <!--#include virtual="/includes/searchform.html" -->
    

Server Side Include (SSI) needs to be turned on in Mailman's Apache server configuration to process the SSI directives.

"set var" is used to set listname value, to be used in a hidden field in search form so it can be passed to search engine to search the right list index. "include" is used to include the searchform.html - a search form on subject, body, and author. In the search form, "echo var" is used to get the listname value.

  • Add extra HTML META tags

SWISH-E can use META tages as index properties so that these meta data can be searched on and displayed in a user friendly format. The META tags are added to "article.html" template after "<META NAME="robots" CONTENT="index,nofollow">:

    <META NAME="author" CONTENT="%(author_html)s">
    <META NAME="datestr" CONTENT="%(datestr_html)s">
    <META NAME="unixdate" CONTENT="%(date)s">
    
  • Re-geneate list archives

Since the current archives do not not have these META data, we need to re-generate all list archives. Mailman arch program uses the above templates to generate new html files for each message. 'listname' variable will get a real value, SSI include directives will be inserted. To re-genearte archives, run:

   /usr/lib/mailman/bin/arch --wipe <listname>
   

You can put a wraper to generate archives for all mailing lists. Note that this is a one-time process. New archives will have META tag and SSI directives automatically.

  • Create new indexes

New indexes are created with 'swish-e' command for each list. We use a wrapper file to create full indexes for all mailing list archives.

This script will create index.swish-e and index.swish-e.prop under each mailing list's archive directory. There are 3 configuration files used by swish-e to generate indexes, they are installed under /etc/swish-e/ diretory:

   general.conf
   private.conf 
   public.conf
   

General.conf is included by public.conf and public.conf for indexing configurations that are the same for both private and public type of archive. Private.conf and public.conf have specific rules to replace the archive file location in index files with their respect URLs.

  • Need SSI function

We need to turn on Apache server side include for Mailman to allow HTML and CGI script's output to be processed by SSI. This is what you need:

       AddType text/html .html
       AddOutputFilter INCLUDES .html
   

And

      <Directory /usr/lib/cgi-bin/mailman/>
          Options ExecCGI FollowSymlinks +Includes
          SetOutputFilter INCLUDES
          AcceptPathInfo On
     </Directory>
   

Enable mod_include and restart apache server.

Now all the forms and indexes are in place, ready to be processed by a search tool with authentication and authorization capability.

Search authentication and authorization

A private mailing list may have both Stanford and non-Stanford users. Stanford users use WebAuth to authenticate to their list's archive, while non-Stanford users use their Mailman's password to authentication themselves. Idealy, if we can put Mailman's private - private archive access script - under WebAuth protection, and if a user doesn't have a Stanford Webauth cookie, do not display WebAuth login, use Mailman's own authenticaiton, then it will simplify the implementation a lot. Same goes to the swish.cgi search script. Since no 'fall-through' feature in the current Webauth implementation, we have to use two separate CGI scripts to handle authentication. The two scripts - private and suprivate - are already implemented to handle current archiving reading, we need to add more scripts to handle search function.

The programs related to implementing authenticated archive search are:

  • private and suprivate

suprivate is under WebAuth, and private is not. When a user goes to a private archive link, the request will be processed by private. Users will not be presented with a Stanford WebAuth login page; however, it will check the presence of "webauth_at" cookie through HTTP_COOKIE environment variable. If the cookie exists, it redirects the user to suprivate, which is under WebAuth protection and will take over further cookie, and membership validation for Stanford users. If no "webauth_at" cookie, Mailman will display its own login page. A login name containing "@stanford.edu" will also redirect the user to suprivate, otherwise, a list password is required.

private uses the default searchform.html inserted in the archive for non Stanford members, while suprivate will change the default form to susearchform.html before presenting the archives to the users. See the

After the authentication, both private and suprivate will check the user's list membership and go to the list's archive page if the user is authorized.

  • swish and suswish

These two scripts are based on SWISH-E's CGI perl search script. swish - not protected under WebAuth - will handle public archive search and private archive search for non-Stanford members. For private archives, it needs to get Mailman's session cookie, validate the cookie and does the normal search if the user is authorized. suswish - protected under WebAuth - will get WebAuth user's name and validate membership, then move on to do the query.

  • searchform.html and susearchform.html

The only difference is the form action. One is posted to swish, one to suswish. suprivate script changes the form file name in an archived html file to susearchform.html for Stanford users. Otherwise the default searchform.html is used.

  • SuTemplateDefault.pm

This perl package is under /usr/lib/swish-e/perl/SWISH. It is a copy of TemplateDefault.pm package, containing routines to format HTML output. Modifications are added to use listname value in output links so that message links and page navigation links in the search output will go to the correct list archive. Here is the SuTemplateDefault.pm.

Last modified Friday, 30-Nov-2007 09:06:34 AM

Stanford University Home Page