Apache Nutch 1.0 Released

By Susam Pal on 28 Mar 2009

Today, we received an announcement from the Nutch committer, Sami Siren that Apache Nutch 1.0 has been released. An extract from the announcement:

Apache Nutch, a subproject of Apache Lucene, is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats.

Apache Nutch 1.0 contains a number of bug fixes and improvements such as Solr Integration, new indexing framework and new scoring framework just to mention a few. Details can be found in the changes file:

http://svn.apache.org/repos/asf/nutch/tags/release-1.0/CHANGES.txt

Apache Nutch is available for download from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

I have been waiting for this release for a long time as I made some contributions to this project and I wanted them to be available in official release so that I didn't have to maintain a separate set of patches for myself. These contributions were also my first major contributions to an open source project. Let me list my contributions from the CHANGES.txt file and then describe how I got involved in this project.

62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
    server. (Susam Pal via dogacan)

77. NUTCH-44 - Too many search results, limits max results returned from a
    single search. (Emilijan Mirceski and Susam Pal via kubes)

80. NUTCH-612 - URL filtering was disabled in Generator when invoked
    from Crawl (Susam Pal via ab)

81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)

In 2007, while playing with the search engine, I found that there was no way for Nutch to authenticate itself to intranet sites requiring HTTP authentication. I modified the module that deals with the HTTP protocol so that it could authenticate itself with configured credentials when challenged with authentication. With this change, Nutch now supports NTLM, Basic and Digest authentication schemes. More details on this can be found on the Nutch JIRA at NUTCH-559. and at the Nutch Wiki entry on HTTP authentication schemes.

NUTCH-44 and NUTCH-612 were bug fixes. NUTCH-601 now allows Nutch to perform deeper crawls using a live index. In the days of Nutch 0.9, the crawler complained if a directory with the name 'crawl' already existed in the current directory. As a result, before beginning a re-crawl using the bin/nutch crawl command, we had to move the existing crawl directory to another location. After a discussion in the community, we agreed that it was better to avoid shuffling the crawl directories by allowing re-crawls on the same directory.

Nutch users' mailing list has often received emails from users who wanted to know how they could enable support for authentication schemes in Nutch 0.9 by applying the NUTCH-559 patch. Patching Nutch 0.9 was a little cumbersome as the patch was generated against the trunk. With this release, the users can now simply download Nutch 1.0 and configure the authentication schemes.

Comments | #technology