Publications‎ > ‎

dmoz

Download

The current version (1.5.2) is available through Sourceforge. Click here to visit the SourceForge project site.

Requirements

This script uses (and therefore requires):

  • XML::Twig
  • IO::File
  • URI::URL
Install them through CPAN:
perl -MCPAN -e 'install XML::Twig'
perl -MCPAN -e 'install URI::URL'

Input

Command Line: parse.pl [content.rdf.u8] [files.xml]

  • The file content.rdf.u8 from http://www.dmoz.org/
  • The file files.xml provided with this package

Output

A series of files containing URLs - listed in the DMOZ - that are beneath categories matching the regular expressions provided (see below). Output is split into two files: filename.domain and filename.url to allow for better use by content filters. All usernames, passwords, targets and querystrings are stripped before being written to file.

content.rdf.u8

This file can be downloaded, gzipped, from http://dmoz.org/rdf.html. dmoz.pl expects content.rdf.u8 to be uncompressed and in the format used on 1/27/2004, though later formats may work.

IMPORTANT: See the UTF-8 note at the end of this document

From the DMOZ web site

The Open Directory follows in the footsteps of some of the most important editor/contributor projects of the 20th century. Just as the Oxford English Dictionary became the definitive word on words through the efforts of volunteers, the Open Directory follows in its footsteps to become the definitive catalog of the Web.

The Open Directory was founded in the spirit of the Open Source movement, and is the only major directory that is 100% free. There is not, nor will there ever be, a cost to submit a site to the directory, and/or to use the directory's data. The Open Directory data is made available for free to anyone who agrees to comply with our free use license.

files.xml

An XML file in the following format:

<Files>
        <File name="filename">
                <Description>A description of the file to be created</Description>
                <Include>^Top/Category.*$</Include>
                <Include>...</Include>
                <Exclude>.*Chat.*</Exclude>
                <Exclude>...</Exclude>
                ...
        </File>
        <File>
                ...
        </File>
        ...
</Files>

The suffixes '.url' and '.domain' will be appended to the filename to create two files for each File definition. The Include and Exclude blocks are regular expressions that will the applied the DMOZ category id to determine whether that category's links should be output to 'filename.domain' and/or 'filename.url'.

The Include and Exclude blocks expect Perl regular expressions. Don't forget the "Top/" prefix before any absolute category definitions.

For a category's urls to be output to a file

  1. The category id must match one or more Include blocks
  2. The category id must not match any Exclude blocks

Where to go for help

This script is provided without warranty, support, sympathy or assistance of any kind, including the warranties of MERCHANTABILITY and FITNESS FOR A PARTICULAR PURPOSE. Enjoy.

UTF-8

Previously, a bug in the DMOZ data dump resulted in malformed UTF-8. This bug made the dmoz.pl script fail in all manner of strange and unpredictable ways . The version 1.5.0 dmoz.pl script does not attempt to parse the content.rdf.u8 file in utf8 mode. Instead, it treats the entire file as standard ascii, and the regular expressions will match utf8 byte-by-byte instead of character-by-character.

If you REALLY need utf-8 support, first fix the DMOZ data dump and then change the line "no utf8" in dmoz.pl to read "use utf8". That's all there is to it.

Comments