Directory organisation

bin - The programs
check - Regression tests
crawler - Crawler library (parser, crawl logic)
doc - Texinfo documentation
hooks - Interfaces to inverted index libraries
man - Manual pages
tools - General purpose libraries

Crawler library organisation

cookies.cc - Handling of HTTP cookies
crawl.cc - Driving logic for crawl
dirsel.cc - Find out if a URL match according to Allow or Disallow
	    specifications of robots.txt
ftp.cc - FTP protocol driver
html.cc - Bind the html parser with the crawl logic
html_content.lxx - Parse the textual information of a HTML file
html_href.lxx - Parse the links from a HTML file
html_parser.c - Generic templates for html_content.lxx and html_href.lxx
http.cc - HTTP protocol driver
mime.cc - MIME types handling and mapping to/from extensions
robots.cc - Robot exclusion protocol implementation
robots_parser.lxx - Parser for robots.txt files
sqlutil.cc - Utility functions for MySQL
webbase.cc - Internal/External translation of start/url2start/url SQL tables
webbase_url.cc - Additional Internal/External translation of url SQL table
webbase_create.cc - Schema of the SQL database
webtools.cc - TCP/IP library wrapper implementing timeout and I/O callbacks

Crawler library conventions

Most files implement a kind of object oriented object and the sources
have a similar structure. Here is the framework:

In the header (cookies.h for instance)

//
// A structure that holds all the information (would by a class in 
// C++) The type name of the structure must be <file>_t, if possible.
//
typedef struct cookies {
  //
  // This is mandatory to remember the options
  //
  hash_t* options;

  cookies_entry_t current;
  ...
} cookies_t;

//
// External functions declarations (all prefixed by <file>_ to avoid
// any name class, all lowercase separated with _.
//
struct option* cookies_options(struct option options[]);
//
// Mandatory
//
void cookies_free(cookies_t* params);

In the C file (cookies.cc for instance)

//
// Mandatory
//
#ifdef HAVE_CONFIG_H
#include "config.h"
#endif /* HAVE_CONFIG_H */

//
// System includes
//
#include <stdio.h>
...

//
// webbase includes
//
#include <hash.h>
...

static int verbose = 0;

//
// For the semantic of this structure see GNU getopt(3) 
//
static struct option long_options[] =
{
  /*- Verbose cookies handling */
  {"verbose_cookies", 0, &verbose, 1},
	....
  //
  // *_OPTIONS must be defined in the cookies.h file to a value
  // not used by any other file.
  //
  {0, 0, 0, COOKIES_OPTIONS}
};

//
// Static functions declarations
//
static cookies_entry_t* cookies_parse(cookies_t* cookies, uri_t* url_object, char* cookie);
...
//
// Mandatory
//
static cookies_t* params_alloc();


//
// Return the options for this file. Will be used by getopt_merge (from
// tools directory)
//
struct option* cookies_options(struct option [])
{
  return long_options;
}

//
// Allocate an object according to the arguments given.
//
cookies_t* cookies_alloc(int argc, char** argv, struct option options[])
{
  cookies_t* params = params_alloc();
  //
  // The structure of the functions is similar for every object 
  // and must be modeled after an existing one. 
  // 
  //
}

void cookies_free(cookies_t* params)
{
  //
  // Free all the resources used by params, including params itself
  //
}

static cookies_t* params_alloc()
{
  // 
  // Allocate a cookies_t object and set it to empty
  // the cookies_alloc will fill it according to options
  //
}
