Fri Mar 02 11:44:20 2001  Loic Dachary  <loic@senga.org>

	* man/crawler.1: home_re documentation

Fri Mar 02 11:33:44 2001  Loic Dachary  <loic@senga.org>

	* check/consistentc_test: regression tests for home_re table
	  implementation.

Thu Mar 01 17:34:56 2001  Loic Dachary  <loic@senga.org>

	* crawler/server.cc: converted the whole code to use the
	  same object model as cookies and other modules. Add loading
	  of the home_re table and stub code to calculate the 
	  server name from the regular expression.

Thu Mar 01 11:11:01 2001  Loic Dachary  <loic@senga.org>

	* crawler/webbase.cc (webbase_alloc): change ECILA to WEBBASE

Tue Jan 23 11:47:49 2001  Loic Dachary  <loic@senga.org>

	* crawler/http.cc (http_header): only reset modification time
	  if the document is not NOT_MODIFIED. The NOT_MODIFIED response
	  header may not contain the last-modification field and we
	  already have it.

Mon Jan 22 16:45:16 2001  Loic Dachary  <loic@senga.org>

	* bin/crawler.cc (crawlers): forward signals to children 
	  crawlers.

Mon Jan 22 16:02:54 2001  Loic Dachary  <loic@senga.org>

	* crawler/crawlsig.c: Signal handling that prevents interrupting
	  the crawl operation in the middle.

Mon Jan 22 12:11:26 2001  Loic Dachary  <loic@senga.org>

	* hooks/webbase_hook_mifluz.cc (hook_update): do nothing if
	  HTTP code is NOT_MODIFIED. Prevent trashing index content
	  on NOT_MODIFIED condition.

	* crawler/crawl.cc (mirror_2): Consider -noheuristics in -touch
	  to force recrawling a specific URL.

Mon Jan 15 00:13:42 2001  Loic Dachary  <loic@senga.org>

 	* webbase-5.16 release

	* crawler/crawl.cc (mirror_scheme): Add EINTR as an alias
	  to timeout. Occurs when DNS lookup fails and was not 
	  considered a timeout condition.

	* bin/crawler.cc (init): Add -version option

Wed Jan 10 09:01:21 2001  Loic Dachary  <loic@senga.org>

	* man/crawler.1: Add max_allowed_packet information written by
	  Otis Gospodnetic <otis_gospodnetic@yahoo.com>.

Tue Jan 09 18:53:13 2001  root  <loic@senga.org>

	* man/crawler.1: Fix typos and kill references to var/cache.

Tue Jan 09 18:29:30 2001  Loic Dachary  <loic@senga.org>

	* hooks/webbase_hook_mifluz.cc (hook_prepare): free res only
	  after using it, not before. 

Tue Jan 09 17:48:44 2001  Loic Dachary  <loic@senga.org>

	* crawler/webbase.cc (webbase_alloc): read mysql options from 
	  "my" virtual file name instead of $HOME/.my.cnf, including
	  all possible variations.

Sat Dec 30 20:37:46 2000  Loic Dachary  <loic@senga.org>

	* crawler/crawl_private.h (CRAWL_USER_AGENT): use configure.in
	  imported version number instead of hardwired.

Thu Dec 28 15:02:27 2000  root  <loic@senga.org>

 	* webbase-5.15 release

	* crawler/webbase_url.h (WEBBASE_URL_START_STATE_MASK): bad mask,
	  fix it using symbolic names instead of hardwired const.

Wed Dec 27 08:37:08 2000  Loic Dachary  <loic@senga.org>

	* tools/khash.{h,c}: rename hash.h and hash.c in khash.h and
	  khash.c and prefix symbols that conflicts with hash_* symbols
	  exported by mysql-3.23.29a-gamma.

	* bin/consistentc.cc (main): allow -servers and -keys at the
	  same time.

Tue Dec 26 13:19:25 2000  Loic Dachary  <loic@senga.org>

	* crawler/http.cc (http_header): reset mtime to 0 before
	  parsing headers so that the old mtime is not kept for ever.
	  Use Date: field if Last-Modified: is not available.

	* check/index_test: add Iupdate and Idelete tests

	* hooks/webbase_hook_mifluz.cc: implement hook_prepare,
	  hook_update and hook_delete.

	* crawler/crawl.cc (mirror_2): call hook_prepare *before*
	  crawl so that hook can store the previous content of the
	  document.

	* hooks/webbase_hook_mifluz.cc (hook_getopt): MIFLUZ_CONFIG 
	  takes precendence over the default webbase_mifluz configuration 
	  file.
	
	* hooks/webbase_hook.h: kill tdelete method, only keep
	  delete_id method.

Sat Dec 23 18:14:55 2000  Loic Dachary  <loic@senga.org>

 	* webbase-5.14 release

	* webbase.spec.in: add
	
Tue Dec 19 18:05:46 2000  Loic Dachary  <loic@senga.org>

	* crawler/webbase_create.cc (webbase_schema_content): 
	  add MAX_ROWS=16000000 to potentially big tables.

Fri Nov 22 15:57:22 2000  Loic Dachary  <loic@senga.org>

	* hooks/webbase_hook_mifluz.cc: fix bad location position

	* check/conf/mifluz.conf: remove workey description so that
	  default is taken instead.

	* crawler/webbase_create.cc (webbase_create): Add const char**
	  cast because gcc-2.95 complaints.

Fri Nov 10 15:57:22 2000  Loic Dachary  <loic@senga.org>

	* bin/crawler.cc (crawlers): implement -crawlers, -crawlers_nproc
	  and -crawlers_chunk to run many crawlers in parallel.

	* tools/logfile.c (logfile): kill LOGDIR env variable.

Fri Nov 10 11:57:22 2000  Loic Dachary  <loic@senga.org>

	* configure.in: add --with-split to allow splitted tables
	  for url_content to bypass the 2Gb limit.

Thu Nov 09 17:12:26 2000  Loic Dachary  <loic@senga.org>

	* crawler/sqlutil.h: Create sql_select_exists.

	* crawler/*.{cc,h}: HTML content is stored in url_content
	  table instead of files. This becomes the default and
	  the old behaviour may be restored by removing the 
	  -DWEBBASE_CONTENT_BASE in the Makefile.am files.

Wed Nov 08 13:29:24 2000  Loic Dachary  <loic@senga.org>

	* crawler/crawl.cc: include html_content.h even if LANGREC
	  is not defined to satisfy *indexable* functions.

Tue Nov 07 15:23:22 2000  Loic Dachary  <loic@senga.org>

	* hooks/webbase_hook_mifluz.cc (hook_insert): reset location
	  at the beginning of each document and not each time a chunk
	  of text is processed.

	* bin/crawler.cc (main): add -show_indexable options to display
	  indexable text.

	* crawl/crawl.cc: implement hp_show_indexable

	* crawl/sqlutil.h: Add WEBBASE_INTEGER_VALUE_SIZE macro
	
Sat Nov 04 00:16:13 2000   Loic Dachary  <loic@senga.org>

	* hooks/webbase-mifluz.conf: Mirror the builtin configuration
	  of webbase_hook_mifluz.cc and install in sysconfdir.

	* hooks/webbase_hook_mifluz.cc: load DEFAULT_CONFIG_FILE
	  if available (see Makefile.am for definition).

	* hooks/Makefile.am: Default directory for full text index
	  is var/lib/webbase instead of var/webbase and is created
	  if does not exist.

	* crawler/dirsel.cc (dirsel_match): Fix the prefix comparison
	  that was not working anymore since the modification to 
	  compare with uri_all_path instead of uri_path. 

	* crawler/webbase_create.h (WEBBASE_SCHEMA_*): make room for the
	  servers table.

Fri Nov 03 17:04:32 2000  Loic Dachary  <loic@senga.org>

	* bin/consistentc.cc: allocate a crawl object instead of a simple
	  webbase object.

	* crawler/robots.h (robots_p): remove the server_id hack

	* crawler/server.[ch]: Add server identifier allocation and 
	  maintainance functions based on the new table servers. 

	* hooks/webbase_hook.h: url2server hook function to compute the
	  server part of an URL.

	* bin/consistentc.cc: repair the servers table content

	* configure.in (CHECK_ZLIB): Add detection so that libmysqlclient
	  will find compress in it if needed.

Fri Nov 03 11:57:20 2000  Loic Dachary  <loic@senga.org>

	* check/index_test: set LTDL_LIBRARY_PATH to allow dynamic loading
	  of hook before installation.

Fri Nov 03 11:19:37 2000  Loic Dachary  <loic@senga.org>

	* tools/webbasedl.{h,c}: dynamic loading interface, replaces
	  WebbaseDL.{cc,h}

	* tools/WebbaseGetopt.{cc,h}: kill

	* hooks/webbase_hook.h: replace hooks/WebbaseHook.{cc,h}

	* hooks/webbase_hook_mifluz.cc: replace hooks/WebbaseHookMifluz.{cc,h}

	* crawler/crawl.cc: replace hardwired link to mifluz by dynamic loading
	  of hook library. Replaced the C++ sources by C-like sources to avoid
	  shared lib loading problems due to C++ symbols propagation.

	* crawler/crawl.cc: suppress -no_hook and add -hook <lib> option.

Thu Nov 02 16:07:45 2000  Loic Dachary  <loic@senga.org>

	* hooks/webbase_hook_mifluz.cc: rewrite WebbaseHookMifluz in C
	  to make it work as a dynamicaly loadable plugin.

	* hooks/webbase_hook.h: no inheritance, only a header file that
	  defines the function list that is the interface to the hook.

Tue Oct 31 12:12:24 2000  Loic Dachary  <loic@senga.org>

	* man/crawler.1: Add -touch documentation

	* crawler/webbase.cc: move locking information to verbose level
	  2. Print query for webbase_url_walk.
	
	* hooks/WebbaseHook.cc (RebuildInit): add -where_url argument
	  to update set hookid = 0

	* crawler/crawl.cc (crawl_rebuild): forward -where_url argument
	  to RebuildInit

Tue Oct 31 11:16:23 2000  Loic Dachary  <loic@senga.org>

	* crawler/webbase_create.cc: increase field sizes. Add list of 
	  unknown extension, add compression mime type  (arj/rar/ace/zip)
	  with [arcz][0-9][0-9] extensions.

Mon Oct 30 18:02:36 2000  Loic Dachary  <loic@senga.org>

	* crawler/webbase_url.h (WEBBASE_URL_START_*): renumber flags
	  because AUTH flag was removed.

Fri Oct 27 12:18:19 2000  Loic Dachary  <loic@senga.org>

 	* webbase-5.13 release

	* crawler/Makefile.am (INCLUDES): fix bugous variable 
	  DEFAULT_FILE_CACHE
	
Thu Oct 26 16:33:55 2000  Loic Dachary  <loic@senga.org>

	* hooks/WebbaseHookMifluz.cc (Parse): use standard mifluz configuration
	  files if no -index_config provided.
	
	* hooks/WebbaseHookMifluz.cc (InsertContent): Add key overflow checks.

	* crawler/crawl.cc: kill -auth option. The authorization information
	  is included in the URL.
	
	* man/crawler.1: Proofread the page.

	* crawler/crawl.cc: Implement the -agent option
	
	* crawler/crawl.cc (mirror_request_http): when noheuristics
	  was set, scheduling was disabled in some places. Scheduling
	  must still occur. Noheuristics is only a way to reset the 
	  crawling time of each URL

	* crawler/webbase.cc (webbase_alloc): remove net_buffer_length
	  option. Can now be set in MySQL configuration file.

	* acinclude.m4 (CHECK_URI): allow --with-mifluz alone

Thu Oct 26 10:18:31 2000  Loic Dachary  <loic@senga.org>

	* man/crawler.1: document crawler -index_conf options

Fri Oct 20 12:58:09 2000  Helios de Creisquer  <creis@zehc.net>

	* man/crawler.1: Correction of examples.

Thu Oct 19 11:26:34 2000  Loic Dachary  <loic@senga.org>

	* crawler/html_{href,content}.lxx: use %pointer for faster parsing.

Wed Oct 18 10:26:13 2000  Loic Dachary  <loic@senga.org>

	* crawler/webbase_create.h (WEBBASE_SCHEMA_LENGTH):
	  add function webbase_table_schema.

	* crawler/crawl.cc (mirror_cleanup): Useless loops occured when
	  listing records in start2url that are linked with other 
	  starting points. Missing distinct in "select distinct a.url ..."
	  query was listing as many entries as there were links to a 
	  given url. The code was not bugous but highly counter performant
	  when a lot of links existed on a given URL.
	
	* crawler/crawl.cc,bin/crawler.cc,man/crawler.1: implement the
	  -show, -show_fields and -show_where options.

	* man/crawler.1: MD5 hash calculation hint

	* STATISTICS (base): Add

	* acinclude.m4 (ZLIB_HOME): -with-mysql was not working
	  without argument (set to yes).

	* bin/crawler.cc (init): do not require -base when -schema
	  is specified.

Mon Oct 16 17:55:36 2000  Loic Dachary  <loic@senga.org>

	* crawler/crawl.cc (mirror_http): use mknod instead of create to
	  create the file without opening it.

Mon Oct 16 10:11:36 2000  Helios de Creisquer <creis@zehc.net>

	* crawler/dirsel.cc (dirsel_comparable): function return
	uri_all_path instead of uri_path. This allows disallow and
	allow string to match in query params.

Sun Oct 15 11:19:59 2000  Loic Dachary  <loic@senga.org>

	* Makefile.am (install-data-local): create the cache in /var/cache 
	  instead of /var/lib/webbase to conform to the FHS-2.1 
	  (http://www.pathname.com/fhs/).

Fri Oct 13 07:31:42 2000  Loic Dachary  <loic@linux1.compile.sourceforge.net>

	* Makefile.am (rpm): Add instructions to enable additional configure
	  flags when building rpms.

	* configure.in (AC_OUTPUT): add $(AM_CONFIGFLAGS) to the distcheck
	  configure invocation in Makefile.

	* webbase.spec: simplify the specifications

Fri Oct 13 03:13:48 2000  Loic Dachary  <loic@linux1.compile.sourceforge.net>

	* check/*_test: remove $srcdir in front of test_functions

	* acinclude.m4 (CHECK_MYSQL): Add --with-mysql-lib and 
	  --with-mysql-include to support installation sites where the
	  architecture dependent files are installed in a separate tree.

Thu Oct 12 10:50:01 2000  Helios de Creisquer  <creis@zehc.net>

	* crawler/crawl.cc (mirror_http): f_lang uses raw_path to
	use a unique temp file instead of /tmp/file.langrec

Wed Oct 11 11:24:08 2000  Helios de Creisquer  <creis@zehc.net>

	* man/crawler.1,doc/webbase.texi: document default values and
	add precisions.

Tue Oct 10 16:58:35 2000  Helios de Creisquer  <creis@zehc.net>

	* check/test_functions.in, check/conf/Makefile, check/conf/httpd.conf.in,
	acinclude.m4, configure.in: Changed CHECK_USER -> CHECK_ID and add Group
	attribute in httpd.conf

Tue Oct 10 16:01:15 2000  Helios de Creisquer  <creis@zehc.net>

	* check/conf/httpd.conf.in: Comment log_agent log_referer mod_rewrite

Tue Oct 10 15:42:13 2000  Loic Dachary  <loic@senga.org>

	* crawler/webbase.cc: Uncomment .my.cnf handling
	
Wed Sep 13 11:26:15 2000  Loic Dachary  <loic@senga.org>

	* webbase-5.12 release

Thu Aug 10 19:09:15 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* check/test_quantify.cc: creation. crawl 10000 times the same URL

Wed Aug 09 18:25:49 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* check/t/regex.expect: new version of regression test (more
	complex case)

	* check/htdocs/regex.html: created a specific file for regex
	regression test.

Tue Aug 08 13:39:20 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* crawler/crawl.cc (html_content_collect_begin): store text to
	parse in a file (instead of a buffer) and get the whole content
	text.

Thu Aug 03 18:49:42 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* Makefile.am, check/Makefile.am: added missing EXTRA_DIST variables

Thu Aug 03 09:44:39 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* crawler/crawl.cc (mirror_http), crawler/crawl.cc
	(html_content_collect_begin): modified langrec mechanisms : get
	the entire document and parse it. don't use anymore webbase_url
	struct.

Wed Aug 02 23:50:53 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* check/t/Iregex.expect: creation. reference output for regression
	test for regexp filtering on HREFs

	* check/index_test: added regression test for regexp filtering on
	HREFs

	* crawler/dirsel.h, crawler/dirsel.cc, crawler/crawl.cc,
	crawler/webbase_url.h: extended Allow/Disallow clauses to accept
	regexps

Wed Aug 02 18:00:29 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* doc/webbase.texi: terminated integration of admin guide in
	webbase.texi

	* check/index_test: added Irebuild_where test
	(regression test for -where_url option specified in addition to
	-rebuild)

Wed Aug 02 13:15:09 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* doc/Makefile.am: extended to also generate .txt,
	.html and .ps documentation format

	* doc/*.txt: touched to be able to generate .txt file

	* doc/webbase.txt, doc/webbase.html: creation
	
Wed Aug 02 12:04:52 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* check/t/Ilangrec.expect, check/htdocs/langrec.html,
	check/htdocs/langrec-2.html, check/langrec_test: add a regression
	test for language recognition

Wed Aug 02 11:18:32 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* doc/webbase.texi: started integration of administrator's guide

	* aclocal.m4 (URI_HOME): langrec includes must be searched in
	${LANGREC_HOME}/include/langrec and not ${LANGREC_HOME}/include

	* crawler/crawl.cc: corrected 2 bugs in language recognition
	functionality (unlink of a path which is used after, added a
	#define to check if path of sorted dictionaries is specified)

Wed Aug 02 00:49:24 2000  Loic Dachary  <loic@senga.org>

	* check/index_test (Irebuild): only compare dict to avoid 
	  non-significant difference on the list of temporary files
	  in the index.

Tue Aug 01 15:54:01 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* check/webbase_test: corrected wrong SQL query

Tue Aug 01 15:29:41 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* crawler/crawl.cc (mirror_http): add a test to call
	textlang_in_string only if the reference string is not empty.

Sun Jul 30 07:15:25 2000  Loic Dachary  <loic@senga.org>

	* doc/webbase.texi (Indexer): remove obsolete section about
	  --with-key.

Fri Jul 28 17:07:02 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* doc/*.fig: readded all xfig figures

Thu Jul 27 20:07:47 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* doc/*.eps: added new eps figures 

Wed Jul 26 18:52:05 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* doc/*.eps, doc/admin_guide.tex: creation. 

	* doc/*.fig: deletion (latex documents use .eps graphics)

Wed Jul 26 10:47:20 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* bin/crawler.cc (init): added -where_url option (in options
	struct and in crawler_params_t structure.

	* crawler/crawl.cc: added #define LANGREC around
	html_content_collect_begin

Tue Jul 25 18:13:17 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* doc/http_header.fig, doc/html.fig: creation

Tue Jul 25 13:32:51 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* crawler/crawl.cc (crawl_rebuild), crawler/crawl.h: add the
	-where_start option to the rebuild functionnality

Tue Jul 25 10:06:20 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* doc/crawl.fig, doc/crawl_rebuid.fig, doc/crawl_urls.fig,
	doc/interaction_structs.fig, doc/robots.fig,
	doc/webbase_other_struct.fig, doc/webbase_url_start.fig,
	doc/webtools.fig, dirsel.fig, dirsel_cted.fig, cookies.fig :
	creation

Mon Jul 24 23:25:10 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* crawler/crawl.cc (mirror_http): added language recognition
	mechanism (uses langrec module)

Fri Jul 21 16:19:04 2000  Loic Dachary  <loic@senga.org>

	* check/index_test: fix to add -s index to htdb_dump and rebuild the
	  reference output. Works with mifluz-0.19.
	
Fri Jul 21 15:13:54 2000  Loic Dachary  <loic@senga.org>

	* acconfig.h: Add all symbols for AC_MANDATORY_* macros. autoheader
	  has builtin rules to generate those for AC_CHECK_* macros and there
	  is no easy way to add new definitions for AC_MANDATORY_.

Fri Jul 21 13:42:25 2000  Loic Dachary  <loic@senga.org>

	* acinclude.m4 (CHECK_LANGREC): not wanted is the default

Fri Jul 21 12:54:53 2000  Loic Dachary  <loic@senga.org>

	* THANKS,AUTHORS,INSTALL,NEWS: added gnu compliant files

Thu Jul 20 19:36:26 2000  Loic Dachary  <loic@senga.org>

	* acinclude.m4 (CHECK_MIFLUZ): not wanted is the default

Thu Jul 20 09:22:51 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* crawler/crawl.cc (crawl_urls): bug fix, the 'where'
	  clause of the sql query was wrong.

Wed Jul 19 15:46:59 2000  Loic Dachary  <loic@senga.org>

	* DEBUGGING: update purify instructions

	* acinclude.m4: quote macro names as instructed by 
	  Peter Simons <simons@research.cys.de>

Wed Jul 19 14:26:16 2000  Loic Dachary  <loic@senga.org>

	* acinclude.m4 (AC_MANDATORY_HEADER, AC_MANDATORY_HEADERS, 
	  AC_MANDATORY_LIB): create these macros as helpers for package 
	  inclusion macros such as CHECK_LANGREC.

	* acinclude.m4 (CHECK_MIFLUZ, CHECK_LANGREC, CHECK_ZLIB): 
	  use *MANDATORY* macros. Upgrade the CHECK_ZLIB macro from
	  mifluz/acinclude.m4, add sanity check in CHECK_LANGREC and
	  CHECK_MIFLUZ to avoid redundant inclusion of -I/usr/include.
	  Update documentation of macros.

        * acinclude.m4,configure.in (CHECK_MYSQL, CHECK_URI):
	  move checking of MySQL and uri libraries to acinclude.m4 and
	  use *MANDATORY* macros.

Wed Jul 19 13:51:09 2000  Loic Dachary  <loic@senga.org>

	* configure.in (CPPFLAGS): remove double inclusion of 
	  CHECK_LANGREC

Sun Jul 17 10:35:24 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* tools/getopttools.h (struct option_help): moved option_help
	structure from getopt.h to this file

	* aclocal.m4: creation of the CHECK_LANGREC macro (check if
	langrec module is avalaible)

	* configure.in: added a call to the CHECK_LANGREC macro. reordered
	library calls because of unresolved symbols during linking

Thu Jul 13 14:22:46 2000  Loic Dachary  <loic@senga.org>

	* check/conf/httpd.conf.in: remove mod_env and mod_auth loading
	  because unused and not always available.

	* crawler/crawl.cc: remove bodyparse.h

Tue Jul 11 13:08:34 2000  Benoit Orihuela  <benoit.orihuela@idealx.com>

	* tools/getopt.h (struct option_help): set struct fields non-const
	(modified in getopt_help_merge).

	* tools/getopttools.c (getopt_dump): changed the way to display
	information about available options.

Tue Jul 11 01:27:00 2000  Benoit Orihuela    <borihuela@idealx.com>

        * tools/getopt.h (struct option_help): created option_help
        structure in order to implement -help feature (contains the name
        and description of each option)

        * tools/getopttools.c (getopt_help_merge): creation. makes the
        same work that getopt_merge with option_help structure

        * bin/crawler.cc (init): added option_help structure

        * bin/consistentc.cc, bin/dumpdata.cc, bin/html2text.cc (init):
        added option_help structure. added -help option.

        * crawler/cookies.cc, crawler/crawl.cc, crawler/robots.cc,
        crawler/webbase.cc, crawler/webtools.cc : added option_help
        structure for these modules, added cookies_help_options function
        to retrieve information about help options.

Thu Jun 15 19:34:06 2000  root  <root@chez.com>

	* hooks/WebbaseHookMifluz.h: virtual Parse() to init mifluz

Thu Jun 15 17:42:19 2000  Loic Dachary  <loic@senga.org>

	* crawler/webtools.cc (webtools_open): Add cache for hostname lookup
	  to reduce DNS traffic.

Thu Jun 15 16:56:20 2000  Loic Dachary  <loic@senga.org>

	* crawler/webbase_create.cc: Add .tap, .rar unknown extensions.

	* crawler/html_content.lxx (html_content_parse_print_one): use
	  unsigned char to prevent negative lookup in isspace and isalnum.

Thu Jun 15 11:40:44 2000  Loic Dachary  <loic@senga.org>

	* bin/consistentc.cc (fix_one_key): Fix unbounded usage of 
	  str2md5ascii_simple.

	* crawler/ftp.cc: use SETSOCKOPT_ARG3

	* acinclude.m4: create AC_COMPILE_SETSOCKOPT

	* configure.in: use AC_COMPILE_SETSOCKOPT

	* acconfig.h: Add SETSOCKOPT_ARG3

Tue Jun 13 17:47:40 2000  root  <root@chez.com>

	* check/conf/mifluz.conf: create

Fri Jun 09 19:52:35 2000  Loic Dachary  <loic@senga.org>

	* crawler/webbase_url.cc (webbase_url_file): use rowid instead of
	  MD5 key to generate the file name.

Fri Jun 09 18:55:19 2000  root  <root@chez.com>

	* webbase: create default localstate directory for installation

	* crawler/crawl.cc: use uri_set_root with DEFAULT_FILE_CACHE
	  /var/lib/webbase/file_cache by default

	* crawler/crawl.cc: -file_cache specify alternate DEFAULT_FILE_CACHE

	* hooks/WebbaseHookMifluz.cc: -index_file specify alternate index file
	  DEFAULT_INDEX_DIR/index

	* hooks/Makefile.am: DEFAULT_INDEX_DIR set to /var/lib/webbase
	
Thu Jun 08 18:14:52 2000  root  <root@chez.com>

	* crawler/mime.cc (mime_accept_1): Fix inverted test on
	  unallocated, slow down things on update and leaks memory.

Wed Jun 07 19:10:09 2000  root  <root@chez.com>

	* man/consistentc.1: Add manual page

	* man/crawler.1: Fix options, more -create info

Wed Jun 07 17:05:15 2000  root  <thy-@ctkproda-17.chez.com>

	* hooks/WebbaseHookMifluz.cc (InsertContent): upgrade to Mifluz-0.17

Thu Apr 20 16:55:08 2000  Loic Dachary  <loic@senga.org>

	* webbase-5.11 release

Fri Apr 14 16:15:03 2000  Loic Dachary  <loic@senga.org>

	* crawler/webbase_url.h,webbase.h,webtools.h: Add comments

Tue Apr 12 19:56:27 2000  Loic Dachary  <loic@senga.org>

	* crawler/{cookies,http,dirsel,mime,robots}.cc: Add comments

	* crawler/html*.cc: Proofread comments 

Tue Apr 11 19:56:27 2000  Loic Dachary  <loic@senga.org>

	* ROADMAP: Create: information on files/directory organisation

	* crawler/crawl.cc: Add comments

Wed Apr 05 13:32:30 2000  Loic Dachary  <loic@senga.org>

	* crawler/dirsel.cc (dirsel_init): raise hash limit to 
	  HASHCOUNT_T_MAX.

Tue Apr 04 17:02:10 2000  Loic Dachary  <loic@senga.org>

	* hooks/WebbaseHookMifluz.cc (InsertContent): use Override instead
	  of Insert (75% faster, counts for ~30% of total CPU used).

Tue Mar 14 00:07:16 2000  Loic Dachary  <loic@ceic.com>

	* webbase-5.10 release

	* hooks/WebbaseHookMifluz.cc (RebuildStart): use O_TRUNC to
	  reset index instead of removing the file. Created problems
	  with open environment and required.

	* acinclude.m4: upgraded APACHE macro to find modules
	  in lib/apache.

	* acinclude.m4 (CHECK_MIFLUZ): only depends on mifluz, not
	  htdb as of mifluz-0.14

	* hooks/WebbaseHookMifluz.cc (WebbaseHookMifluz): upgrade to new
	  WordKey format.

Tue Feb 29 10:54:28 2000  Loic Dachary  <loic@ceic.com>

	* crawler/webtools.cc (webtools_open_1): reset alarm before 
	  returning.

Tue Feb 15 18:34:04 2000  Loic Dachary  <loic@ceic.com>

	* webbase-5.9 release

	* acinclude.m4: use mifluz.h instead of wordlist.h to detect mifluz.

	* hook/WebbaseHookMifluz.{cc,h}: use HAVE_MIFLUZ_H instead of 
	  HAVE_WORDLIST_H. 

Tue Feb 01 10:07:26 2000  Loic Dachary  <loic@ceic.com>

	* bin/crawler.cc (main): Cannonicalize arguments before using
	  them.

Sun Jan 30 13:35:11 2000  Loic Dachary  <loic@ceic.com>

	* webbase-5.8 release

	* hooks/WebbaseHookMifluz.cc (WebbaseHookMifluz): do not
	  activate monitoring by default.

Wed Jan 26 13:41:03 2000  Loic Dachary  <loic@ceic.com>

	* crawler/*.cc: fix numerous warnings about unused arguments
	
	* acinclude.m4: activate -Werror if possible in AC_PROTOTYPE

	* acinclude.m4: implement AC_PROTOTYPE, AC_PROTOTYPE_GETSOCKNAME and
	  AC_PROTOTYPE_ACCEPT

	* crawler/ftp.cc: use GETSOCKNAME_ARG3 for getsockname and ACCEPT_ARG3 for
	  accept

Thu Jan 20 10:47:40 2000  Loic Dachary  <loic@ceic.com>

	* configure.in: socklen_t at beginning of file

Thu Jan 20 10:11:52 2000  Loic Dachary  <loic@ceic.com>

	* */Makefile.am: pkginclude_HEADER so that .h are located in
	  a subdirectory, avoiding conflicts.

	* webbase.spec: Added spec file contributed by 
	  Davor Cengija <davor@linuxfan.com>

Mon Jan 17 17:20:27 2000  Marcel Bosc  <bosc@ceic.com>

	* hooks/WebbaseHookMifluz.cc (InsertContent): fixed memory leak

Mon Jan 17 15:10:32 2000  Marcel Bosc  <bosc@ceic.com>

	* crawler/ftp.cc :  replaced socklen_t by confgure 
	determined SOCKET_LENGTH_T

	* configure.in : added message for socklen_t

	* acinclude.m4 : added macro for testing type in function 
	prototype

	* configure.in: added prottype checking for getting 
	type of socklen_t

Mon Jan 17 12:08:44 2000  Loic Dachary  <loic@ceic.com>

	* webbase-5.7 release
	
	* man/crawler.1: Add hook options documentation and -- terminator
	  warning.

Fri Jan 14 13:45:55 2000  Loic Dachary  <loic@ceic.com>

	* hooks/WebbaseHookMifluz.cc (WebbaseHookMifluz): config a pointer, allocated
	  from WordContext::Initialize from ~/.mifluz

Thu Jan 13 12:28:02 2000  Loic Dachary  <loic@ceic.com>

	* hooks/WebbaseHookMifluz.{cc,h}: wordRef now a pointer because
	  wordkey has to be defined by config before we can allocate one.

Tue Jan 11 13:10:52 2000  Loic Dachary  <loic@ceic.com>

	* acinclude.m4 (CHECK_MIFLUZ): use htdb instead of db

Wed Jan 05 11:14:45 2000  Marcel Bosc  <bosc@ceic.com>

	* hooks/*: Adapted to changes in mifluz key mechanism
	(as of mifluz-0.11)

Tue Jan 04 14:01:31 2000  Loic Dachary  <loic@ceic.com>

	* hooks/*: restructuration. WebbaseHook.cc is a base class
	  for indexers. WebbaseHookMifluz.cc is a derived class that
	  implements the interface to mifluz-0.10. 

	* bin/*.cc,hooks/*.cc: use "-" in getopt_long to prevent reordering
	  of options. It does not solve everything. The end of options is
	  not found anymore.
	
	* crawler/html_content.lxx: fix redundant verbose + incorrect
	  fill_href prototype

	* crawler/*.cc, bin/*.cc: unknown options do not trigger errors
	  this is annoying but dramatically simplify implementation of
	  options handling. No cascade or option collection is necessary,
	  each module handles options it recognizes.

	* crawler/*.c : convert to C++, to start migration to C++

	* tools/WebbaseGetopt.cc: base class for options specific to a class.

	* tools/WebbaseDl.cc: encapsulate dynamic loading using libtool libltdl
	  for webbase purposes.

Wed Dec 15 11:37:14 1999  Loic Dachary  <loic@ceic.com>

	* webbase-5.6 release

	* crawler/crawl.c: fix major bug: file deleted from WLROOT
	  if Not Modified. 

	* test/*: use htdump instead of db_dump
	
	* hooks/hooks_mifluz: hardwired compression + cache +
	  page_size. Removed unecessary extern C.
	
	* check/test_functions.in,check/config: uses .my.cnf
	  for permissions. Not needed to patch config to run
	  tests anymore.
	
	* {tools,webbase}/*.h: add extern "C" everywhere

	* tools/md5.h: add #ifndef _md5_h

	* tools/salloc.h: remove include malloc.h

	* {bin,check}/*.c -> *.cc: main progs are C++

Wed Dec 15 11:18:38 1999  Loic Dachary  <loic@ceic.com>

	* configure.in: change LANG to C++

Thu Dec 09 17:27:16 1999  Loic Dachary  <loic@ceic.com>

	* acinclude.m4: upgraded CHECK_ZLIB

	* acinclude.m4 (AC_PROG_APACHE): documentation

Tue Dec 07 12:08:35 1999  Loic Dachary  <loic@ceic.com>

	* webbase-5.5 release

Tue Nov 30 12:08:35 1999  Loic Dachary  <loic@ceic.com>

	* crawler/html_content.l: test null in parse_print, fix
	  array bound write

Mon Nov 29 19:19:49 1999  Loic Dachary  <loic@ceic.com>

	* webbase-5.4 release

	* check/*: find in $srcdir
	
	* crawler/Makefile.am: added html_parser.h

	* Makefile.am: added .version

	* check/index_test (samples): indexed is non accented

	* tools/isomap.c (unaccent): added string_length argument

Fri Nov 26 10:59:21 1999  Loic Dachary  <loic@ceic.com>

	* webbase-5.3 release

	* check/* : include apache detection, autodetect modules

Thu Nov 25 19:35:00 1999  Loic Dachary  <loic@ceic.com>

	* crawler/html_*: complete rewrite of the html parser

	* hooks/*: isolate hooks in separate library

	* check/*: more tests for html parser

Wed Nov 10 17:18:20 1999  Quiedeville Rodolphe  <rodo@banquise.ceic.com>

	* man/crawler.1:  -create option : Exclusive, no other option accepted.

Tue Nov 02 16:21:45 1999  Loic Dachary  <loic@ceic.com>

	* bin/furi2md5.c: convert FURI to FURI_MD5 (see uri(3))

Fri Oct 29 15:39:17 1999  Loic Dachary  <loic@ceic.com>

	* crawler/robots.c (robots_load_1): netloc now is a unique key, added rowid
	  to get a unique identifier per server. Handle the race conditions when
	  two process try to insert the same robots entry.
	
Fri Oct 29 11:13:46 1999  Loic Dachary  <loic@ceic.com>

	* crawler/webbase_url.c (webbase_url_start_ok): only cannonical
	  and absolute url are valid starting points.

Thu Oct 28 16:39:42 1999  Loic Dachary  <loic@ceic.com>

	* crawler/crawl.c (mirror_schedule): if delay <= 0, default to 1 week.

Thu Oct 28 15:27:21 1999  Loic Dachary  <loic@ceic.com>

	* bin/consistentc.c (fix_keys): implement -keys_url, -keys_md5, -keys_normalize

Thu Oct 28 09:23:41 1999  Loic Dachary  <loic@ceic.com>

	* crawler/webbase.c (webbase_unlock): uses md5 key instead of long
	  ascii names.

Wed Oct 27 16:31:54 1999  Loic Dachary  <loic@ceic.com>

	* crawler/webbase.c (webbase_insert_url): fix big problem with
	  realloc(&p, &s, s + value) changed to realloc(&p, &s, value).

Tue Oct 26 18:48:00 1999  Loic Dachary  <loic@ceic.com>

	* crawler/webtools.c: implement -webtools_limit to limit the maximum size
	  of a document.

Tue Oct 26 16:01:22 1999  Loic Dachary  <loic@ceic.com>

	* bin/consistentc.c: consistentc -key cannonicalize urls

Fri Oct 22 13:58:03 1999  Loic Dachary  <loic@ceic.com>

	* crawler/webbase.c: use mysql_real_connect instead of deprectated
	  mysql_connect.

	* crawler/webbase.c: read defaults from ~/.my.cnf if options are missing

	* crawler/webbase.c: do not try to connect twice

	* crawler/webbase_create.c: add bz2 and wdz extensions to unknown mime
	  type

	* bin/crawler.c (init): added -schema to print default database schema

Thu Oct 21 18:57:34 1999  Loic Dachary  <loic@ceic.com>

	* port to freebsd-3.3
	
	* crawler/crawl.c,webtools.c: conditionaly use ETIME, prefer ETIMEDOUT

Thu Oct 21 18:19:14 1999  Loic Dachary  <loic@ceic.com>

	* check/index_test: created

Tue Oct 19 19:00:14 1999  Loic Dachary  <loic@ceic.com>

	* crawler/hook_mifluz.cc: initial version

	* configure.in : --with-mifluz implementation

Mon Oct 18 17:55:46 1999  Loic Dachary  <loic@ceic.com>

	* test/webbase_test: feed url_md5 + call consistentc -key
	  when manually inserting urls in start.

Fri Oct 15 10:31:34 1999  Loic Dachary  <loic@ceic.com>

	* crawler/webbase_url.c: add webbase_url_free and call
	  on context.webbase_url objects.

	* crawler/webbase.c: add webbase_start_free and call
	  on start objects.

Thu Oct 14 17:24:52 1999  Loic Dachary  <loic@ceic.com>

	* DEBUGGING: create

	* tool/getopt*: Upgraded

	* fix various warnings reported by purify.

	* crawler/webbase*.c: fix memory leak : do not reset
	  w_*_length to 0 in *_reset.

	* added .cvsignore everywhere

Thu Oct 14 11:04:41 1999  Loic Dachary  <loic@ceic.com>

	* bin/consistentc: added -keys that rebuilds all the url_md5 keys in start and
	  url tables.

Wed Oct 13 09:36:23 1999  Loic Dachary  <loic@ceic.com>

	* crawler: add url_md5 field in start and url tables. Modify all
	  sources to fill and use this field instead of url.

	* tools/md5str.[ch]: create

	* configure.in: cleanup add link to mifluz 

1999-07-30  Bertrand Demiddelaer <bert@ceic.com>

	* crawler/webtools.c (webtools_open_1): timeout for connect() added

Mon Jul 19 14:33:25 1999    <loic@ceic.com>

	* webbase-5.2 release

1999-07-17  Loic Dachary  <loic@ceic.com>

	* crawler/webbase.c (webbase_alloc): break if connection successfull

	* crawler/dirsel.c (hnode_free): strdup key to prevent unexpected
	  deallocation

	* check: test suite

1999-07-15  Loic Dachary  <loic@ceic.com>

	* tools/dirname.[ch]: rename to urldirname to prevent conflict

1999-07-13  Loic DACHARY  <loic@home.ceic.com>

	* webbase-5.1 release

1999-07-09  Loic Dachary  <loic@ceic.com>

	* Initial import
