User Tools

Site Tools


sphinx:install

Install and Configure Sphinx

These instructions apply to a SUSE Linux installation of CATS where the CATS package has been installed in the directory /srv/www/htdocs/cats. If your installation differs, you will need to make appropriate adjustments to the paths referenced in this documentation.

Basic Requirements

  1. Installed and operational CATS system
  2. functional cron
  3. LSB compatible init.d for run control

What it Does

The Sphinx package consists of two primary parts, the indexer that creates the search indexes and is run periodically to rebuild those indexes, and the searchd daemon that handles the queries from the sphinxapi.php library.

The CATS integration design calls for a primary index, cats, that is rebuilt once per day via cron.daily. This once per day rebuild picks up all candidates/resumes in the database and completely reindexes the text resume, key skills, and candidate's first and last names. Additionally, it resets the sph_counter to a high water mark.

A second delta index, catsdelta, handles additions to the database during the business day via a cron script that rebuilds only the secondary delta index based on the high water mark set at the prior run of the primary index. It is evisioned that this script would run every 20 or 30 minutes during the business day to keep that delta index up to date with recent additions to the database.

Installation Instructions

Download Sphinx

  • Configure, make, and make install the tarball according to the installation documentation at the sphinxsearch.com site. On the SUSE box, this installed Sphinx in the following directories;
    /usr/local/bin/indexer
    /usr/local/bin/searchd
    /usr/local/man/man8/searchd.8.gz
    /usr/local/etc/sphinx.conf.dist
    /usr/share/doc/packages/sphinx-0.9.7-rc2
    /usr/share/doc/packages/sphinx-0.9.7-rc2/COPYING
    /usr/share/doc/packages/sphinx-0.9.7-rc2/doc
    /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/mk.cmd
    /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.css
    /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.html
    /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.txt
    /usr/share/doc/packages/sphinx-0.9.7-rc2/doc/sphinx.xml
    /usr/share/doc/packages/sphinx-0.9.7-rc2/INSTALL

Copy sphinxapi.php

  • Copy the api/sphinxapi.php file to /srv/www/htdocs/cats/lib/sphinxapi.php

Create Indexer Cron Script

  • Create the following cron script in /etc/cron.daily/indexer to run the indexer on a daily basis.
    #!/bin/sh
    /usr/local/bin/indexer --all --rotate --config /srv/www/htdocs/cats/modules/search/sphinx.conf
  • chown root:root /etc/cron.daily/indexer
  • chmod 700 /etc/cron.daily/indexer

Create Searchd init.d Script

  • Create an /etc/init.d/searchd script for the searchd daemon, the example below works well for a SUSE installation, you may need to alter it for non-SUSE distributions.
    #! /bin/sh
    # Copyright (c) 1995-2004 SUSE Linux AG, Nuernberg, Germany.
    # All rights reserved.
    #
    # Author: Kurt Garloff
    # Please send feedback to http://www.suse.de/feedback/
    #
    # /etc/init.d/searchd
    #   and its symbolic link
    # /(usr/)sbin/rcsearchd
    #
    #    This program is free software; you can redistribute it and/or modify 
    #    it under the terms of the GNU General Public License as published by 
    #    the Free Software Foundation; either version 2 of the License, or 
    #    (at your option) any later version. 
    # 
    #    This program is distributed in the hope that it will be useful, 
    #    but WITHOUT ANY WARRANTY; without even the implied warranty of 
    #    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
    #    GNU General Public License for more details. 
    # 
    #    You should have received a copy of the GNU General Public License 
    #    along with this program; if not, write to the Free Software 
    #    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
    #
    #
    ### BEGIN INIT INFO
    # Provides:          searchd for sphinx
    # Required-Start:    $syslog $remote_fs mysql
    # Should-Start: $time ypbind sendmail
    # Required-Stop:     $syslog $remote_fs
    # Should-Stop: $time ypbind sendmail
    # Default-Start:     3 5
    # Default-Stop:      0 1 2 6
    # Short-Description: searchd daemon for sphinx search
    # Description:       Starts the Sphinx searchd daemon
    ### END INIT INFO
    # 
    
    # Check for missing binaries (stale symlinks should not happen)
    # Note: Special treatment of stop for LSB conformance
    LOGFILE=/var/log/searchd.log
    SEARCHD=/usr/local/bin/searchd
    
    test -x $SEARCHD || { echo "$SEARCHD not installed"; 
    	if [ "$1" = "stop" ]; then exit 0;
    	else exit 5; fi; }
    
    
    # Source LSB init functions
    . /etc/rc.status
    
    # Reset status of this service
    rc_reset
    
    
    case "$1" in
        start)
    	echo -n "Starting $SEARCHD "
    	## Start daemon with startproc(8). If this fails
    	## the return value is set appropriately by startproc.
    	startproc -l $LOGFILE $SEARCHD --config /srv/www/htdocs/cats/modules/search/sphinx.conf
    
    	# Remember status and be verbose
    	rc_status -v
    	;;
        stop)
    	echo -n "Shutting down $SEARCHD "
    	## Stop daemon with killproc(8) and if this fails
    	## killproc sets the return value according to LSB.
    
    	killproc -TERM $SEARCHD
    
    	# Remember status and be verbose
    	rc_status -v
    	;;
        try-restart|condrestart)
    	## Do a restart only if the service was active before.
    	## Note: try-restart is now part of LSB (as of 1.9).
    	## RH has a similar command named condrestart.
    	if test "$1" = "condrestart"; then
    		echo "${attn} Use try-restart ${done}(LSB)${attn} rather than condrestart ${warn}(RH)${norm}"
    	fi
    	$0 status
    	if test $? = 0; then
    		$0 restart
    	else
    		rc_reset	# Not running is not a failure.
    	fi
    	# Remember status and be quiet
    	rc_status
    	;;
        restart)
    	## Stop the service and regardless of whether it was
    	## running or not, start it again.
    	$0 stop
    	$0 start
    
    	# Remember status and be quiet
    	rc_status
    	;;
        force-reload)
    	## Signal the daemon to reload its config. Most daemons
    	## do this on signal 1 (SIGHUP).
    	## If it does not support it, restart.
    
    	echo -n "Reload service $SEARCHD "
    	## if it supports it:
    	killproc -HUP $SEARCHD
    	rc_status -v
    
    	## Otherwise:
    	#$0 try-restart
    	#rc_status
    	;;
        reload)
    	## Like force-reload, but if daemon does not support
    	## signaling, do nothing (!)
    
    	# If it supports signaling:
    	echo -n "Reload service $SEARCHD "
    	killproc -HUP $SEARCHD
    	rc_status -v
    	
    	## Otherwise if it does not support reload:
    	#rc_failed 3
    	#rc_status -v
    	;;
        status)
    	echo -n "Checking for service $SEARCHD "
    	## Check status with checkproc(8), if process is running
    	## checkproc will return with exit status 0.
    
    	# Return value is slightly different for the status command:
    	# 0 - service up and running
    	# 1 - service dead, but /var/run/  pid  file exists
    	# 2 - service dead, but /var/lock/ lock file exists
    	# 3 - service not running (unused)
    	# 4 - service status unknown :-(
    	# 5--199 reserved (5--99 LSB, 100--149 distro, 150--199 appl.)
    	
    	# NOTE: checkproc returns LSB compliant status values.
    	checkproc $SEARCHD
    	# NOTE: rc_status knows that we called this init script with
    	# "status" option and adapts its messages accordingly.
    	rc_status -v
    	;;
        *)
    	echo "Usage: $0 {start|stop|status|try-restart|restart|force-reload|reload|probe}"
    	exit 1
    	;;
    esac
    rc_exit
  • Save the searchd init.d script to /etc/init.d/searchd, make it executable, then install it as a service using insserv searchd.
  • Create a run control softlink for the init.d script
    ln -s /etc/init.d/searchd /usr/sbin/rcsearchd

Create sphinx.conf

  • Create the sphinx.conf configuration file and save it to /srv/www/htdocs/cats/modules/search. You will need to create the 'search' directory since it doesn't exist yet. Be sure to specify your correct <catsuser> and <catspass> on the configuration lines where indicated. Also be sure to create the index file directory you specify under the index path if it doesn't exist (/srv/www/htdocs/cats/modules/search/index).
    #
    # sphinx configuration file for CATS
    #
    
    #############################################################################
    ## data source definition
    #############################################################################
    
    source catsdb
    {
    	type				= mysql
    	strip_html			= 0
    	index_html_attrs	=
    
    	# some straightforward parameters for 'mysql' source type
    	sql_host			= localhost
    	sql_user			= <catsuser>
    	sql_pass			= <catspass>
    	sql_db			= cats
    	sql_port			= 3306	# optional, default is 3306
    
    	sql_query_pre		= REPLACE INTO sph_counter SELECT 1, MAX(attachment_id) from attachment
    	sql_query			= \
    		SELECT attachment_id, data_item_id, UNIX_TIMESTAMP(attachment.date_created) AS date_added, title, text \
    		last_name, first_name, notes, key_skills \
    		FROM attachment left join candidate on data_item_id = candidate_id \
    		where resume = 1 and attachment.site_id = 1 and data_item_type = 100 \
    		and attachment_id <= (SELECT max_doc_id from sph_counter where counter_id = 1)
    
    	sql_group_column	= data_item_id
    	sql_date_column		= date_added
    	sql_query_post		=
    	sql_query_info		= SELECT * FROM attachment WHERE attachment_id=$id
    
    }
    
    source delta : catsdb
    {
        sql_query_pre=
        sql_query           = \
            SELECT attachment_id, data_item_id, UNIX_TIMESTAMP(attachment.date_created) AS date_added, title, text \
            last_name, first_name, notes, key_skills \
            FROM attachment left join candidate on data_item_id = candidate_id \
            where resume = 1 and attachment.site_id = 1 and data_item_type = 100 \
            and attachment_id > (SELECT max_doc_id from sph_counter where counter_id = 1)
        
    }
    
    #############################################################################
    ## index definition
    #############################################################################
    
    index cats
    {
    	source			= catsdb
    
    	# this is path and index file name without extension
    	#
    	# indexer will append different extensions to this path to
    	# generate names for both permanent and temporary index files
    	#
    	# .tmp* files are temporary and can be safely removed
    	# if indexer fails to remove them automatically
    	#
    	# .sp* files are fulltext index data files. specifically,
    	# .spa contains attribute values attached to each document id
    	# .spd contains doclists and hitlists
    	# .sph contains index header (schema and other settings)
    	# .spi contains wordlists
    	#
    	# MUST be defined
    	path			= /srv/www/htdocs/cats/modules/search/index/cats
    
    	docinfo			= extern
    	morphology			= none
    	stopwords			=
    	min_word_len		= 1
    	charset_type		= sbcs
    }
    
    index catsdelta : cats
    {
        source          = delta
        path            = /srv/www/htdocs/cats/modules/search/index/cats_delta
    
    }
    
    #############################################################################
    ## indexer settings
    #############################################################################
    
    indexer
    {
    	mem_limit			= 32M
    }
    
    #############################################################################
    ## searchd settings
    #############################################################################
    
    searchd
    {
    
          address = 127.0.0.1
    	port				= 3312
    	log				= /var/log/searchd.log
    	query_log			= /var/log/query.log
    	read_timeout		= 5
    	max_children		= 30
    	pid_file			= /var/run/searchd.pid
    	# default is 1000 (just like with Google)
    	max_matches			= 1000
    }
    
    # --eof--

Add sph_counter to CATS database

  • create the sph_counter table in the CATS database.
    # in MySQL
    use cats
    CREATE TABLE sph_counter
    (
        counter_id INTEGER PRIMARY KEY NOT NULL,
        max_doc_id INTEGER NOT NULL
    );

Try creating your index

  • Index your CATS database by running the indexer from the commandline
    helphand:~ # /usr/local/bin/indexer --all --config /srv/www/htdocs/cats/modules/search/sphinx.conf
    Sphinx 0.9.7-RC2
    Copyright (c) 2001-2006, Andrew Aksyonoff
    
    using config file '/srv/www/htdocs/cats/modules/search/sphinx.conf'...
    indexing index 'cats'...
    collected 4668 docs, 20.7 MB
    sorted 2.1 Mhits, 100.0% done
    total 4668 docs, 20663522 bytes
    total 3.324 sec, 6216481.50 bytes/sec, 1404.34 docs/sec
    helphand:~ #      

Start the Searchd Daemon

  • Assuming your indexer processed without errors, your install is in good shape, so start the searchd daemon.
    helphand:~ #rcsearchd start

Test Search from Commandline

  • Now test the search from the commandline by searching for a resume keyword.
    helphand:~ # search --config /srv/www/htdocs/cats/modules/search/sphinx.conf controller    
         [lot's of returned stuff snipped for brevity]
    
    
    
            date_created=2006-10-11 13:38:19
            date_modified=2006-10-11 13:38:19
    
    words:
    1. 'controller': 402 documents, 776 hits
    helphand:~ #

Setup Cron for Regular Updates

  • Assuming the search returned expected results, you are almost finished with the Sphinx install. You simply need to add a crontab entry to run the following periodically throughout the business day to index any new candidate resumes added to the database during the day. Create the following file in /etc/cron.d/cats
    # use /bin/sh to run commands, no matter what /etc/passwd says
    SHELL=/bin/sh
    # mail any output to `root', no matter whose crontab this is
    MAILTO=root
    PATH=/usr/local/bin
    #
    
    # Business Days, Business Hours
    
       20,50 7-17 * * Mon,Tue,Wed,Thu,Fri        root  $PATH/indexer --rotate --config /srv/www/htdocs/cats/modules/search/sphinx.conf catsdelta >>/dev/null
  • chmod 600 /etc/cron.d/cats
  • You're done!
sphinx/install.txt · Last modified: 2007/01/26 23:18 (external edit)