User Tools

Site Tools


cats:parseresumes

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
cats:parseresumes [2007/02/07 21:02]
helphand
cats:parseresumes [2007/02/07 22:09] (current)
helphand
Line 1: Line 1:
 +^  :!: This Documentation Applies to CATS Version 0.6.1 :!:  ^
 +|  The CATS Team has since released new versions, the material documented here likely will not work on the new versions without modification.  |
 +
 +===== Automatically Parse and Add Resumes to CATS =====
 +Any new CATS installation with an existing inventory of candidate resumes will face the conundrum of how to get all those resumes into CATS. The thought of manually data entering thousands, or tens of thousands, of candidates with their associated resume into CATS can be overwhelming to the point of dropping the idea of even installing CATS. 
 +
 +Well, if you are technically inclined, or have someone on staff who is, and your existing inventory of resumes are organized in a manner similar to ours, you may be able to adapt the solution we used to load over 30,000 candidates and resumes into our CATS installation. The determining factors will be;
 +
 +  - Are the resumes in doc, pdf, txt, htm or rtf format?
 +  - Are they saved as a single file per candidate?
 +  - Is the candidate's name present in the filename?
 +  - Do you have perl installed or do you have the technical expertise available to get it installed?
 +
 +
 +If you can answer yes to the above, then chances are you can adapt our script to work with your setup. Be warned though, technical expertise is required as well as knowledge about your own setup. So, if you are a novice or a regular end-user, this journey is not for you :-(
 +
 +==== Our Setup ====
 +Before installing CATS, our offices kept their resumes on a shared drive, each office had a folder named after the office and in that folder was a Resumes folder where they dropped their incoming resumes.<code>
 +losangeles
 +|-- Resumes
 +    |-- controller
 +    |   |-- Smith, John.doc
 +    |   |-- Doe, Mary.doc
 +    |-- CFOs
 +    |   |-- Wilder, Billy.doc
 +    |   |-- Cravits, Henri.doc
 +    |-- Sr Accountants
 +    |   |-- Nguyen, Tron.doc
 +    |   |-- Collins, Francis.pdf
 +    |-- Bookkeepers
 +        |-- Jones, Tom.txt
 +        |-- Jones, Mary Lou.htm
 +</code>
 +
 +Within the Resumes folder, they had sub-folders categorizing the candidates primary skill area. They would drop the candidates resume into the sub-folder that best matched that candidates skill. If necessary, they would rename the file using the convention Last Name, First Name - optional key skill data.ext. Because this pre-existing system had some structure to it, it could easily be used to grab the resumes and build CATS candidate records to load the database.
 +
 +Our CATS system resides on a SUSE Linux box, so the first step was to use CIFS to create mount points on the Linux box to the appropriate windows shares on the Windows file server. The Linux box already had a machine account in ActiveDirectory, so mounting the shares was simply a matter of putting the correct entries in /etc/samba/smbfstab and performing the initial mount manually from the commandline. Once that was done, the resumes were visible on the Linux box from mount points that looked like this;  <code>
 +/mnt
 + |--resumes
 +    |--losangeles
 +    |  |-- Resumes
 +    |      |-- controller
 +    |      |   |-- Smith, John.doc
 +    |      |   |-- Doe, Mary.doc
 +    |      |-- CFOs
 +    |      |   |-- Wilder, Billy.doc
 +    |      |   |-- Cravits, Henri.doc
 +    |      |-- Sr Accountants
 +    |      |   |-- Nguyen, Tron.doc
 +    |      |   |-- Collins, Francis.pdf
 +    |      |-- Bookkeepers
 +    |          |-- Jones, Tom.txt
 +    |          |-- Jones, Mary Lou.htm
 +    |--sandiego
 +    |  |-- Resumes
 +    |      |-- controller
 +    |      |   |-- Williams, Beth.doc
 +    |      |-- CFOs
 +    |      |   |-- Smith, George.doc
 +    |      |   |-- Welsh, John.doc
 +    |      |-- Sr Accountants
 +    |      |   |-- Orwell, George.doc
 +    |      |-- Bookkeepers
 +    |          |-- Carter, Barbara.txt
 +    |--sanfrancisco     
 +       |-- Resumes
 +           |-- controller
 +             |-- Nelson, Jill.doc
 +             |-- Marks, Savannah.doc
 +           |-- CFOs
 +             |-- Ford, Samuel.doc
 +           |-- Sr Accountants
 +             |-- Dillon, Matt.doc
 +             |-- Nielson, Alice.pdf
 +           |-- Bookkeepers
 +               |-- Gates, Will.txt
 +               |-- Skywalker, Luke.htm
 +
 +</code>
 +
 +
 +
 +==== cats_parser.pl script ====
 +
 +With the above defined structure, we can run the following perl script, cats_parser.pl, against the directory structure and it will grab each resume, convert it to text, parse the filename to derive the candidates name, parse the text resume to find an email address and phone number, and use that data to populate a new candidate record with attachment in CATS.<code>
 +#perl cats_parser.pl /mnt/resumes
 +</code>
 +
 +Note that this perl script obviously requires perl be installed as well as several modules, DBI, dbd::myql, POSIX, File::Find, File::Basename, and File::Spec. You would **not run this script** against your data without first modifying it to match your setup and testing it in a safe environment to avoid damaging your production data. Trust me, using this script requires technical expertise and at least a basic understanding of perl, **use it at your own risk**, I will not be available to assist or help should you destroy your installation or damage your inventory of resumes.
 +
 +In addition to the actual script, you must have the required converters installed and working on your system. The converters include:
 +
 +  * antiword - available from [[http://www.winfield.demon.nl/]]
 +  * pdftotext - part of the xpdf package [[http://www.foolabs.com/xpdf/home.html]]
 +  * html2text - available from [[http://www.mbayer.de/html2text/]]
 +  * rtf-converter - available from [[http://directory.fsf.org/rtf-converter.html]]
 +
 +<code>
 +#!/usr/bin/perl -w
 +use strict;
 +
 +# cats_parser.pl parses resumes and loads them into CATS system.
 +# Copyright (C) 2006  Scott W. Leighton  <helphand@pacbell.net>
 +
 +# This program is free software; you can redistribute it and/or
 +# modify it under the terms of the GNU General Public License
 +# as published by the Free Software Foundation; either version 2
 +# of the License, or (at your option) any later version.
 +
 +# This program is distributed in the hope that it will be useful,
 +# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +# GNU General Public License for more details.
 +
 +# You should have received a copy of the GNU General Public License
 +# along with this program; if not, write to the Free Software
 +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
 +
 +
 +use File::Find;
 +use File::Basename;
 +use File::Spec;
 +use DBI;
 +use POSIX qw(strftime);
 +
 +my $DEBUG = 0;
 +
 +my $dbh=DBI->connect("DBI:mysql:database=cats;host=sqlhost",
 +                   $ENV{'SQLUSER'},$ENV{'SQLPW'},
 +                   {'RaiseError'=>1});
 +
 +
 +
 +# Subroutine to validate file names: return true if file is ok to process
 +# or false to skip the file.
 +
 +sub check_path {
 +    my $path = shift;
 +    return 1 if $path =~ /\.(rtf|txt|pdf|htm|html|doc)$/i;  # return true if ends in one of desired extensions
 +    return 0;  # return false to skip
 +}
 +
 +sub check_dir {
 +    my $dir = shift;
 +    return 0 if m[^\.];  # return false if starts with a dot
 +    return 1 if m!^/mnt/resumes$!i;
 +    return 1 if m!^/mnt/resumes/([^/]+)$!i;
 +    return 0 unless m!^/mnt/resumes/([^/]+)/Resumes.*$!i;
 +    return 1;  # return true to process this directory
 +}
 +
 +
 +
 +find(
 +    {
 +        wanted => \&wanted,
 +        no_chdir => 1, 
 +        follow => 1,
 +    },
 +    @ARGV,
 +);
 +
 +sub wanted {
 +    my $path = $File::Find::name;
 +
 +    if ( -d ) { 
 +        if ( !check_dir( $path ) ) {
 +            $File::Find::prune = 1;
 +        }
 +        return;
 +    }
 +
 +    if ( !-r _ ) {
 +        warn "$File::Find::name is not readable\n";
 +        return;
 +    }
 +
 +
 +    my $mtime = (stat _ )[9];
 +
 +    if ( !check_path( $path ) ) {
 +        print "skipping $path\n" if $DEBUG;
 +        return;
 +    }
 +
 +    print "processing $path\n" if $DEBUG;
 +
 +    # Otherwise, fetch document 
 +    process_file( $path, $mtime );
 +
 +}
 +
 +sub process_file {
 +    my ( $path, $mtime ) = @_;
 +
 +    
 +    my $contenttype;
 +    my $content;
 +
 +    my ($filename,$filepath,$suffix) = fileparse($path,qr/\.[^.]*$/);
 +
 +    $suffix=lc($suffix);
 +    
 +    my $qpath = $dbh->quote($path);
 +
 +    # Convert to plain text
 +   
 +    if ($suffix eq '.doc') {
 +      $contenttype='application/msword';
 +      $content= `/usr/local/bin/antiword -m 8859-1.txt $qpath`;
 +   } elsif ($suffix eq '.pdf') {
 +      $contenttype='application/pdf';
 +      $content= `/usr/bin/pdftohtml -stdout $qpath`;
 +   } elsif ($suffix eq '.rtf') {
 +      $contenttype='application/rtf';
 +      $content= `/usr/local/bin/rtf-converter $qpath`;
 +   } elsif ($suffix eq '.txt' )  {
 +      $contenttype='text/plain';
 +      $content= `cat $qpath`;
 +   } elsif ($suffix eq '.htm' or $suffix eq '.html' {
 +      $contenttype='text/html';
 +      $content= `cat $qpath`;
 +   } else {
 +      $contenttype='application/octet-stream';
 +      $content= `cat $qpath`;
 +   }
 +
 +   # look thru the plain text version for an email address and phone number
 +
 +   my $workarea = $content;
 +   $workarea =~ s/[\n]+/ /gs;
 +   $workarea =~ s/[\x00-\x1F]+//gs;
 +   $workarea =~ s/[\x80-\xFF]+//gs;
 +   
 +   my $email;
 +   if ($workarea =~ m/\b([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b/gis) {
 +      $email = $1;
 +   }
 +
 +   $workarea =~ s/[\|]+/ /gs;
 +   my $phone;
 +   if ( $workarea =~ m/(
 +                   \(?     # optional parentheses
 +                     \d{3} # area code required
 +                   \)?     # optional parentheses
 +                   \s?     # optional space
 +                   [-\s.]? # separator is either a dash, a space, or a period.
 +                   \s?     # optional extra space
 +                     \d{3} # 3-digit prefix
 +                   \s?     # optional extra space
 +                   [-\s.]  # another separator
 +                   \s?     # optional extra space
 +                     \d{4} # 4-digit line number
 +                   )/isx ) {
 +         $phone = $1;
 +   } else {
 +       if (
 +             m/(
 +                (?:
 +                 (?:1-?)?
 +                 (?:\(?[0-9]{3}\)?-)?
 +                )?
 +                [0-9]{3}-?[0-9]{4}
 +               )/x ) {
 +           $phone = $1;
 +       }
 + 
 +   }
 +
 +   my $bytes = output_document( $path, \$content, $mtime, $contenttype, $email, $phone);
 +
 +
 +}
 +
 +
 +sub output_document {
 +    my ( $path, $content_ref, $mtime, $parser_type, $email, $phone ) = @_;
 +
 +    my $moddate = strftime "%Y-%m-%d %H:%M:%S", (localtime $mtime)[0..5];
 +    
 +    my ($filename,$filepath,$suffix) = fileparse($path,qr/\.[^.]*$/);
 +    my $originalfname = "$filename$suffix";
 +    my %ks;
 +    my $branch;
 +    
 +    # grab the branch name out of the path
 +
 +    $path =~ m!^/mnt/resumes/([^/]+)/Resumes.*$!i;
 +    $branch=lc($1) if $1;
 +
 +    # save the relevant portions of the path as key skills
 +    # by stripping out any sub-folder names below the Resumes
 +    # folder and using them as a key skill item
 +    my @dirs = File::Spec->splitdir($path);
 +    my $discard = pop(@dirs);   # dump the filename portion
 +    while (my $dir = pop(@dirs)) {
 +       last if $dir =~ /Resumes/;  #stop when we hit this directory
 +       $ks{lc($dir)}++ if $dir;
 +    }
 +    
 +    # now parse that file name
 +    my ($fname,$lname,$other) = parse_filename($filename);
 +    
 +    # anything following a dash in the file name is considered 
 +    # key skill data, so split it out and clean it up
 +    if ($other) {
 +      my @parts = split(/,\s+|\,|\s+/,$other);
 +      foreach my $p (@parts) {
 +         $p=~ s/resume//gi;   #strip out the word resume
 +         if ($p and length($p) > 1) {  # skip if only 1 character long
 +            $ks{lc($p)}++ if $p;
 +         }
 +      }
 +    }
 +    my @keys = keys %ks;
 +    %ks = ();
 +    # clean up the key skills, remove special chars and the word resume
 +    foreach my $p (@keys) {
 +       $p=~ s/resume//gi;   # strip out the word resume
 +       $p=~ s/[-&,]/ /g;    # strip special chars
 +       $p=~ s/^\s+//g;
 +       $p=~ s/\s+$//g;
 +       $ks{lc($p)}++ if $p;
 +    }
 +    
 +    # build the final key skills string, put the branch we found 
 +    # to the front so it always displays on candidate screens   
 +    my $keyskills = join (", ",keys %ks);
 +    if ($branch) {
 +       $keyskills = "$branch, " . $keyskills;
 +    }
 +
 +    # create the CATS candidate record    
 +
 +    my $sql = sprintf ("INSERT into cats.candidate (
 +                 first_name,
 +                 last_name,
 +                 source,
 +                 phone_home,
 +                 key_skills,
 +                 entered_by,
 +                 owner,
 +                 site_id,
 +                 date_created,
 +                 date_modified,
 +                 email1
 +                ) values (
 +                 %s,
 +                 %s,
 +                 'Unsolicited Resume',
 +                 %s,
 +                 %s,
 +                 1,
 +                 1,
 +                 1,
 +                 %s,
 +                 %s,
 +                 %s
 +                );",
 +                 $dbh->quote($fname),
 +                 $dbh->quote($lname),
 +                 $dbh->quote($phone),
 +                 $dbh->quote($keyskills),
 +                 $dbh->quote($moddate),
 +                 $dbh->quote($moddate),
 +                 $dbh->quote($email)
 +                );
 +    
 +    $dbh->do($sql) or die $dbh->errstr;
 +               
 +    my $candidateid = $dbh->{'mysql_insertid'};
 +
 +    # Create the CATS attachment record 
 +
 +    $sql = sprintf ("INSERT into cats.attachment (
 +                 data_item_type,
 +                 data_item_id,
 +                 title,
 +                 original_filename,
 +                 stored_filename,
 +                 content_type,
 +                 resume,
 +                 text,
 +                 site_id,
 +                 date_created,
 +                 date_modified
 +                 ) values (
 +                 100,
 +                 %s,
 +                 %s,
 +                 %s,
 +                 %s,
 +                 %s,
 +                 1,
 +                 %s,
 +                 1,
 +                 %s,
 +                 %s
 +                 );", 
 +                 $candidateid,
 +                 $dbh->quote($filename),
 +                 $dbh->quote($originalfname),
 +                 $dbh->quote($originalfname),
 +                 $dbh->quote($parser_type),
 +                 $dbh->quote($$content_ref),
 +                 $dbh->quote($moddate),
 +                 $dbh->quote($moddate)
 +                 );
 +                 
 +    $dbh->do($sql) or die $dbh->errstr;
 +                 
 +    my $attachid = $dbh->{'mysql_insertid'};
 +
 +    # build the path name for the attachment file, then
 +    # create the directory, move the original resume to it,
 +    # and fixup ownership of the resume file
 +    
 +    my $newpath = "/srv/www/htdocs/cats/attachments/$attachid";
 +    mkdir ($newpath);
 +    my $qpath = $dbh->quote($path);
 +    my $qnewpath = $dbh->quote($newpath);
 +    `mv $qpath $qnewpath`;
 +    `chown -R wwwrun:www $qnewpath`;
 +    chmod 0755, $qnewpath;
 +
 +}
 +
 + 
 +# Routine to parse the resume doc's filename
 +
 +sub parse_filename {
 +    my ($filename) = @_;
 +    my ($fname,$lname,$other);
 +    
 +    # look for std pattern convention
 +    if ($filename =~ /^([^,]+)(\,\s?)([^,]+)(\s?-\s?)(.*)$/ ) {
 +       $lname=$1;
 +       $fname=$3;
 +       $other=$5;
 +       print "std $filename -> $lname, $fname - $other\n" if $DEBUG;
 +    } elsif ( $filename =~ /^([^\s,]+)\s+([^\s,-]+)(\s?-\s?)(.*)$/ ) {
 +       $fname=$1;
 +       $lname=$2;
 +       $other=$4;
 +       print "#2 $filename -> $lname, $fname - $other\n" if $DEBUG;
 +    } elsif ( $filename =~ /^([^\s,]+)\s+([^\s,]+)\s?,\s?(.*)$/ ) {
 +       $lname=$1;
 +       $fname=$2;
 +       $other=$3;
 +       print "#3 $filename -> $lname, $fname - $other\n" if $DEBUG;
 +    } elsif ( $filename =~ /^([^\s,]+)\s?,\s?([^\s]+)$/ ) {
 +       $lname=$1;
 +       my @fps = split (/[-,\s]/,$2);
 +       $fname=shift(@fps);
 +       $other = join(", ",@fps);
 +       print "#4 $filename -> $lname, $fname - $other\n" if $DEBUG;
 +    } elsif ( $filename =~ /^([^\s,]+)\s?,\s?(.*)$/ ) {
 +       $lname =$1;
 +       my @fps = split (/[-,\s]/,$2);
 +       $fname = shift(@fps);
 +       $other = join(", ",@fps);
 +       print "#5 $filename -> $lname, $fname - $other\n" if $DEBUG;
 +    } else {   
 +       $filename =~ s/[-+_]/ /g;     # change all underscores to space
 +    
 +       my @fps = split (/,|-|\s+/,$filename);
 +       my $c = scalar(@fps);
 +       if ($c>0) {
 +          if ($c == 2) {
 +             $fname = shift(@fps);
 +             $lname = shift(@fps);
 +          } elsif ($c > 1) {
 +             ($fname,$lname)= split(/\s+/,$fps[0]);
 +             for (my $i=1;$i < $c;$i++) {
 +                $other .= $other?", $fps[$i]":$fps[$i];
 +             }
 +          } else {
 +             ($fname,$lname,$other) = split (/\s+/,$filename);
 +          }
 +       }
 +       print "#6 with c at $c $filename -> $lname, $fname - $other\n" if $DEBUG;
 +    }
 +
 +    # give up with some defaults if we couldn't figure out the
 +    # format of the name
 +    $fname = "APPLICANT" unless $fname;
 +    $lname = "RESUME" unless $lname;
 +    
 +    return ($fname,$lname,$other);
 +}
 +
 +</code>
 +
 +
 +
 +
 +
  
cats/parseresumes.txt ยท Last modified: 2007/02/07 22:09 by helphand