User Tools

Site Tools


cats:parseresumes
:!: This Documentation Applies to CATS Version 0.6.1 :!:
The CATS Team has since released new versions, the material documented here likely will not work on the new versions without modification.

Automatically Parse and Add Resumes to CATS

Any new CATS installation with an existing inventory of candidate resumes will face the conundrum of how to get all those resumes into CATS. The thought of manually data entering thousands, or tens of thousands, of candidates with their associated resume into CATS can be overwhelming to the point of dropping the idea of even installing CATS.

Well, if you are technically inclined, or have someone on staff who is, and your existing inventory of resumes are organized in a manner similar to ours, you may be able to adapt the solution we used to load over 30,000 candidates and resumes into our CATS installation. The determining factors will be;

  1. Are the resumes in doc, pdf, txt, htm or rtf format?
  2. Are they saved as a single file per candidate?
  3. Is the candidate's name present in the filename?
  4. Do you have perl installed or do you have the technical expertise available to get it installed?

If you can answer yes to the above, then chances are you can adapt our script to work with your setup. Be warned though, technical expertise is required as well as knowledge about your own setup. So, if you are a novice or a regular end-user, this journey is not for you :-(

Our Setup

Before installing CATS, our offices kept their resumes on a shared drive, each office had a folder named after the office and in that folder was a Resumes folder where they dropped their incoming resumes.

losangeles
|-- Resumes
    |-- controller
    |   |-- Smith, John.doc
    |   |-- Doe, Mary.doc
    |-- CFOs
    |   |-- Wilder, Billy.doc
    |   |-- Cravits, Henri.doc
    |-- Sr Accountants
    |   |-- Nguyen, Tron.doc
    |   |-- Collins, Francis.pdf
    |-- Bookkeepers
        |-- Jones, Tom.txt
        |-- Jones, Mary Lou.htm

Within the Resumes folder, they had sub-folders categorizing the candidates primary skill area. They would drop the candidates resume into the sub-folder that best matched that candidates skill. If necessary, they would rename the file using the convention Last Name, First Name - optional key skill data.ext. Because this pre-existing system had some structure to it, it could easily be used to grab the resumes and build CATS candidate records to load the database.

Our CATS system resides on a SUSE Linux box, so the first step was to use CIFS to create mount points on the Linux box to the appropriate windows shares on the Windows file server. The Linux box already had a machine account in ActiveDirectory, so mounting the shares was simply a matter of putting the correct entries in /etc/samba/smbfstab and performing the initial mount manually from the commandline. Once that was done, the resumes were visible on the Linux box from mount points that looked like this;

/mnt
 |--resumes
    |--losangeles
    |  |-- Resumes
    |      |-- controller
    |      |   |-- Smith, John.doc
    |      |   |-- Doe, Mary.doc
    |      |-- CFOs
    |      |   |-- Wilder, Billy.doc
    |      |   |-- Cravits, Henri.doc
    |      |-- Sr Accountants
    |      |   |-- Nguyen, Tron.doc
    |      |   |-- Collins, Francis.pdf
    |      |-- Bookkeepers
    |          |-- Jones, Tom.txt
    |          |-- Jones, Mary Lou.htm
    |--sandiego
    |  |-- Resumes
    |      |-- controller
    |      |   |-- Williams, Beth.doc
    |      |-- CFOs
    |      |   |-- Smith, George.doc
    |      |   |-- Welsh, John.doc
    |      |-- Sr Accountants
    |      |   |-- Orwell, George.doc
    |      |-- Bookkeepers
    |          |-- Carter, Barbara.txt
    |--sanfrancisco     
       |-- Resumes
           |-- controller
           |   |-- Nelson, Jill.doc
           |   |-- Marks, Savannah.doc
           |-- CFOs
           |   |-- Ford, Samuel.doc
           |-- Sr Accountants
           |   |-- Dillon, Matt.doc
           |   |-- Nielson, Alice.pdf
           |-- Bookkeepers
               |-- Gates, Will.txt
               |-- Skywalker, Luke.htm

cats_parser.pl script

With the above defined structure, we can run the following perl script, cats_parser.pl, against the directory structure and it will grab each resume, convert it to text, parse the filename to derive the candidates name, parse the text resume to find an email address and phone number, and use that data to populate a new candidate record with attachment in CATS.

#perl cats_parser.pl /mnt/resumes

Note that this perl script obviously requires perl be installed as well as several modules, DBI, dbd::myql, POSIX, File::Find, File::Basename, and File::Spec. You would not run this script against your data without first modifying it to match your setup and testing it in a safe environment to avoid damaging your production data. Trust me, using this script requires technical expertise and at least a basic understanding of perl, use it at your own risk, I will not be available to assist or help should you destroy your installation or damage your inventory of resumes.

In addition to the actual script, you must have the required converters installed and working on your system. The converters include:

#!/usr/bin/perl -w
use strict;
# 
# cats_parser.pl parses resumes and loads them into CATS system.
# Copyright (C) 2006  Scott W. Leighton  <helphand@pacbell.net>
# 
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
# 

use File::Find;
use File::Basename;
use File::Spec;
use DBI;
use POSIX qw(strftime);

my $DEBUG = 0;

my $dbh=DBI->connect("DBI:mysql:database=cats;host=sqlhost",
                   $ENV{'SQLUSER'},$ENV{'SQLPW'},
                   {'RaiseError'=>1});



# Subroutine to validate file names: return true if file is ok to process
# or false to skip the file.

sub check_path {
    my $path = shift;
    return 1 if $path =~ /\.(rtf|txt|pdf|htm|html|doc)$/i;  # return true if ends in one of desired extensions
    return 0;  # return false to skip
}

sub check_dir {
    my $dir = shift;
    return 0 if m[^\.];  # return false if starts with a dot
    return 1 if m!^/mnt/resumes$!i;
    return 1 if m!^/mnt/resumes/([^/]+)$!i;
    return 0 unless m!^/mnt/resumes/([^/]+)/Resumes.*$!i;
    return 1;  # return true to process this directory
}



find(
    {
        wanted => \&wanted,
        no_chdir => 1, 
        follow => 1,
    },
    @ARGV,
);

sub wanted {
    my $path = $File::Find::name;

    if ( -d ) { 
        if ( !check_dir( $path ) ) {
            $File::Find::prune = 1;
        }
        return;
    }

    if ( !-r _ ) {
        warn "$File::Find::name is not readable\n";
        return;
    }


    my $mtime = (stat _ )[9];

    if ( !check_path( $path ) ) {
        print "skipping $path\n" if $DEBUG;
        return;
    }

    print "processing $path\n" if $DEBUG;

    # Otherwise, fetch document 
    process_file( $path, $mtime );

}

sub process_file {
    my ( $path, $mtime ) = @_;

    
    my $contenttype;
    my $content;

    my ($filename,$filepath,$suffix) = fileparse($path,qr/\.[^.]*$/);

    $suffix=lc($suffix);
    
    my $qpath = $dbh->quote($path);

    # Convert to plain text
   
    if ($suffix eq '.doc') {
      $contenttype='application/msword';
      $content= `/usr/local/bin/antiword -m 8859-1.txt $qpath`;
   } elsif ($suffix eq '.pdf') {
      $contenttype='application/pdf';
      $content= `/usr/bin/pdftohtml -stdout $qpath`;
   } elsif ($suffix eq '.rtf') {
      $contenttype='application/rtf';
      $content= `/usr/local/bin/rtf-converter $qpath`;
   } elsif ($suffix eq '.txt' )  {
      $contenttype='text/plain';
      $content= `cat $qpath`;
   } elsif ($suffix eq '.htm' or $suffix eq '.html')  {
      $contenttype='text/html';
      $content= `cat $qpath`;
   } else {
      $contenttype='application/octet-stream';
      $content= `cat $qpath`;
   }

   # look thru the plain text version for an email address and phone number

   my $workarea = $content;
   $workarea =~ s/[\n]+/ /gs;
   $workarea =~ s/[\x00-\x1F]+//gs;
   $workarea =~ s/[\x80-\xFF]+//gs;
   
   my $email;
   if ($workarea =~ m/\b([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b/gis) {
      $email = $1;
   }

   $workarea =~ s/[\|]+/ /gs;
   my $phone;
   if ( $workarea =~ m/(
                   \(?     # optional parentheses
                     \d{3} # area code required
                   \)?     # optional parentheses
                   \s?     # optional space
                   [-\s.]? # separator is either a dash, a space, or a period.
                   \s?     # optional extra space
                     \d{3} # 3-digit prefix
                   \s?     # optional extra space
                   [-\s.]  # another separator
                   \s?     # optional extra space
                     \d{4} # 4-digit line number
                   )/isx ) {
         $phone = $1;
   } else {
       if (
             m/(
                (?:
                 (?:1-?)?
                 (?:\(?[0-9]{3}\)?-)?
                )?
                [0-9]{3}-?[0-9]{4}
               )/x ) {
           $phone = $1;
       }
 
   }

   my $bytes = output_document( $path, \$content, $mtime, $contenttype, $email, $phone);


}


sub output_document {
    my ( $path, $content_ref, $mtime, $parser_type, $email, $phone ) = @_;

    my $moddate = strftime "%Y-%m-%d %H:%M:%S", (localtime $mtime)[0..5];
    
    my ($filename,$filepath,$suffix) = fileparse($path,qr/\.[^.]*$/);
    my $originalfname = "$filename$suffix";
    my %ks;
    my $branch;
    
    # grab the branch name out of the path

    $path =~ m!^/mnt/resumes/([^/]+)/Resumes.*$!i;
    $branch=lc($1) if $1;

    # save the relevant portions of the path as key skills
    # by stripping out any sub-folder names below the Resumes
    # folder and using them as a key skill item
    my @dirs = File::Spec->splitdir($path);
    my $discard = pop(@dirs);   # dump the filename portion
    while (my $dir = pop(@dirs)) {
       last if $dir =~ /Resumes/;  #stop when we hit this directory
       $ks{lc($dir)}++ if $dir;
    }
    
    # now parse that file name
    my ($fname,$lname,$other) = parse_filename($filename);
    
    # anything following a dash in the file name is considered 
    # key skill data, so split it out and clean it up
    if ($other) {
      my @parts = split(/,\s+|\,|\s+/,$other);
      foreach my $p (@parts) {
         $p=~ s/resume//gi;   #strip out the word resume
         if ($p and length($p) > 1) {  # skip if only 1 character long
            $ks{lc($p)}++ if $p;
         }
      }
    }
    my @keys = keys %ks;
    %ks = ();
    # clean up the key skills, remove special chars and the word resume
    foreach my $p (@keys) {
       $p=~ s/resume//gi;   # strip out the word resume
       $p=~ s/[-&,]/ /g;    # strip special chars
       $p=~ s/^\s+//g;
       $p=~ s/\s+$//g;
       $ks{lc($p)}++ if $p;
    }
    
    # build the final key skills string, put the branch we found 
    # to the front so it always displays on candidate screens   
    my $keyskills = join (", ",keys %ks);
    if ($branch) {
       $keyskills = "$branch, " . $keyskills;
    }

    # create the CATS candidate record    

    my $sql = sprintf ("INSERT into cats.candidate (
                 first_name,
                 last_name,
                 source,
                 phone_home,
                 key_skills,
                 entered_by,
                 owner,
                 site_id,
                 date_created,
                 date_modified,
                 email1
                ) values (
                 %s,
                 %s,
                 'Unsolicited Resume',
                 %s,
                 %s,
                 1,
                 1,
                 1,
                 %s,
                 %s,
                 %s
                );",
                 $dbh->quote($fname),
                 $dbh->quote($lname),
                 $dbh->quote($phone),
                 $dbh->quote($keyskills),
                 $dbh->quote($moddate),
                 $dbh->quote($moddate),
                 $dbh->quote($email)
                );
    
    $dbh->do($sql) or die $dbh->errstr;
               
    my $candidateid = $dbh->{'mysql_insertid'};

    # Create the CATS attachment record 

    $sql = sprintf ("INSERT into cats.attachment (
                 data_item_type,
                 data_item_id,
                 title,
                 original_filename,
                 stored_filename,
                 content_type,
                 resume,
                 text,
                 site_id,
                 date_created,
                 date_modified
                 ) values (
                 100,
                 %s,
                 %s,
                 %s,
                 %s,
                 %s,
                 1,
                 %s,
                 1,
                 %s,
                 %s
                 );", 
                 $candidateid,
                 $dbh->quote($filename),
                 $dbh->quote($originalfname),
                 $dbh->quote($originalfname),
                 $dbh->quote($parser_type),
                 $dbh->quote($$content_ref),
                 $dbh->quote($moddate),
                 $dbh->quote($moddate)
                 );
                 
    $dbh->do($sql) or die $dbh->errstr;
                 
    my $attachid = $dbh->{'mysql_insertid'};

    # build the path name for the attachment file, then
    # create the directory, move the original resume to it,
    # and fixup ownership of the resume file
    
    my $newpath = "/srv/www/htdocs/cats/attachments/$attachid";
    mkdir ($newpath);
    my $qpath = $dbh->quote($path);
    my $qnewpath = $dbh->quote($newpath);
    `mv $qpath $qnewpath`;
    `chown -R wwwrun:www $qnewpath`;
    chmod 0755, $qnewpath;

}

 
# Routine to parse the resume doc's filename

sub parse_filename {
    my ($filename) = @_;
    my ($fname,$lname,$other);
    
    # look for std pattern convention
    if ($filename =~ /^([^,]+)(\,\s?)([^,]+)(\s?-\s?)(.*)$/ ) {
       $lname=$1;
       $fname=$3;
       $other=$5;
       print "std $filename -> $lname, $fname - $other\n" if $DEBUG;
    } elsif ( $filename =~ /^([^\s,]+)\s+([^\s,-]+)(\s?-\s?)(.*)$/ ) {
       $fname=$1;
       $lname=$2;
       $other=$4;
       print "#2 $filename -> $lname, $fname - $other\n" if $DEBUG;
    } elsif ( $filename =~ /^([^\s,]+)\s+([^\s,]+)\s?,\s?(.*)$/ ) {
       $lname=$1;
       $fname=$2;
       $other=$3;
       print "#3 $filename -> $lname, $fname - $other\n" if $DEBUG;
    } elsif ( $filename =~ /^([^\s,]+)\s?,\s?([^\s]+)$/ ) {
       $lname=$1;
       my @fps = split (/[-,\s]/,$2);
       $fname=shift(@fps);
       $other = join(", ",@fps);
       print "#4 $filename -> $lname, $fname - $other\n" if $DEBUG;
    } elsif ( $filename =~ /^([^\s,]+)\s?,\s?(.*)$/ ) {
       $lname =$1;
       my @fps = split (/[-,\s]/,$2);
       $fname = shift(@fps);
       $other = join(", ",@fps);
       print "#5 $filename -> $lname, $fname - $other\n" if $DEBUG;
    } else {   
       $filename =~ s/[-+_]/ /g;     # change all underscores to space
    
       my @fps = split (/,|-|\s+/,$filename);
       my $c = scalar(@fps);
       if ($c>0) {
          if ($c == 2) {
             $fname = shift(@fps);
             $lname = shift(@fps);
          } elsif ($c > 1) {
             ($fname,$lname)= split(/\s+/,$fps[0]);
             for (my $i=1;$i < $c;$i++) {
                $other .= $other?", $fps[$i]":$fps[$i];
             }
          } else {
             ($fname,$lname,$other) = split (/\s+/,$filename);
          }
       }
       print "#6 with c at $c $filename -> $lname, $fname - $other\n" if $DEBUG;
    }

    # give up with some defaults if we couldn't figure out the
    # format of the name
    $fname = "APPLICANT" unless $fname;
    $lname = "RESUME" unless $lname;
    
    return ($fname,$lname,$other);
}
cats/parseresumes.txt · Last modified: 2007/02/07 22:09 by helphand