This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
cats:parseresumes [2007/02/07 21:02] helphand |
cats:parseresumes [2007/02/07 22:09] (current) helphand |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ^ :!: This Documentation Applies to CATS Version 0.6.1 :!: ^ | ||
+ | | The CATS Team has since released new versions, the material documented here likely will not work on the new versions without modification. | ||
+ | |||
+ | ===== Automatically Parse and Add Resumes to CATS ===== | ||
+ | Any new CATS installation with an existing inventory of candidate resumes will face the conundrum of how to get all those resumes into CATS. The thought of manually data entering thousands, or tens of thousands, of candidates with their associated resume into CATS can be overwhelming to the point of dropping the idea of even installing CATS. | ||
+ | |||
+ | Well, if you are technically inclined, or have someone on staff who is, and your existing inventory of resumes are organized in a manner similar to ours, you may be able to adapt the solution we used to load over 30,000 candidates and resumes into our CATS installation. The determining factors will be; | ||
+ | |||
+ | - Are the resumes in doc, pdf, txt, htm or rtf format? | ||
+ | - Are they saved as a single file per candidate? | ||
+ | - Is the candidate' | ||
+ | - Do you have perl installed or do you have the technical expertise available to get it installed? | ||
+ | |||
+ | |||
+ | If you can answer yes to the above, then chances are you can adapt our script to work with your setup. Be warned though, technical expertise is required as well as knowledge about your own setup. So, if you are a novice or a regular end-user, this journey is not for you :-( | ||
+ | |||
+ | ==== Our Setup ==== | ||
+ | Before installing CATS, our offices kept their resumes on a shared drive, each office had a folder named after the office and in that folder was a Resumes folder where they dropped their incoming resumes.< | ||
+ | losangeles | ||
+ | |-- Resumes | ||
+ | |-- controller | ||
+ | | |-- Smith, John.doc | ||
+ | | |-- Doe, Mary.doc | ||
+ | |-- CFOs | ||
+ | | |-- Wilder, Billy.doc | ||
+ | | |-- Cravits, Henri.doc | ||
+ | |-- Sr Accountants | ||
+ | | |-- Nguyen, Tron.doc | ||
+ | | |-- Collins, Francis.pdf | ||
+ | |-- Bookkeepers | ||
+ | |-- Jones, Tom.txt | ||
+ | |-- Jones, Mary Lou.htm | ||
+ | </ | ||
+ | |||
+ | Within the Resumes folder, they had sub-folders categorizing the candidates primary skill area. They would drop the candidates resume into the sub-folder that best matched that candidates skill. If necessary, they would rename the file using the convention Last Name, First Name - optional key skill data.ext. Because this pre-existing system had some structure to it, it could easily be used to grab the resumes and build CATS candidate records to load the database. | ||
+ | |||
+ | Our CATS system resides on a SUSE Linux box, so the first step was to use CIFS to create mount points on the Linux box to the appropriate windows shares on the Windows file server. The Linux box already had a machine account in ActiveDirectory, | ||
+ | /mnt | ||
+ | | ||
+ | |--losangeles | ||
+ | | |-- Resumes | ||
+ | | |-- controller | ||
+ | | | |-- Smith, John.doc | ||
+ | | | |-- Doe, Mary.doc | ||
+ | | |-- CFOs | ||
+ | | | |-- Wilder, Billy.doc | ||
+ | | | |-- Cravits, Henri.doc | ||
+ | | |-- Sr Accountants | ||
+ | | | |-- Nguyen, Tron.doc | ||
+ | | | |-- Collins, Francis.pdf | ||
+ | | |-- Bookkeepers | ||
+ | | |-- Jones, Tom.txt | ||
+ | | |-- Jones, Mary Lou.htm | ||
+ | |--sandiego | ||
+ | | |-- Resumes | ||
+ | | |-- controller | ||
+ | | | |-- Williams, Beth.doc | ||
+ | | |-- CFOs | ||
+ | | | |-- Smith, George.doc | ||
+ | | | |-- Welsh, John.doc | ||
+ | | |-- Sr Accountants | ||
+ | | | |-- Orwell, George.doc | ||
+ | | |-- Bookkeepers | ||
+ | | |-- Carter, Barbara.txt | ||
+ | |--sanfrancisco | ||
+ | |-- Resumes | ||
+ | |-- controller | ||
+ | | ||
+ | | ||
+ | |-- CFOs | ||
+ | | ||
+ | |-- Sr Accountants | ||
+ | | ||
+ | | ||
+ | |-- Bookkeepers | ||
+ | |-- Gates, Will.txt | ||
+ | |-- Skywalker, Luke.htm | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ==== cats_parser.pl script ==== | ||
+ | |||
+ | With the above defined structure, we can run the following perl script, cats_parser.pl, | ||
+ | #perl cats_parser.pl / | ||
+ | </ | ||
+ | |||
+ | Note that this perl script obviously requires perl be installed as well as several modules, DBI, dbd::myql, POSIX, File::Find, File:: | ||
+ | |||
+ | In addition to the actual script, you must have the required converters installed and working on your system. The converters include: | ||
+ | |||
+ | * antiword - available from [[http:// | ||
+ | * pdftotext - part of the xpdf package [[http:// | ||
+ | * html2text - available from [[http:// | ||
+ | * rtf-converter - available from [[http:// | ||
+ | |||
+ | < | ||
+ | # | ||
+ | use strict; | ||
+ | # | ||
+ | # cats_parser.pl parses resumes and loads them into CATS system. | ||
+ | # Copyright (C) 2006 Scott W. Leighton | ||
+ | # | ||
+ | # This program is free software; you can redistribute it and/or | ||
+ | # modify it under the terms of the GNU General Public License | ||
+ | # as published by the Free Software Foundation; either version 2 | ||
+ | # of the License, or (at your option) any later version. | ||
+ | # | ||
+ | # This program is distributed in the hope that it will be useful, | ||
+ | # but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
+ | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | ||
+ | # GNU General Public License for more details. | ||
+ | # | ||
+ | # You should have received a copy of the GNU General Public License | ||
+ | # along with this program; if not, write to the Free Software | ||
+ | # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. | ||
+ | # | ||
+ | |||
+ | use File::Find; | ||
+ | use File:: | ||
+ | use File::Spec; | ||
+ | use DBI; | ||
+ | use POSIX qw(strftime); | ||
+ | |||
+ | my $DEBUG = 0; | ||
+ | |||
+ | my $dbh=DBI-> | ||
+ | | ||
+ | | ||
+ | |||
+ | |||
+ | |||
+ | # Subroutine to validate file names: return true if file is ok to process | ||
+ | # or false to skip the file. | ||
+ | |||
+ | sub check_path { | ||
+ | my $path = shift; | ||
+ | return 1 if $path =~ / | ||
+ | return 0; # return false to skip | ||
+ | } | ||
+ | |||
+ | sub check_dir { | ||
+ | my $dir = shift; | ||
+ | return 0 if m[^\.]; | ||
+ | return 1 if m!^/ | ||
+ | return 1 if m!^/ | ||
+ | return 0 unless m!^/ | ||
+ | return 1; # return true to process this directory | ||
+ | } | ||
+ | |||
+ | |||
+ | |||
+ | find( | ||
+ | { | ||
+ | wanted => \& | ||
+ | no_chdir => 1, | ||
+ | follow => 1, | ||
+ | }, | ||
+ | @ARGV, | ||
+ | ); | ||
+ | |||
+ | sub wanted { | ||
+ | my $path = $File:: | ||
+ | |||
+ | if ( -d ) { | ||
+ | if ( !check_dir( $path ) ) { | ||
+ | $File:: | ||
+ | } | ||
+ | return; | ||
+ | } | ||
+ | |||
+ | if ( !-r _ ) { | ||
+ | warn " | ||
+ | return; | ||
+ | } | ||
+ | |||
+ | |||
+ | my $mtime = (stat _ )[9]; | ||
+ | |||
+ | if ( !check_path( $path ) ) { | ||
+ | print " | ||
+ | return; | ||
+ | } | ||
+ | |||
+ | print " | ||
+ | |||
+ | # Otherwise, fetch document | ||
+ | process_file( $path, $mtime ); | ||
+ | |||
+ | } | ||
+ | |||
+ | sub process_file { | ||
+ | my ( $path, $mtime ) = @_; | ||
+ | |||
+ | | ||
+ | my $contenttype; | ||
+ | my $content; | ||
+ | |||
+ | my ($filename, | ||
+ | |||
+ | $suffix=lc($suffix); | ||
+ | | ||
+ | my $qpath = $dbh-> | ||
+ | |||
+ | # Convert to plain text | ||
+ | |||
+ | if ($suffix eq ' | ||
+ | $contenttype=' | ||
+ | $content= `/ | ||
+ | } elsif ($suffix eq ' | ||
+ | $contenttype=' | ||
+ | $content= `/ | ||
+ | } elsif ($suffix eq ' | ||
+ | $contenttype=' | ||
+ | $content= `/ | ||
+ | } elsif ($suffix eq ' | ||
+ | $contenttype=' | ||
+ | $content= `cat $qpath`; | ||
+ | } elsif ($suffix eq ' | ||
+ | $contenttype=' | ||
+ | $content= `cat $qpath`; | ||
+ | } else { | ||
+ | $contenttype=' | ||
+ | $content= `cat $qpath`; | ||
+ | } | ||
+ | |||
+ | # look thru the plain text version for an email address and phone number | ||
+ | |||
+ | my $workarea = $content; | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | my $email; | ||
+ | if ($workarea =~ m/ | ||
+ | $email = $1; | ||
+ | } | ||
+ | |||
+ | | ||
+ | my $phone; | ||
+ | if ( $workarea =~ m/( | ||
+ | | ||
+ | \d{3} # area code required | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | \d{3} # 3-digit prefix | ||
+ | | ||
+ | | ||
+ | | ||
+ | \d{4} # 4-digit line number | ||
+ | )/isx ) { | ||
+ | | ||
+ | } else { | ||
+ | if ( | ||
+ | m/( | ||
+ | (?: | ||
+ | | ||
+ | | ||
+ | )? | ||
+ | [0-9]{3}-? | ||
+ | )/x ) { | ||
+ | | ||
+ | } | ||
+ | |||
+ | } | ||
+ | |||
+ | my $bytes = output_document( $path, \$content, $mtime, $contenttype, | ||
+ | |||
+ | |||
+ | } | ||
+ | |||
+ | |||
+ | sub output_document { | ||
+ | my ( $path, $content_ref, | ||
+ | |||
+ | my $moddate = strftime " | ||
+ | | ||
+ | my ($filename, | ||
+ | my $originalfname = " | ||
+ | my %ks; | ||
+ | my $branch; | ||
+ | | ||
+ | # grab the branch name out of the path | ||
+ | |||
+ | $path =~ m!^/ | ||
+ | $branch=lc($1) if $1; | ||
+ | |||
+ | # save the relevant portions of the path as key skills | ||
+ | # by stripping out any sub-folder names below the Resumes | ||
+ | # folder and using them as a key skill item | ||
+ | my @dirs = File:: | ||
+ | my $discard = pop(@dirs); | ||
+ | while (my $dir = pop(@dirs)) { | ||
+ | last if $dir =~ / | ||
+ | | ||
+ | } | ||
+ | | ||
+ | # now parse that file name | ||
+ | my ($fname, | ||
+ | | ||
+ | # anything following a dash in the file name is considered | ||
+ | # key skill data, so split it out and clean it up | ||
+ | if ($other) { | ||
+ | my @parts = split(/, | ||
+ | foreach my $p (@parts) { | ||
+ | $p=~ s/ | ||
+ | if ($p and length($p) > 1) { # skip if only 1 character long | ||
+ | $ks{lc($p)}++ if $p; | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | my @keys = keys %ks; | ||
+ | %ks = (); | ||
+ | # clean up the key skills, remove special chars and the word resume | ||
+ | foreach my $p (@keys) { | ||
+ | $p=~ s/ | ||
+ | $p=~ s/ | ||
+ | $p=~ s/^\s+//g; | ||
+ | $p=~ s/\s+$//g; | ||
+ | | ||
+ | } | ||
+ | | ||
+ | # build the final key skills string, put the branch we found | ||
+ | # to the front so it always displays on candidate screens | ||
+ | my $keyskills = join (", ",keys %ks); | ||
+ | if ($branch) { | ||
+ | | ||
+ | } | ||
+ | |||
+ | # create the CATS candidate record | ||
+ | |||
+ | my $sql = sprintf (" | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | ) values ( | ||
+ | %s, | ||
+ | %s, | ||
+ | ' | ||
+ | %s, | ||
+ | %s, | ||
+ | 1, | ||
+ | 1, | ||
+ | 1, | ||
+ | %s, | ||
+ | %s, | ||
+ | %s | ||
+ | );", | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | ); | ||
+ | | ||
+ | $dbh-> | ||
+ | |||
+ | my $candidateid = $dbh-> | ||
+ | |||
+ | # Create the CATS attachment record | ||
+ | |||
+ | $sql = sprintf (" | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | text, | ||
+ | | ||
+ | | ||
+ | | ||
+ | ) values ( | ||
+ | 100, | ||
+ | %s, | ||
+ | %s, | ||
+ | %s, | ||
+ | %s, | ||
+ | %s, | ||
+ | 1, | ||
+ | %s, | ||
+ | 1, | ||
+ | %s, | ||
+ | %s | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | ); | ||
+ | |||
+ | $dbh-> | ||
+ | |||
+ | my $attachid = $dbh-> | ||
+ | |||
+ | # build the path name for the attachment file, then | ||
+ | # create the directory, move the original resume to it, | ||
+ | # and fixup ownership of the resume file | ||
+ | | ||
+ | my $newpath = "/ | ||
+ | mkdir ($newpath); | ||
+ | my $qpath = $dbh-> | ||
+ | my $qnewpath = $dbh-> | ||
+ | `mv $qpath $qnewpath`; | ||
+ | `chown -R wwwrun:www $qnewpath`; | ||
+ | chmod 0755, $qnewpath; | ||
+ | |||
+ | } | ||
+ | |||
+ | |||
+ | # Routine to parse the resume doc's filename | ||
+ | |||
+ | sub parse_filename { | ||
+ | my ($filename) = @_; | ||
+ | my ($fname, | ||
+ | | ||
+ | # look for std pattern convention | ||
+ | if ($filename =~ / | ||
+ | | ||
+ | | ||
+ | | ||
+ | print "std $filename -> $lname, $fname - $other\n" | ||
+ | } elsif ( $filename =~ / | ||
+ | | ||
+ | | ||
+ | | ||
+ | print "#2 $filename -> $lname, $fname - $other\n" | ||
+ | } elsif ( $filename =~ / | ||
+ | | ||
+ | | ||
+ | | ||
+ | print "#3 $filename -> $lname, $fname - $other\n" | ||
+ | } elsif ( $filename =~ / | ||
+ | | ||
+ | my @fps = split (/ | ||
+ | | ||
+ | | ||
+ | print "#4 $filename -> $lname, $fname - $other\n" | ||
+ | } elsif ( $filename =~ / | ||
+ | | ||
+ | my @fps = split (/ | ||
+ | | ||
+ | | ||
+ | print "#5 $filename -> $lname, $fname - $other\n" | ||
+ | } else { | ||
+ | | ||
+ | | ||
+ | my @fps = split (/, | ||
+ | my $c = scalar(@fps); | ||
+ | if ($c>0) { | ||
+ | if ($c == 2) { | ||
+ | | ||
+ | | ||
+ | } elsif ($c > 1) { | ||
+ | | ||
+ | for (my $i=1;$i < $c;$i++) { | ||
+ | $other .= $other?", | ||
+ | } | ||
+ | } else { | ||
+ | | ||
+ | } | ||
+ | } | ||
+ | print "#6 with c at $c $filename -> $lname, $fname - $other\n" | ||
+ | } | ||
+ | |||
+ | # give up with some defaults if we couldn' | ||
+ | # format of the name | ||
+ | $fname = " | ||
+ | $lname = " | ||
+ | | ||
+ | return ($fname, | ||
+ | } | ||
+ | |||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||