any perl performance experts awake?

Wed Apr 10 17:15:01 PDT 2013

On Wed, Apr 10, 2013, Lonni J Friedman wrote:
>I've got a perl script that is used to parse data from one format into
>another format.  It works fairly well 99% of the time, however when
>the data that its parsing is large, the performance of the script gets
>awful.  Unfortunately, my perl skills are marginal at best, so I'm
>lost on how to debug this problem.
>
>For example, for 99% of the cases, there are less than 1k rows of data
>to parse, and it completes in less than 10 seconds.  However, for the
>remaining 1%, there are over 150k rows, and the script takes hours
>(3+) to finish.  I'm hoping that this is due to something inefficient
>in my perl, that can be fixed easily, but I'm not sure what that might
>be.
>
>The slow part of the script is this subroutine:
>######
>sub sqlInsert {
>    my ($fh, $app, $status, $entry, $table_testlist_csv_path,%hash_values) = @_;
>    my $now=strftime("%Y-%m-%d %H:%M:%S", localtime) ;
>    my $entryVals = join(',', map { "\"$$entry{$_}\""} qw(suiteid
>regressionCL cl os arch build_type branch gpu subtest osversion));
>    my $testid = $hash_values{$app} ;
>
>    # we need to add an escape character in front of all double quotes
>in a testname, or the dquotes will be stripped out when the SQL COPY
>occurs
>    $app =~ s/"/~"/g ;
>    print $fh <<END;
>"$now","$app","$status","$testid",$entryVals
>END
>}

Somebody has already pointed out the use of the strftime/localtime 
for every iteration.  This reminds me of my first programming
experience in FORTRAN almost 50 years ago where the "I'm an
Engineer, not a Programmer and Proud of It" person who wrote the
program computed the square root of PI/2.0 every time in a
subroutine that was called over 20,000 times per run.  I
calculated it once, put it in COMMON, and cut the run time from
30 minutes to 5 minutes.

There are a few things you might do to improve this.

   + Use one of the database interfaces (DBI) available in Perl to connect
     to the database.  It's been quite a while since I did this as I'm
     primarily doing Python these days so I don't remember the details.
     The DBI libraries typically have facilities to properly quote as
     necessary.

   + I think that most SQL databases have a now() function that will get
     the current time, and that would probably be much more efficient than
     doing it externally.  I have a link to the PostgreSQL page on this
     here.
		http://www.postgresql.org/docs/8.2/static/functions-datetime.html

   + If the SQL back end has stored procedures, it might be most efficient
     to have one handle the time automatically on insert.

Bill
-- 
INTERNET:   bill at celestial.com  Bill Campbell; Celestial Software LLC
URL: http://www.celestial.com/  PO Box 820; 6641 E. Mercer Way
Voice:          (206) 236-1676  Mercer Island, WA 98040-0820
Fax:            (206) 232-9186  Skype: jwccsllc (206) 855-5792

Property must be secured, or liberty cannot exist. -- John Adams