any perl performance experts awake?

Thu Apr 11 08:10:48 PDT 2013

On Wed, Apr 10, 2013 at 5:15 PM, Bill Campbell <linux-sxs at celestial.com> wrote:
> On Wed, Apr 10, 2013, Lonni J Friedman wrote:
>>I've got a perl script that is used to parse data from one format into
>>another format.  It works fairly well 99% of the time, however when
>>the data that its parsing is large, the performance of the script gets
>>awful.  Unfortunately, my perl skills are marginal at best, so I'm
>>lost on how to debug this problem.
>>
>>For example, for 99% of the cases, there are less than 1k rows of data
>>to parse, and it completes in less than 10 seconds.  However, for the
>>remaining 1%, there are over 150k rows, and the script takes hours
>>(3+) to finish.  I'm hoping that this is due to something inefficient
>>in my perl, that can be fixed easily, but I'm not sure what that might
>>be.
>>
>>The slow part of the script is this subroutine:
>>######
>>sub sqlInsert {
>>    my ($fh, $app, $status, $entry, $table_testlist_csv_path,%hash_values) = @_;
>>    my $now=strftime("%Y-%m-%d %H:%M:%S", localtime) ;
>>    my $entryVals = join(',', map { "\"$$entry{$_}\""} qw(suiteid
>>regressionCL cl os arch build_type branch gpu subtest osversion));
>>    my $testid = $hash_values{$app} ;
>>
>>    # we need to add an escape character in front of all double quotes
>>in a testname, or the dquotes will be stripped out when the SQL COPY
>>occurs
>>    $app =~ s/"/~"/g ;
>>    print $fh <<END;
>>"$now","$app","$status","$testid",$entryVals
>>END
>>}
>
> Somebody has already pointed out the use of the strftime/localtime
> for every iteration.  This reminds me of my first programming
> experience in FORTRAN almost 50 years ago where the "I'm an
> Engineer, not a Programmer and Proud of It" person who wrote the
> program computed the square root of PI/2.0 every time in a
> subroutine that was called over 20,000 times per run.  I
> calculated it once, put it in COMMON, and cut the run time from
> 30 minutes to 5 minutes.
>
> There are a few things you might do to improve this.
>
>    + Use one of the database interfaces (DBI) available in Perl to connect
>      to the database.  It's been quite a while since I did this as I'm
>      primarily doing Python these days so I don't remember the details.
>      The DBI libraries typically have facilities to properly quote as
>      necessary.

Yea, I'm aware of that, but unfortunately this script has to run on a
large number of environments (some of which are not Linux), and
getting those modules installed is a huge PITA.  Also, the bottleneck
in the script isn't the database queries, its writing out a file
locally, so making this change wouldn't help regardless.

>
>    + I think that most SQL databases have a now() function that will get
>      the current time, and that would probably be much more efficient than
>      doing it externally.  I have a link to the PostgreSQL page on this
>      here.
>                 http://www.postgresql.org/docs/8.2/static/functions-datetime.html

Yup, and we actually have now() as the default for the column in
question.  Unfortunately (or perhaps fortunately), I'm not the
original author of this script, I've inherited the mess, and need to
maintain it on top of 3829483 other responsibilities.  At some point I
should determine what the added overhead is to letting the database
figure out now() for the millions of rows we insert each day, rather
than pre-calculating it on the clients.

>
>    + If the SQL back end has stored procedures, it might be most efficient
>      to have one handle the time automatically on insert.

That's what setting a now() default does for a column (at least in Postgresql).

Now that I've determined that the client side timestamp calculation
isn't the bottleneck, what else can I look at next?

thanks!

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama at gmail.com
LlamaLand                       https://netllama.linux-sxs.org