any perl performance experts awake?

Thu Apr 11 09:13:09 PDT 2013

On Thu, Apr 11, 2013 at 8:43 AM, Andrew Gould <andrewlylegould at gmail.com> wrote:
> On Thu, Apr 11, 2013 at 10:10 AM, Lonni J Friedman <netllama at gmail.com>
> wrote:
>>
>> On Wed, Apr 10, 2013 at 5:15 PM, Bill Campbell <linux-sxs at celestial.com>
>> wrote:
>> > On Wed, Apr 10, 2013, Lonni J Friedman wrote:
>> >>I've got a perl script that is used to parse data from one format into
>> >>another format.  It works fairly well 99% of the time, however when
>> >>the data that its parsing is large, the performance of the script gets
>> >>awful.  Unfortunately, my perl skills are marginal at best, so I'm
>> >>lost on how to debug this problem.
>> >>
>> >>For example, for 99% of the cases, there are less than 1k rows of data
>> >>to parse, and it completes in less than 10 seconds.  However, for the
>> >>remaining 1%, there are over 150k rows, and the script takes hours
>> >>(3+) to finish.  I'm hoping that this is due to something inefficient
>> >>in my perl, that can be fixed easily, but I'm not sure what that might
>> >>be.
>> >>
>> >>The slow part of the script is this subroutine:
>> >>######
>> >>sub sqlInsert {
>> >>    my ($fh, $app, $status, $entry,
>> >> $table_testlist_csv_path,%hash_values) = @_;
>> >>    my $now=strftime("%Y-%m-%d %H:%M:%S", localtime) ;
>> >>    my $entryVals = join(',', map { "\"$$entry{$_}\""} qw(suiteid
>> >>regressionCL cl os arch build_type branch gpu subtest osversion));
>> >>    my $testid = $hash_values{$app} ;
>> >>
>> >>    # we need to add an escape character in front of all double quotes
>> >>in a testname, or the dquotes will be stripped out when the SQL COPY
>> >>occurs
>> >>    $app =~ s/"/~"/g ;
>> >>    print $fh <<END;
>> >>"$now","$app","$status","$testid",$entryVals
>> >>END
>> >>}
>> >
>> > Somebody has already pointed out the use of the strftime/localtime
>> > for every iteration.  This reminds me of my first programming
>> > experience in FORTRAN almost 50 years ago where the "I'm an
>> > Engineer, not a Programmer and Proud of It" person who wrote the
>> > program computed the square root of PI/2.0 every time in a
>> > subroutine that was called over 20,000 times per run.  I
>> > calculated it once, put it in COMMON, and cut the run time from
>> > 30 minutes to 5 minutes.
>> >
>> > There are a few things you might do to improve this.
>> >
>> >    + Use one of the database interfaces (DBI) available in Perl to
>> > connect
>> >      to the database.  It's been quite a while since I did this as I'm
>> >      primarily doing Python these days so I don't remember the details.
>> >      The DBI libraries typically have facilities to properly quote as
>> >      necessary.
>>
>> Yea, I'm aware of that, but unfortunately this script has to run on a
>> large number of environments (some of which are not Linux), and
>> getting those modules installed is a huge PITA.  Also, the bottleneck
>> in the script isn't the database queries, its writing out a file
>> locally, so making this change wouldn't help regardless.
>>
>> >
>> >    + I think that most SQL databases have a now() function that will get
>> >      the current time, and that would probably be much more efficient
>> > than
>> >      doing it externally.  I have a link to the PostgreSQL page on this
>> >      here.
>> >
>> > http://www.postgresql.org/docs/8.2/static/functions-datetime.html
>>
>> Yup, and we actually have now() as the default for the column in
>> question.  Unfortunately (or perhaps fortunately), I'm not the
>> original author of this script, I've inherited the mess, and need to
>> maintain it on top of 3829483 other responsibilities.  At some point I
>> should determine what the added overhead is to letting the database
>> figure out now() for the millions of rows we insert each day, rather
>> than pre-calculating it on the clients.
>>
>> >
>> >    + If the SQL back end has stored procedures, it might be most
>> > efficient
>> >      to have one handle the time automatically on insert.
>>
>> That's what setting a now() default does for a column (at least in
>> Postgresql).
>>
>> Now that I've determined that the client side timestamp calculation
>> isn't the bottleneck, what else can I look at next?
>>
>
> Does the script read in an entire data source file and parse each line?  Or
> does is read one line at a time and parse/write it prior to reading the next
> line?  If the entire source file is being read into memory, could it be
> causing a bottleneck?

The script reads in an entire data source file, parsing line by line,
putting the data into a hash (%hash_values).  Once that is completed,
the hash is passed to sqlInsert().  So everything is already read into
memory at the point in time when performance tanks.  I'd expect that
this would be the fast path, since its never needed to read from disk.
 All of my systems have 2+GB RAM, and the data in question is always
less than 30MB, so I can't imagine that this would be a swap issue, if
that's what you mean?   Unless querying a key/value pair in a hash is
not a good performance path in perl?