any perl performance experts awake?

Thu Apr 11 08:43:56 PDT 2013

On Thu, Apr 11, 2013 at 10:10 AM, Lonni J Friedman <netllama at gmail.com>wrote:

> On Wed, Apr 10, 2013 at 5:15 PM, Bill Campbell <linux-sxs at celestial.com>
> wrote:
> > On Wed, Apr 10, 2013, Lonni J Friedman wrote:
> >>I've got a perl script that is used to parse data from one format into
> >>another format.  It works fairly well 99% of the time, however when
> >>the data that its parsing is large, the performance of the script gets
> >>awful.  Unfortunately, my perl skills are marginal at best, so I'm
> >>lost on how to debug this problem.
> >>
> >>For example, for 99% of the cases, there are less than 1k rows of data
> >>to parse, and it completes in less than 10 seconds.  However, for the
> >>remaining 1%, there are over 150k rows, and the script takes hours
> >>(3+) to finish.  I'm hoping that this is due to something inefficient
> >>in my perl, that can be fixed easily, but I'm not sure what that might
> >>be.
> >>
> >>The slow part of the script is this subroutine:
> >>######
> >>sub sqlInsert {
> >>    my ($fh, $app, $status, $entry,
> $table_testlist_csv_path,%hash_values) = @_;
> >>    my $now=strftime("%Y-%m-%d %H:%M:%S", localtime) ;
> >>    my $entryVals = join(',', map { "\"$$entry{$_}\""} qw(suiteid
> >>regressionCL cl os arch build_type branch gpu subtest osversion));
> >>    my $testid = $hash_values{$app} ;
> >>
> >>    # we need to add an escape character in front of all double quotes
> >>in a testname, or the dquotes will be stripped out when the SQL COPY
> >>occurs
> >>    $app =~ s/"/~"/g ;
> >>    print $fh <<END;
> >>"$now","$app","$status","$testid",$entryVals
> >>END
> >>}
> >
> > Somebody has already pointed out the use of the strftime/localtime
> > for every iteration.  This reminds me of my first programming
> > experience in FORTRAN almost 50 years ago where the "I'm an
> > Engineer, not a Programmer and Proud of It" person who wrote the
> > program computed the square root of PI/2.0 every time in a
> > subroutine that was called over 20,000 times per run.  I
> > calculated it once, put it in COMMON, and cut the run time from
> > 30 minutes to 5 minutes.
> >
> > There are a few things you might do to improve this.
> >
> >    + Use one of the database interfaces (DBI) available in Perl to
> connect
> >      to the database.  It's been quite a while since I did this as I'm
> >      primarily doing Python these days so I don't remember the details.
> >      The DBI libraries typically have facilities to properly quote as
> >      necessary.
>
> Yea, I'm aware of that, but unfortunately this script has to run on a
> large number of environments (some of which are not Linux), and
> getting those modules installed is a huge PITA.  Also, the bottleneck
> in the script isn't the database queries, its writing out a file
> locally, so making this change wouldn't help regardless.
>
> >
> >    + I think that most SQL databases have a now() function that will get
> >      the current time, and that would probably be much more efficient
> than
> >      doing it externally.  I have a link to the PostgreSQL page on this
> >      here.
> >
> http://www.postgresql.org/docs/8.2/static/functions-datetime.html
>
> Yup, and we actually have now() as the default for the column in
> question.  Unfortunately (or perhaps fortunately), I'm not the
> original author of this script, I've inherited the mess, and need to
> maintain it on top of 3829483 other responsibilities.  At some point I
> should determine what the added overhead is to letting the database
> figure out now() for the millions of rows we insert each day, rather
> than pre-calculating it on the clients.
>
> >
> >    + If the SQL back end has stored procedures, it might be most
> efficient
> >      to have one handle the time automatically on insert.
>
> That's what setting a now() default does for a column (at least in
> Postgresql).
>
> Now that I've determined that the client side timestamp calculation
> isn't the bottleneck, what else can I look at next?
>
> thanks!
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> L. Friedman                                    netllama at gmail.com
> LlamaLand                       https://netllama.linux-sxs.org
> _______________________________________________
>

Does the script read in an entire data source file and parse each line?  Or
does is read one line at a time and parse/write it prior to reading the
next line?  If the entire source file is being read into memory, could it
be causing a bottleneck?

Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.celestial.com/pipermail/linux-users/attachments/20130411/504f8d4a/attachment.html