<div dir="ltr">On Thu, Apr 11, 2013 at 10:10 AM, Lonni J Friedman <span dir="ltr"><<a href="mailto:netllama@gmail.com" target="_blank">netllama@gmail.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Wed, Apr 10, 2013 at 5:15 PM, Bill Campbell <<a href="mailto:linux-sxs@celestial.com">linux-sxs@celestial.com</a>> wrote:<br>
> On Wed, Apr 10, 2013, Lonni J Friedman wrote:<br>
>>I've got a perl script that is used to parse data from one format into<br>
>>another format. It works fairly well 99% of the time, however when<br>
>>the data that its parsing is large, the performance of the script gets<br>
>>awful. Unfortunately, my perl skills are marginal at best, so I'm<br>
>>lost on how to debug this problem.<br>
>><br>
>>For example, for 99% of the cases, there are less than 1k rows of data<br>
>>to parse, and it completes in less than 10 seconds. However, for the<br>
>>remaining 1%, there are over 150k rows, and the script takes hours<br>
>>(3+) to finish. I'm hoping that this is due to something inefficient<br>
>>in my perl, that can be fixed easily, but I'm not sure what that might<br>
>>be.<br>
>><br>
>>The slow part of the script is this subroutine:<br>
>>######<br>
>>sub sqlInsert {<br>
>> my ($fh, $app, $status, $entry, $table_testlist_csv_path,%hash_values) = @_;<br>
>> my $now=strftime("%Y-%m-%d %H:%M:%S", localtime) ;<br>
>> my $entryVals = join(',', map { "\"$$entry{$_}\""} qw(suiteid<br>
>>regressionCL cl os arch build_type branch gpu subtest osversion));<br>
>> my $testid = $hash_values{$app} ;<br>
>><br>
>> # we need to add an escape character in front of all double quotes<br>
>>in a testname, or the dquotes will be stripped out when the SQL COPY<br>
>>occurs<br>
>> $app =~ s/"/~"/g ;<br>
>> print $fh <<END;<br>
>>"$now","$app","$status","$testid",$entryVals<br>
>>END<br>
>>}<br>
><br>
> Somebody has already pointed out the use of the strftime/localtime<br>
> for every iteration. This reminds me of my first programming<br>
> experience in FORTRAN almost 50 years ago where the "I'm an<br>
> Engineer, not a Programmer and Proud of It" person who wrote the<br>
> program computed the square root of PI/2.0 every time in a<br>
> subroutine that was called over 20,000 times per run. I<br>
> calculated it once, put it in COMMON, and cut the run time from<br>
> 30 minutes to 5 minutes.<br>
><br>
> There are a few things you might do to improve this.<br>
><br>
> + Use one of the database interfaces (DBI) available in Perl to connect<br>
> to the database. It's been quite a while since I did this as I'm<br>
> primarily doing Python these days so I don't remember the details.<br>
> The DBI libraries typically have facilities to properly quote as<br>
> necessary.<br>
<br>
</div></div>Yea, I'm aware of that, but unfortunately this script has to run on a<br>
large number of environments (some of which are not Linux), and<br>
getting those modules installed is a huge PITA. Also, the bottleneck<br>
in the script isn't the database queries, its writing out a file<br>
locally, so making this change wouldn't help regardless.<br>
<div class="im"><br>
><br>
> + I think that most SQL databases have a now() function that will get<br>
> the current time, and that would probably be much more efficient than<br>
> doing it externally. I have a link to the PostgreSQL page on this<br>
> here.<br>
> <a href="http://www.postgresql.org/docs/8.2/static/functions-datetime.html" target="_blank">http://www.postgresql.org/docs/8.2/static/functions-datetime.html</a><br>
<br>
</div>Yup, and we actually have now() as the default for the column in<br>
question. Unfortunately (or perhaps fortunately), I'm not the<br>
original author of this script, I've inherited the mess, and need to<br>
maintain it on top of 3829483 other responsibilities. At some point I<br>
should determine what the added overhead is to letting the database<br>
figure out now() for the millions of rows we insert each day, rather<br>
than pre-calculating it on the clients.<br>
<div class="im"><br>
><br>
> + If the SQL back end has stored procedures, it might be most efficient<br>
> to have one handle the time automatically on insert.<br>
<br>
</div>That's what setting a now() default does for a column (at least in Postgresql).<br>
<br>
Now that I've determined that the client side timestamp calculation<br>
isn't the bottleneck, what else can I look at next?<br>
<br>
thanks!<br>
<br>
<br>
--<br>
<div class="im HOEnZb">~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~<br>
L. Friedman <a href="mailto:netllama@gmail.com">netllama@gmail.com</a><br>
LlamaLand <a href="https://netllama.linux-sxs.org" target="_blank">https://netllama.linux-sxs.org</a><br>
</div><div class="HOEnZb"><div class="h5">_______________________________________________<br></div></div></blockquote><div><br></div><div style>Does the script read in an entire data source file and parse each line? Or does is read one line at a time and parse/write it prior to reading the next line? If the entire source file is being read into memory, could it be causing a bottleneck?</div>
<div style><br></div><div style>Andrew</div><div> </div></div></div></div>