any perl performance experts awake?

Thu Apr 11 13:37:34 PDT 2013

I figured this out.  It was definitely the hash table lookups that
were the problem.  Or more accurately, the fact that I was passing the
entire hash table into the subroutine thousands of times, which meant
it was being copied thousands of times.  I switched to using a
hashref, and that cut the processing time from hours to a few minutes.

On Thu, Apr 11, 2013 at 12:14 PM, Lonni J Friedman <netllama at gmail.com> wrote:
> On Thu, Apr 11, 2013 at 11:19 AM, Andrew Gould
> <andrewlylegould at gmail.com> wrote:
>> On Thu, Apr 11, 2013 at 11:36 AM, Lonni J Friedman <netllama at gmail.com>
>> wrote:
>>>
>>> On Thu, Apr 11, 2013 at 9:32 AM, Andrew Gould <andrewlylegould at gmail.com>
>>> wrote:
>>> >
>>> > On Thu, Apr 11, 2013 at 11:13 AM, Lonni J Friedman <netllama at gmail.com>
>>> > wrote:
>>> >>
>>> >>
>>> >> >
>>> >> > Does the script read in an entire data source file and parse each
>>> >> > line?
>>> >> > Or
>>> >> > does is read one line at a time and parse/write it prior to reading
>>> >> > the
>>> >> > next
>>> >> > line?  If the entire source file is being read into memory, could it
>>> >> > be
>>> >> > causing a bottleneck?
>>> >>
>>> >> The script reads in an entire data source file, parsing line by line,
>>> >> putting the data into a hash (%hash_values).  Once that is completed,
>>> >> the hash is passed to sqlInsert().  So everything is already read into
>>> >> memory at the point in time when performance tanks.  I'd expect that
>>> >> this would be the fast path, since its never needed to read from disk.
>>> >>  All of my systems have 2+GB RAM, and the data in question is always
>>> >> less than 30MB, so I can't imagine that this would be a swap issue, if
>>> >> that's what you mean?   Unless querying a key/value pair in a hash is
>>> >> not a good performance path in perl?
>>> >> _______________________________________________
>>> >
>>> >
>>> >
>>> > The script is holding the input file (>150k rows?) and the hash in
>>> > memory
>>> > while it's reformatting the data and performing sqlinsert().  I was
>>> > wondering whether the combination of processing and RAM utilization
>>> > could be
>>> > causing the slowdown.
>>>
>>> Yes, that's how its behaving.  Is there a better way to do this in perl?
>>> _______________________________________________
>>
>>
>> I can't help with perl specifics, but when I process large files in python,
>> I don't read the entire input file at once. I read, process and write one
>> line at a time:
>>
>> 1. Assign input and output files to file handlers using open().
>> 2. Read one line from the input file, process it and write the results to
>> the output file.  Repeat as necessary.
>> 3. Close the input and output files.
>>
>> It takes less than an hour to process a large file (input file = 4.5 million
>> rows, 2GB; output file size approximately 890MB) on a system with 2.9GHz
>> processor and 4GB RAM running 32bit WinXP.  (They don't let me use Linux at
>> work.)
>>
>
> The thing is, the data being passed into this slow function
> (subroutine) is initially being read from a file, and that portion is
> fast.  This seems to suggest that perl is somehow faster at reading &
> processing data from disk than in memory, which seems ridiculous to
> me.  Surely this would be considered a huge bug that would have been
> fixed years ago?
>
> Also, I still suspect that the issue is something related to the hash,
> as querying the key from the value in the hash seems to be where the
> performance goes downhill. However, everything that I've read suggests
> that hashes are the way to go for getting better performance
> (especially when compared with an array).
>
> I did find this thread, which seems to suggest that I might benefit
> from using a hash reference inside the subroutine, rather than passing
> the entire hash:
> http://stackoverflow.com/questions/5692349/benefits-of-using-hash-references
>
> Unless someone else has any ideas, I guess I'll give that a try next.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman                                    netllama at gmail.com
LlamaLand                       https://netllama.linux-sxs.org