any perl performance experts awake?

Thu Apr 11 12:14:17 PDT 2013

On Thu, Apr 11, 2013 at 11:19 AM, Andrew Gould
<andrewlylegould at gmail.com> wrote:
> On Thu, Apr 11, 2013 at 11:36 AM, Lonni J Friedman <netllama at gmail.com>
> wrote:
>>
>> On Thu, Apr 11, 2013 at 9:32 AM, Andrew Gould <andrewlylegould at gmail.com>
>> wrote:
>> >
>> > On Thu, Apr 11, 2013 at 11:13 AM, Lonni J Friedman <netllama at gmail.com>
>> > wrote:
>> >>
>> >>
>> >> >
>> >> > Does the script read in an entire data source file and parse each
>> >> > line?
>> >> > Or
>> >> > does is read one line at a time and parse/write it prior to reading
>> >> > the
>> >> > next
>> >> > line?  If the entire source file is being read into memory, could it
>> >> > be
>> >> > causing a bottleneck?
>> >>
>> >> The script reads in an entire data source file, parsing line by line,
>> >> putting the data into a hash (%hash_values).  Once that is completed,
>> >> the hash is passed to sqlInsert().  So everything is already read into
>> >> memory at the point in time when performance tanks.  I'd expect that
>> >> this would be the fast path, since its never needed to read from disk.
>> >>  All of my systems have 2+GB RAM, and the data in question is always
>> >> less than 30MB, so I can't imagine that this would be a swap issue, if
>> >> that's what you mean?   Unless querying a key/value pair in a hash is
>> >> not a good performance path in perl?
>> >> _______________________________________________
>> >
>> >
>> >
>> > The script is holding the input file (>150k rows?) and the hash in
>> > memory
>> > while it's reformatting the data and performing sqlinsert().  I was
>> > wondering whether the combination of processing and RAM utilization
>> > could be
>> > causing the slowdown.
>>
>> Yes, that's how its behaving.  Is there a better way to do this in perl?
>> _______________________________________________
>
>
> I can't help with perl specifics, but when I process large files in python,
> I don't read the entire input file at once. I read, process and write one
> line at a time:
>
> 1. Assign input and output files to file handlers using open().
> 2. Read one line from the input file, process it and write the results to
> the output file.  Repeat as necessary.
> 3. Close the input and output files.
>
> It takes less than an hour to process a large file (input file = 4.5 million
> rows, 2GB; output file size approximately 890MB) on a system with 2.9GHz
> processor and 4GB RAM running 32bit WinXP.  (They don't let me use Linux at
> work.)
>

The thing is, the data being passed into this slow function
(subroutine) is initially being read from a file, and that portion is
fast.  This seems to suggest that perl is somehow faster at reading &
processing data from disk than in memory, which seems ridiculous to
me.  Surely this would be considered a huge bug that would have been
fixed years ago?

Also, I still suspect that the issue is something related to the hash,
as querying the key from the value in the hash seems to be where the
performance goes downhill. However, everything that I've read suggests
that hashes are the way to go for getting better performance
(especially when compared with an array).

I did find this thread, which seems to suggest that I might benefit
from using a hash reference inside the subroutine, rather than passing
the entire hash:
http://stackoverflow.com/questions/5692349/benefits-of-using-hash-references

Unless someone else has any ideas, I guess I'll give that a try next.