any perl performance experts awake?

Thu Apr 11 11:19:20 PDT 2013

On Thu, Apr 11, 2013 at 11:36 AM, Lonni J Friedman <netllama at gmail.com>wrote:

> On Thu, Apr 11, 2013 at 9:32 AM, Andrew Gould <andrewlylegould at gmail.com>
> wrote:
> >
> > On Thu, Apr 11, 2013 at 11:13 AM, Lonni J Friedman <netllama at gmail.com>
> > wrote:
> >>
> >>
> >> >
> >> > Does the script read in an entire data source file and parse each
> line?
> >> > Or
> >> > does is read one line at a time and parse/write it prior to reading
> the
> >> > next
> >> > line?  If the entire source file is being read into memory, could it
> be
> >> > causing a bottleneck?
> >>
> >> The script reads in an entire data source file, parsing line by line,
> >> putting the data into a hash (%hash_values).  Once that is completed,
> >> the hash is passed to sqlInsert().  So everything is already read into
> >> memory at the point in time when performance tanks.  I'd expect that
> >> this would be the fast path, since its never needed to read from disk.
> >>  All of my systems have 2+GB RAM, and the data in question is always
> >> less than 30MB, so I can't imagine that this would be a swap issue, if
> >> that's what you mean?   Unless querying a key/value pair in a hash is
> >> not a good performance path in perl?
> >> _______________________________________________
> >
> >
> >
> > The script is holding the input file (>150k rows?) and the hash in memory
> > while it's reformatting the data and performing sqlinsert().  I was
> > wondering whether the combination of processing and RAM utilization
> could be
> > causing the slowdown.
>
> Yes, that's how its behaving.  Is there a better way to do this in perl?
> _______________________________________________
>

I can't help with perl specifics, but when I process large files in python,
I don't read the entire input file at once. I read, process and write one
line at a time:

1. Assign input and output files to file handlers using open().
2. Read one line from the input file, process it and write the results to
the output file.  Repeat as necessary.
3. Close the input and output files.

It takes less than an hour to process a large file (input file = 4.5
million rows, 2GB; output file size approximately 890MB) on a system with
2.9GHz processor and 4GB RAM running 32bit WinXP.  (They don't let me use
Linux at work.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.celestial.com/pipermail/linux-users/attachments/20130411/a9c79bb9/attachment.html