FreeBSD and filePro weirdness

Tue Dec 5 19:26:31 PST 2006

----- Original Message ----- 
From: "Walter Vaughan" <wvaughan at steelerubber.com>
To: "filePro" <filepro-list at lists.celestial.com>
Sent: Tuesday, December 05, 2006 10:26 AM
Subject: FreeBSD and filePro weirdness

> We've noticed this happening a lot on our FreeBSD 6.1 filePro server. It 
> seem like it started with the security upgrades to sshd recently. Zombie 
> filePro processes that spin up to 100% CPU usage. Broken sessions cause 
> this, and we've seen them for years on SCO Unix, and we just kill them 
> off... they just seem to be running the CPU up to 100% on freeBSD vs 
> idling at 0% on SCO. Ive seen it in the past few days with *clerk and 
> *report as well.
>
> This is filePro 5.0.14, so I have no idea if 5.6.X has same behavior.
> # file rcabe
> rcabe: setuid ELF 32-bit LSB executable, Intel 80386, version 1 (FreeBSD), 
> for FreeBSD 4.8, statically linked, stripped
>
> As you can see from the "top" snippet it happens even to an aborted cabe 
> session....
>
> last pid: 28407;  load averages:  1.00,  1.00,  1.00    up 1+00:04:26 
> 10:03:43
> 60 processes:  2 running, 58 sleeping
> CPU states: 19.0% user,  0.0% nice, 31.2% system,  0.0% interrupt, 49.8% 
> idle
> Mem: 81M Active, 640M Inact, 184M Wired, 36K Cache, 111M Buf, 92M Free
> Swap: 4096M Total, 4096M Free
>
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU 
> COMMAND
> 18668 filepro       1 118    0  1292K   896K CPU1   1  18.9H 98.97% rcabe
>
> Since this is a multiple CPU system, it's not evident it happens unless 
> two or more exist. I had three one day, and the tip off to me was that it 
> started to feel sticky.
>
> Mostly I am posting this in case someone else has this problem in the 
> future with filePro, and are looking for confirmation of past behavior.

We've seen it too. I finally wrote this a few months ago.
http://www.aljex.com/bkw/filepro/#headless

cd /usr/local/bin
wget http://www.aljex.com/bkw/filepro/headless
chmod 755 headless

I run it with "-k" from cron every minute as part of another script that 
already runs every minute.

What it does is look for processes where the command name exactly matches 
anything from a list of fp binary names, whose parent process id is "1".

Once in a great while there is a process that a human can tell should be 
killed but whose ppid isn't 1, but unless you run any fp binaries from init, 
there is never a time where a filepro process legitimately has a ppid of 1.
That includes cron and cgi jobs.

This doesn't look at controlling tty because there are plenty of legitimate 
cases for a fp process to have no tty.
It is often interesting though so it displays tty in the display mode.

This doesn't attempt to clean up any other processes that might be related 
to the actual fp binary.
Usually those are already gone. Usually thats actually the problem.

This just uses kill with no options, which is the same as -15 or -TERM which 
filePro normally handles gracefully.

adding more anecdotal data to the pile...

We just recently started having a similar problem on a new linux box, where 
hitting break while in clerk/report/etc... will sometimes cause filepro's 
parent process to exit instead of filepro. In fact filepro won't see the 
keystroke at all.

Normally, hitting break in fp doesn't actually break, fp traps it and 
prompts you to hit it again
with the Ctrl-C? or Del? prompt on the bottom right.
In this case, fp is not putting up that message, and immediately you get a 
shell prompt on the screen at the same time as rclerk.

I've repeated it and looked at the process list in another window and it's 
clear.
You start with a process group like this:

pid        ppid      command
3558    133        sshd
3559    3558      -bash
3866    3559      /bin/bash /some/script
3868    3866      rclerk somefile -s1 -u

You hit break, and then your screen has both rclerk and a shell active at 
once.
Most keystrokes go to the shell but some do make it to rclerk.
If you hit Ctrl-L the screen clears (no fp screen data), the shell prompt 
appears somewhere to the left (sometimes high, sometimes low), and the 
cursor moves back to wherever it was waiting for iput in the filepro screen, 
which might be the entsel prompt or some screen field or input prompt.
If filepro was busy like processing a report or a getnext loop, new screen 
updates from fp do draw on the screen.
It's hard to trigger fp to do anything though since almost no keystrokes 
make it to fp any more, they all go to the shell.
But not all. And the process list changed to this:

pid        ppid      command
3558    133        sshd
3559    3558      -bash
3868    1            rclerk somefile -s1 -u

The parent script process (3866) went away
and the rclerk process parent process id (3866) has changed to "1"

If you kill 3868 you are left with a normal interactive shell on that tty.
Or sometimes you need to blindly type "^Jstty sane^J" to get the shell back 
to normal.

Thats only happening on one box, which is to the best of my ability set up 
exactly the same as several others in terms of OS, config, fp binaries etc.
And it's only happening in one fp environmenty on that box.
The problem environment shares the same filepro binaries, same termcap & 
terminfo, (same fp/termcap), uses the same start script to set up the env 
and start fp, even uses a copy of the good environments fp config file.
The problem happens for all combinations of TERM & terminal  or emulator for 
which I have fully configured and well tested setups and use routinely on 
lots of boxes, including:
TERM=linux, using, linux console, putty, anzio
TERM=scoansi, using facetwin, anzio, putty
TERM=rxvt, using rxvt on linux, freebsd, sco
All of which work fine on the same box just working on other sets of fp data 
so the fp/termcap is good and matches the terminals and the syetm termcap & 
terminfo & relevant stty settings.

The only difference between the problem environment and the working 
environments on that box are PFDIR, which in this case results in a 
different filepro directory, so all different data, processing, menus, 
pfglob

Which suggests processing.
But this body of code was rsynced from a sco box where it was working fine 
for years and continues to work fine.
So if it was something in processing it should be causing the problem there 
too.
The rsync binaries on both ends are way too well proven to be suspect.

I still have to try copying it to some other linux box and see if it fails 
there too.
I have played with putting trap commands in the parent script and altering 
the parent scripts shell but only a little yet and not very systematically. 
I thought I spotted a difference between config files with pfbreak=old but 
eliminated that.

You can emulate the effect by just doing kill -9 of the parent of any fp 
process.
That way you can test the effects and use of headless at will
 run "headless" and normally see no output (no headless fp's normally)
start up a sacraficial rclerk, kill -9 it's parent, and then see that the 
process shows up in headless.
then run headless -k and see the one fp process go away.
The tty the process was on may appear locked up but really just needs 
"^Jstty sane^J"

Brian K. White  --  brian at aljex.com  --  http://www.aljex.com/bkw/
+++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
filePro  BBx    Linux  SCO  FreeBSD    #callahans  Satriani  Filk!