newlib-cygwin/winsup/cygwin/DevNotes

2012-05-08  cgf-000004

The change for cgf-000003 introduced a new problem:
http://cygwin.com/ml/cygwin/2012-05/msg00154.html
http://cygwin.com/ml/cygwin/2012-05/msg00157.html

Since a handle associated with the parent is no longer being duplicated
into a non-cygwin "execed child", Windows is free to reuse the pid of
the parent when the parent exits.  However, since we *did* duplicate a
handle pointing to the pid's shared memory area into the "execed child",
the shared memory for the pid was still active.

Since the shared memory was still available, if a new process reuses the
previous pid, Cygwin would detect that the shared memory was not created
and had a "PID_REAPED" flag.  That was considered an error, and, so, it
would set procinfo to NULL and pinfo::thisproc would die since this
situation is not supposed to occur.

I fixed this in two ways:

1) If a shared memory region has a PID_REAPED flag then zero it and
reuse it.  This should be safe since you are not really supposed to be
querying the shared memory region for anything after PID_REAPED has been
set.

2) Forego duping a copy of myself_pinfo if we're starting a non-cygwin
child for exec.

It seems like 2) is a common theme and an audit of all of the handles
that are being passed to non-cygwin children is in order for 1.7.16.

The other minor modification that was made in this change was to add the
pid of the failing process to fork error output.  This helps slightly
when looking at strace output, even though in this case it was easy to
find what was failing by looking for '^---' when running the "stv"
strace dumper.  That found the offending exception quickly.

2012-05-07  cgf-000003

<1.7.15>
Don't make Cygwin wait for all children of a non-cygwin child program.
Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html,
       http://cygwin.com/ml/cygwin/2012-05/msg00075.html
</1.7.15>

This problem is due to a recent change which added some robustness and
speed to Cygwin's exec/spawn handling by not trying to force inheritance
every time a process is started.  See ChangeLog entries starting on
2012-03-20, and multiple on 2012-03-21.

Making the handle inheritable meant that, as usual, there were problems
with non-Cygwin processes.  When Cygwin "execs" a non-Cygwin process N,
all of its N + 1, N + 2, ...  children will also inherit the handle.
That means that Cygwin will wait until all subprocesses have exited
before it returns.

I was willing to make this a restriction of starting non-Cygwin
processes but the problem with allowing that is that it can cause the
creation of a "limbo" pid when N exits and N + 1 and friends are still
around.  In this scenario, Cygwin dutifully notices that process N has
died and sets the exit code to indicate that but N's parent will wait on
rd_proc_pipe and will only return when every N + ...  windows process
has exited.

The removal of cygheap::pid_handle was not related to the initial
problem that I set out to fix.  The change came from the realization
that we were duping the current process handle into the child twice and
only needed to do it once.  The current process handle is used by exec
to keep the Windows pid "alive" so that it will not be reused.  So, now
we just close parent in child_info_spawn::handle_spawn iff we're not
execing.

In debugging this it bothered me that 'ps' identified a nonactive pid as
active.  Part of the reason for this was the 'parent' handle in
child_info was opened in non-Cygwin processes, keeping the pid alive.
That has been kluged around (more changes after 1.7.15) but that didn't
fix the problem.  On further investigation, this seems to be caused by
the fact that the shared memory region pid handles were still being
passed to non-cygwin children, keeping the pid alive in a limbo-like
fashion.  This was easily fixed by having pinfo::init() consider a
memory region with PID_REAPED as not available.  A more robust fix
should be considered for 1.7.15+ where these handles are not passed
to non-cygwin processes.

This fixed the problem where a pid showed up in the list after a user
does something like: "bash$ cmd /c start notepad" but, for some reason,
it does not fix the problem where "bash$ setsid cmd /c start notepad".
That bears investigation after 1.7.15 is released but it is not a
regression and so is not a blocker for the release.

2012-05-03  cgf-000002

<1.7.15>
Fix problem where too much input was attempted to be read from a
pty slave.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html
</1.7.15>

My change on 2012/04/05 reintroduced the problem first described by:
http://cygwin.com/ml/cygwin/2011-10/threads.html#00445

The problem then was, IIRC, due to the fact that bytes sent to the pty
pipe were not written as records.  Changing pipe to PIPE_TYPE_MESSAGE in
pipe.cc fixed the problem since writing lines to one side of the pipe
caused exactly that the number of characters to be read on the other
even if there were more characters in the pipe.

To debug this, I first replaced fhandler_tty.cc with the 1.258,
2012/04/05 version.  The test case started working when I did that.

So, then, I replaced individual functions, one at a time, in
fhandler_tty.cc with their previous versions.  I'd expected this to be a
problem with fhandler_pty_master::process_slave_output since that had
seen the most changes but was surprised to see that the culprit was
fhandler_pty_slave::read().

The reason was that I really needed the bytes_available() function to
return the number of bytes which would be read in the next operation
rather than the number of bytes available in the pipe.  That's because
there may be a number of lines available to be read but the number of
bytes which will be read by ReadFile should reflect the mode of the pty
and, if there is a line to read, only the number of bytes in the line
should be seen as available for the next read.

Having bytes_available() return the number of bytes which would be read
seemed to fix the problem but it could subtly change the behavior of
other callers of this function.  However, I actually think this is
probably a good thing since they probably should have been seeing the
line behavior.

2012-05-02  cgf-000001

<1.7.15>
Fix problem setting parent pid to 1 when process with children execs
itself.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html
</1.7.15>

Investigating this problem with strace showed that ssh-agent was
checking the parent pid and getting a 1 when it shouldn't have.  Other
stuff looked ok so I chose to consider this a smoking gun.

Going back to the version that the OP said did not have the problem, I
worked forward until I found where the problem first occurred -
somewhere around 2012-03-19.  And, indeed, the getppid call returned the
correct value in the working version.  That means that this stopped
working when I redid the way the process pipe was inherited around
this time period.

It isn't clear why (and I suspect I may have to debug this further at
some point) this hasn't always been a problem but I made the obvious fix.
We shouldn't have been setting ppid = 1 when we're about to pass off to
an execed process.

As I was writing this, I realized that it was necessary to add some
additional checks.  Just checking for "have_execed" isn't enough.  If
we've execed a non-cygwin process then it won't know how to deal with
any inherited children.  So, always set ppid = 1 if we've execed a
non-cygwin process.