Re: webserver stalls [was Re: bug in (linux) slattach]

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* Re: webserver stalls [was Re: bug in (linux) slattach]
       [not found] <1034621946.3293.92.camel@cool>
@ 2002-10-17 10:04 ` jb1
  0 siblings, 0 replies; 16+ messages in thread
From: jb1 @ 2002-10-17 10:04 UTC (permalink / raw
  To: Harry Kalogirou; +Cc: Linux-8086

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2394 bytes --]

On 15 Oct 2002, Harry Kalogirou wrote:

...
> Until this point, everything looks ok. The redhat machine send
> 7 bytes (1 -> 8) and ELKS acknowleged them all. Then ELKS sended the
> first 100 bytes (1 -> 101) of the page. Redhat ack'ed them too.
> Then ELKS sends 1 byte (101 -> 102) and Redhat acks it too.
> The problem starts here....
> 
> > 04:28:17.961776  sl0 < 192.168.1.100.www > 192.168.2.5.1025: P 102:125(23) ack 8 win 512
...
> ELKS continues to send 23 bytes more...
> 
> > 04:28:18.411777  sl0 < 192.168.1.100.www > 192.168.2.5.1025: P 125:225(100) ack 8 win 512
...
> ELKS continues to send 100 bytes more
> 
> > 04:28:18.411835  sl0 > 192.168.2.5.1025 > 192.168.1.100.www: . 8:8(0) ack 102 win 32512 (DF)
...
> and here we have an ack from redhad only for the first 101 bytes again.
> Not for the 123 extra bytes ELKS send.
> 
> I don't seem to get what causes the problem. As much as I can see ELKS
> does the job fine.
> 
> Is there any chance that tcpdump even dumps packets with wrong checksum?

Yes, according to information I found on the web, tcpdump doesn't reject 
packets with incorrect checksums.

Attached is telnetdumps.zip, which contains sample tcpdumps illustrating
how a 99-byte webpage on ELKS succeeds, but two different 100-byte
variants stall on my Red Hat 7.0 Linux. Both 100-byte webpages, as well as
the original 266-byte sample webpage seem to fail in the same place;
specifically, Linux fails to ACK the 23-byte block from ELKS but ELKS
continues sending data, while Linux keeps ACK'ing the packet prior to the
one with 23 bytes of data. One of the 100-byte variants demonstrates that
the problem is not caused by the original webpage's lack of a terminating
newline. The file "telnetdumps.jb1" describe the contents in more detail.

I'd like to analyze the packets in more detail, but can't determine the
structure. Below is what I've inferred so far; any corrections and
additions (especially the checksum location, if any) would be greatly
appreciated:

Byte	Function
------	------------------------------------------------------------------
0
1
2-3	packet size
4
5
6
7
8
9
10
11
12-15	source IP address
16-19	destination IP address
20-21	source port
22-23	destination port
24-27	sequence number
28-31	acknowledgement number
32
33
34-35	sender's "win" value
36
37
38
39
40 ...	data

I changed the "Subject:" to get this in the appropriate thread.

[-- Attachment #2: Type: APPLICATION/zip, Size: 11333 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
@ 2002-10-19 10:07 jb1
  2002-10-19 17:55 ` Gabucino
  0 siblings, 1 reply; 16+ messages in thread
From: jb1 @ 2002-10-19 10:07 UTC (permalink / raw
  To: Harry Kalogirou; +Cc: linux-8086

Thanks to Blaz, who put me on the right track, I now know the format of
the tcpdumps I attached to my previous message in this thread. The first
twenty bytes of each packet are an IP header, the next twenty are a TCP
header, and any subsequent bytes are data.

The sixth 16-bit word of each packet is the IP Header Checksum. For each
case in which the connection to the ELKS webserver stalled, the IP Header
Checksum was *WRONG* in the packet the Red Hat 7.0 Linux box failed to
ACK. Furthermore it was always the *same* incorrect value (0xf6ff) in the
*same* packet (the one with the Content Length). The IP Header Checksum 
was correct (0xf5ff) for the 99-byte file that didn't stall. It looks to 
me like the Linux box is doing what it should, ignoring a packet with a 
bad IP checksum. I think the ELKS box is either failing to time out due to 
the missing ACK, or is erroneously "re-transmitting" what would be the 
next packet instead of the bad one.

There's something else strange about all the packets sent by ELKS: the IP
Identification field (the third 16-bit word of each packet) is always
zero, whereas the Linux box seems to increment that field each time it
sends a packet. Maybe ELKS is sending the wrong packet because it has the
same IP Identification.

Here's some information you might find useful:

1. The description of IP checksum algorithm in RFC-791 (and probably that
of the TCP Checksum in RFC-793) is, to put it kindly, misleading. It
should probably read something like, "The checksum is computed by doing a
16-bit add-with-carry of all the 16-bit words in the IP header except the
checksum field, itself, for which the value zero is assumed, then
complementing each bit of the sum".

2. I only glanced at it, but I think RFC-1624 discusses some boundary 
conditions under which commonly-used checksum algorithms fail.

3. When computing the ELKS IP Header Checksum, the only field that seems
to change from packet to packet (since the Identification is always zero
and the Header Checksum field is taken to be zero), is the Total Length
(the second 16-bit word). I was able to quickly test the accuracy of
checksums by precomputing the sum-with-carry of everything else, then
simply adding-with-carry the length, then inverting the bits. For all the 
tcpdumps I sent (including the one for the original 266 byte webpage) 
simply add-with-carry the second 16-bit word to 0x09c1, then complement 
that sum and compare it to the one in the tcpdump. Note that this applies 
only to IP headers sent by the ELKS box.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-19 10:07 jb1
@ 2002-10-19 17:55 ` Gabucino
  0 siblings, 0 replies; 16+ messages in thread
From: Gabucino @ 2002-10-19 17:55 UTC (permalink / raw
  To: linux-8086

[-- Attachment #1: Type: text/plain, Size: 352 bytes --]

> There's something else strange about all the packets sent by ELKS
I've also reported a problem where ELKS replied with somewhat random IP 
addresses in ICMP Echo Reply packets. Maybe there's some connection?

-- 
Gabucino

/ Worshipping Niedermayer 4ever /
"Do you always look at it encoded?" (C) D. R. F.
"minden lehet. ez nem." (c) A'rpi

[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
       [not found] <1035036158.454.17.camel@cool>
@ 2002-10-20  9:34 ` jb1
  2002-10-20 17:06   ` Harry Kalogirou
  0 siblings, 1 reply; 16+ messages in thread
From: jb1 @ 2002-10-20  9:34 UTC (permalink / raw
  To: Harry Kalogirou; +Cc: Linux-8086

On 19 Oct 2002, Harry Kalogirou wrote:

> 
> Since I had used ELKS for long time on my network and I hardly had
> checksum errors, had other biger problems 8), I think that this has

From what I've seen in the mailinglist archives, I think other people have
had similar problems. They probably just gave up when no one answered
their vague, sometimes irrelevant, questions.

> something to do with the serial line altering bytes that ELKS transmits.
> Can you send me the output of "stty -a -F /dev/ttySX" after you setup
> the connection. Maybe the line is not corectly setup on linux side (XOFF
> XON and stuff). 

After issuing "/bin/stty -F /dev/ttyS0 4800" on the Red Hat 7.0 Linux box,
"stty -a -F /dev/ttyS0" displays:
speed 4800 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W;
lnext = ^V; flush = ^O; min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
-ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl ixon 
-ixoff
-iuclc -ixany -imaxbel
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 
vt0 ff0
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop 
-echoprt
echoctl echoke

> If the above proves the setup of the serial line to be ok, then ELKS has
> a problem. Maybe then the problem is at the assembly optimized checksum
> routines I wrote. Disabling that by undefing USE_ASM in ip.c will show
> that.

Both the C and ASM routines in ip_calc_chksum() in elksnet/ktcp/ip.c from
elksnet-0.1.1.tar.gz look like they should have worked correctly for the
packet with the bad IP Header checksum. The C routine has a lurking bug;
it doesn't account for a possible carry in
	return ~((sum & 0xffff) + ((sum >> 16) & 0xffff));
but even if USE_ASM were undefined it wouldn't have affected that packet.

Even if my serial port handshaking is incorrecty set up, the difference
between the packet that fails and the one that succeeds is trivial. The
former is 63 bytes long with the data "Content-Length: 100^M^J^M^J"; the
latter is 62 bytes long with the data "Content-Length: 99^M^J^M^J and
(after the ACK by the linux box) is followed by a *successful* 139-byte
packet containing the entire 99-byte webpage file. Also, the erroneous
checksum is exactly the same and in exactly the same packet even for the
266-byte original file tcpdump'ed several days earlier ("Content-Length:  
266^M^J^M^J"). The Linux box's /proc/cpuinfo says its AMD-K6 is running at 
360.800 MHz; I wouldn't be surprised if it could run 4800 baud with no 
handshaking at all, and I didn't notice any XON or XOFF characters mixed 
in with the data.

I wonder if there's any significance in the fact that the problem occurs
precisely at the boundary between data obviously generated by the server,
itself, and the contents of the webpage file.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-20  9:34 ` webserver stalls [was Re: bug in (linux) slattach] jb1
@ 2002-10-20 17:06   ` Harry Kalogirou
  2002-10-21  9:44     ` jb1
  0 siblings, 1 reply; 16+ messages in thread
From: Harry Kalogirou @ 2002-10-20 17:06 UTC (permalink / raw
  To: jb1; +Cc: Linux-8086

> On 19 Oct 2002, Harry Kalogirou wrote:
> 
> > 
> > Since I had used ELKS for long time on my network and I hardly had
> > checksum errors, had other biger problems 8), I think that this has
> 
> >From what I've seen in the mailinglist archives, I think other people have
> had similar problems. They probably just gave up when no one answered
> their vague, sometimes irrelevant, questions.
> 
> > something to do with the serial line altering bytes that ELKS transmits.
> > Can you send me the output of "stty -a -F /dev/ttySX" after you setup
> > the connection. Maybe the line is not corectly setup on linux side (XOFF
> > XON and stuff). 
> 
> After issuing "/bin/stty -F /dev/ttyS0 4800" on the Red Hat 7.0 Linux box,
> "stty -a -F /dev/ttyS0" displays:
> speed 4800 baud; rows 0; columns 0; line = 0;
> intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
> eol2 = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W;
> lnext = ^V; flush = ^O; min = 1; time = 0;
> -parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
> -ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl ixon 
> -ixoff
> -iuclc -ixany -imaxbel
> opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 
> vt0 ff0
> isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop 
> -echoprt
> echoctl echoke
>

As I suspected, misconfigued line. Configure your like this :

intr = <undef>; quit = <undef>; erase = <undef>; kill = <undef>; eof =
<undef>;
eol = <undef>; eol2 = <undef>; start = <undef>; stop = <undef>; susp =
<undef>;
rprnt = <undef>; werase = <undef>; lnext = <undef>; flush = <undef>;
min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon
-ixoff
-iuclc -ixany -imaxbel
-opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0
bs0 vt0
ff0
-isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop
-echoprt -echoctl -echoke
 

Basicaly the above configuration is done by the -L parameter of
slattach, except the -crtscts. What I do just to be sure is :

# slattach -p [c]slip -L -e /dev/ttyS0
# stty -F /dev/ttyS0 -crtscts
# slattach -p [c]slip -s 4800 -m /dev/ttyS0 &


Harry




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-20 17:06   ` Harry Kalogirou
@ 2002-10-21  9:44     ` jb1
  2002-10-21  9:55       ` Harry Kalogirou
  0 siblings, 1 reply; 16+ messages in thread
From: jb1 @ 2002-10-21  9:44 UTC (permalink / raw
  To: Harry Kalogirou; +Cc: Linux-8086

On 20 Oct 2002, Harry Kalogirou wrote:

> > On 19 Oct 2002, Harry Kalogirou wrote:
...
> As I suspected, misconfigued line. Configure your like this :
> 
> intr = <undef>; quit = <undef>; erase = <undef>; kill = <undef>; eof =
> <undef>;
> eol = <undef>; eol2 = <undef>; start = <undef>; stop = <undef>; susp =
> <undef>;
> rprnt = <undef>; werase = <undef>; lnext = <undef>; flush = <undef>;
> min = 1; time = 0;
> -parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
> ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon
> -ixoff
> -iuclc -ixany -imaxbel
> -opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0
> bs0 vt0
> ff0
> -isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop
> -echoprt -echoctl -echoke
>  
> 
> Basicaly the above configuration is done by the -L parameter of
> slattach, except the -crtscts. What I do just to be sure is :
> 
> # slattach -p [c]slip -L -e /dev/ttyS0
> # stty -F /dev/ttyS0 -crtscts
> # slattach -p [c]slip -s 4800 -m /dev/ttyS0 &

I did as you suggested (with one exception), confirmed that the settings 
were exactly like yours, and found *no* difference; the 99-byte webpage 
file works, the 100-byte byte webpage files don't. The exception was:
	stty 4800 -F /dev/ttyS0 -crtscts
because my slattach program doesn't seem to change the baud rate. Also, 
I still getting frequent seemingly-random errors when I ping the ELKS box.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-21  9:44     ` jb1
@ 2002-10-21  9:55       ` Harry Kalogirou
  2002-10-22 10:16         ` jb1
  0 siblings, 1 reply; 16+ messages in thread
From: Harry Kalogirou @ 2002-10-21  9:55 UTC (permalink / raw
  To: jb1; +Cc: Linux-8086

[-- Attachment #1: Type: text/plain, Size: 1788 bytes --]

> On 20 Oct 2002, Harry Kalogirou wrote:
> 
> > > On 19 Oct 2002, Harry Kalogirou wrote:
> ...
> > As I suspected, misconfigued line. Configure your like this :
> > 
> > intr = <undef>; quit = <undef>; erase = <undef>; kill = <undef>; eof =
> > <undef>;
> > eol = <undef>; eol2 = <undef>; start = <undef>; stop = <undef>; susp =
> > <undef>;
> > rprnt = <undef>; werase = <undef>; lnext = <undef>; flush = <undef>;
> > min = 1; time = 0;
> > -parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts
> > ignbrk -brkint ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon
> > -ixoff
> > -iuclc -ixany -imaxbel
> > -opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0
> > bs0 vt0
> > ff0
> > -isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase -tostop
> > -echoprt -echoctl -echoke
> >  
> > 
> > Basicaly the above configuration is done by the -L parameter of
> > slattach, except the -crtscts. What I do just to be sure is :
> > 
> > # slattach -p [c]slip -L -e /dev/ttyS0
> > # stty -F /dev/ttyS0 -crtscts
> > # slattach -p [c]slip -s 4800 -m /dev/ttyS0 &
> 
> I did as you suggested (with one exception), confirmed that the settings 
> were exactly like yours, and found *no* difference; the 99-byte webpage 
> file works, the 100-byte byte webpage files don't. The exception was:
> 	stty 4800 -F /dev/ttyS0 -crtscts
> because my slattach program doesn't seem to change the baud rate. Also, 
> I still getting frequent seemingly-random errors when I ping the ELKS box.

Mmm.. weird.. I probably got you tired with all this but can you try and
see if the failures are realy random? A good aid at this the -p
parameter of ping.

I'm just convinsed that there is a problem after the packets leave ELKS.
Harry



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-21  9:55       ` Harry Kalogirou
@ 2002-10-22 10:16         ` jb1
  2002-10-22 13:56           ` Harry Kalogirou
  2002-10-22 13:57           ` [SOLVED] " Harry Kalogirou
  0 siblings, 2 replies; 16+ messages in thread
From: jb1 @ 2002-10-22 10:16 UTC (permalink / raw
  To: Harry Kalogirou; +Cc: Linux-8086

On 21 Oct 2002, Harry Kalogirou wrote:

> Mmm.. weird.. I probably got you tired with all this but can you try and
> see if the failures are realy random? A good aid at this the -p
> parameter of ping.

100 pings (200 packets) each of patterns 00, 55, aa, and ff had zero to 
five errors, too few to account for the 100 percent failure rate of 
certain webpage files. 55 had the most errors and was the only one with an 
error in the pattern data. Most of the other errors were something about 
the time-of-day going back; 00 had one extremely long response time 
(1074131 mS).

I think I can now prove that there's at least one IP Header sum-with-carry
that results in a reproducible checksum error. I discovered that if the
ELKS IP address were 192.168.1.135, all my test files could be read; large
files required a few tries, but I was even able to read one 4369 (0x1111)
bytes long! The unique property of the packets that never got ACK'ed is 
that their checksum-field contains 0xF6FF instead of the correct value 
0xF5FF (the complement of 0x0A00).

Each of the webpage files that stall produces a defective packet with this 
IP Header (the first twenty bytes of the packet):
	4500 003f 0000 0000 4006 f6ff c0a8 0164 c0a8 0205
The corresponding packet in the 99-byte file is one byte shorter (003e 
instead of 003f), consequently having a different IP Header Checksum 
(f600 instead of the erroneous f6ff):
	4500 003e 0000 0000 4006 f600 c0a8 0164 c0a8 0205

Ping uses Protocol 01 instead of Protocol 06, so by changing the ELKS IP 
address from 192.168.1.100 to 192.168.1.105 I was able to produce the 
identical erroneous IP Header Checksum with the command:
	ping -s 35 192.168.1.105
resulting in the IP header:
	4500 003e 0000 0000 4001 f6ff c0a8 0169 c0a8 0205

To demonstrate that the problem is not the total packet size I added 1 to 
the packetsize and subtracted 1 from the ELKS IP address:
	ping -s 36 192.168.1.104
resulting in the IP Header:
	4500 0040 0000 0000 4001 f6ff c0a8 0168 c0a8 0205

Just for symmetry, I produced the same checksum as that for the 99-byte 
webpage file, but the same length as the 100- and 266 byte webpage files 
with:
	ping -s 35 192.168.1.104
resulting in the IP Header:
	4500 003f 0000 0000 4001 f600 c0a8 0168 c0a8 0205

In all cases, the pings with the defective checksum had 100% loss, while 
those with the good checksum succeeded. I didn't try manipulating the 
source IP address (c0a8 0205 = 192.168.2.5). If you can manipulate the 
packetsize and ELKS IP address so that the sum-with-carry of this header 
sans checksum-field is 0x09C1 you should be able to reproduce my results; 
otherwise it's probably a quirk in Red Hat 7.0 Linux (or you're using a 
different version of some critical ELKS file).

SOURCE PACKAGES:
        elks-0.1.1.tar.gz, elkscmd_20020501.tar.gz, elksnet-0.1.1.tar.gz,
        Dev86src-0.16.0.tar.gz
CVS PATCHES:
        (none)
COMPILED UNDER:
        Red Hat 7.0 Linux, kernel 2.2.16-22

Note: I think bad packets comsume memory. After several unsuccessful 
transfers I started seeing "Cannot fork" on the ELKS box when I issued 
commands ... eventually I'd have to reboot it. It might be a good idea to 
purge them after a minute or two.

Does anything other than the system time depend upon the CMOS clock? It 
obviously hasn't been read on any of the four machines on which I tried 
ELKS (yes, they all *have* standard, working CMOS clocks).

By the way, I received two copies of this message in addition to the copy 
sent from the mailing list.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 10:16         ` jb1
@ 2002-10-22 13:56           ` Harry Kalogirou
  2002-10-22 13:57           ` [SOLVED] " Harry Kalogirou
  1 sibling, 0 replies; 16+ messages in thread
From: Harry Kalogirou @ 2002-10-22 13:56 UTC (permalink / raw
  To: jb1; +Cc: Linux-8086

[-- Attachment #1: Type: text/plain, Size: 3597 bytes --]

> On 21 Oct 2002, Harry Kalogirou wrote:
> 
> > Mmm.. weird.. I probably got you tired with all this but can you try and
> > see if the failures are realy random? A good aid at this the -p
> > parameter of ping.
> 
> 100 pings (200 packets) each of patterns 00, 55, aa, and ff had zero to 
> five errors, too few to account for the 100 percent failure rate of 
> certain webpage files. 55 had the most errors and was the only one with an 
> error in the pattern data. Most of the other errors were something about 
> the time-of-day going back; 00 had one extremely long response time 
> (1074131 mS).
> 
> 
> I think I can now prove that there's at least one IP Header sum-with-carry
> that results in a reproducible checksum error. I discovered that if the
> ELKS IP address were 192.168.1.135, all my test files could be read; large
> files required a few tries, but I was even able to read one 4369 (0x1111)
> bytes long! The unique property of the packets that never got ACK'ed is 
> that their checksum-field contains 0xF6FF instead of the correct value 
> 0xF5FF (the complement of 0x0A00).
> 
> Each of the webpage files that stall produces a defective packet with this 
> IP Header (the first twenty bytes of the packet):
> 	4500 003f 0000 0000 4006 f6ff c0a8 0164 c0a8 0205
> The corresponding packet in the 99-byte file is one byte shorter (003e 
> instead of 003f), consequently having a different IP Header Checksum 
> (f600 instead of the erroneous f6ff):
> 	4500 003e 0000 0000 4006 f600 c0a8 0164 c0a8 0205
> 
> Ping uses Protocol 01 instead of Protocol 06, so by changing the ELKS IP 
> address from 192.168.1.100 to 192.168.1.105 I was able to produce the 
> identical erroneous IP Header Checksum with the command:
> 	ping -s 35 192.168.1.105
> resulting in the IP header:
> 	4500 003e 0000 0000 4001 f6ff c0a8 0169 c0a8 0205
> 
> To demonstrate that the problem is not the total packet size I added 1 to 
> the packetsize and subtracted 1 from the ELKS IP address:
> 	ping -s 36 192.168.1.104
> resulting in the IP Header:
> 	4500 0040 0000 0000 4001 f6ff c0a8 0168 c0a8 0205
> 
> Just for symmetry, I produced the same checksum as that for the 99-byte 
> webpage file, but the same length as the 100- and 266 byte webpage files 
> with:
> 	ping -s 35 192.168.1.104
> resulting in the IP Header:
> 	4500 003f 0000 0000 4001 f600 c0a8 0168 c0a8 0205
> 
> In all cases, the pings with the defective checksum had 100% loss, while 
> those with the good checksum succeeded. I didn't try manipulating the 
> source IP address (c0a8 0205 = 192.168.2.5). If you can manipulate the 
> packetsize and ELKS IP address so that the sum-with-carry of this header 
> sans checksum-field is 0x09C1 you should be able to reproduce my results; 
> otherwise it's probably a quirk in Red Hat 7.0 Linux (or you're using a 
> different version of some critical ELKS file).
>

I'll check and get back to you...
 
> Note: I think bad packets comsume memory. After several unsuccessful 
> transfers I started seeing "Cannot fork" on the ELKS box when I issued 
> commands ... eventually I'd have to reboot it. It might be a good idea to 
> purge them after a minute or two.

These are the web servers that wait for the data to be transmited, when
they exit memory will be freed.

> Does anything other than the system time depend upon the CMOS clock? It 
> obviously hasn't been read on any of the four machines on which I tried 
> ELKS (yes, they all *have* standard, working CMOS clocks).
> 

I don't think so.

Harry



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 10:16         ` jb1
  2002-10-22 13:56           ` Harry Kalogirou
@ 2002-10-22 13:57           ` Harry Kalogirou
  2002-10-22 16:02             ` Harry Kalogirou
  1 sibling, 1 reply; 16+ messages in thread
From: Harry Kalogirou @ 2002-10-22 13:57 UTC (permalink / raw
  To: jb1; +Cc: Linux-8086

> On 21 Oct 2002, Harry Kalogirou wrote:
> 
> > Mmm.. weird.. I probably got you tired with all this but can you try and
> > see if the failures are realy random? A good aid at this the -p
> > parameter of ping.
> 
> 100 pings (200 packets) each of patterns 00, 55, aa, and ff had zero to 
> five errors, too few to account for the 100 percent failure rate of 
> certain webpage files. 55 had the most errors and was the only one with an 
> error in the pattern data. Most of the other errors were something about 
> the time-of-day going back; 00 had one extremely long response time 
> (1074131 mS).
> 
> 
> I think I can now prove that there's at least one IP Header sum-with-carry
> that results in a reproducible checksum error. I discovered that if the
> ELKS IP address were 192.168.1.135, all my test files could be read; large
> files required a few tries, but I was even able to read one 4369 (0x1111)
> bytes long! The unique property of the packets that never got ACK'ed is 
> that their checksum-field contains 0xF6FF instead of the correct value 
> 0xF5FF (the complement of 0x0A00).
> 
> Each of the webpage files that stall produces a defective packet with this 
> IP Header (the first twenty bytes of the packet):
> 	4500 003f 0000 0000 4006 f6ff c0a8 0164 c0a8 0205
> The corresponding packet in the 99-byte file is one byte shorter (003e 
> instead of 003f), consequently having a different IP Header Checksum 
> (f600 instead of the erroneous f6ff):
> 	4500 003e 0000 0000 4006 f600 c0a8 0164 c0a8 0205
> 
> Ping uses Protocol 01 instead of Protocol 06, so by changing the ELKS IP 
> address from 192.168.1.100 to 192.168.1.105 I was able to produce the 
> identical erroneous IP Header Checksum with the command:
> 	ping -s 35 192.168.1.105
> resulting in the IP header:
> 	4500 003e 0000 0000 4001 f6ff c0a8 0169 c0a8 0205
> 
> To demonstrate that the problem is not the total packet size I added 1 to 
> the packetsize and subtracted 1 from the ELKS IP address:
> 	ping -s 36 192.168.1.104
> resulting in the IP Header:
> 	4500 0040 0000 0000 4001 f6ff c0a8 0168 c0a8 0205
> 
> Just for symmetry, I produced the same checksum as that for the 99-byte 
> webpage file, but the same length as the 100- and 266 byte webpage files 
> with:
> 	ping -s 35 192.168.1.104
> resulting in the IP Header:
> 	4500 003f 0000 0000 4001 f600 c0a8 0168 c0a8 0205
> 
> In all cases, the pings with the defective checksum had 100% loss, while 
> those with the good checksum succeeded. I didn't try manipulating the 
> source IP address (c0a8 0205 = 192.168.2.5). If you can manipulate the 
> packetsize and ELKS IP address so that the sum-with-carry of this header 
> sans checksum-field is 0x09C1 you should be able to reproduce my results; 
> otherwise it's probably a quirk in Red Hat 7.0 Linux (or you're using a 
> different version of some critical ELKS file).
> 

Ok the quest is over.

After all it was a problem of the checksum functions writen in assembly!
Did you try with USE_ASM undefined? Anyway it works now and I commited
it to the CVS.

Thank you very much for all your efford! Nice work.

Harry





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 13:57           ` [SOLVED] " Harry Kalogirou
@ 2002-10-22 16:02             ` Harry Kalogirou
  2002-10-23  9:37               ` jb1
  2002-10-29 10:25               ` jb1
  0 siblings, 2 replies; 16+ messages in thread
From: Harry Kalogirou @ 2002-10-22 16:02 UTC (permalink / raw
  To: Harry Kalogirou; +Cc: jb1, Linux-8086

[-- Attachment #1: Type: text/plain, Size: 386 bytes --]


> Ok the quest is over.
> 
> After all it was a problem of the checksum functions writen in assembly!
> Did you try with USE_ASM undefined? Anyway it works now and I commited
> it to the CVS.
> 
> Thank you very much for all your efford! Nice work.
> 
> Harry


Actualy the quest is over now... as previously I managed to commit half
the patch to the CVS...

Harry



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 16:02             ` Harry Kalogirou
@ 2002-10-23  9:37               ` jb1
  2002-10-23 11:42                 ` Harry Kalogirou
  2002-10-29 10:25               ` jb1
  1 sibling, 1 reply; 16+ messages in thread
From: jb1 @ 2002-10-23  9:37 UTC (permalink / raw
  To: Harry Kalogirou; +Cc: Harry Kalogirou, Linux-8086

On 22 Oct 2002, Harry Kalogirou wrote:

> Actualy the quest is over now... as previously I managed to commit half
> the patch to the CVS...

Maybe not. I found ip.c Version 1.9 by browsing the CVS repository and, as
far as I can tell, the only change was that you moved the first "dec cx";  
this will have *no* effect. The algorithm can still fail if the carry flag
happens to be set going into the routine, or if there is a carry generated
the last time "adc [di]" is executed. I suggest something like this for
_ip_calc_chksum:

        push    bp
        mov     bp,sp
        push    di

        mov     cx, 6[bp]
        sar     cx, 1
        dec     cx
        xor     ax,ax           ; clear carry flag (as well as AX)
        mov     di, 4[bp]
        mov     ax, [di]
        inc     di
        inc     di
loop1:
        adc     ax, [di]
        inc     di
        inc     di

        loop    loop1;          ; a byte shorter and a clock faster
                                ;  than DEC CX/JNZ LOOP1

        adc     ax,0            ; add (just) the final carry
        not     ax

        pop di
        pop bp

        ret

Of course, this algorithm is valid only if the length (6[bp]) is an even
number of bytes. While this is always true for IP headers, for TCP packet 
checksums there would have to be a final test of the length's low bit and 
appropriate handling of an additional odd byte.

I ran the original routine on my "defective" packet IP Header using 
MSDOS' "debug"  (with the carry initially clear and the data 
byte-swapped in memory) and got the correct checksum. Were there any other 
updated files I should have downloaded from the CVS repository?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-23  9:37               ` jb1
@ 2002-10-23 11:42                 ` Harry Kalogirou
  2002-10-24  8:55                   ` jb1
  0 siblings, 1 reply; 16+ messages in thread
From: Harry Kalogirou @ 2002-10-23 11:42 UTC (permalink / raw
  To: jb1; +Cc: Linux-8086

> Maybe not. I found ip.c Version 1.9 by browsing the CVS repository and, as
> far as I can tell, the only change was that you moved the first "dec cx";  
> this will have *no* effect. The algorithm can still fail if the carry flag
> happens to be set going into the routine, or if there is a carry generated
> the last time "adc [di]" is executed. I suggest something like this for
> _ip_calc_chksum:
> 
>         push    bp
>         mov     bp,sp
>         push    di
> 
>         mov     cx, 6[bp]
>         sar     cx, 1
>         dec     cx
>         xor     ax,ax           ; clear carry flag (as well as AX)
>         mov     di, 4[bp]
>         mov     ax, [di]
>         inc     di
>         inc     di
> loop1:
>         adc     ax, [di]
>         inc     di
>         inc     di
> 
>         loop    loop1;          ; a byte shorter and a clock faster
>                                 ;  than DEC CX/JNZ LOOP1
> 
>         adc     ax,0            ; add (just) the final carry
>         not     ax
> 
>         pop di
>         pop bp
> 
>         ret
>

You can't be more right 8). I just thought I could get away without
opening the 8086 instruction manual, and I just made bad assumptions
about when the carry flag is cleared. 

The CVS now contains all your bugfixes (clear carry before entering the loop,
adding last carry), the use of "loop" and I also unrolled the
loop once. A code review would be gladly appreciated.
 
> Of course, this algorithm is valid only if the length (6[bp]) is an even
> number of bytes. While this is always true for IP headers, for TCP packet 
> checksums there would have to be a final test of the length's low bit and 
> appropriate handling of an additional odd byte.

TCP uses another routine.

Harry





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-23 11:42                 ` Harry Kalogirou
@ 2002-10-24  8:55                   ` jb1
  0 siblings, 0 replies; 16+ messages in thread
From: jb1 @ 2002-10-24  8:55 UTC (permalink / raw
  To: Harry Kalogirou; +Cc: Linux-8086

On 23 Oct 2002, Harry Kalogirou wrote:

> The CVS now contains all your bugfixes (clear carry before entering the loop,
> adding last carry), the use of "loop" and I also unrolled the
> loop once. A code review would be gladly appreciated.

The file ip.c Version 1.10 from the CVS repository looks good. I haven't 
tried it yet, but a "toy" version of _ip_calc_chksum runs correctly in 
DEBUG under MSDOS.

There's a trivial change I'd suggest: "SAR CX,1" to "SHR CX,1". SAR
retains the high bit's value (for signed arithmetic), whereas SHR shifts a
zero into the high bit. Since the IP Internet Header Length from which the
length is derived can be no more that 15, and both inctructions are two
bytes and take two clocks, this is just defensive programming against a
spurious call with a length greater than 32767. Also, the copyright date 
is still last year's.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-22 16:02             ` Harry Kalogirou
  2002-10-23  9:37               ` jb1
@ 2002-10-29 10:25               ` jb1
  2002-10-29 12:37                 ` Harry Kalogirou
  1 sibling, 1 reply; 16+ messages in thread
From: jb1 @ 2002-10-29 10:25 UTC (permalink / raw
  To: Harry Kalogirou; +Cc: Linux-8086

On 22 Oct 2002, Harry Kalogirou wrote:

> > Ok the quest is over.

Not yet. I think _tcp_chksumraw in tcp_output.c needs the same fixes as 
those you applied to _tcp_chksum. Without them I still got partial files 
with telnet/get.

There's *still* something wrong, but it shows up most frequently when I 
urlget from one ELKS box to another (yes, they have different IP 
addresses). Rarely, all goes as it should; more often, the entire file 
comes in a reasonable time, but I never get the command prompt; often, 
nothing comes in and I never get the command prompt. Once, nothing seemed 
to happen for about 10 minutes, but when I checked the machines about 10 
minutes later, the file had come in but there was no command prompt. I had 
enabled a second getty on that machine, so I was able to log in and run 
netstat on both machines while urlget was hung. Here are the results 
(about an hour later):

On the client ("urlget") machine (192.168.1.100) --
1 ESTABLISHED 4000ms 1025       0.0.0.0  2
2 ESTABLISHED 2400ms 1024 192.168.1.144 80
3      LISTEN 4000MS   80       0.0.0.0  0

On the server ("sender") (1.2.168.1.144) --
1 ESTABLISHED 4000ms 1024       0.0.0.0  2
2      LISTEN 4000ms   80       0.0.0.0  0

Obviously, the server has discarded the connection, but the client machine
thinks it's still connected.

I'm also not sure the client port number (the one that's 1024 or greater) 
is handled properly. Each time I connect from a Linux box the port number 
is incremented, but once I observered that a first, successful, connection 
from ELKS was from port 1024, and the next, hanging, attempt was *also* 
port 1024. Connections from Linux usually, but not always, work; 
connections from ELKS rarely work.

Diagnosing this stuff is very time-consuming because "kill" doesn't seem 
to do anything, so I must reboot both machines. Since ELKS "telnet" 
doesn't do anything but connect (and logs me out when it terminates!) I 
can't compare telnet from Linux and ELKS. I can only compare "urlget" from 
both systems, and since there's no tcpdump for ELKS I can't even determing 
if a failure is actually due to urlget.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [SOLVED] Re: webserver stalls [was Re: bug in (linux) slattach]
  2002-10-29 10:25               ` jb1
@ 2002-10-29 12:37                 ` Harry Kalogirou
  0 siblings, 0 replies; 16+ messages in thread
From: Harry Kalogirou @ 2002-10-29 12:37 UTC (permalink / raw
  To: jb1; +Cc: Linux-8086

> On 22 Oct 2002, Harry Kalogirou wrote:
> 
> > > Ok the quest is over.
> 
> Not yet. I think _tcp_chksumraw in tcp_output.c needs the same fixes as 
> those you applied to _tcp_chksum. Without them I still got partial files 
> with telnet/get.

It is fixed.
 
> There's *still* something wrong, but it shows up most frequently when I 
> urlget from one ELKS box to another (yes, they have different IP 
> addresses). Rarely, all goes as it should; more often, the entire file 
> comes in a reasonable time, but I never get the command prompt; often, 
> nothing comes in and I never get the command prompt. Once, nothing seemed 
> to happen for about 10 minutes, but when I checked the machines about 10 
> minutes later, the file had come in but there was no command prompt. I had 
> enabled a second getty on that machine, so I was able to log in and run 
> netstat on both machines while urlget was hung. Here are the results 
> (about an hour later):
> 
> On the client ("urlget") machine (192.168.1.100) --
> 1 ESTABLISHED 4000ms 1025       0.0.0.0  2
> 2 ESTABLISHED 2400ms 1024 192.168.1.144 80
> 3      LISTEN 4000MS   80       0.0.0.0  0
> 
> 
> On the server ("sender") (1.2.168.1.144) --
> 1 ESTABLISHED 4000ms 1024       0.0.0.0  2
> 2      LISTEN 4000ms   80       0.0.0.0  0
> 
> Obviously, the server has discarded the connection, but the client machine
> thinks it's still connected.
> 
> I'm also not sure the client port number (the one that's 1024 or greater) 
> is handled properly. Each time I connect from a Linux box the port number 
> is incremented, but once I observered that a first, successful, connection 
> from ELKS was from port 1024, and the next, hanging, attempt was *also* 
> port 1024. Connections from Linux usually, but not always, work; 
> connections from ELKS rarely work.

ELKS reuses the last used port if it is not still in use. I don't think
that this is a problem. 

> Diagnosing this stuff is very time-consuming because "kill" doesn't seem 
> to do anything, so I must reboot both machines. Since ELKS "telnet" 

The kernel in the CVS will probably handle this more gracefully and
actualy the process.

> doesn't do anything but connect (and logs me out when it terminates!) I 
> can't compare telnet from Linux and ELKS. I can only compare "urlget" from 
> both systems, and since there's no tcpdump for ELKS I can't even determing 
> if a failure is actually due to urlget.

You mean that you do "telnet bla.bla 80" and after you connect you can't
do "get /"?

Harry





^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2002-10-29 12:37 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1035036158.454.17.camel@cool>
2002-10-20  9:34 ` webserver stalls [was Re: bug in (linux) slattach] jb1
2002-10-20 17:06   ` Harry Kalogirou
2002-10-21  9:44     ` jb1
2002-10-21  9:55       ` Harry Kalogirou
2002-10-22 10:16         ` jb1
2002-10-22 13:56           ` Harry Kalogirou
2002-10-22 13:57           ` [SOLVED] " Harry Kalogirou
2002-10-22 16:02             ` Harry Kalogirou
2002-10-23  9:37               ` jb1
2002-10-23 11:42                 ` Harry Kalogirou
2002-10-24  8:55                   ` jb1
2002-10-29 10:25               ` jb1
2002-10-29 12:37                 ` Harry Kalogirou
2002-10-19 10:07 jb1
2002-10-19 17:55 ` Gabucino
     [not found] <1034621946.3293.92.camel@cool>
2002-10-17 10:04 ` jb1

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.