Home > Archive > Unix Programming > February 2008 > Re: Mysterious network IO behavior - Progress
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Re: Mysterious network IO behavior - Progress
|
|
| David Mathog 2008-02-07, 7:16 pm |
| David Mathog wrote:
> There's something bizarre going on here but I cannot see what it might
> be. All the data ends up where it is supposed to, it just takes varying
> amounts of time to get there.
1. There was a bug in the program. It didn't change the amount of data
written to the transmit buffer, but it did change where in the ring
buffer that data came from, and it did increase the number of calls to
write(). This will teach me not to use "dd if=/dev/zero" for the input
test source, always better to use a real data file so that the transfer
can be compared with the original.
Anyway, there is a 5Mb (typically) ring buffer, it looks something like
this:
[ (data written from here)data(data read into here) ]
and the () region wanders slowly around the buffer, with any reads that
get ahead of writes for even one cycle causing it to migrate to the
right. After a while the data containing region could wander to the top
and wrap, at which point it would look like this:
[...data(data read into here) (data written from here)data...]
The write routine would first try to send everything in the upper data
region. If write returned a value less than the size of that region it
should have stopped, but would instead drop down to the lower data
region and attempt to write the remainder from there. The amount of
data written returned was the sum of the two write() calls. This is
nonblocking IO, so the second write() mostly just came back with a 0.
However, every once and a while the second write would succeed, and
that would corrupt the output file by putting a small patch of data in
the wrong place, which is what eventually revealed this issue.
So to review, the bug caused a largish number of wasted write() attempts
and a smallish number of write() attempts that succeeded. The net
result was about 0.5% corruption in the output file - assuming the input
isn't all zero, in which case this error is invisible. The total amount
of data going through the write()'s was the same, with or without the
bug. However, even though the bug was not affecting the amount of data
written for some reason or other it really slowed things down. I do not
think it was the extra calls per se, we are only talking an extra 10000
or so total calls to write() over around 6 seconds. Perhaps it is a
memory cache issue, where the accesses on the upper end of the buffer
and the lower end caused cache thrashing. Perhaps the interrupts
associated with those events interfered with the NIC associated
interrupts? Fixing this bug let the code run at "full" speed as a relay
using newish hardware (Athlon64 single CPU workstations, 1Mb cache.) It
also tended to run more quickly on the older hardware (Athlon MP single
core compute nodes, 256 Kb cache), but these still had frequent "slow" runs.
2. TCP tuning of the slower nodes. This is based on the material here:
http://dsd.lbl.gov/TCP-tuning/linux.html
sysctl values were modified and a series of 5 consecutive runs
performed, the average speeds are shown.
A) No sysctl modification. Average relay throughput 6.9Mb/s.
(A few 11.x Mb/s and a lot of slower ones.)
B) net.core.rmem_default and wmem_default, increased from 107520
to 16777216. Average rose to 10.0Mb/s, a clear win, this
change was retained for all subsequent tests.
C) net.ipv4.tcp_rmem and wmem increased from the defaults to
"4096 87380 16777216" and "4096 65536 16777216", respectively.
9.1 Mb/sec. Various other settings were tried but the default
of "4096 87380 1048576" "4096 16384 1048576" was as good or better
than any of them, so the defaults were retained.
D) net.ipv4.tcp_congestion_control changed from cubic to reno.
8.9 Mb.sec. Left this setting at cubic.
E) net.ipv4.tcp_no_metrics_save set to 1. Average rose to 10.3Mb/sec.
There were fewer really slow runs and mostly it was 11.0 or 11.1.
Left it like this, went to work on other things, came back two
hours later and repeated the test (the exact same binaries) and
the average was even higher. Serendipitously or for whatever reason
the most common speed is now 11.4Mb/s, which happens about 80% of
the time, and the other 20% are slower. Usually only a little bit,
like 9 or 10Mb/s but sometimes still as low as 6.5Mb/sec. With
"top -d 1" running a slow one is obvious since the CPU usage
peaks at 10-12%, instead of the 20-25% for a "fast" run, and still,
there is nothing else showing up in top that would account for the
change.
So, in summary, most of the "slow" runs have been eliminated. They seem
to be completely gone on the newer machines, even without touching the
sysctl parameters. On the older machines sysctl adjustment helped a
lot, but "slow" runs still occur about every fifth run, for no obvious
reason.
Regards,
David Mathog
| |
| David Mathog 2008-02-08, 7:17 pm |
| David Mathog wrote:
One other point which i'll add for future reference. The tuning changes
described in the previous post also did not fix the NPtcp stalls
described before. It still does this when run in bidirectional (the
default) mode
50: 2048 bytes 118 times --> 37.08 Mbps in 421.34 usec
51: 2051 bytes 118 times --> 37.14 Mbps in 421.27 usec
52: 3069 bytes 118 times --> 2.27 Mbps in 10316.80 usec
53: 3072 bytes 6 times --> 46.27 Mbps in 506.50 usec
54: 3075 bytes 131 times --> 2.28 Mbps in 10282.60 usec
55: 4093 bytes 3 times --> 52.58 Mbps in 593.85 usec
56: 4096 bytes 84 times --> 52.90 Mbps in 590.74 usec
and as before, running it in a single direction mode has no such
performance holes (data not shown).
If "top -d 1" is running at the same time as a bidirectional test
the results are very similar to what I see with my application.
Normally NPtcp sits at 10-15% CPU. When it hits the stall points at
tests 52 and 54 it uses so LITTLE processor time that it disappears from
top, reappearing again at the normal CPU level after it gets past the
stalls.
Regards,
David Mathog
|
|
|
|
|