Home > Archive > Unix Programming > December 2006 > BSD sockets: recv with MSG_WAITALL should return EWOULDBLOCK?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
BSD sockets: recv with MSG_WAITALL should return EWOULDBLOCK?
|
|
| softwaredoug@gmail.com 2006-12-14, 7:06 pm |
| The correct behavior of the situation below is a little ambigous to me
and seems open to some interpretation. I've googled and searched the
newsgroup with little avail.
Scenario:
Suppose I have a nonblocking socket, s.
Calling recv with MSG_WAITALL and socket s has fewer bytes than
specified via the recv call.
If s were blocking, I would expect the recv to block. Therefore, since
s is nonblocking, I would expect to see an EWOULDBLOCK return from recv
with such a scenario. When the number of bytes exceeded the bytes
specified, then the recv would copy into my buffer.
Is this a correct interpretation?
| |
| Alex Fraser 2006-12-14, 7:06 pm |
| <softwaredoug@gmail.com> wrote in message
news:1166129772.840975.114500@73g2000cwn.googlegroups.com...
[snip]
> Suppose I have a nonblocking socket, s.
> Calling recv with MSG_WAITALL and socket s has fewer bytes than
> specified via the recv call.
>
> If s were blocking, I would expect the recv to block. Therefore, since
> s is nonblocking, I would expect to see an EWOULDBLOCK return from recv
> with such a scenario. When the number of bytes exceeded the bytes
> specified, then the recv would copy into my buffer.
I assume you mean "was not less than" rather than "exceeded". Although your
expectation seems logical, I think there is a problem if you consider the
bigger picture: in the typical case, non-blocking sockets are used in
conjunction with select() or poll(). The behaviour you describe would cause
an essentially unavoidable "busy loop" whenever some, but not enough, bytes
were available.
It seems to me that MSG_WAITALL does not make sense on non-blocking sockets.
(I am not convinced it is much use for blocking sockets either.)
Alex
| |
| Rainer Weikusat 2006-12-15, 4:11 am |
| softwaredoug@gmail.com writes:
> Scenario:
> Suppose I have a nonblocking socket, s.
> Calling recv with MSG_WAITALL and socket s has fewer bytes than
> specified via the recv call.
MSG_WAITALL is supposed to block until the requested amount of data
is available unless 'something of interest' (eg signal caught or
connection closed by remote) happens first. I/O requests for
descriptors with O_NONBLOCK set never block. => MSG_WAITALL should
have no effect then.
> If s were blocking, I would expect the recv to block. Therefore, since
> s is nonblocking, I would expect to see an EWOULDBLOCK return from recv
> with such a scenario.
EWOULDBLOCK is the BSD error. For UNIX(*), this should be EAGAIN.
| |
| Rainer Weikusat 2006-12-15, 4:11 am |
| "Alex Fraser" <me@privacy.net> writes:
[...]
> It seems to me that MSG_WAITALL does not make sense on non-blocking
> sockets. (I am not convinced it is much use for blocking sockets
> either.)
It is useful for bulk downloads. Assuming your application processes
data fast enough, you can expect recv to return for each
'link-layer PDU' (eg ethernet frame) received unless the kernel
handles TCP PSH somehow (at least Linux doesn't). But if you receive
steady stream of a lot of 'link layer PDUs', that is very wasteful and
blocking inside the kernel until that application buffer has been
filled completely makes more sense. For instance, if you are
downloading a file with a known size (within the size limits for the
procedure described), you could just create the file, ftruncate it to
the desired length, mmap it and then do a single recv-call (assumption:
no interruptions) with the expected file size. The kernel would then
copy the complete file basically directly into the $whatever-cache
that is used by your OS for these type of file access.
| |
| David Schwartz 2006-12-15, 8:07 am |
|
Rainer Weikusat wrote:
> It is useful for bulk downloads. Assuming your application processes
> data fast enough, you can expect recv to return for each
> 'link-layer PDU' (eg ethernet frame) received unless the kernel
> handles TCP PSH somehow (at least Linux doesn't). But if you receive
> steady stream of a lot of 'link layer PDUs', that is very wasteful and
> blocking inside the kernel until that application buffer has been
> filled completely makes more sense. For instance, if you are
> downloading a file with a known size (within the size limits for the
> procedure described), you could just create the file, ftruncate it to
> the desired length, mmap it and then do a single recv-call (assumption:
> no interruptions) with the expected file size. The kernel would then
> copy the complete file basically directly into the $whatever-cache
> that is used by your OS for these type of file access.
You can only return for each ethernet frame if the system is unloaded.
How much sense does it make to try to reduce CPU usage only in the case
where the system is unloaded?
DS
| |
| Rainer Weikusat 2006-12-15, 8:07 am |
| "David Schwartz" <davids@webmaster.com> writes:
> Rainer Weikusat wrote:
>
> You can only return for each ethernet frame if the system is
> unloaded. How much sense does it make to try to reduce CPU usage
> only in the case where the system is unloaded?
Is there a point supposed to be in this statement, and if so, which?
| |
| Maxim Yegorushkin 2006-12-15, 7:05 pm |
|
Rainer Weikusat wrote:
> "Alex Fraser" <me@privacy.net> writes:
>
> [...]
>
>
> It is useful for bulk downloads. Assuming your application processes
> data fast enough, you can expect recv to return for each
> 'link-layer PDU' (eg ethernet frame) received unless the kernel
> handles TCP PSH somehow (at least Linux doesn't). But if you receive
> steady stream of a lot of 'link layer PDUs', that is very wasteful and
> blocking inside the kernel until that application buffer has been
> filled completely makes more sense. For instance, if you are
> downloading a file with a known size (within the size limits for the
> procedure described), you could just create the file, ftruncate it to
> the desired length, mmap it and then do a single recv-call (assumption:
> no interruptions) with the expected file size. The kernel would then
> copy the complete file basically directly into the $whatever-cache
> that is used by your OS for these type of file access.
In the other post you mentioned that MSG_WAITALL should have no effect
on a nonblocking socket, which implies that the above recipe is only
applicable to blocking sockets, isn't it?
For a nonblocking socket to do bulk download one could just set
SO_RCVLOWAT to the size of the file being downloaded, so that select()
reports readability only when at least that many bytes are available
(or there is an error). Or, may be even better, use aio_read() to make
the kernel read asynchronously into the file mapping and report when
it's completed the job (or encountered an error). Is it not more
scalable approach?
| |
| David Schwartz 2006-12-15, 7:05 pm |
|
Rainer Weikusat wrote:
> Is there a point supposed to be in this statement, and if so, which?
Yes, it's dumb to optimize for the unloaded case with code that may
make things worse in the case where you are under load.
Specifying MSG_WAITALL might improve efficiency in the case where
things otherwise would "run too fast" and you have to just wait again.
However, in return it makes the kernel do extra work when you are
actually under load.
Seems like a bad trade-off to me.
DS
| |
| David Schwartz 2006-12-16, 7:06 pm |
|
Maxim Yegorushkin wrote:
> For a nonblocking socket to do bulk download one could just set
> SO_RCVLOWAT to the size of the file being downloaded, so that select()
> reports readability only when at least that many bytes are available
> (or there is an error).
I would strongly advise against doing such things. Keeping data in
kernel memory longer than needed to improve performance only in the
case where there is minimal load just doesn't seem to make much sense.
(If there is high load, your problem will be keeping up, not getting
too little in each pass.)
This has two big di vantages under high load:
1) More kernel memory will be consumed. This could be in short supply
when load is high. You need to get data out of the kernel as quickly as
possible.
2) Your process will become ready-to-run later. If there's a long wait
between becoming ready-to-run and running, you may actually let the TCP
buffer become full and slow down the transfer.
> Or, may be even better, use aio_read() to make
> the kernel read asynchronously into the file mapping and report when
> it's completed the job (or encountered an error). Is it not more
> scalable approach?
That depends upon how aio_read is implemented on the platform, but
likely that's the most efficient way.
DS
| |
| Rainer Weikusat 2006-12-17, 8:06 am |
| "David Schwartz" <davids@webmaster.com> writes:
> Rainer Weikusat wrote:
>
>
> Yes, it's dumb to optimize for the unloaded case with code that may
> make things worse in the case where you are under load.
What precisely is 'an unloaded case' and 'a loaded case'?
> Specifying MSG_WAITALL might improve efficiency in the case where
> things otherwise would "run too fast" and you have to just wait
> again.
It is not questionable that it does improve effiency, because it means
that fewer (and eventually, a lot less) system calls need to happen
and the process that initated the bulk download does not need to be
woken up and scheduled just so that it can go back to sleep again all
the time ('the process' I am actually refering to is a wget-like tool
for an embedded Linux network appliance and the savings in both CPU
time and (for a LAN environment) wallclock time are significant (> 50%
on my main target system).
Just for a quick comparison (10.2M download/ 100 MBit ethernet, two
networks separated by a nokia IP 530):
avg CPU time/ system: 0.271
avg CPU time/ user: 0.0018
without waitall:
0.383s and 0.012
NB: These numbers are from my laptop that has a lot more horsepower
than our devices.
> However, in return it makes the kernel do extra work when you are
> actually under load.
At worst, it makes the check if the process should be woken up on
arrival of data marginally more complicated (instead of just
unconditionally waking it up all the time, it should only be woken up
when the buffer supplied by it is full). But, assuming the example
above, it avoids 3317 calls to recv, ie process wakeup, followed by
return to userspace, followed by return to kernel space, followed by
process going to sleep again.
| |
| Maxim Yegorushkin 2006-12-17, 7:04 pm |
|
David Schwartz wrote:
> Maxim Yegorushkin wrote:
>
>
> I would strongly advise against doing such things. Keeping data in
> kernel memory longer than needed to improve performance only in the
> case where there is minimal load just doesn't seem to make much sense.
> (If there is high load, your problem will be keeping up, not getting
> too little in each pass.)
>
> This has two big di vantages under high load:
>
> 1) More kernel memory will be consumed. This could be in short supply
> when load is high. You need to get data out of the kernel as quickly as
> possible.
>
> 2) Your process will become ready-to-run later. If there's a long wait
> between becoming ready-to-run and running, you may actually let the TCP
> buffer become full and slow down the transfer.
Those are very good points, I confess I had not given it enough
thought. Thank you.
| |
| Rainer Weikusat 2006-12-17, 7:04 pm |
| "Maxim Yegorushkin" <maxim.yegorushkin@gmail.com> writes:
> Rainer Weikusat wrote:
>
> In the other post you mentioned that MSG_WAITALL should have no effect
> on a nonblocking socket, which implies that the above recipe is only
> applicable to blocking sockets, isn't it?
Well ... a flag that specifically ask for blocking and a flag that
specifically request to never block can hardly both have an effect at
the same time, can they?
> For a nonblocking socket to do bulk download one could just set
> SO_RCVLOWAT to the size of the file being downloaded, so that select()
> reports readability only when at least that many bytes are available
> (or there is an error).
As David has already pointed out, that would mean to abuse the socket
receive buffer to keep a copy of the file which would be a decidedly
bad idea, especially for large files. And it may not even be possible.
SUS states (setsockopt): "The default value for SO_RCVLOWAT is 1. [...]
Note that not all implementations allow this option to be set."
Specifically, Linux behaves this way.
> Or, may be even better, use aio_read() to make the kernel read
> asynchronously into the file mapping and report when it's completed
> the job (or encountered an error).
Depending on the aio-implementation in question, 'interesting things'
could happen then (for instance, a thread could be spawned for each
aio request, which does the actual I/O operations in some unspecified
way or the call could even just become a synchronous read for
sockets).
It is generally a good idea to attempt to solve simple problems with
simple means and to only get fancy if there is an actual need to do
so. You can always make the code more complicated if it doesn't yet
work as it should. Making in simpler instead because the complication
isn't actually needed rarely happens (judging from my experience).
| |
| Barry Margolin 2006-12-17, 7:04 pm |
| In article <87k60qr10g.fsf@farside.sncag.com>,
Rainer Weikusat <rainer.weikusat@sncag.com> wrote:
> Well ... a flag that specifically ask for blocking and a flag that
> specifically request to never block can hardly both have an effect at
> the same time, can they?
The non-blocking flag effects the return value and errno. If you ask
for non-blocking and specify MSG_WAITALL, it should report
errno=EWOULDBLOCK whenever it can't fill the entire buffer.
Basically, this shifts the burden of buffer management from the
application to the kernel.
--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
| |
| Rainer Weikusat 2006-12-17, 7:04 pm |
| Barry Margolin <barmar@alum.mit.edu> writes:
> In article <87k60qr10g.fsf@farside.sncag.com>,
> Rainer Weikusat <rainer.weikusat@sncag.com> wrote:
>
>
> The non-blocking flag effects the return value and errno.
It affects operations done on the descriptor that would normally block
so that they don't block anymore but return an error code.
> If you ask for non-blocking and specify MSG_WAITALL, it should report
> errno=EWOULDBLOCK whenever it can't fill the entire buffer.
That may be what you believe it should do, contrary to the meaning
suggested by name and documentation, but it is not specified by
UNIX(*) and tradionally not implemented that way (considering nothing
waits in this case, it would be a hefty misnomer, anyway).
Quoting from UNP (2nd ed):
This flag was introduced with 4.3BSD Reno. It tells the
kernel not to return from a read operation until the
requested number of bytes have been read.
(p. 356)
and
Since TCP is a byte stream, we will be awakened when "some"
data arrives: it could be a single byte of data, or it could
be a full TCP segment of data. If we want to wait until
some fixed amount of data is available, we can [...]
specify the MSG_WAITALL flag.
[...]
With a nonblocking socket, if the input operation cannot be
satisfied (at least one byte of data for TCP [...]), return
is made immediatly with an error of EWOULDBLOCK.
(p. 397)
| |
| David Schwartz 2006-12-17, 7:04 pm |
|
Rainer Weikusat wrote:
> "David Schwartz" <davids@webmaster.com> writes:
>
> What precisely is 'an unloaded case' and 'a loaded case'?
The "unloaded case" is the case where the system is not loaded and
system calls return essentially as quickly as possible. The "loaded
case" is the case where the system is under load and there's a
significant delay between when a process becomes ready-to-run and when
it runs.
[color=darkred]
> It is not questionable that it does improve effiency, because it means
> that fewer (and eventually, a lot less) system calls need to happen
> and the process that initated the bulk download does not need to be
> woken up and scheduled just so that it can go back to sleep again all
> the time ('the process' I am actually refering to is a wget-like tool
> for an embedded Linux network appliance and the savings in both CPU
> time and (for a LAN environment) wallclock time are significant (> 50%
> on my main target system).
It only means fewer system calls if the system calls are returning "too
fast". In that case, there's plenty of CPU to waste, so there's no
point in optimizing for that case. However, the extra delay in becoming
ready-to-run in the loaded case may really hurt you.
> Just for a quick comparison (10.2M download/ 100 MBit ethernet, two
> networks separated by a nokia IP 530):
>
> avg CPU time/ system: 0.271
> avg CPU time/ user: 0.0018
>
> without waitall:
>
> 0.383s and 0.012
>
> NB: These numbers are from my laptop that has a lot more horsepower
> than our devices.
As I said, it helps in the case where you have way too much CPU. But
nobody should care about optimizing that case.
[color=darkred]
> At worst, it makes the check if the process should be woken up on
> arrival of data marginally more complicated (instead of just
> unconditionally waking it up all the time, it should only be woken up
> when the buffer supplied by it is full). But, assuming the example
> above, it avoids 3317 calls to recv, ie process wakeup, followed by
> return to userspace, followed by return to kernel space, followed by
> process going to sleep again.
Right, now test it on a loaded system.
DS
| |
| Alex Fraser 2006-12-17, 7:04 pm |
| "Barry Margolin" <barmar@alum.mit.edu> wrote in message
news:barmar-6E1FF4.12292117122006@comcast.dca.giganews.com...
[snip]
> The non-blocking flag effects the return value and errno. If you ask
> for non-blocking and specify MSG_WAITALL, it should report
> errno=EWOULDBLOCK whenever it can't fill the entire buffer.
As far as I can see, this behaviour is mostly useless, as I hinted in my
earlier post in this thread. Suppose you are blocked in select()/poll() and
a non-blocking socket becomes readable. The call returns and you call recv()
with MSG_WAITALL. If there is insufficient data and (in accordance with the
above) you get EWOULDBLOCK/EAGAIN, when you loop back to select()/poll() the
call will not block as the socket is still readable. You have a busy loop.
Alex
| |
| Rainer Weikusat 2006-12-18, 8:06 am |
| "David Schwartz" <davids@webmaster.com> writes:
>
> The "unloaded case" is the case where the system is not loaded and
> system calls return essentially as quickly as possible. The "loaded
> case" is the case where the system is under load and there's a
> significant delay between when a process becomes ready-to-run and when
> it runs.
That is not precise, only verbose.
>
>
> It only means fewer system calls if the system calls are returning "too
> fast".
Under extremely improbable[*] and essentially random circumstances, a
similar effect could be achieved if the socket receive buffer is
significantly larger than the application buffer and always fills up
at least to the amount of data that would fit into the application
buffer before the application can get enough running time to return
into the kernel. But it would be stupid to rely on that, because that
would, by means of TCP flow control, lead to a reduction in the
sending rate of the other TCP (window is shrinking) and for the
situation described, the ideal application buffer is the entire file
and that is typically much larger than the socket receive buffer.
[*] If you do intermediary packet processing, you cannot
'load' the system to the point where, for instance, routing
latency becomes larger than negligent, because otherwise,
people will start to complain about the 'slow network'.
> In that case, there's plenty of CPU to waste, so there's no
> point in optimizing for that case.
See comment above about processing latency. Further, a simple strategy
to avoid system overload is to avoid 'wasting plenty of CPU time'
before the system is overloaded. One would expect this to be so
trivial as to be self-evident[*]. And one of the 'points' here is a saving in
wallclock time of up to 15s for a frequent, interactive operation
(that helps nobody but me, but my time is a lot more expensive than
that of my computer).
[*] A nice real world analogy: As long as I do not have less money
than I need, there is no point in saving it instead of
just throwing it out of the window.
> However, the extra delay in becoming ready-to-run in the loaded case
> may really hurt you.
MSG_WAITALL is an hint to the kernel that I do not want to become
ready to run before a certain amount of data has been copied into the
buffer I supplied to the revc-call, because I don't have anything to
do otherwise.
In this context, your statement is at best bizarre.
[...]
>
> Right, now test it on a loaded system.
Test what? I have now checked this with three implementations (Linux
2.4, Linux 2.6 and FreeBSD) and they all work roughly the same:
1. Copy data into the application buffer until the receive
buffer has been drained or the request fullfilled.
2. Return to caller/ userspace if the request has been fullfilled
or MSG_WAITALL was not given.
3. Block until more data is available, then goto 1.
That is essentially the only sane way to implement this functionality
and I see no place where this somewhat mysterious effect you appear to
suspect in here could hide.
| |
| Rainer Weikusat 2006-12-18, 7:07 pm |
| Rainer Weikusat <rainer.weikusat@sncag.com> writes:
[...]
>
> MSG_WAITALL is an hint to the kernel that I do not want to become
> ready to run before a certain amount of data has been copied into the
> buffer I supplied to the revc-call, because I don't have anything to
> do otherwise.
>
> In this context, your statement is at best bizarre.
As I have now described this incorrectly for a second time, a
correction may be necessary: Technically, the process blocked in the
kernel does become ready to run, but the next time it is scheduled, it
continues to copy data from the socket receive buffer to its buffer
without returning to userspace until the requested amount of data has
been copied.
| |
| David Schwartz 2006-12-18, 7:07 pm |
|
Rainer Weikusat wrote:
> [*] A nice real world analogy: As long as I do not have less money
> than I need, there is no point in saving it instead of
> just throwing it out of the window.
You can't save CPU time for a rainy day.
[color=darkred]
> MSG_WAITALL is an hint to the kernel that I do not want to become
> ready to run before a certain amount of data has been copied into the
> buffer I supplied to the revc-call, because I don't have anything to
> do otherwise.
Consider these two cases:
1) You call 'select'. When 'select' returns a hit, you are scheduled.
Once you are scheduled, you call 'read' which does not block. Most
likely, when the last byte is received, you only have to receive one
small chunk because you already received the rest.
2) You call 'read' with a MSG_WAITALL. You are not even scheduled until
all the data you want is received. When you are scheduled, you have to
receive all the data.
Look at all the advantages of the first case:
1) If the system is busy and all the data is received before you run,
in case 1, you became ready-to-run earlier, so you are likely to run
earlier.
2) If the system is not busy, in case 1 you will only have to receive
the last chunk of data, rather than all of it.
3) In case 1, under best conditions, there is minimal latency between
when you receive the last bit of data and when you get the first bit
out of the kernel's buffer, in fact you can even do the latter first.
In the second case, all the data sits in the kernel's buffer before you
are even ready-to-run, much less reading any of it.
4) In case 1, you empty TCP data out of kernel memory as fast as
possible. In case 2, there's a greater risk that you will engage TCP
flow control because you keep the whole chunk in kernel memory until
you receive the last byte.
Against all these advantages of the first case, the second case has
exactly one advantage -- if there is plenty of CPU time to spare, the
second case might save a little.
> In this context, your statement is at best bizarre.
Using MSG_WAITALL is bizarre.
DS
| |
| Rainer Weikusat 2006-12-19, 4:09 am |
| "David Schwartz" <davids@webmaster.com> writes:
> Rainer Weikusat wrote:
>
>
> You can't save CPU time for a rainy day.
The CPU can only accomplish a limited amount of work within a given
timespan and the OS only offers me a limited amount of time slots
applications can be scheduled on within a certain timespan (and
scheduler implementations exist where the work the scheduler needs to
do is proportional to the number of runnable tasks).
The 'ideal' state for a system is therefore one where all processes
that exist at a given time block and if they awake, do what they need
to do with as little effort as possible before they block again and
free 'the system' so that it is again available to something else.
>
>
> Consider these two cases:
>
> 1) You call 'select'. When 'select' returns a hit, you are scheduled.
> Once you are scheduled, you call 'read' which does not block. Most
> likely, when the last byte is received, you only have to receive one
> small chunk because you already received the rest.
Doesn't apply for the situation I was talking about. Calling poll
(don't use select for small descriptor sets) instead of a blocking
read would double the amount of syscalls necessary to save the
incoming data to a file.
> 2) You call 'read' with a MSG_WAITALL. You are not even scheduled until
> all the data you want is received. When you are scheduled, you have to
> receive all the data.
It took me a while do understand that you believe that MSG_WAITALL is
essentially identicall to SO_RCVLOWAT because you are probably just
theorizing, don't really know how this is actually implemented and
that this would be a possible, though somewhat twisted, interpretation
of the documentation of both, but this is wrong: The process is
scheduled to run and copies data to its buffer, but it does not return
into userspace before the requested amount of data has been copied.
> Look at all the advantages of the first case:
>
> 1) If the system is busy and all the data is received before you run,
> in case 1, you became ready-to-run earlier, so you are likely to run
> earlier.
This can never happen for the situation I was talking about, because
the system cannot receive 'all the data' due to socket buffer size
limits. Further, after having become runnable (meaning exiting from
the kernel after poll), the application needs to do a little amount of
processing and then enter the kernel again to start copying data. If
it is already blocked in a read, it can immediatly start to copy data,
thereby getting it out of the socket receive buffer faster, which
fills asynchronously (interrupt driven).
> 2) If the system is not busy, in case 1 you will only have to receive
> the last chunk of data, rather than all of it.
That would be a di vantage.
> 3) In case 1, under best conditions, there is minimal latency between
> when you receive the last bit of data and when you get the first bit
> out of the kernel's buffer, in fact you can even do the latter first.
> In the second case, all the data sits in the kernel's buffer before you
> are even ready-to-run, much less reading any of it.
As I have written above (and yesterday and as easily verified by
checking the source of an actual implementation): It doesn't work this
way.
> 4) In case 1, you empty TCP data out of kernel memory as fast as
> possible. In case 2, there's a greater risk that you will engage TCP
> flow control because you keep the whole chunk in kernel memory until
> you receive the last byte.
See above.
> Against all these advantages of the first case, the second case has
> exactly one advantage -- if there is plenty of CPU time to spare, the
> second case might save a little.
Even if this was so (and it isn't), that would free the CPU to do
something diffferent (like processing the next video frame coming from
some device) earlier, which is in itself advantageous at least for
a) desktops
b) systems that act as network service providers (multiple
services)
and the second is, coincidentally, again exactly what I was talking
about.
[...]
> Using MSG_WAITALL is bizarre.
Using SND_RCVLOWAT is 'bizarre' for the reasons you have given, and
that is likely the reason that Linux doesn't even implement it.
| |
| David Schwartz 2006-12-19, 10:05 pm |
|
Rainer Weikusat wrote:
> It took me a while do understand that you believe that MSG_WAITALL is
> essentially identicall to SO_RCVLOWAT because you are probably just
> theorizing, don't really know how this is actually implemented and
> that this would be a possible, though somewhat twisted, interpretation
> of the documentation of both, but this is wrong: The process is
> scheduled to run and copies data to its buffer, but it does not return
> into userspace before the requested amount of data has been copied.
I can't see why you would think this is any better. It seems to be it
would be worse because it means extra transitions to user space and
probably a reduction in the process' dynamic priority because it is
running rather than blocked.
DS
| |
| Rainer Weikusat 2006-12-20, 8:08 am |
| "David Schwartz" <davids@webmaster.com> writes:
> Rainer Weikusat wrote:
>
> I can't see why you would think this is any better.
> It seems to be it would be worse because it means extra transitions
> to user space
NB: All of the following applies to Linux in this form, because that's
the kernel I happen to do programming for (among other things) but is
similar for UNIX(*). Additionally, it omits various things for
simplification.
This doesn't work this way, either. A process doing a read is
executing some kernel code. Assuming no data is initially available,
this kernel code first puts the process on a waitqueue and invokes the
scheduler to select another process to run because the current process
can not run at the moment. At some later time, code that runs as part
of the the TCP receive procedure does as wake_up call for this
waitqueue after it has added some data to the socket receive
buffer. This causes the blocked process to become runnable again and
the scheduler will again consider it for being scheduled. After this
has happened, the process continues execution of the read-code in the
kernel after the point where it had been put to sleep. This code now
copies data from the socket receive buffer to the buffer the process
used in his read(actually, recv)-call. Let's assume the amount of data
available was less than the size of this buffer. Ordinarily, the
process would now leave the kernel, meaning the read-call would return,
and it would continue to execute application code. If MSG_WAITALL
was specified as flag to recv, the process would again be put onto the
waitqueue until more data became available instead of leaving the
kernel. But when the process cannot proceed without a certain amount of
data, it has to call recv again to again enter the kernel, get blocked
there and wait for more input to arrive. And this return to userspace
followed by an 'immediate' return to the exact same kernel code the
process was already executing can be avoided that way.
> and probably a reduction in the process' dynamic priority because it is
> running rather than blocked.
This is exactly the same as without MSG_WAITALL. The process is
running if it has work to do (copy data out of the socket receive
buffer) and otherwise, it sleeps. It just doesn't return to executing
application code but sleeps again if not enough data was available to
fulfill the complete request.
Better (or at least more comprehensive) explanations of this can be
found among 'the usual suspects' (eg "Design and Implementation
of the $prefix_of_today BSD Operating System") or, for instance, here:
http://www.xml.com/ldd/chapter/book/ch05.html#t2
| |
| Alex Fraser 2006-12-20, 7:05 pm |
| "Rainer Weikusat" <rainer.weikusat@sncag.com> wrote in message
news:87slfhser2.fsf@farside.sncag.com...
> "Alex Fraser" <me@privacy.net> writes:
> [...]
>
> It is useful for bulk downloads.
....but not much else.
For a start, either you must be expecting the sender to send at least as
many bytes as the size argument to recv() or to close the connection after
sending. So MSG_WAITALL has strictly limited applicability.
Even if the above is satisfied, MSG_WAITALL can harm throughput if the size
passed to recv() is "too large" and you are performing some kind of stream
processing (for instance, decompressing it). I guess this is just an extra
restriction on applicability.
> Assuming your application processes data fast enough, you can expect recv
> to return for each 'link-layer PDU' (eg ethernet frame) received unless
> the kernel handles TCP PSH somehow (at least Linux doesn't).
The above is only true if the delay between when the thread blocked in
recv() becomes ready to run (the relevant receive buffer becomes non-empty)
and actually running (a CPU becomes available) is always small. On a loaded
system, ie one where the average delay is larger, additional data could
sometimes be received and queued to the socket buffer during the delay. I
would expect the average bytes per recv() to increase in this case. If so,
that means MSG_WAITALL provides less benefit on a loaded system.
Alex
| |
| Maxim Yegorushkin 2006-12-21, 4:06 am |
|
Alex Fraser wrote:
> "Rainer Weikusat" <rainer.weikusat@sncag.com> wrote in message
> news:87slfhser2.fsf@farside.sncag.com...
>
> ...but not much else.
>
> For a start, either you must be expecting the sender to send at least as
> many bytes as the size argument to recv() or to close the connection after
> sending. So MSG_WAITALL has strictly limited applicability.
I agree wholeheartedly.
> Even if the above is satisfied, MSG_WAITALL can harm throughput if the size
> passed to recv() is "too large" and you are performing some kind of stream
> processing (for instance, decompressing it). I guess this is just an extra
> restriction on applicability.
>
>
> The above is only true if the delay between when the thread blocked in
> recv() becomes ready to run (the relevant receive buffer becomes non-empty)
> and actually running (a CPU becomes available) is always small. On a loaded
> system, ie one where the average delay is larger, additional data could
> sometimes be received and queued to the socket buffer during the delay. I
> would expect the average bytes per recv() to increase in this case. If so,
> that means MSG_WAITALL provides less benefit on a loaded system.
The other thing is that using MSG_WAITALL makes you not scalable: the
thread is blocked in read/recv while it could handle several
nonblocking sockets using select/poll/epoll/kqueue/whatever. In other
words, it may only be applicable to a some class of client
applications, not for servers hanling multiple connections/clients.
To scale it, one would have to create a thread per connection, which is
not quite elegant (at least in unix world). Saving a few system calls
does not make much sense, unless your profiler shows that this is the
bottleneck.
p.s. I don't understand why the thread digressed to MSG_WAITALL and a
blocking socket, since the original question was about non-blocking
socket and MSG_WAITALL.
| |
| David Schwartz 2006-12-21, 4:06 am |
|
Maxim Yegorushkin wrote:
> p.s. I don't understand why the thread digressed to MSG_WAITALL and a
> blocking socket, since the original question was about non-blocking
> socket and MSG_WAITALL.
MSG_WAITALL is never useful with a non-blocking socket. So the
discussion turned to under what circumstances MSG_WAITALL might be
useful.
DS
| |
| Rainer Weikusat 2006-12-21, 4:06 am |
| "Alex Fraser" <me@privacy.net> writes:
> "Rainer Weikusat" <rainer.weikusat@sncag.com> wrote:
>
> ...but not much else.
>
> For a start, either you must be expecting the sender to send at least as
> many bytes as the size argument to recv() or to close the connection after
> sending. So MSG_WAITALL has strictly limited applicability.
This sounds a lot like 'it is only useful if you specified the flag
because it would be useful and it wasn't, for instance, generated by a
call to random(3)'. Which seems pretty obvious to me.
> Even if the above is satisfied, MSG_WAITALL can harm throughput if the size
> passed to recv() is "too large" and you are performing some kind of stream
> processing (for instance, decompressing it).
And this sounds like 'if you are using a buffer size that doesn't
match some unspecified restrictions some postprocessing code may have
and if MSG_WAITALL causes this postprocessing code to really have to
deal with full buffers, instead of partially filled buffers, and if it
could work better on partially filled buffers, which, per chance,
match his unspecified limits, that could have a negative effect'.
A shorter version of this subjunctive tapeworm would be 'Code that
processes data in block of a certain size does not work (or not work
well) with blocks of a different size'. Which is again pretty obvious,
and not the least bit related to where the input came from and how it
arrived.
>
> The above is only true if the delay between when the thread blocked in
> recv() becomes ready to run (the relevant receive buffer becomes non-empty)
> and actually running (a CPU becomes available) is always small.
Technically, this is not correct. It doesn't matter if the delay is
small or large, the process must just be able to process incoming
frames in real time.
| |
| Rainer Weikusat 2006-12-21, 8:05 am |
| "Maxim Yegorushkin" <maxim.yegorushkin@gmail.com> writes:
[...]
>
> The other thing is that using MSG_WAITALL makes you not scalable: the
> thread is blocked in read/recv while it could handle several
> nonblocking sockets using select/poll/epoll/kqueue/whatever.
Duh. Blocking I/O in itself is only useful if there no other
activities that could be performed during the time the process (or
thread) is blocked.
> To scale it, one would have to create a thread per connection, which is
> not quite elegant (at least in unix world).
Amusingy, there are still people advocating this model, which claim
that the fact that it doesn't work well is due to lacking
implementations. But this is just another example of 'people with a
reality distortion' ...
> Saving a few system calls does not make much sense, unless your
> profiler shows that this is the bottleneck.
The 'general academic assumption' is that doing X (with X being an
arbitrary piece of work) should be avoided because it is not worth the
effort. This happens to be an a priori assumption and one that usually
isn't changed just because measurable effects contradict it (like
here). Recommendation is to ignore it insofar real world problems
(that manifest itself outside of publications) matter.
:->
| |
| David Schwartz 2006-12-21, 8:05 am |
|
Rainer Weikusat wrote:
> Technically, this is not correct. It doesn't matter if the delay is
> small or large, the process must just be able to process incoming
> frames in real time.
The only thing MSG_WAITALL can do for you is cause your process to
become ready-to-run later than it would have otherwise. This is a loss
in every case except the cases where you can do absolutely nothing if
you had become ready-to-run earlier. That's an extraordinarily rare
case.
DS
| |
| Rainer Weikusat 2006-12-21, 8:05 am |
| "David Schwartz" <davids@webmaster.com> writes:
> Rainer Weikusat wrote:
>
> The only thing MSG_WAITALL can do for you is cause your process to
> become ready-to-run later than it would have otherwise.
The (blocked) process becomes 'ready to run' as soon as it is woken
up. And MSG_WAITALL still does not interfere with this.
| |
| David Schwartz 2006-12-22, 7:05 pm |
|
Rainer Weikusat wrote:
> "David Schwartz" <davids@webmaster.com> writes:
[color=darkred]
[color=darkred]
[color=darkred]
> The (blocked) process becomes 'ready to run' as soon as it is woken
> up. And MSG_WAITALL still does not interfere with this.
Then it does nothing.
DS
| |
| Alex Fraser 2006-12-23, 8:04 am |
| "Rainer Weikusat" <rainer.weikusat@sncag.com> wrote in message
news:87ac1ha7uz.fsf@farside.sncag.com...
> "Alex Fraser" <me@privacy.net> writes:
>
> This sounds a lot like 'it is only useful if you specified the flag
> because it would be useful and it wasn't, for instance, generated by a
> call to random(3)'. Which seems pretty obvious to me.
I said I wasn't convinced MSG_WAITALL was much use. Part of that is the fact
that you can't possibly use it in many cases. Yes, that is obvious; I was
just being explicit.
>
> And this sounds like 'if you are using a buffer size that doesn't
> match some unspecified restrictions some postprocessing code may have
> [...]
I think you completely missed the point here. Suppose you have a simple loop
like this:
while ((n = recv(s, buf, len, MSG_WAITALL)) > 0)
consume(buf, n);
Where consume() implements an algorithm which runs in O(n) time. In
practice, of course, there will be some per-call overhead. Assume consume()
can process data faster than it can possibly need to (eg it runs at 20MB/s
but the input is coming over a 100Mbit/s link).
Clearly, the greater the value of len, the better the efficiency (ie
clocks/byte) will be because there are fewer kernel/userspace transitions
due to fewer recv() calls, and of course, similar applies to the per-call
overhead from consume().
But beyond some value of len, you will eventually reach a (system-dependent)
point where throughput starts to reduce, because the kernel buffers will
fill up while consume() is running. That is, len can be "too large". But how
large is too large?
>
> Technically, this is not correct. It doesn't matter if the delay is
> small or large, the process must just be able to process incoming
> frames in real time.
You snipped the important bit, perhaps because once again you missed the
point. This is roughly what I think could (depending on various factors)
happen if the delay is large:
1. The application calls recv() supplying a large buffer but without
specifying MSG_WAITALL. No data is available so the thread sleeps.
2. A frame arrives causing data to be queued to the socket buffer. The
thread becomes ready-to-run.
3. Another frame arrives and more data is queued to the socket buffer. (And
perhaps another, etc.)
4. The thread runs; all available data is copied to the user buffer, then
recv() returns.
This is getting closer to what would happen if you used MSG_WAITALL, ie the
relative gain in efficiency is reduced on a loaded system.
I haven't made any new points in this post, just - hopefully - clarified the
points I was trying to make before. Unless you actually address them I think
we'll have to call it a day and agree to disagree :).
Alex
| |
| Rainer Weikusat 2006-12-25, 7:06 pm |
| "David Schwartz" <davids@webmaster.com> writes:
> Rainer Weikusat wrote:
>
>
>
>
> Then it does nothing.
I have described the procedure in some detail a couple of times and
additionally, at least three different major implementations (Linux,
BSD, Solaris) have sources available on the internet which could be
perused to determine (or verify) this information. I am going to give
a very short summary for another time:
If a process is blocked in the kernel and waits for the arrival of
data from a TCP connection and data arrives, the process is woken up
(becomes ready to run) and then executes kernel code that copies
data from the socket receive buffer to the buffer the process supplied
as argument to the recv system call. If there is no more data to copy
before the buffer has been completely filled, there are two possible
continuation scenarios:
a) the process returns from the kernel and continues the
execute application code
b) the process blocks again and waits for more data
The MSG_WAITALL flag can be used to request b) instead of a) if it is
known that a certain amount data must be received before application
processing can continue. Like when downloading a file from an
HTTP-server if the server reply header set contained a
Content-Length:-header. a) would, for instance, be sensible for a FTP
command connection, where the application must first determine what to
do next by examining the data already received so far.
| |
| Rainer Weikusat 2006-12-25, 7:06 pm |
| "Alex Fraser" <me@privacy.net> writes:
> "Rainer Weikusat" <rainer.weikusat@sncag.com> wrote:
[...]
>
> I think you completely missed the point here. Suppose you have a simple loop
> like this:
>
> while ((n = recv(s, buf, len, MSG_WAITALL)) > 0)
> consume(buf, n);
>
> Where consume() implements an algorithm which runs in O(n) time. In
> practice, of course, there will be some per-call overhead. Assume consume()
> can process data faster than it can possibly need to (eg it runs at 20MB/s
> but the input is coming over a 100Mbit/s link).
>
> Clearly, the greater the value of len, the better the efficiency (ie
> clocks/byte) will be because there are fewer kernel/userspace transitions
> due to fewer recv() calls, and of course, similar applies to the per-call
> overhead from consume().
>
> But beyond some value of len, you will eventually reach a (system-dependent)
> point where throughput starts to reduce, because the kernel buffers will
> fill up while consume() is running. That is, len can be "too large". But how
> large is too large?
To me, this sounds exactly like the problem I was talking about,
namely, that the application needs to use a sensible buffer size. For
this example, 'a sensible buffer size' cannot be determined a priori,
because it depends (among other things) on accidental properties of
the execution environment like 'how much CPU time is available to the
process for consuming' and 'how does the input data rate
vary'. Basically, any buffer size can be 'too large' in this respect
if there is a sudden spike in incoming data or "processing system
load" caused by events external to the application. To solve this, one
would need to decouple processing from reception by buffering some
amount of data inside the application and by ensuring that the
receiving code can preempt the processing code if need be.
The question if it would be beneficial to try receive data in blocks
of some minium size larger than the payload of a link-layer PDU is (to
my understanding) completely orthogonal to that.
>
> You snipped the important bit, perhaps because once again you missed the
> point.
I purposely ignored it because this was again something that I'd consider
to be fairly obvious.
> This is roughly what I think could (depending on various factors)
> happen if the delay is large:
>
> 1. The application calls recv() supplying a large buffer but without
> specifying MSG_WAITALL. No data is available so the thread sleeps.
> 2. A frame arrives causing data to be queued to the socket buffer. The
> thread becomes ready-to-run.
> 3. Another frame arrives and more data is queued to the socket buffer. (And
> perhaps another, etc.)
> 4. The thread runs; all available data is copied to the user buffer, then
> recv() returns.
>
> This is getting closer to what would happen if you used MSG_WAITALL, ie the
> relative gain in efficiency is reduced on a loaded system.
At worst, there could be 'no gain'. But only if the 'large buffer that
should be filled completely' is not larger than the socket receive
buffer. For 'file downloads from a HTTP-Server', the 'application
buffer' can easily be much larger than the socket receive buffer.
Apart from that, I agree that MSG_WAITALL is rarely useful and, in
fact, I had considered it to be of no use at all until I encountered
this particular situation. And this was the reason I wrote about it in
the first place.
| |
| David Schwartz 2006-12-26, 7:09 pm |
|
Alex Fraser wrote:
> I think you completely missed the point here. Suppose you have a simple loop
> like this:
>
> while ((n = recv(s, buf, len, MSG_WAITALL)) > 0)
> consume(buf, n);
>
> Where consume() implements an algorithm which runs in O(n) time. In
> practice, of course, there will be some per-call overhead. Assume consume()
> can process data faster than it can possibly need to (eg it runs at 20MB/s
> but the input is coming over a 100Mbit/s link).
>
> Clearly, the greater the value of len, the better the efficiency (ie
> clocks/byte) will be because there are fewer kernel/userspace transitions
> due to fewer recv() calls, and of course, similar applies to the per-call
> overhead from consume().
No, I'm sorry, that's not correct. You look at the one advantage and
ignore the di vantages.
Suppose 'len' is 512KB and the query is 512KB. In that case, the
latency between when the last byte is received and the first byte is
processed will be the time it takes to process 512KB. However, if 'len'
was 16KB, then the latency would be only the time it takes to process
16KB. In other words, all things being equal, lower 'len' values will
generally result in better performance than higher ones.
> But beyond some value of len, you will eventually reach a (system-dependent)
> point where throughput starts to reduce, because the kernel buffers will
> fill up while consume() is running. That is, len can be "too large". But how
> large is too large?
You've forced yourself into an impossible situation. If 'len' is too
small, you will waste kernel-user transitions. If 'len' is too large,
you will have high latency. The fix is ridiculously simple, get rid of
MSG_WAITALL and you can make 'len' as large as you want with no
penalty. Duh.
> 1. The application calls recv() supplying a large buffer but without
> specifying MSG_WAITALL. No data is available so the thread sleeps.
> 2. A frame arrives causing data to be queued to the socket buffer. The
> thread becomes ready-to-run.
> 3. Another frame arrives and more data is queued to the socket buffer. (And
> perhaps another, etc.)
> 4. The thread runs; all available data is copied to the user buffer, then
> recv() returns.
>
> This is getting closer to what would happen if you used MSG_WAITALL, ie the
> relative gain in efficiency is reduced on a loaded system.
>
> I haven't made any new points in this post, just - hopefully - clarified the
> points I was trying to make before. Unless you actually address them I think
> we'll have to call it a day and agree to disagree :).
There is no advantage. All MSG_WAITALL does is make you wait longer
before you process the first byte. I don't see any conceivable way that
can make things faster. It either makes no difference (if you would
have gotten it done on time anyway) or it increases latency (if the
other side has to wait longer for you to process the last byte because
you waited longer to process the first one).
DS
| |
| Rainer Weikusat 2006-12-27, 4:10 am |
| "David Schwartz" <davids@webmaster.com> writes:
> Alex Fraser wrote:
>
> No, I'm sorry, that's not correct. You look at the one advantage and
> ignore the di vantages.
>
> Suppose 'len' is 512KB and the query is 512KB. In that case, the
> latency between when the last byte is received and the first byte is
> processed will be the time it takes to process 512KB.
This is obviously nonsense, because processing does not start until
the last byte has been copied into the supplied buffer (not received).
If I assume that you meant 'the time it takes to copy 512K to the
supplied buffer', it is still pointless, because nobody was talking
about latency so far.
[...]
>
> You've forced yourself into an impossible situation. If 'len' is too
> small, you will waste kernel-user transitions. If 'len' is too large,
> you will have high latency. The fix is ridiculously simple, get rid of
> MSG_WAITALL and you can make 'len' as large as you want with no
> penalty. Duh.
There will be a significant penalty if len is large because the buffer
will mostly be empty.
>
> There is no advantage. All MSG_WAITALL does is make you wait longer
> before you process the first byte. I don't see any conceivable way that
> can make things faster.
That is, because you conveniently ignore the 'minor little detail'
that the total processing time is the time spent copying the data from
the socket receive buffer to the application buffer plus the time
spent processing the data in this buffer plus the time spent
transitioning from userspace to kernelspace plus the time spent
calling consume and returning from the call plus loop processing time.
The copying time and the processing time are proportional to the size
of the data to process. All others are proportional to the number of
recv-calls needed to receive the data. And the number of recv-calls is
the size of the data divided by the amount of data received during
each call.
> It either makes no difference (if you would have gotten it done on
> time anyway)
You are assuming that a computer is a thing which is dedicated
to exactly one task at a single time, ie that you actually run on a
batch processing system. UNIX(*) isn't a batch processing system.
|
|
|
|
|