For Programmers: Free Programming Magazines  


Home > Archive > Unix Programming > July 2007 > Scalable tcp server









You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

 

Author Scalable tcp server
Henrik Goldman

2007-07-07, 8:05 am

Hello,

I'm in the process of improving an existing tcp server in order to improve
the speed.
The server is ported to Windows, linux, solaris, macosx and several other
unix.

The server has until now been running with up to 50 clients at once which
performs an action once in a while.
However due to new requirements the server should be capable of handling
several thousand clients.
Overall the design is pretty flexible so there should be no bottlenecks for
achiving this.
However after performing a bit of testing I found some issues which needs to
be improved.

In my test I wrote a client application which simulates up to 500 clients by
spawning 500 threads. Each thread would then connect to the server and
perform a request.
When the server is empty the request takes about 250 ms. However with 500
idle clients connected each request takes about 890 ms.
This means that each request gets slowed down by several times.

On the server side I have 2 linked lists using std::list.
One is called unprocessed connections and the other is called in-process
connections.
Then I have a thread pool consisting of a thread per 15 connection + 3.
This means that for 500 clients I would have 500 / 15 + 3 = 36 threads.
I found by testing that this value gives the fastest results.

What each thread does is:
1. Remove a connection from the unprocessed list and insert it into the
in-process list.
2. Perform select() on each connection object and select for readability and
writability.
3. If something needs to be read or written the action is performed.

I should mention that all connections use non-blocking sockets already.

From my tests I know that out of 501 connections in total the 500 are idle
and does not need to be processed.
However select still returns that the sockets are readable. I did a small
optimization here by ensuring that recv() would only get called
when there was no pending data for sending back to the client. This gave a
big speed improvement.

So the question is how I can more effectively find out which socket are
pending for reading?
I found as a test that if I remove select for writability (by always pushing
data to be sent to the client even though it would return EWOULDBLOCK) it's
not going to improve the speed. So perhaps I should change my strategy for
finding readable and writable connections in another manner?

Thanks.

-- Henrik


moi

2007-07-07, 10:11 pm

On Sat, 07 Jul 2007 14:14:59 +0200, Henrik Goldman wrote:

> Hello,
>
> I'm in the process of improving an existing tcp server in order to improve
> the speed.
> The server is ported to Windows, linux, solaris, macosx and several other
> unix.
>
> The server has until now been running with up to 50 clients at once which
> performs an action once in a while.
> However due to new requirements the server should be capable of handling
> several thousand clients.
> Overall the design is pretty flexible so there should be no bottlenecks for
> achiving this.
> However after performing a bit of testing I found some issues which needs to
> be improved.
>
> In my test I wrote a client application which simulates up to 500 clients by
> spawning 500 threads. Each thread would then connect to the server and
> perform a request.
> When the server is empty the request takes about 250 ms. However with 500
> idle clients connected each request takes about 890 ms.
> This means that each request gets slowed down by several times.
>


Which is to be expected. More work==slower response.
Where does the server spend it's time ?
Can you profile it ?

> On the server side I have 2 linked lists using std::list.


Why ?

> One is called unprocessed connections and the other is called in-process
> connections.


Why ?

> Then I have a thread pool consisting of a thread per 15 connection + 3.
> This means that for 500 clients I would have 500 / 15 + 3 = 36 threads.
> I found by testing that this value gives the fastest results.
>
> What each thread does is:
> 1. Remove a connection from the unprocessed list and insert it into the
> in-process list.


Why ?
IMHO, for dispatching you need only *one* list/queue.
(this could even be implemented as a bitmap, eg an fd_set)

A thread can take a task from(the head of) the list and
execute it. If the task is finished, you are done, otherwise, the task can
be re-added to the work-list.

> 2. Perform select() on each connection object and select for readability
> and writability.


You call select() for every connection ? That would take 2 systemcalls for
every read... Why not let one centralized select() that adds work to the
worklist ?

> 3. If something needs to be read or written the action is

performed.

If nothing needs to be done, the task should not *be* on the worklist in
the first place....

> I should mention that all connections use non-blocking sockets already.
>


Which is good. (but it also frees you from calling select() before every
read(), since the read would "return" EWOULDBLOCK anyway.

> From my tests I know that out of 501 connections in total the 500 are
> idle and does not need to be processed. However select still returns
> that the sockets are readable. I did a small optimization here by
> ensuring that recv() would only get called when there was no pending
> data for sending back to the client. This gave a big speed improvement.
>
> So the question is how I can more effectively find out which socket are
> pending for reading?
> I found as a test that if I remove select for writability (by always
> pushing data to be sent to the client even though it would return
> EWOULDBLOCK) it's not going to improve the speed. So perhaps I should
> change my strategy for finding readable and writable connections in
> another manner?
>


For writeability, opinions differ. Normally, you don't need to select()
for it. If the network bandwidth is saturated, (and the response time
creeps up) people will stop using your server anyway.

Also, if you do throttle the writing process, you would still have to
buffer the data in your application's memory, which will cost roughly the
same amount of memory. (the best way would probably be to stop accepting
new work until the output has been drained )


HTH,
AvK
Henrik Goldman

2007-07-07, 10:11 pm


>
> Which is to be expected. More work==slower response.
> Where does the server spend it's time ?
> Can you profile it ?


So far I have not been able to profile it that much but perhaps I should
recompile it with gprof to see how it behaves.
However it's pretty straightforward what goes on. Since I know that most
clients are not doing anything the only thing that goes on is stuff related
to networking.

>
> Why ?


>
> Why ?


This is a business demand in order to do remote statistics and get a
complete picture of who is connected to the server.
It would not be needed otherwise. However it's needed to take a snapshot of
the current server status.

> IMHO, for dispatching you need only *one* list/queue.
> (this could even be implemented as a bitmap, eg an fd_set)


Very true. fd_set bitmap will likely not be sufficient since each client has
an amount of status information associated.
However everything is wrapped into a connection object on the server side
which also includes the socket.

> A thread can take a task from(the head of) the list and
> execute it. If the task is finished, you are done, otherwise, the task can
> be re-added to the work-list.


Thats exactly what happens. There are just more threads doing that.

>
> You call select() for every connection ? That would take 2 systemcalls for
> every read... Why not let one centralized select() that adds work to the
> worklist ?


Because most OS's has a limit of how many fd's can be polled. I know that
it's common that OS's has 64 as a limit.
My idea to work around this is to poll up to 64 at a time and then add them
into the queue again.
So instead of just processing 1 connection it processes up to 64 at once.

>
> If nothing needs to be done, the task should not *be* on the worklist in
> the first place....


Well thats the issue. You need to find out which connection needs to be
processed before you can add it.

>
> Which is good. (but it also frees you from calling select() before every
> read(), since the read would "return" EWOULDBLOCK anyway.


Good point. I think I'll try to do something about that.

>
> For writeability, opinions differ. Normally, you don't need to select()
> for it. If the network bandwidth is saturated, (and the response time
> creeps up) people will stop using your server anyway.


Heh this is not an option. People are forced to use the services since it's
used within large corporations where it's not an option.
The only alternative is to let people install use more than one server to
get more speed.

> Also, if you do throttle the writing process, you would still have to
> buffer the data in your application's memory, which will cost roughly the
> same amount of memory. (the best way would probably be to stop accepting
> new work until the output has been drained )
>


Yes everything is buffered until written.
As you pointed out I stop accepting new reads from the same client until all
data is written. This speed up the process a lot since reading and writing
won't be happening in the same run.

Thanks for your input.

-- Henrik


moi

2007-07-07, 10:11 pm

On Sat, 07 Jul 2007 17:07:25 +0200, Henrik Goldman wrote:


>
> So far I have not been able to profile it that much but perhaps I should
> recompile it with gprof to see how it behaves.
> However it's pretty straightforward what goes on. Since I know that most
> clients are not doing anything the only thing that goes on is stuff related
> to networking.


Profile.
[my hypothesis is that you waste too much time maintaining the linked
lists, or you are convoying on the semaphores that guard them. Or you
poll too much).


>
[color=darkred]
>
> This is a business demand in order to do remote statistics and get a
> complete picture of who is connected to the server. It would not be
> needed otherwise. However it's needed to take a snapshot of the current
> server status.


Business demand does not dictate an implementation.
You could easily get your 'state report' by scanning an array and
counting the various states that the connections happen to be in.

>
>
> Very true. fd_set bitmap will likely not be sufficient since each client
> has an amount of status information associated. However everything is
> wrapped into a connection object on the server side which also includes
> the socket.
>


"normally" (eg without threads/objects), one would just use an array,
indexed by filedescriptor (since fd is guaranteed to be the lowest
available, this can be a fixed-size array). Each entry would contain all
the {state,data, buffers} wrt this connection.
In your case, you could use an array of pointers to the "objects"

it. If[color=darkred]
>
> Thats exactly what happens. There are just more threads doing that.


If you protect the list-operations by a semaphore, ("latch") there is
probably nothing wrong with this.

>
> Because most OS's has a limit of how many fd's can be polled. I know
> that it's common that OS's has 64 as a limit. My idea to work around
> this is to poll up to 64 at a time and then add them into the queue
> again.
> So instead of just processing 1 connection it processes up to 64 at
> once.


I don't know about other OSses. For UNIX, there will always be a way to
select() on all your available fds. (the same goes for poll()) You may
have to do some tweaking, but is is possible.
If you insist on chopping up your fdset, (which can only be done when
using poll() BTW), there is one problem: you cannot afford to block inside
select/poll, so you are in-fact busy-polling. ( --> calling select/poll N
times, just to discover 1 readable fd)

>
>
> Well thats the issue. You need to find out which connection needs to be
> processed before you can add it.
>



IMHO, "processed" is ambiguous, here.

WRT processing, there are only two states:

NEED_INPUT: this the 'idle' state for a connection.
this connection's fd has to be included in the fd_set for
input.

HAVE_WORK: in this state, enough input has been collected to perform
some useful work. We don't need more input, since we still have work...
( -->> this fd does not have to be included in the read-fd_set)

Note that the transition between NEED_INPUT and HAVE_WORK can be subtle:
eg if your protocol is 'line based', you cannot process before a CR/LF is
seen. You might ad a pointer or flag (or extra states) to handle this.
[ removing/adding a node to a linked list (+latching) is probably a bit
too expensive, just to check for sufficient input ]

HTH,
AvK


allthecoolkidshaveone@gmail.com

2007-07-08, 4:17 am

On Jul 7, 5:14 am, "Henrik Goldman" <henrik_gold...@mail.tele.dk>
wrote:
> Hello,
>
> I'm in the process of improving an existing tcp server in order to improve
> the speed.
> The server is ported to Windows, linux, solaris, macosx and several other
> unix.
>


One option:

Get rid of most if not all the threads. Rewrite it with a callback-
based model using libevent
(http://monkey.org/~provos/libevent/), which will use the highest
performance event polling mechanism a particular OS supports (kqueue
on BSDs (Including OS X), epoll on linux 2.6, /dev/poll on Solaris,
poll, and as a last resort, select).

David Schwartz

2007-07-08, 10:08 pm

On Jul 7, 7:10 am, moi <r...@localhost.localdomain> wrote:

> For writeability, opinions differ. Normally, you don't need to select()
> for it. If the network bandwidth is saturated, (and the response time
> creeps up) people will stop using your server anyway.


WHAT?! That's as wrong as anything can be.

Suppose 500 people connect to your server and they each are
downloading a 1GB file over a 56Kbs connection. If you don't 'select'
for writability, your server will be completely crippled by this
regardless of how much network bandwidth the server has.

DS

David Schwartz

2007-07-08, 10:08 pm

On Jul 7, 11:01 pm, allthekidshave...@gmail.com wrote:

> One option:
>
> Get rid of most if not all the threads. Rewrite it with a callback-
> based model using libevent
> (http://monkey.org/~provos/libevent/), which will use the highest
> performance event polling mechanism a particular OS supports (kqueue
> on BSDs (Including OS X), epoll on linux 2.6, /dev/poll on Solaris,
> poll, and as a last resort, select).


That's a good idea. But you still need threads to block on disk I/O
and to handle extraordinary conditions like errors where the code may
need to fault in.

DS

moi

2007-07-09, 7:06 pm

On Sun, 08 Jul 2007 17:16:56 -0700, David Schwartz wrote:

> On Jul 7, 7:10 am, moi <r...@localhost.localdomain> wrote:
>
>
>
> WHAT?! That's as wrong as anything can be.
>


I stand corrected. Opinions do not differ.

:-)

AvK

Rick Jones

2007-07-09, 7:06 pm

Henrik Goldman <henrik_goldman@mail.tele.dk> wrote:
> In my test I wrote a client application which simulates up to 500
> clients by spawning 500 threads. Each thread would then connect to
> the server and perform a request. When the server is empty the
> request takes about 250 ms. However with 500 idle clients connected
> each request takes about 890 ms. This means that each request gets
> slowed down by several times.


How do you know some of that isn't in the client? If I were looking
to measure the scalibility of a server I'd want to have several
clients, not just one client process with 500 threads. Or, I'd want
to check the client first by testing it against several servers at
once.

The suggestion to profile things is spot-on.

rick jones
--
web2.0 n, the dot.com reunion tour...
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
hg@x-formation.com

2007-07-13, 8:06 am

Thanks to everyone who has answered this so far.

As per suggestions I did some profiling and with a few small patches I
managed to boost the speed quite a bit. These patches were some
useless wait's here and there and some improvements with a little bit
of caching.

However now with those patches done I hit a new problem. Once in a
while I get errors on the client side and things just stop working.
This problem is *only* happening when I perform stress tests with
multiple socket connections.

I have identified it to be this piece of code on the client:


bool Csocket::SafeRecv()
{
int nIndex = 0;
int nLeft;
int ret;
HEADER H;

// P for data
do
{
if ((ret = m_Socket.Recv(&H, sizeof(H), MSG_PEEK)) ==
SOCKET_ERROR)
return false;

// See if client is disconnected
if (ret == 0) return false;

Sleep(5);
} while (ret < (int) sizeof(NH));

// The header has been received and tells you how much data is left
to receive.
nLeft = ENDIAN(H.lLength);

ResizeMemory(nLeft);

// Get the rest
while (nLeft > 0)
{
ret = m_Socket.Recv(&m_pBuffer[nIndex], nLeft, 0);
// Either the client disconnected or a socket error occured.
if (ret == SOCKET_ERROR) return false;
if (ret == 0) return false;

nLeft -= ret;
nIndex += ret;
}

return true;
}

I should make it clear that SOCKET_ERROR is defined as -1 and that
m_Socket.Recv() is just a wrapper around recv().

In the above code I read the header before reading rest of the data.
This is required in order to read how much more data is missing and
perform some cryptography services which I left out.

I have identified that something weird is going on with msg_p since
"if (ret == 0) return false;" sometimes gets invoked. This means that
the server should have dropped the client connection but this is not
the case. I know that the server didn't do that just for the fun of
it.

Can anyone propose a better/ more safe way of achiving the same as
above?
What I want is to read the network header as 1 recv() before reading
rest of the data. This means that I'd like to wait on the client side
until I know that there is enough data to be read.

Thanks.

-- Henrik

shakahshakah@gmail.com

2007-07-13, 10:05 pm

Ignoring the issue of trying to send structs over a socket & the
Sleep() call, shouldn't your loop be something along the lines of the
following to handle a possible incomplete read of HEADER on your call
to Recv()?

// P for data
int nReadSoFar = 0 ;
while(1) {
if ((ret = m_Socket.Recv(
((char *) &H) + nReadSoFar
,sizeof(H)-nReadSoFar
,MSG_PEEK)) == SOCKET_ERROR)
return false;

// See if client is disconnected
if (ret == 0) return false;

nReadSoFar += ret ;
if(nReadSoFar >= (int) sizeof(NH)) {
break ;
}
Sleep(5);
}


On Jul 13, 8:55 am, h...@x-formation.com wrote:
> Thanks to everyone who has answered this so far.
>
> As per suggestions I did some profiling and with a few small patches I
> managed to boost the speed quite a bit. These patches were some
> useless wait's here and there and some improvements with a little bit
> of caching.
>
> However now with those patches done I hit a new problem. Once in a
> while I get errors on the client side and things just stop working.
> This problem is *only* happening when I perform stress tests with
> multiple socket connections.
>
> I have identified it to be this piece of code on the client:
>
> bool Csocket::SafeRecv()
> {
> int nIndex = 0;
> int nLeft;
> int ret;
> HEADER H;
>
> // P for data
> do
> {
> if ((ret = m_Socket.Recv(&H, sizeof(H), MSG_PEEK)) ==
> SOCKET_ERROR)
> return false;
>
> // See if client is disconnected
> if (ret == 0) return false;
>
> Sleep(5);
> } while (ret < (int) sizeof(NH));
>
> // The header has been received and tells you how much data is left
> to receive.
> nLeft = ENDIAN(H.lLength);
>
> ResizeMemory(nLeft);
>
> // Get the rest
> while (nLeft > 0)
> {
> ret = m_Socket.Recv(&m_pBuffer[nIndex], nLeft, 0);
> // Either the client disconnected or a socket error occured.
> if (ret == SOCKET_ERROR) return false;
> if (ret == 0) return false;
>
> nLeft -= ret;
> nIndex += ret;
> }
>
> return true;
>
> }
>
> I should make it clear that SOCKET_ERROR is defined as -1 and that
> m_Socket.Recv() is just a wrapper around recv().
>
> In the above code I read the header before reading rest of the data.
> This is required in order to read how much more data is missing and
> perform some cryptography services which I left out.
>
> I have identified that something weird is going on with msg_p since
> "if (ret == 0) return false;" sometimes gets invoked. This means that
> the server should have dropped the client connection but this is not
> the case. I know that the server didn't do that just for the fun of
> it.
>
> Can anyone propose a better/ more safe way of achiving the same as
> above?
> What I want is to read the network header as 1 recv() before reading
> rest of the data. This means that I'd like to wait on the client side
> until I know that there is enough data to be read.
>
> Thanks.
>
> -- Henrik


Henrik Goldman

2007-07-13, 10:05 pm

Hello,

> Ignoring the issue of trying to send structs over a socket & the
> Sleep() call,


Actually I'm not doing that. I am just packing a network header into a
1-byte aligned structure. It's just an easier way then to extract data at
specific positions.
This is only used for the header anyway since rest of the data is in a
"binary xml" kind of format which identifiers and values.

> shouldn't your loop be something along the lines of the
> following to handle a possible incomplete read of HEADER on your call
> to Recv()?
>
> // P for data
> int nReadSoFar = 0 ;
> while(1) {
> if ((ret = m_Socket.Recv(
> ((char *) &H) + nReadSoFar
> ,sizeof(H)-nReadSoFar
> ,MSG_PEEK)) == SOCKET_ERROR)
> return false;
>
> // See if client is disconnected
> if (ret == 0) return false;
>
> nReadSoFar += ret ;
> if(nReadSoFar >= (int) sizeof(NH)) {
> break ;
> }
> Sleep(5);
> }
>


Isn't the whole idea that MSG_PEEK should not remove the data? So this means
that when I'm peaking I will get the same data over again.
Maybe I should remove the p and then go for the normal recv()?
I don't know how this will help me though.
For one reason of another I get booted from the server when the socket
"pressure" is too large.
This is completely odd though. I see the same behavior on a number of OS's
though.
Is this something which is expected? At least I would have expected a socket
error instead of recv with 0 as return.

-- Henrik


shakahshakah@gmail.com

2007-07-13, 10:05 pm

On Jul 13, 5:20 pm, "Henrik Goldman" <henrik_gold...@mail.tele.dk>
wrote:
> Hello,
>
>
> Actually I'm not doing that. I am just packing a network header into a
> 1-byte aligned structure. It's just an easier way then to extract data at
> specific positions.
> This is only used for the header anyway since rest of the data is in a
> "binary xml" kind of format which identifiers and values.
>
>
>
>
>
>
>
> Isn't the whole idea that MSG_PEEK should not remove the data? So this means
> that when I'm peaking I will get the same data over again.
> Maybe I should remove the p and then go for the normal recv()?
> I don't know how this will help me though.
> For one reason of another I get booted from the server when the socket
> "pressure" is too large.
> This is completely odd though. I see the same behavior on a number of OS's
> though.
> Is this something which is expected? At least I would have expected a socket
> error instead of recv with 0 as return.
>
> -- Henrik


I missed the MSG_PEEK, so you're probably right about not having to
worry about an "incomplete" read.

At a glance it reminded me of a late night debugging session of a
situation where, under heavy load, my read() calls weren't even
returning the initial 4 bytes which were the "bytes to follow" count.

moi

2007-07-14, 8:04 am

On Fri, 13 Jul 2007 05:55:09 -0700, hg wrote:

> Thanks to everyone who has answered this so far.
>
> As per suggestions I did some profiling and with a few small patches I
> managed to boost the speed quite a bit. These patches were some
> useless wait's here and there and some improvements with a little bit
> of caching.
>
> However now with those patches done I hit a new problem. Once in a
> while I get errors on the client side and things just stop working.
> This problem is *only* happening when I perform stress tests with
> multiple socket connections.
>
> I have identified it to be this piece of code on the client:
>
>
> bool Csocket::SafeRecv()
> {
> int nIndex = 0;
> int nLeft;
> int ret;
> HEADER H;
>
> // P for data
> do
> {
> if ((ret = m_Socket.Recv(&H, sizeof(H), MSG_PEEK)) ==
> SOCKET_ERROR)
> return false;
>
> // See if client is disconnected
> if (ret == 0) return false;
>
> Sleep(5);


I don't like the sleep. If your recieve queue happens to have some data
in it, but < sizeof(H) the thread will "poll-block" in sleep.

> } while (ret < (int) sizeof(NH));
>
> // The header has been received and tells you how much data is left
> to receive.
> nLeft = ENDIAN(H.lLength);
>
> ResizeMemory(nLeft);
>
> // Get the rest
> while (nLeft > 0)
> {
> ret = m_Socket.Recv(&m_pBuffer[nIndex], nLeft, 0);
> // Either the client disconnected or a socket error occured.
> if (ret == SOCKET_ERROR) return false;
> if (ret == 0) return false;
>
> nLeft -= ret;
> nIndex += ret;
> }
>
> return true;
> }


Here you call recv/read repeatedly. In most cases(less data on the receive
queue than you expect), the last recv will always fail with EAGAIN.
Per packet, you perform too many (at least 2) recv() systemcalls.
Since you keep trying until done, this loop will spend it's time
in wasted recv() systemcalls.



> I should make it clear that SOCKET_ERROR is defined as -1 and that
> m_Socket.Recv() is just a wrapper around recv().
>
> In the above code I read the header before reading rest of the data.
> This is required in order to read how much more data is missing and
> perform some cryptography services which I left out.


Data is data. You first have to receive it until you can do something
(such as decrypt) with it.

> I have identified that something weird is going on with msg_p since
> "if (ret == 0) return false;" sometimes gets invoked. This means that
> the server should have dropped the client connection but this is not the
> case. I know that the server didn't do that just for the fun of it.


>
> Can anyone propose a better/ more safe way of achiving the same as
> above?


Yes:
1) if fd selects as readable: perform *one* read/recv
2) Buffer all the data you received.
3) after each read: determine if you
a) have enough data for constructing the "message-header" from it. -->>
calculate remaining-amount
b) finished reading the entire "message": carve it out, and shift the
remainder (if any) down to the beginning of the buffer.
c) IFF you have a completed message: pass it tho the application-code;
otherwise just return and trust on select() to call you when there *is*
more data.


> What I want is to read the network header as 1 recv() before reading
> rest of the data. This means that I'd like to wait on the client side
> until I know that there is enough data to be read.


IMHO: Don't.

> Thanks.


You're welcome,
AvK

moi

2007-07-14, 7:06 pm

On Fri, 13 Jul 2007 05:55:09 -0700, hg wrote:


> I have identified that something weird is going on with msg_p since
> "if (ret == 0) return false;" sometimes gets invoked. This means that
> the server should have dropped the client connection but this is not
> the case. I know that the server didn't do that just for the fun of
> it.



Forgot this one. Insufficient data to answer this. How do you know the
server did not close() / shutdown the connection ? Does the server log
these connect/close events ? It could be that the server fails to handle a
connection teardown incorrectly, and keeps the stale fd "open"; that would
not show up in the server's own logging... Tracing the TCP/IP traffic will
probably get you somewhere. (it could even be that some (NAT)
router/firewall inbetween runs out of resources, and closes some
connections.)

HTH,
AvK


Sponsored Links







Also available: Server administration forum archive | Web Design forum archive | Software forum archive | Hardware reviews archive

Copyright 2008 codecomments.com