[hunchentoot-devel] hunchentoot file upload performance

Discussion:

Mac Chan

2006-11-15 02:10:52 UTC

Hi all,

I've been using tbnl behind mod_lisp2 for quite a while.

After Edi merged tbnl with hunchentoot I wanted to get rid of apache
and run hunchentoot in standalone mode for easier deployment (not to
mention the coolness factor of running everything in lisp).

However, I've found that the file upload performance of hunchentoot is
4-10 times slower than tbnl (standalone or behind mod_lisp). I tested
this by uploading a 30mb file in the test upload page.

I'm guessing that Chunga might be the reason, since this is the only
new component introduced.

[ Chunga is currently not optimized towards performance - it is rather
intended to be easy to use and (if possible) to behave correctly.]

On a side note, while the file is being uploaded the cpu will go up to
90-100% (This happens to both tbnl and hunchentoot).

This is not the case when I upload the same file to a php script w/
apache.

Again, my guess is that rfc2388 repeatedly call read-char instead of
grabbing a buffer with read-sequence and decode it as a chunk.

These two issues make migrating to hunchentoot particular painful
because if one of the users uploads a huge file the whole site will
become very unresponsive (in tbnl the cpu spike goes away a lot
faster, but improving the i/o routine is still a big win).

I don't want to blindly guess, but I don't know how to use lispworks'
profiler to profile a multi-threaded server app.

The lispworks profiler requires you to run (profile <forms>) and it
will return the profiling data after running the <form>.

Is there a way to profile hunchentoot without writing individual test
case that simulates uploading a 30 mb file?

What about people using sbcl? How do you go about profiling apps like
hunchentoot?

Thanks.

Edi Weitz

2006-11-15 06:53:20 UTC

Permalink

Post by Mac Chan
I've been using tbnl behind mod_lisp2 for quite a while.
After Edi merged tbnl with hunchentoot I wanted to get rid of apache
and run hunchentoot in standalone mode for easier deployment (not to
mention the coolness factor of running everything in lisp).
However, I've found that the file upload performance of hunchentoot
is 4-10 times slower than tbnl (standalone or behind mod_lisp). I
tested this by uploading a 30mb file in the test upload page.

Yes, I've seen the same.

Post by Mac Chan
I'm guessing that Chunga might be the reason, since this is the only
new component introduced.

That would be my guess as well - Chunga and/or FLEXI-STREAMS.
Including the fact that both Chunga and FLEXI-STREAMS make heavy use
of Gray streams.

Post by Mac Chan
Again, my guess is that rfc2388 repeatedly call read-char instead of
grabbing a buffer with read-sequence and decode it as a chunk.

That is certainly part of the problem.

The only way out is probably to write our own version of the RFC2388
library - which is one of my long-term plans.

Post by Mac Chan
These two issues make migrating to hunchentoot particular painful
because if one of the users uploads a huge file the whole site will
become very unresponsive (in tbnl the cpu spike goes away a lot
faster, but improving the i/o routine is still a big win).
I don't want to blindly guess, but I don't know how to use
lispworks' profiler to profile a multi-threaded server app.
The lispworks profiler requires you to run (profile <forms>) and it
will return the profiling data after running the <form>.
Is there a way to profile hunchentoot without writing individual
test case that simulates uploading a 30 mb file?

I'd like to know the answer too... :)

I think you better chances to get an answer to this question on the LW
mailing list.

Cheers,
Edi.

Edi Weitz

2006-11-16 00:30:05 UTC

Permalink

Post by Mac Chan
I'm guessing that Chunga might be the reason, since this is the only
new component introduced.

Actually, the more I think about it the more I'm sure that
FLEXI-STREAMS is the culprit. I also have an idea how to make it
faster, but I'm not sure if I'll find the time to do it in the next
days.

Post by Mac Chan
Again, my guess is that rfc2388 repeatedly call read-char instead of
grabbing a buffer with read-sequence and decode it as a chunk.

Because of the way the streams are layered now, you probably wouldn't
win much (if anything at all) if you used READ-SEQUENCE instead.

More later,
Edi.

Mac Chan

2006-11-16 03:04:10 UTC

Permalink

Post by Edi Weitz
Actually, the more I think about it the more I'm sure that
FLEXI-STREAMS is the culprit. I also have an idea how to make it
faster, but I'm not sure if I'll find the time to do it in the next
days.

No hurry, Edi. As long as we know there's a solution I'm at ease :-)

Post by Edi Weitz

Post by Mac Chan
Again, my guess is that rfc2388 repeatedly call read-char instead of
grabbing a buffer with read-sequence and decode it as a chunk.

Because of the way the streams are layered now, you probably wouldn't
win much (if anything at all) if you used READ-SEQUENCE instead.

So I just did some profiling earlier using the tip from

http://thread.gmane.org/gmane.lisp.lispworks.general/5563/focus=5573

and found that most time are spent in tons of calls to read-char.

I then tried something like reading the whole buffer into a string and
use the (parse-mime string) method instead of (parse-mime stream)
(this is not a fix because if you upload 100mb file, it will allocate
a 100mb buffer).

The result is worse... will investigate later. Also I plan do similar
test with allegro-serve and see if it uses 100% cpu performing the
same task.

Edi Weitz

2006-11-16 07:32:24 UTC

Permalink

Post by Mac Chan
I then tried something like reading the whole buffer into a string
and use the (parse-mime string) method instead of (parse-mime
stream) (this is not a fix because if you upload 100mb file, it will
allocate a 100mb buffer).

You could divide this in smaller buffers, but then it gets ugly. See
below.

Post by Mac Chan
The result is worse...

Yes, as I conjectured in my previous email. The reason is that
FLEXI-STREAMS currently has to read in octet-size steps anyway.

Post by Mac Chan
will investigate later. Also I plan do similar test with
allegro-serve and see if it uses 100% cpu performing the same task.

I'm pretty sure it doesn't, mainly for two reasons:

1. For chunking and external format switching it uses AllegroCL's
built-in "simple streams" - you're unlikely to beat that with
portable Gray stream solutions like Chunga and FLEXI-STREAMS.

2. Lots of AllegroServe's source (to me) look like C code with
parentheses around it - pre-allocated buffers and all that stuff.
That's probably good for performance, but it's not the road I want
to go.

Cheers,
Edi.

Edi Weitz

2006-12-27 12:50:04 UTC

Permalink

You might want to try out the new release (0.9.0) of FLEXI-STREAMS to
see if this makes things more acceptable for you.

SBCL (and problably CMUCL) users please note this, though:

http://thread.gmane.org/gmane.lisp.steel-bank.general/1400/

Cheers,
Edi.