[fixed] Error parsing HTTP headers (2 Viewers)

morpheus_xx · February 26, 2014

To be clear: in all commits after the Unit Test was added, the header parsing should work 100% correct already. Please try to confirm this first, before making changes.

mnmr · February 26, 2014

Okay, so did a few quick experiments and was able to bring the runtime for 10k runs down to around 360ms (the original code used about 430ms on my machine), so it's only slightly faster.

These are the modified bits:

Code:

      internal void ParseHeaderAndBody( Stream stream, out string firstLine )
      {
          const byte CR = 13;
          const byte LF = 10;

          byte[] data = (stream as MemoryStream).ToArray();
          int splitSize = 4;
          var splitIndex = FindIndexSinglePattern( data, new[] { CR, LF, CR, LF } );
          if( splitIndex == -1 )
          {
              splitSize = 2;
              splitIndex = FindIndexSinglePattern( data, new[] { LF, LF } );
          }

          string header = Encoding.UTF8.GetString( data, 0, splitIndex );
          string[] lines = header.Split( splitSize == 2 ? new [] { "\n" } : new [] { "\r\n" }, StringSplitOptions.None );
          if( lines.Length < 1 )
              throw new InvalidDataException( "Invalid empty HTTP header" );

          firstLine = lines[ 0 ].Trim();
          ParseRemainingHeaderLines( lines );
          // Check for correct body length
          int contentLength;
          int bodySize = data.Length - splitIndex - splitSize;
          if( int.TryParse( this[ "CONTENT-LENGTH" ], out contentLength ) && contentLength != -1 && contentLength != bodySize )
              throw new InvalidDataException( "Invalid HTTP content length: header {0}, actual body: {1}", contentLength, bodySize );
      }

     // finds the starting index of the given byte pattern
     public int FindIndexSinglePattern( byte[] data, byte[] pattern )
     {
         var matchCount = 0;
         for( var i = 0; i < data.Length; i++ )
         {
             matchCount = data[ i ] == pattern[ matchCount ] ? matchCount + 1 : 0;
             if( matchCount == pattern.Length )
                 return i - matchCount + 1;
         }
         return -1;
     }

As you'll note, I put some of the parsing logic in a separate method. I like being able to see the entire method body on screen, and when it doesn't fit I take it as a sign that it needs refactoring

Code:

     private void ParseRemainingHeaderLines( string[] lines )
     {
         for( int i = 1; i < lines.Length; i++ )
         {
             string line = lines[ i ].Trim();
             int index = line.IndexOf( ':' );
             if( index == -1 )
                 throw new InvalidDataException( "Invalid HTTP header line '{0}'", line );
             try
             {
                 string key = line.Substring( 0, index ).Trim();
                 string value = line.Substring( index + 1 ).Trim();
                 SetHeader( key, value );
             }
             catch( ArgumentException e )
             {
                 throw new InvalidDataException( "Invalid HTTP header line '{0}'", e, line );
             }
         }
     }

Just for completeness, here's the code I used to measure performance:

Code:

    [TestMethod]
    public void ParseHttpRequestHeadersPerformanceTest()
    {
       var watch = Stopwatch.StartNew();
       for( var i = 0; i < 10000; i++ )
       {
           ParseHttpRequestHeaders();
       }
        watch.Stop();
       Console.WriteLine("Elapsed time: {0}ms", watch.ElapsedMilliseconds);
    }

Removing the Skip/Take methods gained the bulk of the performance win. The MultiPattern method was also doing too many array lookups, so I used the simple single-pattern method I actually wrote first.

Another thing to note is that the Parse method takes a Stream, but nowhere in the code is anything but a MemoryStream passed to it (as you can see, I'm brutally using that to grab all the bytes passed in). Another thing is that in all cases, the MemoryStream is created from a fixed byte array. I was able to shave off another 20ms or so by simply passing the byte[] around instead (real-life performance impact is likely to be slightly better than that, since there is no test-setup-overhead included: 20ms out of 400ms is not-so-much, but 20ms out of 200ms is decent).

Feel free to use or ignore the code above as you deem fit

If you do use it, be sure not to have the cast to MemoryStream in the code.

morpheus_xx · February 26, 2014

mnmr said:
Another thing to note is that the Parse method takes a Stream, but nowhere in the code is anything but a MemoryStream passed to it (as you can see, I'm brutally using that to grab all the bytes passed in).

It's not ok to expect a MemoryStream, while the argument is a generic stream. The callers are SSDP/GENA controller classes, they pass a stream which might be of any type.
As generic solution I would suggest this extension method:

C#:

  internal static class StreamExtensions
  {
    public static byte[] ToArray(this Stream stream)
    {
      const int BUFFER_SIZE = 4096;
      MemoryStream memoryStream = stream as MemoryStream;
      if (memoryStream != null)
        return memoryStream.ToArray();

      using (memoryStream = new MemoryStream())
      {
        byte[] bodyBuffer = new byte[BUFFER_SIZE];
        int bytesRead;
        do
        {
          bytesRead = stream.Read(bodyBuffer, 0, BUFFER_SIZE);
          memoryStream.Write(bodyBuffer, 0, bytesRead);
        } while (bytesRead > 0);
        return memoryStream.ToArray();
      }
    }
  }

I also extended the unit tests to include body-less calls (like GET, CONNECT, NOTIFY...) (see https://github.com/MediaPortal/MediaPortal-2/commit/89ba22c7a1d27ea07c85b49fa5b0a05ec183e556). In this case your current code does fail (split index is -1). But generally I like your approach

. I think the required changes are small to get this fully working. Can you check this?

Thank you again, we already improved the testing and parsing a lot

mnmr · February 26, 2014

morpheus_xx said:
mnmr said:

Another thing to note is that the Parse method takes a Stream, but nowhere in the code is anything but a MemoryStream passed to it (as you can see, I'm brutally using that to grab all the bytes passed in).

Click to expand...

It's not ok to expect a MemoryStream, while the argument is a generic stream. The callers are SSDP/GENA controller classes, they pass a stream which might be of any type.

That is only true in theory - in the current code base they always pass a MemoryStream. I asked my good friend, Dr. ReSharper

, to investigate, and it appears that all calls to the SimpleHttpMessage.Parse method happens from the two places you mention (SSDP and GENA classes). Both of these implementations are private (and thus also non-virtual), so you could in theory change the method signatures to avoid the overhead -- assuming that no-one outside of MP client/server are using the SimpleHttp* classes.

Here's what both of them do:

Code:

        EndPoint remoteEP = new IPEndPoint( state.Endpoint.AddressFamily == AddressFamily.InterNetwork ? IPAddress.Any : IPAddress.IPv6Any, 0);
        Stream stream = new MemoryStream(state.Buffer, 0, socket.EndReceiveFrom(ar, ref remoteEP));
        ...
        SimpleHTTPRequest request;
        SimpleHTTPRequest.Parse(stream, out request);

As you can see, it reads all bytes (the blocking EndReceiveFrom) into a buffer. Then it creates a MemoryStream from that buffer, passes it to the Parse method, which hands it over to the code we've been modifying. The end result is that all this MemoryStream packaging is really just overhead that you technically don't need.

If it was me, I'd change the implementation in SimpleHttpMessage to consume a byte array and modify the two controllers to match that. Then I'd include your extension method and introduce a public overload that allows passing in a stream (the current method signature). That gives you perfect compatibility while maximizing performance for the current use case.

morpheus_xx said:
I also extended the unit tests to include body-less calls (like GET, CONNECT, NOTIFY...) (see https://github.com/MediaPortal/MediaPortal-2/commit/89ba22c7a1d27ea07c85b49fa5b0a05ec183e556). In this case your current code does fail (split index is -1). But generally I like your approach . I think the required changes are small to get this fully working. Can you check this?

I'll have a look - will post my results later this evening

mnmr · February 27, 2014

morpheus_xx said:
In this case your current code does fail (split index is -1). But generally I like your approach . I think the required changes are small to get this fully working. Can you check this?

The error is in the unit test code:

Code:

 string fullRequest = string.Join(delimiter, requestHeaders.ToArray());

This inserts the delimiter between items, not after each item. Thus, to have the header end with two delimiters you need an extra "" item in the headers list.

Alternatively you could use this to build the fullRequest:

Code:

string fullRequest = string.Concat( requestHeaders.Select( h => h + delimiter ) );

Error handling could be better though. I suggest the following change to the method in SimpleHttpMessage:

Code:

         var splitIndex = FindIndexSinglePattern( data, new[] { CR, LF, CR, LF } );
         if( splitIndex == -1 )
         {
             splitSize = 2;
             splitIndex = FindIndexSinglePattern( data, new[] { LF, LF } );
             if( splitIndex == -1 )
                 throw new InvalidDataException( "No end of HTTP header marker found" );
         }

This way it will not try to proceed without a proper split index.

Edit: added alternate solution to unit test code.

morpheus_xx · February 27, 2014

Ok, next version pushed to https://github.com/MediaPortal/MediaPortal-2/commit/1a0d0cd50a540ee153bc1bf721ffdc04f3b15c84.

I fixed the header construction for "body-less" messages by adding a "" to list. So in principle only your proposed overload for byte[] is left.

mnmr · February 27, 2014

That looks great. I think the extensions class should be moved to a separate file and made public, since it's kind of a generic utility method rather than something internal to the HttpMessage. The byte[] optimization we can return to if performance is an issue

[fixed] Error parsing HTTP headers (2 Viewers)

morpheus_xx

Retired Team Member

mnmr

Retired Team Member

morpheus_xx

Retired Team Member

mnmr

Retired Team Member

mnmr

Retired Team Member

morpheus_xx

Retired Team Member

mnmr

Retired Team Member

Users who are viewing this thread