Server Required Performance

Apr 26, 2014 at 7:06 PM
I was looking at the WSDL for the SOAP service, and while it looks relatively simple to just write a basic server for my guild, I was curious what kind of performance would be necessary to handle all the traffic the main service is currently getting. Do you have any ballparks for peak bandwidth (10mbps? 100mbps?) and request rate (1000 req/s? 10k req/s? 100k req/s?) that you'd be willing to share? If I write my server in Ruby, my primary language, I'd probably only be able to handle a thousand requests per second at most, so I was curious if it would be worth attempting to implement it in something faster, like Node or Go.
Coordinator
Apr 28, 2014 at 12:43 AM
The language you write the server in has little to nothing to do with the performance and scalability of the service. It is a simple service that writes and reads from a database. In this context the scripting language is relatively transparent. Performance is limited only by the server hardware and network bandwidth.

The current hosting environment is a semi dedicated server with 1gb of ram and unlimited bandwidth. The ram seems to be the primary bottleneck at this point but i am already paying $30 a month and the next tier is $50.

As far as peak bandwidth iand other statistics are concerned my hosting provider doesnt offer much. I can tell you that during peak times there are about 1500 people updating every 5 seconds.
May 7, 2014 at 3:28 PM
What is involved in running the server component for parsec. My raid group seems to get a lot of disconnects and it would be nice to reduce/remove this if possible. I'm not going to commit to anything but I run a small ISP so I have a 45Mbit DS3 connection and I've got a server with 32GB RAM I use for virtualization I might pimp it out as long as it didn't otherwise impact me. Let me know whats required or where I need to go to read up on it. BTW thanks again Drew for all your work on this project its really great.
Coordinator
May 15, 2014 at 5:35 AM
Thanks Valcaron, I am trying to find out more information from my ISP but they don't provide very good statistics. The bandwidth per month is about 150GB up and 150GB down. Looks like around 300 requests per second.
May 15, 2014 at 3:58 PM
I can't imagine it uses a whole ton, otherwise I would not consider hosting it. You mentioned that RAM in the machine you are renting might be on the shy side and I'm sure I could get you something in the 2-4G range. What OS are you running on that server.
We could set something up as your time permits and if it looks like it will work switch, see how it handles the load and how much traffic it generates and if traffic load seem to be reasonable let it ride and if not you could just keep hosting where you are.
Valcaron DrewCerny wrote:
Coordinator
May 16, 2014 at 2:09 PM
The service is an asp.net web service running on windows server 2008 hosted in iis7 with a sql server 2012 database.
May 18, 2014 at 2:11 AM
Okay, so I finally found the time this weekend to wrap up an alternative implementation of the backend. Unfortunately I was not able to meet my goal of actually writing it to communicate over SOAP, partly because I wasn't able to actually find any working SOAP server libraries in any of the languages I wanted to work in, and partly because I couldn't get gSOAP to work after banging at it all day. The upside of this is that I could use JSON and plain HTTP POST requests, which is a lot faster to encode and smaller than XML. I'm currently hosting it at Linode on a $20/mo server (http://parsec.chromedshark.com/), and have benchmarked it at approximately 1600 SyncRaidStats calls per second with a raid size of 16 members and gzip turned on. If necessary, I could cache some additional things and probably get upwards of 2000 calls per second, especially if I turn off compression (which probably isn't necessary with 3 TB of bandwidth).

I tried to perfectly match the API you had established for the SOAP endpoint, and the documentation for the API can be found at https://github.com/warhammerkid/parsec-go. It's currently a rather simple implementation, as it didn't appear like too much processing was needed on the backend of the stats. However, there may be some necessary processing that's not taking place. For example, I didn't implement any of the disconnect logic, as 160,000 active users only came out to 100 MB of memory. There's just a basic GC process for clearing out stats that are over an hour stale. I was also not sure if the backend was supposed to perform some analysis of incoming stats to reset things when new combat started, so I'm not doing anything with that either. If you'd be willing to share the source code with me, I'd be happy to update the code to more closely match the existing implementation.

To hit the performance that I desired I did two things - pick a language that's blazingly fast (Go) and store all the stats in memory. Stats need not have any layer of persistence, so using a database to store them is going to significantly impact performance. To ensure persistence of raid groups and passwords across restarts, I'm storing those in an SQLite database.

If you're interested in targeting a different backend, I'd be happy to help you out as much as possible. Although I'm not a .Net developer, I might be able to convince a friend of mine to help me out in converting you from SOAP to pure HTTP calls if you'd like.
Coordinator
May 18, 2014 at 3:17 AM
1) the original raid service was written purely in memory but blew up at around 100 concurrent users. The memory usage wasnt bad over time but seemed to balloon at times which caused server restarts. Adding the sql server sounds like it would reduce performance but when i did with the same hosting environment i was able to hit 1000 concurrent. The database is small, optimized and fast.

2) soap was chosen because of its easy integration into a .net client app. Html rest services are great but required me to write an entire layer. Performance wise it just isnt worth it. Gzipped soap is fine.

3) my hosting environment is shared. This is where the real problems lie. The software can scale and support the load easily, but most times the disconnects do not coincide with high traffic of my service.

4) there is some processing being done to end fights and ensure the raid members fights are stuck together properly that you probably havent accounted for
May 18, 2014 at 5:31 AM
1) A full relational database is unquestionably slower than doing things in memory because there's so much that goes on with databases that's unnecessary: parsing and optimizing queries, network activity (probably in-memory pipes if on the same machine), regularly saving to disk, building and maintaining indexes, turning database data structures into C# data structures, etc. Just for kicks, I booted up my benchmarking server again and set it up to swamp the API with 320000 different users worth of stats at a concurrency level of 500 and I didn't see any slowdowns at all. In fact, it benchmarked quite a bit faster than when I was only making 100 concurrent requests - up to 3200 req/s from 1600.

2) While gzipped SOAP certainly works, it significantly complicates your code. With HttpWebRequest.AutomaticDecompression you could trivially enable gzip decompression on data received from the server and simply not gzip uploads. The bulk of the data transfer comes from the stats response, as it's typically going to be 8 times the request size (or 16 for larger groups), so you could significantly reduce the code needed to perform API calls while still obtaining most of the gains of gzipping everything.

3) All the more reason to use the backend I've put together. Linode is a VPS hosting company, and the server I'm using has a guaranteed 2 cores clocked at 2.8 GHz (or faster?). Furthermore I've hammered it with well over 5 times the traffic your SOAP backend can handle and it's not had any problems, all for $20/mo.

4) Yeah, if that's the case then I'm definitely missing that. Would you be willing to share the algorithm/code for that? I would be happy to port them to Go and integrate them in.

If you truly have no interest in the work I've done, I'm perfectly okay with that. I've only put 20 hours or so into it, and I've learned quite a bit from the experience. Should you decide to use my work, I would suggest that you continue to use SOAP for any additional service you want to experiment with, like log uploading, and just use the faster backend for the raid stat synchronization. That way you can experiment without needing to involve me or learn Go, while still gaining the benefits of a faster backend for the performance-critical portions of Parsec.
Coordinator
May 18, 2014 at 6:39 PM
Edited May 18, 2014 at 6:40 PM
1) You paint a pretty horrible picture of database servers but there are positives.
  • At the flip of a switch I can archive fight data with no performance side affects.
  • The data is persisted across IIS restarts, which in a shared environment is crucial
  • The database server already exists in this hosting environment so any overhead the server adds is not relevant.
  • In all likelihood the database is cached in memory
  • Queries like those used for raid stats are cached and do not suffer negatives from parsing or optimizing
  • Writing an in memory thread safe collection is not non-trivial and is not overhead free.
I would like to know more about your "benchmarking server" and also how you are handling thread safety in your collection with so many asynchronous requests updating and reading from your collection.

2) The entire soap response is GZipped by IIS - There is no code involved. This is a feature of IIS now.

3) One of the many benefits of Linux is that it is cheap. However, I live and work in a Microsoft world. One of the benefits to working on Parsec is that it dovetails with my professional work. Finding ways to accomplish things like Parsec in a Microsoft world helps me professionally. I have absolutely zero desire to change platforms. The downside is that cheap hosting is hard to find and is often times "cheap hosting" if you know what I mean.

4) It is not complicated and is a mater of timing out data. here is the simple version:
  • Clear users where last connect date is stale by 2 minutes
  • On a SyncData call (not getData) if the groups last combat update date is stale by 20 seconds clear group data
I really do not have any desire to change platforms, and not because of any love for Microsoft but because it is my professional business. Just as you have learned from writing your service, I too use this as a learning tool that I can directly apply to my real work. But that doesn't mean I am not open to new ideas in architecture. Most of the decisions about the raid service were made out of necessity but some were made out of a lack of experience. It is and always will be a work in progress.
May 19, 2014 at 2:27 PM
Edited May 19, 2014 at 2:27 PM
First, I should state that the 3200 req/s was incorrect, as there was a bug in my benchmarking code that was producing invalid requests, so none of the processing took place. The 1600 req/s stat should be accurate, although in that case I only had one raid group with 16 members instead of 8000 different members spread across 500 guilds.

When I wrote it, I wasn't terribly worried about being perfectly memory safe, and so consequently there were a couple of race conditions where the stats from a given request for a raid/user could be lost. When I attempted to remedy this, the code got very messy very quickly, so I asked myself if redesigning the API would reduce the amount of locking that needed to be done. If users connect and receive a session token, then we don't need to go out to SQLite more than once per request and it also means that we're only ever mutating non-threadsafe data structures once in the session (except the cleanup background thread). Updating a user's stats is perfectly threadsafe, and while it's possible for the JSON serialization to occur in the middle of an update, there are no ill effects from getting just a slice of an updated user's stats besides a temporary inaccuracy. So I went ahead and re-wrote the server against this API. The data-structure copying in calculateRaidStats is costing me a couple 100 req/s, but I haven't had the time to optimize that method fully yet, and it may end up actually being necessary to implement the data timing out you mention. Moving this type of processing to the client would also increase the performance of this method.

In the process of making these changes I found that the gzip library I was using was significantly slower than it's supposed to be, so I updated my first backend to use the new library and benchmarked both again. The results came out to 3855.55 req/s for the non-threadsafe first version and 3851.34 for the new version. The code for both benchmarks can be found at https://gist.github.com/warhammerkid/196a8c20509d3871f9a3, and they were both run on an Amazon c3.2xlarge instance in the US-East datacenter (near where the Linode data center I'm using is). The new backend code can be found at warhammerkid/parsec-go.

Overall I'm pleased with how the new API came out, although the data structures I'm using did increase the complexity of my cleanup code. That the new version is slower than the old isn't entirely surprising due to the increase in thread safety and some other tweaks to reduce memory load at the cost of increased CPU. Also, as a side-benefit of being able to tell users apart by session token, I don't have to depend on any of the data they send up to differentiate them, which makes it impossible for users to change other raid member's stats. Additionally, the new API design doesn't send as much over the wire, going from 196.06 MB to 137.42 MB for the same number of requests. If you're interested in gaining some performance and don't mind a more REST-like API, I would highly suggest trying this design out to see what your performance is.

API Contract:
GET /api/v2/raid_group?name=RAID_GROUP_NAME&password=RAID_PASSWORD
 - Test connection
 - Returns 200 on success or 401 on failure

POST /api/v2/raid_group?name=RAID_GROUP_NAME&password=RAID_PASSWORD&adminPassword=RAID_ADMIN_PASSWORD
 - Create a raid group
 - Returns 200 on success or 400 on failure with an error message in the body of the response

DELETE /api/v2/raid_group?name=RAID_GROUP_NAME&adminPassword=RAID_ADMIN_PASSWORD
 - Delete a raid group
 - Returns 200 on success
 - Returns 400 if a group with the given name and admin password is not found

POST /api/v2/connect?name=RAID_GROUP_NAME&password=RAID_PASSWORD
 - Returns a connection token (v4 UUID)
 - Returns 401 for login failure

GET /api/v2/stats?t=CONNECTION_TOKEN
 - Return value is all raid stats in a JSON array

POST /api/v2/stats?t=CONNECTION_TOKEN
 - Update the user's stats and get the raid stats
 - Post body is JSON formatted stats structure
 - Response is identical to stats GET
May 19, 2014 at 3:08 PM
I forgot to mention this, but I was curious what kind of performance I could get without gzip enabled at all. Under the same circumstances as the benchmarks I mentioned, I was able to hit a processing rate of 6092.61 req/s, although the total data transferred increased to a whopping 1538.93 MB. The data rate was 15.63 MB/sec (125.04 Mbps).
Coordinator
May 20, 2014 at 8:07 PM
I want to point out that this is strictly an apples to oranges comparison. The service as it exists is very scalable - in fact it will work across a server farm if need be where an in memory solution would not. Anyway, I have yet to hit any limit based on the software or the platform.

The problems I have encountered are 100% due to hosting providers. Like today, the raid service is currently throttled at about 200 concurrent users (~70 requests per second). My hosting provider changed some configurations on my server and accidentally put back into place a memory limit of 200mb on the service. Now I have to wait for them to escalate me up the chain until someone who knows something gets my ticket and fixes it.

The saddest part as that this is my 3rd provider and the best one thus far. However, what I think I need now is a VPS with Windows Server, MS SQL and 2GB+ of ram. Too bad I already paid for a year of this current service.