Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added link to load testing utility cookbook

Anchor
Footnote-01-Return
Footnote-01-Return
Executive Summary

In December 2015, the Ed-Fi Alliance concluded an effort to test the full spectrum of the ODS / API capabilities under load.1 The testing covered both transactional operations to create, read, update, and delete entities, as well as bulk operations supporting the import of large files. This technical article reports on the results of the transactional performance testing. Performance testing results for bulk loading are covered in another article, ODS / API Bulk Load Performance Testing.

In transactional tests, the API web server CPU and memory usage increased with activity, but the SQL Server hosting the ODS rarely experienced a spike greater than a 10% CPU utilization. For this reason, testing focused on the performance characteristics of the API web server(s).

The performance tests were run against a single-web-server configuration and a load-balanced, multiple-web-server configuration. Each type of configuration was tested with increasingly powerful virtual machines. The configurations were designed to be characteristic of production environments with a vertical-scaling strategy (i.e., achieving scale by investing in a few, very powerful servers) and a horizontal-scaling strategy (i.e., achieving scale by balancing load across multiple, relatively inexpensive servers).

The performance tests applied increasing pressure (i.e., an increased number of requests per second) in stages to determine the point of stability, stress, and failure for each configuration at each virtual machine size. The virtual servers used were Amazon Web Service (AWS) machines. The high level testing results are summarized in the table below. 

Scaling
Strategy 

Virtual Web Server
Size 

Stable
Requests/sec.

Burst
Requests/sec.

Failure
Requests/sec.

Horizontal2 x Medium525-550575-600625-650
Horizontal4 x Medium875-9001050-10751275-1300
VerticalMedium175-200225-250275-300
VerticalLarge375-400475-500650-675
VerticalExtra Large575-600775-800850-875

Detailed results and server specifications can be found later in this document.

Notes:

  • The ODS / API system as a whole proved to be stable under sustained transactional load.
  • Stability was defined as a consistent average response time of less than 1 sec. / request. The minimum response time for any operation on the configured system was measured at .013 seconds.
  • The load-balanced, horizontal scaling configuration outperformed the vertical scaling strategy using a comparable number of processors and memory.

The load simulated by these tests approximates a fairly high degree of activity at a mid-sized organization, using easily accessible and relatively inexpensive virtual machines. As a point of comparison, an SEA-sponsored production system with over 250K students experiences around 40 transactions/second during business hours on a “normal” day. The intent in using this testing approach was to provide a baseline for organizations to use in planning. The solution can easily be scaled to handle larger organizations or increased performance needs

Project Detail

This section provides detail about the objectives, scope, methodology, of the performance testing effort as well as the architecture tested.

Project Objectives

The transactional load testing objectives were:

  • Validate that the ODS / API is stable under sustained transactional load.
  • Determine practical limits of various server sizes.
  • Compare the performance of vertical and horizontal scaling strategies.
  • Report the results to assist implementers in planning for production deployments.

Scope

The transactional load testing exercised all types of API operations under varying load levels.

  • API Coverage. The testing exercised every type of domain aggregate exposed by the ODS / API, except StudentGradebookEntry, which is roughly 99% of the API resource surface. The tests did not include "helper" API endpoints such as Types, Descriptors, the bulk load endpoints (discussed in a separate technical article) and the Unique ID endpoints.
  • Request Types. Transactional requests exist in four different flavors: Create, Read, Update, and Delete (CRUD) operations for each domain aggregate exposed by the ODS / API. Each operation result is categorized into either “success,” meaning the operation completed without error, or “failure,” with an error message indicating the type of error.
  • Request Load. The Load Testing application allowed for the transactional tempo to be increased by increasing the number of threads. The transactional tests also have configuration options to set the mixture ratio for how many of each operation to perform, which was important when trying to simulate different scenarios such as initial setup, enrollment, day-to-day, and end of year.

Testing Methodology

The goal of this phase of transactional load testing was to determine approximately how many requests per second various server configurations can handle. For comparison purposes, each configuration was analyzed to determine three levels of performance.

The first level is stable throughput, a level that a server could handle with reasonable response time (<1 second) and continue to handle indefinitely. The second level is burst throughput, a level that a server can handle but has noticeable impact on response times (>1 second), as well as eventually leading to service unavailable errors if the burst continues for too long. The final level is the point of failure, the requests per second that lead to very slow response times and a noticeable number of server failures (Service Unavailable or GatewayTimeout) almost immediately.

The load testing was performed using a custom application available to Ed-Fi Licensees. Details on downloading, building, and running load tests using the application can be found in the technical article Ed-Fi article ODS / API Load Testing Utility Cookbook (coming soon).

Testing Architecture

The Ed-Fi ODS / API can be deployed in a variety of architectural configurations, from a single server (as in a development or test machine) to various load-balanced, multi-machine configurations.

Performance tests were run against configurations representative of typical, cloud-based production environments. Both a horizontally scaled and vertically scaled solution was tested, each with a variety of server instance types. Since hardware characteristics can vary results greatly, testing was performed using Amazon Web Services (AWS) to provide a more-or-less standard point of reference.

Vertical configuration testing aimed to understand the performance profile as the web server specifications were increased and horizontal configuration testing which provided insight into performance when multiple web servers are used.

Server Configurations Used for Testing

Horizontally Scaled Server Configuration

Testing was performed against horizontally scaled components distributed on AWS in the following configurations:

Web Servers
Database Server
Load Balancer
2 x MediumMediumAWS Elastic Load Balancing
4 x LargeMediumAWS Elastic Load Balancing

Vertically Scaled Server Configuration

Testing was performed against vertically scaled components distributed on AWS in the following configurations:

Web Server
Database Server
Load Balancer 
MediumMediumNo Load Balancing
LargeLargeNo Load Balancing
Extra LargeExtra LargeNo Load Balancing

Software & Platform Information

  • Microsoft Internet Information Server
  • SQL Server 2012 Enterprise
  • Ed-Fi ODS / API v2.0 Public Release

Software Components

  • ODS Web API. Encompasses the RESTful endpoints that allow CRUD operations against the ODS database, plus the API endpoints related to the Bulk Load Services.
  • ODS Database. The SQL Server installation hosting the ODS and its supporting databases.

Test Results

This section provides detail about the server configurations and associated test results.

Horizontally Scaled Configuration Results

Horizontal testing generally showed stability across the board, to the point of hitting the limits of what the each infrastructure level can handle. In contrast to issues described in the Vertical Scaling section below for the vertical Extra Large test, horizontal tests showed that the IIS queues were not overloaded. This is due to multiple servers each with their own IIS queue, requiring a very large number of requests to fill up the queues.

The horizontal configuration also drastically outperformed a similar number of CPU cores in the vertical configuration, as a result of the inherent benefits of the load balancer handling requests. The individual web servers remained stable due to the fact that if one server would be tied up or blocked by a bad request the other server would continue to process. The load balancer also helped once the configuration was under load, since a dedicated server checking the underlying web server health provided fast responses once the service was unavailable, and reasonably graceful behavior even when overloaded.

Based on these findings and test results, we conclude that a horizontally scaled implementation is generally more performant than a vertical configuration and is the recommended approach for large-scale implementations.

Vertically Scaled Configuration Results

Under normal, steady load, single-web-server vertical configurations were stable. However, under stress, vertical configurations failed when overloaded, oftentimes blocking up the server for up to a minute after the requests stopped being sent.

Without a load balancer, the API web server is responsible for sending the Service Unavailable response. Often the server would be so busy with requests that it would take upwards of 30 seconds to inform the client that the service was unavailable. This causes the very dramatic jumps in response time near the upper reaches of requests per second.

Finally, there were noticeable queue issues with the Extra Large test scenario. The powerful hardware in this setup caused the default configuration of the IIS queue to fill up at times, even when the server itself wasn't overloaded. This is represented in the data by the existence of Service Unavailable in small numbers even when the response time is still low and the server isn't highly utilized. This could be mitigated by adjusting the queue size on IIS when running on a stronger server.

The figures below show response times and requests per second at each request level. Graphs are shown for CPU usage on Medium, Large and Extra Large web server configurations.

Overall Server Health

In general, a healthy server should show low (sub-second) response times, and a response / second rate very similar to the request / second rate. These numbers were used to determine the approximate stable request / second range for a given configuration. As shown in the "Unhealthy Server" chart below, the response time and the number of responses per second vary greatly once the server gets unloaded, leading to inconsistent results for client applications. The response times and responses will try to catch up because IIS and the load balancer are designed to try to recover in these scenarios, but spikes will continue to occur because the server simply can't handle the number of requests being sent to it.

Error Condition Profile

Complex, multi-tier systems under load sometimes exhibit errors that aren't reproducible and difficult to diagnose. The following graph shows a request/response profile for an event the team encountered during testing, where internal server errors caused dramatic spikes in response times. The response times are reasonable until the scenario occurs (around 60 seconds into the test run), at which point the server stops sending back responses for a period of time. The server eventually recovers and works the queue to catch up, bursting a large number of responses. Eventually the server levels off, until the issue happens again.

In production, these types of errors need to be worked individually, and can be caused by a number of factors in the configuration or the code. (In fact, the errors the team encountered in the test runs are being investigated by Ed-Fi technologists and tracked in

Jira
serverEd-Fi Issue Tracker
columnskey,type
serverIde04b01cb-fd08-30cd-a7d6-c8f664ef7691
keyODS-631
to see if a code fix is indicated.)

Recommendations

This section summarizes the recommendations based on the latest round of load testing.

  • Large-scale implementations should prefer horizontal, load-balanced scaling strategies over vertical scaling.
  • Set logging levels appropriately for production. The log4net configuration should be set to error only in production instances, except when troubleshooting. Turn off the SystemDiagnosticsTracing in production systems.

 

 

Test Result Detail

This section contains detail about the testing methods and result data from the testing.

Test Server Specifications

Web Servers

Amazon EC2 C4.x instance types were used to support the web application servers.

Model
Series
Model
vCPU
Mem (GiB)
SSD Storage (GB)
Dedicated EBS Throughput (Mbps)
Smallc4large23.75EBS-only

500

Mediumc4xlarge47.5EBS-only

750

Largec42xlarge815EBS-only

1,000

Extra Largec44xlarge1630EBS-only

2,000


Amazon C4 instances are the latest generation of Compute-optimized instances, featuring the highest performing processors and the lowest price/compute performance in EC2.

Features:

  • High frequency Intel Xeon E5-2666 v3 (Haswell) processors optimized specifically for EC2
  • EBS-optimized by default and at no additional cost
  • Ability to control processor C-state and P-state configuration on the c4.8xlarge instance type
  • Support for Enhanced Networking and Clustering

Database Server

Amazon EC2 R3.x instance types were used to support the database server.

ModelSeriesModelvCPUMem (GiB)SSD Storage (GB)
Smallr3large215.251 x 32
Mediumr3xlarge430.51 x 80
Larger32xlarge8611 x 160
Extra Larger34xlarge161221 x 320

Additional drives were allocated to support the SQL data, log, and tempdb files. This was required to maximize IOPS disk performance across the volumes. R3 instances are optimized for memory-intensive applications and have the lowest cost per GiB of RAM among Amazon EC2 instance types.

Features:

  • High Frequency Intel Xeon E5-2670 v2 (Ivy Bridge) Processors
  • Lowest price point per GiB of RAM
  • SSD Storage
  • Support for Enhanced Networking

 

Test Result Data

Horizontal Test Executions - Data Table

WebDBRequest / secRequest LimitWeb CPUWeb MemDB CPUDB MemAvg RequestTotal RequestsService UnavailableGateway TImeoutOutage %Internal Server ErrorConflictPrecondition FailedNot FoundForbidden
Medium x2Medium49.75503.79%13.86%2.13%4.87%111492600004491620
Medium x2Medium99.661006.07%13.73%3.48%5.56%929899000084716130
Medium x2Medium198.682009.97%16.64%6.99%6.94%18596050009175836270
Medium x2Medium286.4530020.90%22.86%8.56%9.04%55485936000482994404420
Medium x2Medium384.6740043.49%32.36%14.21%12.58%14401154020007141579613030
Medium x2Medium496.7350065.56%41.72%17.84%15.09%2086149018000230532514087006
Medium x2Medium531.8355058.50%23.27%20.99%16.60%28215955000011748222799615
Medium x2Medium563.4457561.11%33.63%20.84%16.91%2589169031000288595215736592
Medium x2Medium564.3960056.50%16.41%17.96%18.20%55841693160002945979299315415
Medium x2Medium616.9265057.28%18.42%18.50%18.87%484118507620616492113.79811754129282817420
Medium x4Large194.272006.54%15.42%3.09%4.11%715828200018171875210
Medium x4Large380.3840014.86%15.77%4.93%5.70%353114113000654027445690
Medium x4Large588.7460023.52%16.51%8.66%8.20%46517662200023547601120840
Medium x4Large782.880038.67%18.33%11.54%11.40%2444234841000138749042599840
Medium x4Large873.5790050.83%18.30%13.95%12.98%716262071000110788316311452
Medium x4Large960.65100055.19%19.18%16.66%13.53%27982881940003959364462117169
Medium x4Large1056.82110068.32%19.87%18.65%14.95%362431704600043392984654247713
Medium x4Large1186.76120070.06%20.95%16.88%15.96%6709356027453829982.116784010407606932531
Medium x4Large1275.21130056.55%30.20%15.11%16.96%747338256243690722213.308277497397066328023

Vertical Test Executions - Data Table

WebDBRequest / secRequest LimitWeb CPUWeb MemDB CPUDB MemAvg RequestTotal RequestsService UnavailableService Unavailable %Internal Server ErrorsConflictPrecondition FailedNot FoundForbidden
MediumMedium49.74 / sec50 / sec7.30%13.00%1.61%10.64%16 ms149220042714130
MediumMedium99.57 / sec100 / sec13.80%16.20%2.95%10.66%81 ms29872001101836130
MediumMedium149.33 / sec150 / sec23.00%27.50%3.30%10.71%106 ms44799006136767200
MediumMedium199.42 / sec200 / sec39.18%13.25%5.71%10.84%1847 ms598270071898333860
MediumMedium229.51 / sec250 / sec52.83%42.62%6.40%12.45%4321 ms6885200164302111022796
MediumMedium280.25 / sec300 / sec55.66%50.18%6.35%14.23%9778 ms840741747420.78414021279636080
MediumMedium332.44 / sec350 / sec68.29%18.44%8.51%15.51%6931 ms997321399414.0316145276014799700
MediumMedium353.43 / sec400 / sec73.17%36.11%8.46%15.95%8736 ms1060282289221.5905143304319399400
LargeLarge49.45 / sec50 / sec2.76%11.53%0.99%3.73%26 ms14834002421810
LargeLarge99.63 / sec100 / sec8.25%12.23%1.56%4.16%*613 ms298880024957186140
LargeLarge197.36 / sec200 / sec14.63%16.36%2.99%4.93%*793 ms5920900732239494291
LargeLarge298.17 / sec300 / sec19.56%23.22%4.00%6.13%75 ms894510010275485390
LargeLarge396.78 / sec400 / sec36.27%32.71%5.90%7.69%383 ms119033002239444191670
LargeLarge482.55 / sec500 / sec51.85%54.65%6.29%11.54%3362 ms1447641526510.544797481310265210
LargeLarge539.55 / sec600 / sec51.35%61.25%3.34%12.39%13336 ms1618646815442.105795338922566290
LargeLarge663.40 / sec700 / sec62.69%19.19%8.07%12.77%5399 ms1990195080225.52621925587267314490
Extra LargeExtra Large49.48 / sec50 / sec2.17%7.83%0.50%2.98%94 ms148430055141860
Extra LargeExtra Large99.64 / sec100 / sec4.85%8.16%0.64%3.21%13 ms29891003113522120
Extra LargeExtra Large201.68 / sec200 / sec6.26%11.20%1.65%3.77%1088 ms6050400392162848240
Extra LargeExtra Large386.49 / sec400 / sec13.17%16.17%3.83%4.52%316 ms11594800723815379440
Extra LargeExtra Large496.51 / sec500 / sec21.17%32.63%3.31%6.82%481 ms14895221131.4186265048454850
Extra LargeExtra Large595.17 / sec600 / sec29.97%43.38%4.96%8.39%1352 ms17855077074.3164115659514224290
Extra LargeExtra Large683.10 / sec700 / sec31.23%27.78%5.16%10.29%1999 ms2049302151510.498774727117531613
Extra LargeExtra Large776.74 / sec800 / sec33.58%62.14%5.18%9.59%1562 ms233022132095.668693847016063620
Extra LargeExtra Large860.80 / sec900 / sec48.44%70.29%5.02%9.76%6583 ms2582417335228.4045117719818567682
Extra LargeExtra Large775.68 / sec1000 / sec40.02%77.73%3.01%9.92%9473 ms2327059312540.0185109444126365594


Column Definitions

  • Web. Hardware used for the ODS / API server(s). Values range from Medium to Extra Large. See Hardware chart for exact specifications.
  • DB. Hardware used for the ODS Database server. Values range from Medium to Extra Large. See Hardware chart for exact specifications.
  • Request / sec. Average number of requests submitted by the client(s) per second.
  • Request Limit. The maximum number of requests per second the client(s) were configured to submit.
  • Web CPU. Average CPU utilization percentage for all ODS / API server(s) used in the test.
  • Web Mem. Average Memory usage for all ODS / API server(s) used in the test.
  • DB CPU. Average CPU utilization percentage for the ODS Database server used in the test.
  • DB Mem. Average Memory usage for the ODS Database server used in the test.
  • Avg Request. Average response time for a client request, in milliseconds. Calculated from the time a client submits the request, to when the client hears back from the server.
  • Total Requests. Total number of requests sent by all client(s) used in the test.
  • Service Unavailable. Total number of requests that resulted in a Service Unavailable (503) error.
  • Gateway Timeout. Total number of requests that resulted in a Gateway Timeout (504) error.
  • Outage %. Percentage of the total errors that were a result of Service Unavailable or Gateway Timeout.
  • Internal Server Error. Total number of requests that resulted in an Internal Server Error (500) error.
  • Conflict. Total number of requests that resulted in a Conflict (409) error.
  • Precondition Failed. Total number of requests that resulted in a Precondition Failed (412) error.
  • Not Found. Total number of requests that resulted in a Not Found (404) error.
  • Forbidden. Total number of requests that resulted in a Forbidden (403) error.
  • Total Errors. Total number of requests that resulted in any kind of error.
  • Error %. Percentage of the total requests that resulted in any kind of error.

Error Descriptions

  • Service Unavailable. The queue in front of IIS on at least one of the web servers was too busy to accept anymore requests, and the client's request was rejected. This is a side effect of heavy load, and is the best indicator of when a server is overloaded.
  • Gateway Timeout. This error only happens when a load balancer is being used, such as on the horizontal tests. This represents a scenario similar to Service Unavailable, but the overloaded server was detected ahead of time by the load balancer. This is analogous to a Service Unavailable error.
  • Internal Server Error. The server had an unhanded exception. In production, instances of these errors need to be investigated individually.
  • Conflict/Precondition Failed. Another client/thread tried to put or post an object at close to the same time. The last one in gets either a Conflict or Precondition Failed error depending on when it gets caught in the server pipeline. This is intentional behavior from the API to prevent accidental data loss when a large number of users are affecting the same records. This is an error made by the load testing client, and simulates a kind of error made by real-world client applications.
  • Not Found. A GET request tried to retrieve an object that had been deleted by another client/thread at close to the same time. The GET request will receive an error response code of Not Found. This an error made by the load testing client, and simulates a kind of error made by real-world client applications.
  • Forbidden. A request tried to execute after another client thread had deleted a relationship which provided security information for accessing the requested entity. The request will receive an error response code of Forbidden. This is an error made by the load testing client, and simulates a kind of error made by real-world client applications.


Anchor
Footnote-01
Footnote-01
1 The tests were performed on the then-current v2.0 of the ODS / API. Field testing indicates that v2.1.1, the latest ODS / API, has the same performance profile and characteristics.