Include Page

	_Previous Version Note
	_Previous Version Note

Executive Summary

In December 2015, the Ed-Fi Alliance concluded an effort to test the full spectrum of the ODS / API capabilities under load. The testing covered both transactional operations to create, read, update, and delete entities, as well as bulk operations supporting the import of large files. This technical article reports on the results of the transactional performance testing. Performance testing results for bulk loading are covered in another article, ODS / API Bulk Load Performance Testing.

In transactional tests, the API web server CPU and memory usage increased with activity, but the SQL Server hosting the ODS rarely experienced a spike greater than a 10% CPU utilization. For this reason, testing focused on the performance characteristics of the API web server(s).

The performance tests were run against a single-web-server configuration and a load-balanced, multiple-web-server configuration. Each type of configuration was tested with increasingly powerful virtual machines. The configurations were designed to be characteristic of production environments with a vertical-scaling strategy (i.e., achieving scale by investing in a few, very powerful servers) and a horizontal-scaling strategy (i.e., achieving scale by balancing load across multiple, relatively inexpensive servers).

The performance tests applied increasing pressure (i.e., an increased number of requests per second) in stages to determine the point of stability, stress, and failure for each configuration at each virtual machine size. The virtual servers used were Amazon Web Service (AWS) machines. The high level testing results are summarized in the table below.

Scaling Strategy	Virtual Web Server Size	Stable Requests/sec.	Burst Requests/sec.	Failure Requests/sec.
Horizontal	2 x Medium	525-550	575-600	625-650
Horizontal	4 x Medium	875-900	1050-1075	1275-1300
Vertical	Medium	175-200	225-250	275-300
Vertical	Large	375-400	475-500	650-675
Vertical	Extra Large	575-600	775-800	850-875

Detailed results and server specifications can be found later in this document.

Notes:

The ODS / API system as a whole proved to be stable under sustained transactional load.
Stability was defined as a consistent average response time of less than 1 sec. / request. The minimum response time for any operation on the configured system was measured at .013 seconds.
The load-balanced, horizontal scaling configuration outperformed the vertical scaling strategy using a comparable number of processors and memory.

The load simulated by these tests approximates a fairly high degree of activity at a mid-sized organization, using easily accessible and relatively inexpensive virtual machines. As a point of comparison, an SEA-sponsored production system with over 250K students experiences around 40 transactions/second during business hours on a “normal” day. The intent in using this testing approach was to provide a baseline for organizations to use in planning. The solution can easily be scaled to handle larger organizations or increased performance needs

Project Detail

This section provides detail about the objectives, scope, methodology, of the performance testing effort as well as the architecture tested.

Project Objectives

The transactional load testing objectives were:

Validate that the ODS / API is stable under sustained transactional load.
Determine practical limits of various server sizes.
Compare the performance of vertical and horizontal scaling strategies.
Report the results to assist implementers in planning for production deployments.

Scope

The transactional load testing exercised all types of API operations under varying load levels.

API Coverage. The testing exercised every type of domain aggregate exposed by the ODS / API, except StudentGradebookEntry, which is roughly 99% of the API resource surface. The tests did not include "helper" API endpoints such as Types, Descriptors, the bulk load endpoints (discussed in a separate technical article) and the Unique ID endpoints.
Request Types. Transactional requests exist in four different flavors: Create, Read, Update, and Delete (CRUD) operations for each domain aggregate exposed by the ODS / API. Each operation result is categorized into either “success,” meaning the operation completed without error, or “failure,” with an error message indicating the type of error.
Request Load. The Load Testing application allowed for the transactional tempo to be increased by increasing the number of threads. The transactional tests also have configuration options to set the mixture ratio for how many of each operation to perform, which was important when trying to simulate different scenarios such as initial setup, enrollment, day-to-day, and end of year.

Testing Methodology

The goal of this phase of transactional load testing was to determine approximately how many requests per second various server configurations can handle. For comparison purposes, each configuration was analyzed to determine three levels of performance.

The first level is stable throughput, a level that a server could handle with reasonable response time (<1 second) and continue to handle indefinitely. The second level is burst throughput, a level that a server can handle but has noticeable impact on response times (>1 second), as well as eventually leading to service unavailable errors if the burst continues for too long. The final level is the point of failure, the requests per second that lead to very slow response times and a noticeable number of server failures (Service Unavailable or GatewayTimeout) almost immediately.

The load testing was performed using a custom application available to Ed-Fi Licensees. Details on downloading, building, and running load tests using the application can be found in the technical article Ed-Fi Load Testing Utility Cookbook (coming soon).

Testing Architecture

The Ed-Fi ODS / API can be deployed in a variety of architectural configurations, from a single server (as in a development or test machine) to various load-balanced, multi-machine configurations.

Performance tests were run against configurations representative of typical, cloud-based production environments. Both a horizontally scaled and vertically scaled solution was tested, each with a variety of server instance types. Since hardware characteristics can vary results greatly, testing was performed using Amazon Web Services (AWS) to provide a more-or-less standard point of reference.

Vertical configuration testing aimed to understand the performance profile as the web server specifications were increased and horizontal configuration testing which provided insight into performance when multiple web servers are used.

Server Configurations Used for Testing

Horizontally Scaled Server Configuration

Testing was performed against horizontally scaled components distributed on AWS in the following configurations:

Web Servers	Database Server	Load Balancer
2 x Medium	Medium	AWS Elastic Load Balancing
4 x Large	Medium	AWS Elastic Load Balancing

Vertically Scaled Server Configuration

Testing was performed against vertically scaled components distributed on AWS in the following configurations:

Web Server	Database Server	Load Balancer
Medium	Medium	No Load Balancing
Large	Large	No Load Balancing
Extra Large	Extra Large	No Load Balancing

Software & Platform Information

Microsoft Internet Information Server
SQL Server 2012 Enterprise
Ed-Fi ODS / API v2.0 Public Release

Software Components

ODS Web API. Encompasses the RESTful endpoints that allow CRUD operations against the ODS database, plus the API endpoints related to the Bulk Load Services.
ODS Database. The SQL Server installation hosting the ODS and its supporting databases.

Test Results

This section provides detail about the server configurations and associated test results.

Horizontally Scaled Configuration Results

Horizontal testing generally showed stability across the board, to the point of hitting the limits of what the each infrastructure level can handle. In contrast to issues described in the Vertical Scaling section below for the vertical Extra Large test, horizontal tests showed that the IIS queues were not overloaded. This is due to multiple servers each with their own IIS queue, requiring a very large number of requests to fill up the queues.

The horizontal configuration also drastically outperformed a similar number of CPU cores in the vertical configuration, as a result of the inherent benefits of the load balancer handling requests. The individual web servers remained stable due to the fact that if one server would be tied up or blocked by a bad request the other server would continue to process. The load balancer also helped once the configuration was under load, since a dedicated server checking the underlying web server health provided fast responses once the service was unavailable, and reasonably graceful behavior even when overloaded.

Based on these findings and test results, we conclude that a horizontally scaled implementation is generally more performant than a vertical configuration and is the recommended approach for large-scale implementations.

Vertically Scaled Configuration Results

Under normal, steady load, single-web-server vertical configurations were stable. However, under stress, vertical configurations failed when overloaded, oftentimes blocking up the server for up to a minute after the requests stopped being sent.

Without a load balancer, the API web server is responsible for sending the Service Unavailable response. Often the server would be so busy with requests that it would take upwards of 30 seconds to inform the client that the service was unavailable. This causes the very dramatic jumps in response time near the upper reaches of requests per second.

Finally, there were noticeable queue issues with the Extra Large test scenario. The powerful hardware in this setup caused the default configuration of the IIS queue to fill up at times, even when the server itself wasn't overloaded. This is represented in the data by the existence of Service Unavailable in small numbers even when the response time is still low and the server isn't highly utilized. This could be mitigated by adjusting the queue size on IIS when running on a stronger server.

The figures below show response times and requests per second at each request level. Graphs are shown for CPU usage on Medium, Large and Extra Large web server configurations.

Overall Server Health

In general, a healthy server should show low (sub-second) response times, and a response / second rate very similar to the request / second rate. These numbers were used to determine the approximate stable request / second range for a given configuration. As shown in the "Unhealthy Server" chart below, the response time and the number of responses per second vary greatly once the server gets unloaded, leading to inconsistent results for client applications. The response times and responses will try to catch up because IIS and the load balancer are designed to try to recover in these scenarios, but spikes will continue to occur because the server simply can't handle the number of requests being sent to it.

Error Condition Profile

Complex, multi-tier systems under load sometimes exhibit errors that aren't reproducible and difficult to diagnose. The following graph shows a request/response profile for an event the team encountered during testing, where internal server errors caused dramatic spikes in response times. The response times are reasonable until the scenario occurs (around 60 seconds into the test run), at which point the server stops sending back responses for a period of time. The server eventually recovers and works the queue to catch up, bursting a large number of responses. Eventually the server levels off, until the issue happens again.

In production, these types of errors need to be worked individually, and can be caused by a number of factors in the configuration or the code. (In fact, the errors the team encountered in the test runs are being investigated by Ed-Fi technologists and tracked in

Jira

server	Ed-Fi Issue Tracker
columns	key,type
serverId	e04b01cb-fd08-30cd-a7d6-c8f664ef7691
key	ODS-631

to see if a code fix is indicated.)

Recommendations

This section summarizes the recommendations based on the latest round of load testing.

Large-scale implementations should prefer horizontal, load-balanced scaling strategies over vertical scaling.
Set logging levels appropriately for production. The log4net configuration should be set to error only in production instances, except when troubleshooting. Turn off the SystemDiagnosticsTracing in production systems.

Test Result Detail

This section contains detail about the testing methods and result data from the testing.

Test Server Specifications

Web Servers

Amazon EC2 C4.x instance types were used to support the web application servers.

Model	Series	Model	vCPU	Mem (GiB)	SSD Storage (GB)	Dedicated EBS Throughput (Mbps)
Small	c4	large	2	3.75	EBS-only	500
Medium	c4	xlarge	4	7.5	EBS-only	750
Large	c4	2xlarge	8	15	EBS-only	1,000
Extra Large	c4	4xlarge	16	30	EBS-only	2,000

C4 instances are the latest generation of Compute-optimized instances, featuring the highest performing processors and the lowest price/compute performance in EC2.

Features:

High frequency Intel Xeon E5-2666 v3 (Haswell) processors optimized specifically for EC2
EBS-optimized by default and at no additional cost
Ability to control processor C-state and P-state configuration on the c4.8xlarge instance type
Support for Enhanced Networking and Clustering

Database Server

Amazon EC2 R3.x instance types were used to support the database server.

Model	Series	Model	vCPU	Mem (GiB)	SSD Storage (GB)
Small	r3	large	2	15.25	1 x 32
Medium	r3	xlarge	4	30.5	1 x 80
Large	r3	2xlarge	8	61	1 x 160
Extra Large	r3	4xlarge	16	122	1 x 320

Additional drives were allocated to support the SQL data, log, and tempdb files. This was required to maximize IOPS disk performance across the volumes.

R3 instances are optimized for memory-intensive applications and have the lowest cost per GiB of RAM among Amazon EC2 instance types.

Features:

High Frequency Intel Xeon E5-2670 v2 (Ivy Bridge) Processors
Lowest price point per GiB of RAM
SSD Storage
Support for Enhanced Networking

Test Result Data

Horizontal Test Executions - Data Table

Web	DB	Request / sec	Request Limit	Web CPU	Web Mem	DB CPU	DB Mem	Avg Request	Total Requests	Service Unavailable	Gateway TImeout	Outage %	Internal Server Error	Conflict	Precondition Failed	Not Found	Forbidden
Medium x2	Medium	49.75	50	3.79%	13.86%	2.13%	4.87%	11	14926	0	0	0	0	449	16	2	0
Medium x2	Medium	99.66	100	6.07%	13.73%	3.48%	5.56%	9	29899	0	0	0	0	847	16	13	0
Medium x2	Medium	198.68	200	9.97%	16.64%	6.99%	6.94%	18	59605	0	0	0	9	1758	36	27	0
Medium x2	Medium	286.45	300	20.90%	22.86%	8.56%	9.04%	554	85936	0	0	0	48	2994	404	42	0
Medium x2	Medium	384.67	400	43.49%	32.36%	14.21%	12.58%	1440	115402	0	0	0	71	4157	961	303	0
Medium x2	Medium	496.73	500	65.56%	41.72%	17.84%	15.09%	2086	149018	0	0	0	230	5325	1408	700	6
Medium x2	Medium	531.83	550	58.50%	23.27%	20.99%	16.60%	282	159550	0	0	0	117	4822	279	96	15
Medium x2	Medium	563.44	575	61.11%	33.63%	20.84%	16.91%	2589	169031	0	0	0	288	5952	1573	659	2
Medium x2	Medium	564.39	600	56.50%	16.41%	17.96%	18.20%	5584	169316	0	0	0	294	5979	2993	1541	5
Medium x2	Medium	616.92	650	57.28%	18.42%	18.50%	18.87%	4841	185076	20616	4921	13.7981	175	4129	2828	1742	0
Medium x4	Large	194.27	200	6.54%	15.42%	3.09%	4.11%	71	58282	0	0	0	18	1718	75	21	0
Medium x4	Large	380.38	400	14.86%	15.77%	4.93%	5.70%	353	114113	0	0	0	65	4027	445	69	0
Medium x4	Large	588.74	600	23.52%	16.51%	8.66%	8.20%	465	176622	0	0	0	235	4760	1120	84	0
Medium x4	Large	782.8	800	38.67%	18.33%	11.54%	11.40%	2444	234841	0	0	0	138	7490	4259	984	0
Medium x4	Large	873.57	900	50.83%	18.30%	13.95%	12.98%	716	262071	0	0	0	110	7883	1631	145	2
Medium x4	Large	960.65	1000	55.19%	19.18%	16.66%	13.53%	2798	288194	0	0	0	395	9364	4621	1716	9
Medium x4	Large	1056.82	1100	68.32%	19.87%	18.65%	14.95%	3624	317046	0	0	0	433	9298	4654	2477	13
Medium x4	Large	1186.76	1200	70.06%	20.95%	16.88%	15.96%	6709	356027	4538	2998	2.1167	840	10407	6069	3253	1
Medium x4	Large	1275.21	1300	56.55%	30.20%	15.11%	16.96%	7473	382562	43690	7222	13.3082	774	9739	7066	3280	23

Vertical Test Executions - Data Table

Web	DB	Request / sec	Request Limit	Web CPU	Web Mem	DB CPU	DB Mem	Avg Request	Total Requests	Service Unavailable	Service Unavailable %	Internal Server Errors	Conflict	Precondition Failed	Not Found	Forbidden
Medium	Medium	49.74 / sec	50 / sec	7.30%	13.00%	1.61%	10.64%	16 ms	14922	0	0	4	271	4	13	0
Medium	Medium	99.57 / sec	100 / sec	13.80%	16.20%	2.95%	10.66%	81 ms	29872	0	0	1	1018	36	13	0
Medium	Medium	149.33 / sec	150 / sec	23.00%	27.50%	3.30%	10.71%	106 ms	44799	0	0	6	1367	67	20	0
Medium	Medium	199.42 / sec	200 / sec	39.18%	13.25%	5.71%	10.84%	1847 ms	59827	0	0	7	1898	333	86	0
Medium	Medium	229.51 / sec	250 / sec	52.83%	42.62%	6.40%	12.45%	4321 ms	68852	0	0	164	3021	1102	279	6
Medium	Medium	280.25 / sec	300 / sec	55.66%	50.18%	6.35%	14.23%	9778 ms	84074	17474	20.7841	40	2127	963	608	0
Medium	Medium	332.44 / sec	350 / sec	68.29%	18.44%	8.51%	15.51%	6931 ms	99732	13994	14.0316	145	2760	1479	970	0
Medium	Medium	353.43 / sec	400 / sec	73.17%	36.11%	8.46%	15.95%	8736 ms	106028	22892	21.5905	143	3043	1939	940	0
Large	Large	49.45 / sec	50 / sec	2.76%	11.53%	0.99%	3.73%	26 ms	14834	0	0	2	421	8	1	0
Large	Large	99.63 / sec	100 / sec	8.25%	12.23%	1.56%	4.16%	*613 ms	29888	0	0	24	957	186	14	0
Large	Large	197.36 / sec	200 / sec	14.63%	16.36%	2.99%	4.93%	*793 ms	59209	0	0	73	2239	494	29	1
Large	Large	298.17 / sec	300 / sec	19.56%	23.22%	4.00%	6.13%	75 ms	89451	0	0	10	2754	85	39	0
Large	Large	396.78 / sec	400 / sec	36.27%	32.71%	5.90%	7.69%	383 ms	119033	0	0	22	3944	419	167	0
Large	Large	482.55 / sec	500 / sec	51.85%	54.65%	6.29%	11.54%	3362 ms	144764	15265	10.5447	97	4813	1026	521	0
Large	Large	539.55 / sec	600 / sec	51.35%	61.25%	3.34%	12.39%	13336 ms	161864	68154	42.1057	95	3389	2256	629	0
Large	Large	663.40 / sec	700 / sec	62.69%	19.19%	8.07%	12.77%	5399 ms	199019	50802	25.5262	192	5587	2673	1449	0
Extra Large	Extra Large	49.48 / sec	50 / sec	2.17%	7.83%	0.50%	2.98%	94 ms	14843	0	0	5	514	18	6	0
Extra Large	Extra Large	99.64 / sec	100 / sec	4.85%	8.16%	0.64%	3.21%	13 ms	29891	0	0	3	1135	22	12	0
Extra Large	Extra Large	201.68 / sec	200 / sec	6.26%	11.20%	1.65%	3.77%	1088 ms	60504	0	0	39	2162	848	24	0
Extra Large	Extra Large	386.49 / sec	400 / sec	13.17%	16.17%	3.83%	4.52%	316 ms	115948	0	0	72	3815	379	44	0
Extra Large	Extra Large	496.51 / sec	500 / sec	21.17%	32.63%	3.31%	6.82%	481 ms	148952	2113	1.4186	26	5048	454	85	0
Extra Large	Extra Large	595.17 / sec	600 / sec	29.97%	43.38%	4.96%	8.39%	1352 ms	178550	7707	4.3164	115	6595	1422	429	0
Extra Large	Extra Large	683.10 / sec	700 / sec	31.23%	27.78%	5.16%	10.29%	1999 ms	204930	21515	10.4987	74	7271	1753	161	3
Extra Large	Extra Large	776.74 / sec	800 / sec	33.58%	62.14%	5.18%	9.59%	1562 ms	233022	13209	5.6686	93	8470	1606	362	0
Extra Large	Extra Large	860.80 / sec	900 / sec	48.44%	70.29%	5.02%	9.76%	6583 ms	258241	73352	28.4045	117	7198	1856	768	2
Extra Large	Extra Large	775.68 / sec	1000 / sec	40.02%	77.73%	3.01%	9.92%	9473 ms	232705	93125	40.0185	109	4441	2636	559	4

Column Definitions

Web. Hardware used for the ODS / API server(s). Values range from Medium to Extra Large. See Hardware chart for exact specifications.
DB. Hardware used for the ODS Database server. Values range from Medium to Extra Large. See Hardware chart for exact specifications.
Request / sec. Average number of requests submitted by the client(s) per second.
Request Limit. The maximum number of requests per second the client(s) were configured to submit.
Web CPU. Average CPU utilization percentage for all ODS / API server(s) used in the test.
Web Mem. Average Memory usage for all ODS / API server(s) used in the test.
DB CPU. Average CPU utilization percentage for the ODS Database server used in the test.
DB Mem. Average Memory usage for the ODS Database server used in the test.
Avg Request. Average response time for a client request, in milliseconds. Calculated from the time a client submits the request, to when the client hears back from the server.
Total Requests. Total number of requests sent by all client(s) used in the test.
Service Unavailable. Total number of requests that resulted in a Service Unavailable (503) error.
Gateway Timeout. Total number of requests that resulted in a Gateway Timeout (504) error.
Outage %. Percentage of the total errors that were a result of Service Unavailable or Gateway Timeout.
Internal Server Error. Total number of requests that resulted in an Internal Server Error (500) error.
Conflict. Total number of requests that resulted in a Conflict (409) error.
Precondition Failed. Total number of requests that resulted in a Precondition Failed (412) error.
Not Found. Total number of requests that resulted in a Not Found (404) error.
Forbidden. Total number of requests that resulted in a Forbidden (403) error.
Total Errors. Total number of requests that resulted in any kind of error.
Error %. Percentage of the total requests that resulted in any kind of error.

Error Descriptions

Service Unavailable. The queue in front of IIS on at least one of the web servers was too busy to accept anymore requests, and the client's request was rejected. This is a side effect of heavy load, and is the best indicator of when a server is overloaded.
Gateway Timeout. This error only happens when a load balancer is being used, such as on the horizontal tests. This represents a scenario similar to Service Unavailable, but the overloaded server was detected ahead of time by the load balancer. This is analogous to a Service Unavailable error.
Internal Server Error. The server had an unhanded exception. In production, instances of these errors need to be investigated individually.
Conflict/Precondition Failed. Another client/thread tried to put or post an object at close to the same time. The last one in gets either a Conflict or Precondition Failed error depending on when it gets caught in the server pipeline. This is intentional behavior from the API to prevent accidental data loss when a large number of users are affecting the same records. This is an error made by the load testing client, and simulates a kind of error made by real-world client applications.
Not Found. A GET request tried to retrieve an object that had been deleted by another client/thread at close to the same time. The GET request will receive an error response code of Not Found. This an error made by the load testing client, and simulates a kind of error made by real-world client applications.
Forbidden. A request tried to execute after another client thread had deleted a relationship which provided security information for accessing the requested entity. The request will receive an error response code of Forbidden. This is an error made by the load testing client, and simulates a kind of error made by real-world client applications.

Space shortcuts

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Executive Summary

Project Detail

Project Objectives

Scope

Testing Methodology