Ed-Fi Data Validation Architecture

Prepared for the Ed-Fi Alliance by: Jonathan Hickam, Learning Tapestry

Contents

Introduction

Problem Domain

Ed-Fi technology aims to provide a centralized data repository and API interfaces for K-12 education information, which in turn enables a large set of opportunities for cost effective and standards-based sharing across diverse systems at the State and District education agency level. However, the usefulness of the solution is only as good as the quality of the data in the Ed-Fi Operational Data Store (ODS). In education data ecosystems, a common challenge is that the initial data inputs are often manual and error prone processes, leading to varying levels of data quality.  Additionally, complex implementation specific business rules for validating the data can only be run once all of the data has been collected.

One of the primary challenges facing the Ed-Fi community is improving the quality of the data in the ODS. Enabling scalable and reusable solutions for these 'level 2 validations’, or data validations that happen asynchronously after the data has been submitted, rather than in real-time in the API as the data are submitted, are the focus of this architecture document. 

Document History

This document has been prepared by Learning Tapestry on behalf of the Ed-Fi Alliance.

This work builds heavily upon the work that is presented on Ed-Fi validations at the 2018 Ed-Fi Technical Congress by Vinaya Maya, Software Development Lead – Ed-Fi, and Britto Augustine, Chief Technology Officer – Arizona Dept. of Education, as seen in this presentation.

An initial version of this document was distributed to the Technical Advisory Group for comments on April 2, 2019.  The document, collected comments, and associated issues were discussed at the Data Validation session at the 2019 Ed-Fi Technical Congress (draft version here, slides here and notes here).  

Following the 2019 Technical Congress, this work was split into three different documents:

  • This 'Ed-Fi Data Validation Architecture' document, which lays out an overview of the proposed architecture and terminology for the general level 2 validation solution. This document forms the foundation for the additional two documents.
  • The Ed-Fi Level 2 Data Validation System Requirements document, which lays out the requirements for a level 2 validation solution, as gathered from multiple sources in the Ed-Fi Community. This area of functionality is represented by the orange 'Level 2 Validation Infrastructure' box in the architectural diagram in the following section.
  • The Ed-Fi Validation API Design document, which suggests a data structure for an API for Student Information Systems and/or other systems in the Ed-Fi ecosystem to retrieve the results from level 2 validations and feed them back to the end user. This area of functionality is represented by the yellow 'Validation API Infrastructure' box in the architectural diagram in the following section.

In addition, a related effort that the Alliance is sponsoring is to research and identify a list of open source rules engines that meet the requirements of the Level 2 Data Validation System Requirements, with the goal of being able to recommend rule engine options to Ed-Fi ODS/API implementations, particularly at the LEA level.

Data Validation Architecture

Architecture Overview

Architectural Component Roles

The proposed validation architecture utilizes these components as follows:

Rules Engine

  • Reads the rules repository and executes those rules against the ODS at the appropriate times
  • Sends validation results to the validation results repository
  • Is NOT responsible for de-duplication, suppression, acknowledgement, or access authorization

Rules Repository

  • Stores validation rules
  • Stores information about how rules are grouped into sets
  • Stores schedule information about when rules and/or sets are to run
  • Provides archive and backup functionality of rules

Validation Results Repository

  • Stores validation results that have been raised by the validation engine

Validation Results API

  • Provides API endpoint for consumption of validation results from the repository
  • Uses the authorization framework from Ed-Fi to limit access to specific resources based on key / secret (This framework is still to-be-determined)

Student Information System or other Validation API Consumers

  • Pulls validation results from API
  • Handles routing of validation results based on education organization in the validation result
  • Handles de-duplication of validation results based on the unique signature in the validation result
  • Handles suppression / acknowledgement tracking of known issues

Vocabulary and Conventions

The following terms and conventions are used in this requirement document:

TermDefinition
Validation RuleA validation rule is a logical expression that specifies the conditions that will raise a validation result.
Validation ResultA validation result is a specific example of an instance that was triggered by a validation rule. The term 'validation result' as a generic term is used for any type of result that can be generated with the terms 'major validation error', 'minor validation error', and 'validation warning' used to denote specific severity levels for the validation results as described in the functional requirements section 1.2.
Validation EngineThe validation engine is the 'actor' that is using validation rules to find validation results.
Validation Rule SetThe validation rule set is a set of validation rules that are initiated in the same scheduled event. 
EntityAn entity is a tangible core data element.  Examples of entities include Student, EducationOrganization, School, and Section.  Every entity will have a table associated with it in an Ed-Fi ODS database (but not every table is associated with an entity).

Entity Attribute

(also referred to as “attribute”)

An attribute is a value that describes an entity. In the database, attributes are generally associated with columns on the entity table or child entities.
Child EntityThis is an entity that relates back to a single instance of an entity and is usually implemented in the database with a foreign key constraint.
Parent EntityThis is an entity that can have multiple child entities and is usually implemented in the database with a foreign key constraint.

Levels of Validation Analysis

Prior to this endeavor, the Ed-Fi community discussed validation errors as either being “level 1”, indicating that it could be caught in the API, or “level 2”, indicating that it would be caught after the data was submitted by an asynchronous data validation process. Upon further analysis, it is necessary to differentiate between the API validations that require a database and those that do not, resulting in a division of level 1 into two levels: level 0 and level 1.  It is also useful  to differentiate between the post-API validations that could happen within the context of the ODS using conditional logic and those that would require either procedural logic or calls to outside systems. Therefore, level 2 validations were divided into two levels: level 2 and level 3. The proposed validation rule complexity levels are as follows:

Validation LevelDefinition
Level 0Level 0 validation occurs on the structure of the data against the JSON schema that happens in the API without the context of the database.  These validations are performed during data submission via the API.
Level 1Level 1 validations are database validations, including foreign key violations and other constraints, NOT NULL column validations, and data type validations.  These validations are caught during data submission via the API, as long as the API is being used in its original context with an Ed-Fi ODS.
Level 2Level 2 validations are validations that happen after the submissions against the data in the Ed-Fi ODS but can still be evaluated using data in the ODS database. This is the area of focus of this requirements specification document.
Level 3Level 3 validations are complex validations that will require complex procedural logic, “human- in-the-loop” processing, and/or access to data sources outside of the ODS. This could be considered a catch-all for the rules that do not fit well into the level 2 lexicon. This document will address the requirements for delegating analysis of level 3 rules to an external processor.

Special considerations for level 2 and higher validation rules.

  1. Analysis of these rules can only happen once the data input is complete.  Because of this, the results of level 2 and 3 validations must be communicated asynchronously and can not be determined during API or database submissions.
  2. Level 2 and 3 validation rules will vary widely between implementations. Not all education agencies will have the same validation rules.
  3. The majority of data errors that will be caught by level 2 and 3 validation rules will require fixes at the district or school level, against the source data (or system of truth), and often will require “Human in the Loop” type fixes - such as asking staff or guardians for confirmations, or re-examining paper documents.

Anatomy of a Validation Rule

In general a validation rule should have the following parts:

Validation Rule TermDefinition
Validation Rule CodeThe rule code will uniquely identify the rule within the implementation. The Code will be human-readable and may or may not have embedded meaning for the organization.  For example, one implementation may use the word 'student' along with an integer for all of the student rules, i.e. student1, student2, student3. Another my use more of a decimal system to imply an organizational hierarchy for the rules, so that all student rules start with "1.", then under that attendance rules would be "1.1.x" and special education rules would be "1.2.x" etc.
Validation Result LevelThis part of the validation rule denotes the importance of the result. Possible values are “validation warning”, “minor validation error”, and “major validation error”.
Validation Target ResourceThis is the entity being validated by the rule and will be one of the Ed-Fi resources.
Conditional

This is a logical statement that, when it evaluates to true, will cause a validation result.  This is where the complexity of the validation logic is seated. 

The conditional can use data from within the validation target element, data from a child element of the target, or data from a parent element of the target.

Conditionals also contain SQL-like comparisons and inspection operators including:   EQUAL, NOT NULL, INCLUDES, NOT EXISTS, EXISTS, GREATER/LESS THAN, BETWEEN, and IN.

Conditionals can contain standard logical operators AND, OR, and NOT to combine multiple conditionals into one conditional.

Validation Category DescriptorThe validation category descriptor is an optional descriptor value that would classify the validation rule and the subsequent validation results. Example values might be something like 'attendance' or 'enrollment.' 

Validation Rule Examples

Validation rule #1: Raise a major validation error of for every section where the method of instruction "Face-to-face instruction" does not exist in at least one child element in StaffSectionAssociation.  

  • Validation Rule Code = 1
  • Validation Result Level = 'major validation error'
  • Validation Target Resource = 'Section'
  • Conditional = 'where method of instruction "Face-to-face instruction"' AND "not exists child record in StaffSectionAssociation"

Validation rule #4: Raise a major error of category "enrollment" for every Student where the StudentSchoolAssociation has an ExitWithdrawDate specified but not an ExitWithdrawDescriptor specified.

  • Validation Rule Code = 4
  • Validation Result Level = ‘major validation error’
  • Validation Target Resource = ‘Student’
  • Conditional = "StudentSchoolAssociation has an ExitWithdrawDate specified but not an ExitWithdrawDescriptor specified'"
  • Validation Category Descriptor = "Enrollment"

Anatomy of a Validation Result

Validation results will have the following data elements:

Validation Result Term
Validation Rule CodeThis is the id that links to the validation rule described above.
Validation ResourceThis identifies the Ed-Fi API resource type that was validated.
Validation Resource IdThis is the id value associated with the specific instance of the Ed-Fi resource that caused the validation result. This should the natural key in the ODS that is also used by the existing Ed-Fi API
Validation Result DetailsDetails specifying the values that were used in the evaluation of the conditional.
TimestampThe specific date/time the validation result was raised.
EducationOrganizationThis information allows the system that is consuming validation results to route the result to the correct end-user.  For example, all of the student registration errors for a given school could be made available to the SIS users at that school.

Validation Result Example

For this example consider the validation rule #4 above finding an error with studentID 216028 at schoolId = 867530193.

  • Validation Rule Code = 4
  • Validation Resource = "Student"
  • Validation Resource Id = "216028"
  • Validation Result Details = "ExitWithdrawDate = 2014-11-11 and ExitWithdrawDescriptor is NULL"
  • EducationOrganization = "867530193"

Validation Result Signature

One of the key properties of validation results is that the combination of the validation rule code and validation resource id will be unique across multiple runs of the validation engine.  This signature may be important for consuming systems when differentiating between a new validation situation and one that has previously been identified but may have had some underlying data changes.

Reference documents

Outstanding Issues / TBDs

The following issues were identified but conclusive solutions have not yet been determined.

How will the validation results API leverage the Ed-Fi authorization model?

At the 2019 Technical Congress session, there was a strong consensus that the validation API should leverage the same authorization model that is in the existing Ed-Fi API. This is based on the assumption that the API consumer is most likely a Student Information System (SIS) that is submitting data. The SIS should have the ability to use the same key / secret to authenticate with the validation API. Upon authentication, the validation results that the SIS can see should be limited to those that are associated with the same resources that the SIS can see via the API. The outstanding issue is how can this be implemented?  Options considered were:

  • Replicate the tables from the Ed-Fi schema and re-create the authorization functionality in the validation API.  This could mean rewriting the authorization framework and the duplication could therefore be problematic.
  • Create a lightweight security module, so that every validation result has an education organization id that maps to the education organization associated with the key / secret of the API client.  This is problematic if there are multiple API clients that must see the same data and it represents a vast simplification of the API authorization module that may or may not be sufficient
  • Determine a way to delegate security calls back to the existing Ed-Fi API. This might be difficult with bulk reads that tend to be common with validation results.
  • Build out the validation functionality as an extension of the existing Ed-Fi API.  A potential solution is to use the existing MetaEd extension functionality.

Is there a need to de-duplicate so that the SIS can just deal with new/changed/removed alerts?

Two important points came up when sharing the API specification with the Ed-Fi community.  One, from a state education agency with one of the more mature data validation infrastructures in place, is that they would recommend that the ID on a validation result be the same value when a given validation situation for a specific resource was identified again and again in subsequent validation runs. This functionality is somewhat supported by the alert signature, which is the combination of the ValidationRuleId and ResourceId, but is a little different because the signature would be the same if an issue went away and came back and their id (maybe call it an incident id?) would not.

The other feedback was from a SIS vendor that would like to be able to submit a checkpoint value to the API, and then have the API only give the validation results that were new, removed (resolved), or changed since the last checkpoint that was submitted.
These requirements point to some sort of infrastructure that would look at the validation results from run to run and discern what the deltas are. It could be cost effective to build this functionality into the validation API, or as some sort of agent that sits between the validation API and the consumers, as opposed to having every SIS vendor rebuild this same functionality in their respective systems. 

Is there a need to differentiate between validated and unvalidated data in the ODS?

In financial applications, there is often a concept of an 'audited' versus 'unaudited' set of books. The idea is that there is binary differentiation between before the data has been audited and after. Potentially, that same sort of need could arise in the ODS. Would applications ever need to have separate interfaces for what is considered validated and what is considered unvalidated? Or is this something that is only handled by business process? Is it realistic to ever have a 'validated' data set?

What about initiating other events from Ed-Fi data?

At a high level, the validation infrastructure could be described as performing automated data analysis on the ODS (running validation rules) and depending on the results initiating business processes (creating validation results).  This same high-level framework could have uses outside of data validation, for example, initiating an automated rostering process when a new student is enrolled. Are there any architectural changes that could be made now to make this solution more suitable for these types of future applications?

Appendix A - Comparison with Prior Work

The prior working model for validation results from the 2018 Ed-Fi Technical Congress is as follows:

A slightly different structure and vocabulary is proposed in this document for additional clarity. Generally, the word ‘result’ is used in place of the word 'error' because of the assertion that some validation results are simply additional information or a warning level.  With regards to errContext / contextKey / contextValue, the phrase "validation resource" instead of "context" is used to more clearly denote the actual resource that is the subject of the validation rule and not contextual information used in the validation rule.

The most impactful change proposed is to remove validation results as a child entity within the context of the larger validation result in order for these validation results to be reported individually.  Learning Tapestry believes that this will make the de-duplication much easier for the consuming applications in which each validation result will be represented atomically with a signature instead of embedded in a larger set of validation results.