The following guide provides an overview of suggested best practices when labelling and validating documents within Affinda's validation interface.
Why are validations important?
The first and most obvious reason is that the validations you make will ultimately impact the data exported into your downstream system. To ensure 100% accurate data and to reduce the frequency of any downstream issues and exceptions, accurate validations are vital.
In addition, consistent validations are extremely important in ensuring that your tailored AI model continues to learn and improve. With consistent validations, the model will have clearer data with which to make its choices about which element of the document is the correct data to predict. The model will also be much more confident, meaning that there will be a higher rate of auto-validation, ultimately reducing the level of human intervention.
Best practice guidelines
Annotate all fields that are present within a document
It is important that when validating documents, if the data is contained within the invoice, this data is captured. The model learns from the data that says a field is present in the document, but also from data that says the data is not present in the document.
For example, if a user does not capture "Invoice Number" when validating a document, the model will learn that this data point does not exist in that document. If it is actually present in the document, this will cause errors in subsequent documents as the model will believe that data points in that format and location shown in the original document not an Invoice Number.
Duplicate examples of data
If there are multiple examples of the same data point, annotate the example that is most prominent and in the most common location for that data point
Data in logos
Only annotate data contained in logos (e.g. Supplier Name) if the data is not found elsewhere in the document
Annotate consistently across documents of the same format
Commonly when setting up tailored models, a high proportion of the documents will be in a small number of formats. This is most obvious within invoices, where often the top 10-20 suppliers will account for over 50% of invoice volume. To get the greatest accuracy, ensure that the annotations are consistent within these documents of the same format.
If the text is not picked up in the document, click 'Apply OCR' to extract the text fully
From time to time, a document may be submitted that has a text layer that does not match fully the data in the document itself. Whilst this is uncommon, this will mean that Affinda has not applied OCR technology and thus we will not be able accurately to extract the data.
In the minority of cases where this occurs, users are able to select 'Run OCR' which will apply COR to the document and re-parse the data.