Reconciliation of Financial Documents

Though financial transactions are increasingly online and digital, much business is still conducted via analog means. Companies may prefer hand filled paper documents for their security, simplicity, or familiarity. However, accounting for and reconciling such transactions across invoices, receipts, bills, and contracts with multiple formats can be extremely challenging for high volume businesses.

To address this issue we propose a machine learning-based approach to automatically and efficiently extract and label information of interest from disparate document formats. Once important data has been extracted, accounting can be easily be performed to compare (for instance) amount billed vs. amount paid for a particular customer. Our proposed approach consists of three steps:

  1. Create digital scans of paper documents.
  2. Use machine learning to detect and label document regions where relevant information is located.
  3. Perform optical character recognition (OCR) on identified regions to extract data of interest.

In the following sections we demonstrate a deep learning-based approach for solving step 2 of this proposed workflow.

Generating a Synthetic Dataset

Deep learning for accurate prediction typically requires a large amount of training data. To address the difficulty of acquiring a large corpus of variably formatted documents we decided to programmatically generate our training set. For our proof of concept demonstration we focused on identification of three types of fillable regions: signature, initial, and date boxes.

ReportLab library allows for generation of rich pdfs from text easily with a specified formatting. To generate pdfs we broke documents into individual subcomponents such as paragraphs, signature boxes, headings, subheadings, fonts, size, emphasis, spacing etc. By mixing and matching these components across a variety of different values we were able to efficiently generate a diverse set of labeled pdfs for training. This process is visualized in Figure 1.


Figure 1: Workflow for generating random documents using ReportLab.

The pdf generation is similar to designing a webpage with HTML tags and populating it with text. Text between tags are customized. We add random text with random visual features and placements of the components. This makes the task of creation of our training set easier. We can create different types of documents with different morphological features for better training to improve prediction accuracy.

Identifying Info Boxes

Manually identifying and classifying locations of an analog document where relevant information is stored can be a tedious, error prone process. Digital scans of records with a wide variety of formats and scan qualities may render straightforward parsing of such documents untenable. To circumvent this limitation we developed a method that treats each document like an image and applies computer vision to identify areas of interest.

Our computer vision approach uses a class of model known as a Convolutional Neural Network (CNN). It leverages a deep network structure with multiple convolutional layers to provide accurate predictions. CNNs take advantage of the inherent importance of spatial locality within images to reduce and restrict the network connections between the input data and convolutional layer nodes. Instead of feeding each input to all nodes, each node is assigned to only a small subset of spatially contiguous inputs. In addition, the weights defining the network are shared across all nodes. This setup is equivalent to forcing the network to learn the kernel for a discrete convolution and implies learning of features that are spatially invariant. The architecture of a 2D convolutional layer is visualized in Figure 2.


Figure 2: Visualization of a 2D convolutional layer. Image input data (X) is fed to the nodes of a convolutional layer (A), producing output activations (Y). Each node in the convolutional layer is connected to a small subset of spatially adjacent input nodes. Image courtesy of (Olah, 2014).

The specific CNN architecture we use for object detection is the Single shot multibox detector (SSD) (Liu, 2015), it is visualized in Figure 3. SSD can predict multiple bounding boxes for different categories of object within the same image during a single inference pass, and has demonstrated a state-of-the-art combination of accuracy and computational efficiency. The model starts with a predefined set of bounding boxes with different locations, aspect, ratios and sizes and for each predicts offsets and object class probabilities. This is accomplished using a cascade of convolutional layers of decreasing resolution built on top of a base truncated VGG-16 model (Simonyan, 2015). Each convolutional layer in the cascade predicts boxes with size corresponding roughly to the layer resolution. Only boxes with high prediction confidence are retained. Using this model we have the potential to efficiently identify multiple information boxes within a single document image.


Figure 3: Architecture of SSD model. Image courtesy of (Liu, 2015).

The final ingredient in our modeling strategy is a technique known as transfer learning (Darrell 2014). In transfer learning a model is first trained on a high quality data-set unrelated to the particular learning objective at hand. The top few layers of this trained model are then removed and the remaining layers become the “base model”. Next, untrained layers specific to the final modeling problem are appended to the base model. Finally, the combined model is “fine-tuned” by training on the data-set of interest. The combined model can be fine-tuned by updating all model weights, or by freezing the weights of the base model and updating only the application specific layers. Transfer learning takes advantage of the sequential representation learning that characterizes deep neural network training, and has been particularly successful in the image domain. Essentially, it allows for augmenting the data-set at hand with any other unrelated high-quality data-set. We use an SSD model pretrained on the popular ImageNet data-set. To fine-tune we freeze the first 7 convolutional layers, replace the output layer with a layer appropriate for our object categories, and update the weights of the remaining layers using our synthetic data-set.


The SSD model is fine-tuned on the synthetic training data until validation loss becomes flat. Figure 4 visualizes the training and loss curves.


Figure 4: Train/validation loss as a function of training epoch.

Examples of documents annotated with predicted object boxes are displayed in Figure 5. Each box is accompanied by the the predicted object category and prediction confidence.


Figure 5: Synthetic documents annotated with predicted bounding boxes for objects of interest.


Olah, C. Understanding Convolutions. (2014, July).

Liu, W., Anguelov, D., Erhan D., Szegedy, C., and Reed, S. SSD: Single shot multibox detector. In: arXiv:1512.02325, (2015).

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: NIPS. (2015).

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, (2014).

Arshak Navruzyan