F# ADR Demo Project: https://github.com/taylorwood/ADRDemo
One of the greatest values we can provide to the customer is quick and complete visibility into the contents of their loan file, which would otherwise be an opaque stack of papers that a human would need to search exhaustively to find a particular document. A typical loan file is comprised of hundreds of discrete documents: legal documents, structured forms, appraisals, credit reports, tax documents, disclosures, financing documentation, etc. Some documents are fairly standardized and predictable, such as the URLA/1003 and the 1008, and some differ significantly by state or originator. Some documents are only one page and some can be dozens or hundreds of pages long. The loan files we process are typically 500-1,500 pages in length and contain hundreds of individual documents. We usually receive a loan file as a single PDF that may as well be a monolithic stack of papers: no table of contents, bookmarks, or any index information at all.
There is plenty written already about classification and recognition algorithms, but I will give a brief overview of how we employ a variety of these techniques in concert to provide incredibly accurate indexing for the types of documents found in loan files.
Nearly every loan file PDF we receive consists only of scanned images. The images can be of varying quality, from pristine digital scans to documents that have been through so many fax machines they’re indistinguishable from a Rorshach test. In order to gather textual information from these PDFs, we must first invent the universe (thanks Carl Sagan), but really, we must first rasterize the PDF, process the resulting images to adjust for binarization/rotation/skew, and then perform Optical Character Recognition on the images. We have developed a distributed, highly scalable processing pipeline using RabbitMQ (see our previous blog post) to perform these tasks in our server farm. The output of our OCR processing tells us what text each image contains and its precise location down to the letter. Now that we have textual data, we can think about what techniques one might use to classify textual data.
In order to perform statistical analysis on text, we must tokenize the text. This is basically the process of converting a word into something you can reason about mathematically: a number, obviously. A three-word document, “Taylor is hungry,” might be converted into three tokens: Taylor = 1, is = 2, hungry = 3. After tokenization, you’re left with what is essentially a “bag of words.” Any analysis of this bag of words is naive in the sense that it doesn’t quite matter which order the words came in, i.e. “Hungry is Taylor” would be considered identical to “Taylor is hungry” based solely on their token content. However, it’s possible to mitigate this naiveté using n-grams. We use a variety of values for n here, usually 1-3. Using bi-grams for example, “Taylor is hungry now” would tokenize to: “Taylor is”, “is hungry”, “hungry now.”
One of the text classification techniques used by our Automated Document Recognition is a variation of one of the oldest, simplest, most popular, and most effective techniques: TF-IDF weighting, which stands for Term Frequency – Inverse Document Frequency. I will summarize it as a method of counting the number of times a particular word appears in a text (a particular document), and augmenting that information using the number of times the same word appears in all texts (referred to as the corpus). In our corpus, words like “loan” and “address” are so ubiquitous they’re almost meaningless, which is why the IDF part of this technique plays an important role. Using TF-IDF we are able to create a type of signature (a very long vector of TF-IDF weightings) for each text. Once we’ve determined the signature of a particular text as can use a variety of similarity functions, such as cosine similarity or logistic regression, to compare it to the signatures of all other known documents. Another advantage of statistical text analysis is that it can be quite forgiving when it comes to classifying “noisy” documents like the aforementioned fax-machine-ravaged images that an OCR engine may not read very well. TF-IDF weights can be used as inputs to many popular classification techniques like Singular Value Decomposition, logistic regression, etc. We can also use decision tree algorithms to accurately identify multi-page or partial documents.
How do we determine what samples to use and types of documents to index? This is where supervision plays a part in supervised machine learning. The machine learns to recognize the types of documents we care about because we supervise the learning process by using known samples of documents we want it to recognize. For instance, let’s say we have 10 training samples of a variety of Trust Deeds. We can train the machine using these samples and the machine can internalize the patterns and “learn” what a Trust Deed is made of textually. The real power of this system is that it can recognize documents it’s never seen before, after being trained on just a few known samples of the respective document type. Managing the training data for your classifier can be a science and art unto itself, and is often more important than the type of classification algorithm you choose.
Another key feature of our ADR system is the implementation of a positive feedback loop, or validated learning. Since it’s impossible to teach our machine everything it will ever need to know from the start (new document types will be added, forms will change, etc.), we must be able to adjust the system’s knowledge on the fly. Using targeted manual review of current ADR results, we are continually expanding the scope and improving the confidence of the ADR system with each new loan file we process.
In practice, our ADR system has dramatically increased both the scope of documents we index and the efficiency with which we index them. There was a time when loan file indexing was a completely manual process for us, and probably still is for a lot of other companies. Fortunately, using data mining techniques we were able to leverage much of this manual indexing as training data for ADR. It would normally take a human worker well over an hour to index a complete loan file, and the consistency of the indexing might vary wildly depending on the worker even if they were only indexing a few critical document types. Our ADR system now indexes hundreds of types of documents at hundreds of pages per second and requires no human intervention!
–Taylor Wood, Senior Developer & Lead Architect at MortgageTrade™