The creation of the PDF aimed to make the dream of the paperless office a reality. Developed by Adobe in the early 1990’s, the format allows sending of text and graphics in the same document electronically. These days a PDF can be viewed on any device, password-protected and printed locally. Another useful feature is the ability to add forms which can be completed and returned digitally.
PDFs are now easier than ever to produce. For example, a PDF can be created from within Microsoft Office 2007 products or Open Office. Various powerful free third-party PDF applications that integrate with older Microsoft Office products are available. Web pages can be converted to PDF using extensions and plugins that work with popular browsers such as Chrome and Firefox.
Another useful feature is the ability to create a tagged PDF. The tags give meaning to the content and allow for the extraction of data. However, it is not easy to add tags. You need additional software, and the tags have to be added manually, which is a repetitive task. Furthermore, if the PDF is produced by an internal system, it may not be possible to add the tags when the PDF is generated.
These advantages mean the format is regularly used to send information from business to business. However, as the content in the PDF has no meaning without tags, the content usually needs to be copied and pasted from the document into the businesses internal system. The task is laborious and prone to error, wasting time and effort.
The solution is to send the data with meaning, usually in XML or JSON format. The file can then be dropped into an internal system that has suitable code to recognise and accurately extract all the information from the file. The problem with XML and JSON is that its aimed at programmers and developers, not the average computer user. The following figures show how a pdf and XML file may differ for ordering a pair of shoes:
Brand: Fly London
<?xml version="1.0" encoding="UTF-8"?> <shoe_order> <brand>Fly London</brand> <design>Shard</design> <size>39</size> <colour>Black</colour> ... </shoe_order>
It is unlikely that the average user would feel comfortable creating an XML file and would rather continue to use PDFs, which have the benefit of password protection should sensitive data be contained within. Subsequently, there needs to be a way to extract data from PDFs other than direct copying and pasting.
Fortunately, there are a number of systems designed to automate extracting information from PDFs. The best ones combine different technologies to recognise the data. These technologies include OCR tools and word pattern recognition. Word pattern recognition is important if you want to recognize a chunk of text that differs in length between two different headings, for example.
Systems designed to extract information from PDFs automatically can be used by Swiftcase to automate your business workflows. As soon as a PDF is sent in, the relevant information can be picked-up by Swiftcase, the data extracted and inserted into the system. No more manual entry required and an instant response to incoming work.
SwiftCase can automate a wide range of data-import processes, helping you focus on providing an excellent service to clients.
If you’re interested in a free, no-obligation demonstration, get in touch today.