Kaemingk: PDF Extraction
Kaemingk is an international organization that focuses on selling living- and seasonal decorations. The organization was founded in 2008 and has a wide product range with seasonal products like Christmas decorations, but also candles, glassware and other home decorations.
Challenge
The organization faced the issue that a part of their product range legally needs to be provided with a certificate, which proves that the product has been tested against legally obligated norms. These products are tested by local testing agencies that provide the results of the test in a PDF format. Those PDFs provide information about which products are tested, when they are tested, which certificate belongs to the performed test, the norms which were tested and potential directives that belong to the test. Every testing agency has its own template to describe the results of a test.
Extracting this information from PDF files is a process that could be performed manually, but this is a truly time-consuming process which also is sensitive to making mistakes. That is why Kaemingk has asked Squadra Machine Learning Company to perform a Proof of Concept (POC) where algorithms were used to extract data from around 20 PDF files.
Solution
After conducting such a POC, it appeared that it would be possible to scrape data from these PDF files in an automated manner. In this way, Kaemingk would be able to extract the information that they needed from these PDFs and export this data to an Excel file. This would not only contribute to a decrease in the number of manhours, but it will also contribute to a better efficiency in the future. When new PDF files are provided by testing agencies, these can be scraped automatically.
For future use, Squadra Machine Learning Company can offer this solution in form of a user-friendly software. In case of this, employees are provided with their own username and password to log in to the portal. After that, they are able to upload files into the software that make the algorithms smarter in reading the PDF files. After this, the actual PDFs can be uploaded which results in an Excel output file. This file contains the name of the scraped PDF file, the year in which the test was performed, the document number, the detected norms, the certificate(s) that belong(s) to these norms and the directives that are mentioned in the file.
Result
After working with Squadra MLC, Kaemingk has managed to automate the process of scraping PDF files. Since these PDFs contain important information about the product itself and the tests that were conducted to meet the legally obligated norms, it was crucial that this important info was extracted. Doing this manually would be an extremely time-consuming process that would take a lot of time and be sensitive for mistakes. With the help of Squadra MLC, Kaemingk can now scrape many PDF files within the instant and their efficiency has increased significantly.