Datasheet Scrubber

It is extremely strenuous and time consuming to manually extract circuit information from the datasheets for large datasets (millions of datasheets). So for this purpose, we implemented a new tool to detect the datasheet title (circuit category) and then extract relevant performance specs.

For the circuit category detection, we use two approaches:

  1. Machine learning: there is a well-known approach which is called bag of words and is a subset of ML classification approach.
  2. Key word searching: in this approach we count the number of word occurrences of each “circuit category” and pick the maximum occurred category. We then cross check whether the specs of the corresponding “circuit category” are found in the datasheet.

If the result of these two approaches is not the same, we use an arbitration for the title recognition. After determining the document circuit class type, we can start extracting specs from the datasheet. Figure 1 shows an example text extraction from Analog Devices datasheet

Analog Device Features

Figure 1: Example of table text extraction

We tested our approach for circuit categorization and spec extraction on more than 3000 different pdf datasheets and academic papers with a successful estimation rate of more than 95%. The confusion matrix of our classification is illustrated in Table. 1.

Category

ADC

CDC

DCDC

PLL

Temp
Sense

SRAM

LDO

ADC

97%

2%

0%

0%

0%

1%

0%

CDC

3%

95%

0%

0%

2%

0%

0%

DCDC

0%

0%

98%

0%

0%

0%

2%

PLL

1%

0%

0%

99%

0%

0%

0%

Temp Sense

0%

2%

0%

0%

97%

1%

0%

SRAM

0%

0%

0%

0%

0%

100%

0%

LDO

0%

0%

4%

0%

0%

0%

96%

Table 1: Confusion Matrix of the circuit classification

For accessing public code scripts for datasheet scrubber you can use following link: https://github.com/umich-cadre/FASoC-Datasheet-Scrubber

Editable work directory for datasheet scrubbing including intermediate folders, pdf datasheets, etc. can be found here: https://www.dropbox.com/sh/o3nxl4qrm2rrvf4/AADF9iPv17bN9U6mJQaa8nrya?dl=0

IDEA & POSH Integration Exercises – January 2019 Demos

Video 1 shows circuit category recognition and text/table data extraction from a sample Analog Device datasheet.

Video 1: Datasheet Scrubber