Is there a way to loop through records?

Hello,

I am currently trying to extract data from a PDF with this kind of format. I have read through the documentation but I am unable to find if there is a way to loop through the PDF and create a new record for each entry. I can’t manually create a field for each word as there is 250 or so pages. This is the layout more or less:
Captura

What I am trying to achieve is to get this data in an excel sheet with a layout like this:

Is there something I have missed in the documentation or is it just not possible to do what I’m trying?

Thanks very much for any help!

Hi! We don’t have a purpose-built method to accomplish exactly what you’re describing, but there’s probably a way to do it using our existing methods.

It looks to me like your ‘word’ headings (aback, abalienate, etc) are in a larger font than their accompanying text. So you could perhaps use the Sections method to segment each record with something like this:

{
  "fields": [],
  /* each section is a word + definitions/data */
  "sections": [
    {
      "id": "dictionary_entry",
      "range": {
        "anchor": {
          /* each  section starts with a large font */
          "match": {
            "type": "regex",
            "pattern": ".+",
            /*  in Sensible app, click on  lines to see their height */
            "minimumHeight": 0.18
          }
        }
      },
      "fields": [
        /* the word being defined (1st line in section) */
        {
          "id": "defined_word",
          "anchor": {
            "match": {
              "type": "first"
            }
          },
          "method": {
            "id": "passthrough"
          }
        },
        /* grab everything to the right
           of the defined word (1st line in section) */
        {
          "id": "ccm",
          "anchor": {
            "match": {
              "type": "first"
            }
          },
          "method": {
            "id": "row"
          }
        },
        /* grab definitions: each is a paragraph starting with # */
        {
          "id": "definitions",
          "match": "all",
          "anchor": {
            "match": {
              "type": "regex",
              "pattern": "^[0-9]"
            }
          },
          "method": {
            "id": "paragraph"
          }
        },
/* fallback field if there's only one un-numbered definition */
        {
          "id": "definitions",
          "anchor": {
            "match": {
              "type": "regex",
              "pattern": "^.+"
            }
          },
          "method": {
            "id": "paragraph"
          }
        },
        /* for troubleshooting/to illustrate section range, output all text in this section */
        {
          "id": "_everything_in_this_section",
          "method": {
            "id": "documentRange",
            "includeAnchor": true
          },
          "anchor": {
            "match": {
              "type": "first"
            }
          }
        }
      ]
    }
  ]
}

Let me know if that works for you after you’ve reconfigured it for your specific situation (font size, etc)!

It should give you output like the following for each record:

{
  "dictionary_entry": [
    {
      "defined_word": {
        "type": "string",
        "value": "abalienated"
      },
      "ccm": null,
      "definitions": [
        {
          "type": "string",
          "value": "1 (obsolete) caused mental aberration"
        },
        {
          "type": "string",
          "value": "2 in civil law transferred land title"
        }
      ],
      "_everything_in_this_section": {
        "type": "string",
        "value": "abalienated 1 (obsolete) caused mental aberration 2 in civil law transferred land title"
      }
    },
{
      "defined_word": {
        "type": "string",
        "value": "another word"
      },
      "ccm": "blah",
      "definitions": [
        {
          "type": "string",
          "value": "1 def 1"
        },
        {
          "type": "string",
          "value": "2 def 2"
        }
      ],
      "_everything_in_this_section": {
        "type": "string",
        "value": "blah blah blah"
      }
    }
  ]
}

To output to Excel, there are a couple things you can do – you can take advantage of Sensible’s native excel output (for more info see Quickstart PDF to Excel and SenseML to spreadsheet reference ). To get your columns just right you may need to use a computed field or advanced computed field method.
Or you can use a Zapier integration