How do I stop paragraphs from being mixed together in research papers?

I am trying to extract data from studies but I have a dilemma. In studies there is a unique format. The first half of the page is on the left side of the page and the second half is on the right side. You are supposed to read all paragraphs on the left from the top down and then read all paragraphs on the right from the top down. There is a line of blank white space running down from the top-middle to the bottom-middle separating the two halves of the page. However, when I try to extract a specific paragraph with sensible the app extracts text from top to bottom and goes over the white line. So what happens is the writing from paragraphs that are horizontal to each other (one on the right side of the page and one on the left) gets combined together and therefore becomes unreadable.

So how do I stop this issue from happening: Is there a way that I can make paragraphs that are on the left not be mixed together with paragraphs that are on the right? Alternatively, maybe there is a way to quickly modify pdfs so that they don’t have that kind of formatting and have the formatting of a google doc and can still be uploaded to sensible?

I will write the code that I used to extract data incase that is helpful:

{
  "fields": [
    {
      "id": "rent_topic_paragraphs",
      "anchor": {
        "match": {
          "type": "first"
        }
      },
      "method": {
        "id": "topic",
        "numParagraphs": 1,
        "terms": [
          "pay",
          "leesee",
          "rent",
          "dollars"
        ]
      }
    }
  ]
}

Unfortunately, while the Paragraph method supports 2 column format, the numParagraphs parameter on the Topic method doesn’t support 2 columns. We’re currently triaging this issue on our product feedback board. You could test and see if the Summarizer method can still make sense of what the Topic method returns even with the extraneous column text included.

If that doesn’t work, some alternate approaches could be as follows. These approaches assume that the paragraphs you’re targeting reliably contain certain terms, so that you can use a regex anchor or an any anchor to match them – in other words, instead of the NLP flexibility of the Topic method, you use a more complex anchor:

  • for an example of a regex anchor see Passthrough
  • for more about using the ANY anchor see Any match.
  • for more about complex anchor concepts (especially start and stop parameters vs match arrays) see anchor nuances.

Then with your complex anchor, you could use one of the following methods:

  • if you only want 1 paragraph, use the Paragraph method
  • if you can identify text that reliably starts and ends the paragraphs you want to capture, use the Passthrough method
  • Use the Document Range method + X Range Filter parameter to capture 1 column, if you’re certain the column widths don’t vary much between papers (see this example.