Document Properties. The Title, Author, Subject, and Keywords can be set to provide the user with additional information about the document.
Document Properties. Archaeological reports contain a large amount of descriptive details. This in- cludes lengthy descriptions, many illustrations, and tabular data about the dis- covered finds and their context. These publications often follow a distinctive chapter/section division that has a semantic meaning (by period, by material category, by type), which would ideally be incorporated into the Text Mining. Xxxxxxx (2015) specifically mentions that the scope of natural language state- ments is often not implicit, but inferred from a hierarchy of chapters and sec- tions. Xxxxxxx uses units within sites as an example, but what we see more often in Dutch reports is e.g., the snippet “we have found an axe” in the section “Ne- olithic”, indicating a Neolithic axe. The section heading might be paragraphs – or even several pages – before the snippet, so there is no direct relation within the vicinity of the text. Apart from the complexity of the text itself, this ‘semantic inheritance’ makes extracting information or finding relations difficult. However, these documents differ largely in internal structure from commercial unit to unit. Since no commercial publisher is interested in these large volume books anymore, most archaeological organisations publish these reports in their own internal series. While there are regulations for the content of the reports, the order, structure and format is not prescribed (Stichting Infrastructuur Kwaliteits- borging Bodembeheer, 2016), and as such we see a large variety. This is not a Dutch only problem, as this problem is also noted by Xxxxxxx & Xxxx (2020) for reports from the United Kingdom. The inconsistencies make extracting the heading structure a challenging task. A compounding factor is the format the documents are stored in: Portable Document Format (PDF) files are notoriously difficult to extract structured text from, as it is a format geared towards cor- rectly displaying text and any structure that the text might have is lost. When extracting text from PDFs, we can get information about font style and size for example, but nowhere are certain snippets marked as being a heading. We have experimented with a rule-based approach to automatically label chapter and sec- tion headings, but due to the noise from PDFs and the different styles between documents, we found it incredibly difficult to do this with a decent level of accu- racy. A machine learning approach for this task might be better suited, but as there is no training data, thi...
Document Properties. Primary Author: Xxxxxxxx, Xxxx X Co-Author(s):
Document Properties. The Title, Author, Subject, and Keywords can be set to provide the user with additional information about the document. PrimoPDF Program Options PrimoPDF’s program options enable you to control how the program operates (see Figure 4).