Architecture and Design of the Platform the NLP functionalities relevant to this deliverable share two mandatory parameters, input and language. When applicable, we document additional, tool-specific parameters. Another prerequisite for integrating a tool in the PANACEA platform is that it can process input and generate output in the common encoding format documented in D3.1, Section 6.1.3. To achieve this goal, PANACEA partners have investigated two approaches. UPF, DCU and CNR have built specific web services10 to perform I/O conversions from and to their tools. ILSP has adapted its NLP tools by integrating importers and exporters from and to the common encoding format. Finally, for each service, we provide links to entries in the PANACEA registry, where (updated) documentation and access information will be provided during and after the project’s timeline, thus ensuring the sustainability of the PANACEA platform.
Architecture and Design of the Platform. Sec. 6.1.2 as Travelling Object 1 (TO1). The output of the FMC is a list of links pointing to TO1 documents. See xxxx://xxx.xxxx.xx/nlp/examples/2547.xml for an example in French for the Environment domain. The FBC, available as web service3 is the first module in the PANACEA pipeline for building parallel, domain-specific LRs from the web. It aims to harvest multilingual web sites, download web documents that are relevant to a predefined domain and in the targeted languages and to identify pairs of parallel documents in the collection of stored documents. To this end, the FBC integrates all processing modules for monolingual data acquisition (i.e. normalization, language identification, cleaning, and text classification) and, in addition, a component for detection of pairs of parallel web pages. The final output of the FBC is a list of links to XML files following the cesAlign Corpus Encoding Standard for linking (parts of) documents. This example xxxx://xxx.xxxx.xx/panacea/xces-xslt/202_225.xml serves as a link between a pair of documents in English and Greek.
Architecture and Design of the Platform the NLP functionalities relevant to this deliverable share two mandatory parameters, input and language. When applicable, we 14 The workflows can be used in the Taverna Workflow Management System xxxx://xxx.xxxxxxx.xxx.xx/. See Appendix 7.5 for some example workflows for Greek and German.
Architecture and Design of the Platform. In more detail, according to a) the [PANACEA Annex I] sections concerning WP5 (Parallel corpus and derivatives) and WP6 (Lexical Acquisition) and b) the user requirements documented in D8.1 User Requirements, the minimum set of NLP tools required for the project needs can be grouped as follows
Architecture and Design of the Platform section 6.1.2. The XML files will also contain the extracted text converted in UTF-8. Paragraph indicators from the HTML pages will be transferred will guide paragraph segmentation of the text in the XML files.
Architecture and Design of the Platform. As required for D7.2 First evaluation report (T14). We expect the final version of each corpus to consist of 1M tokens.
Architecture and Design of the Platform. For illustration purposes, we assume as input a web page in Spanish referring to the EU aid for Haiti after the 2010 earthquake. In the example XML file below, we include the output of WP4 tools for corpus acquisition, normalization and text processing. Brief comments on the XML structure follow and refer to this example. <?xml version="1.0"?> <cesDoc id="news_20100514_haiti_es" version="0.4" xmlns="xxxx://xxx.xxxx.xxx/schema/2003"> <cesHeader version="0.4"> <fileDesc> <titleStmt> <title>La UE enviará más ayuda a Haití</title> <respStmt> <resp> <type>Crawling</type> <name>Panacea partner</name> </resp> </respStmt> <respStmt> <resp> <type>Boilerplate removal, text extraction, paragraph detection, etc. </type> <name>Panacea partner</name> </resp> </respStmt> <respStmt> <resp> <type>Sentence splitting, tokenization, POS tagging, lemmatization, parsing </type> <name>Panacea partner</name> </resp> </respStmt> </titleStmt> <sourceDesc> <biblStruct> <monogr> <author>EU web author if available</author> <imprint> <publisher>EU</publisher> <pubDate>2010-02-20</pubDate> <eAddress type="web"> xxxx://xx.xxxxxx.xx/news/external_relations/100218_es.htm </eAddress> </imprint> </monogr> </biblStruct> 24 xxxx://xxx.xxxx.xxx </sourceDesc> </fileDesc> <profileDesc> <langUsage> <language iso639="es"/> </langUsage> <textClass> <keywords> <keyTerm>Comisión</keyTerm> <keyTerm>Haití</keyTerm> <keyTerm>terremoto</keyTerm> <keyTerm>. . .</keyTerm> </keywords> <domain>International News</domain><!-- or automotive, environment, legal --> <subdomain>Optional information on subdomain</subdomain> <subject>Optional information on the subject</subject> </textClass> <annotations> <annotation xxx.loc="news_20100514_haiti_es.html" type="htmlsource"/> </annotations> </profileDesc> </cesHeader> <text> <body> <p id="p1" > <s id="s1"> <graph root="s1_500"> <terminals>
Architecture and Design of the Platform. These tools will be used in the production of the second version of the monolingual corpora. • T20: Internal deliverable. Version 2 of the monolingual corpora of English, Spanish, Italian, French and Greek annotated for POS and lemma. Result of the 2nd development cycle after the first evaluation cycle.
Architecture and Design of the Platform. These tools will be used in the production of the third version of the monolingual corpora and will be part of the final version of the WP4 prototype. • T28: Internal deliverable. Version 3 of the monolingual corpora of English, Spanish, Italian, French and Greek with syntactic annotations. Result of the 3rd development cycle after the first evaluation cycle. • T29: Internal deliverable. Version 2 of the bilingual corpora of EN-FR, EN-EL in the environment and labor legislation domain, annotated with syntactic annotations. As required for the third evaluation cycle detailed in D7.1 Criteria for evaluation of resources, technology and integration.
Architecture and Design of the Platform. T21: D4.4. 2nd version of the prototype and documentation. The revised prototype will integrate dedicated web services for normalization (including boilerplate removal and duplicate document detection), as required for D7.3 Second evaluation report (T22).