Architecture and Design of the Platform the NLP functionalities relevant to this deliverable share two mandatory parameters, input and language. When applicable, we 14 The workflows can be used in the Taverna Workflow Management System xxxx://xxx.xxxxxxx.xxx.xx/. See Appendix 7.5 for some example workflows for Greek and German.
Architecture and Design of the Platform. In more detail, according to a) the [PANACEA Annex I] sections concerning WP5 (Parallel corpus and derivatives) and WP6 (Lexical Acquisition) and b) the user requirements documented in D8.1 User Requirements, the minimum set of NLP tools required for the project needs can be grouped as follows
Architecture and Design of the Platform section 6.1.2. The XML files will also contain the extracted text converted in UTF-8. Paragraph indicators from the HTML pages will be transferred will guide paragraph segmentation of the text in the XML files.
Architecture and Design of the Platform. As required for D7.2 First evaluation report (T14). We expect the final version of each corpus to consist of 1M tokens.
Architecture and Design of the Platform. For illustration purposes, we assume as input a web page in Spanish referring to the EU aid for Haiti after the 2010 earthquake. In the example XML file below, we include the output of WP4 tools for corpus acquisition, normalization and text processing. Brief comments on the XML structure follow and refer to this example. <?xml version="1.0"?> <cesDoc id="news_20100514_haiti_es" version="0.4" xmlns="xxxx://xxx.xxxx.xxx/schema/2003"> <cesHeader version="0.4"> <type>Crawling</type> <name>Panacea partner</name> </resp> </respStmt> <respStmt> <resp> <type>Boilerplate removal, text extraction, paragraph detection, etc. </type> <name>Panacea partner</name> </resp> </respStmt> <respStmt> <resp> <type>Sentence splitting, tokenization, POS tagging, lemmatization, parsing </type> <name>Panacea partner</name> </resp> </respStmt> </titleStmt> <sourceDesc> <biblStruct> <monogr> <author>EU web author if available</author> <imprint> <publisher>EU</publisher> <pubDate>2010-02-20</pubDate> <eAddress type="web"> xxxx://xx.xxxxxx.xx/news/external_relations/100218_es.htm </eAddress> </imprint> </monogr> </biblStruct> 24 xxxx://xxx.xxxx.xxx </sourceDesc> </fileDesc> <language iso639="es"/> </langUsage> <textClass> <keywords> <keyTerm>Comisión</keyTerm> <keyTerm>Haití</keyTerm> <keyTerm>terremoto</keyTerm> <keyTerm>. . .</keyTerm> </keywords> <domain>International News</domain><!-- or automotive, environment, legal --> <subdomain>Optional information on subdomain</subdomain> <subject>Optional information on the subject</subject> </textClass> <annotations> <annotation xxx.loc="news_20100514_haiti_es.html" type="htmlsource"/> </annotations> </profileDesc> </cesHeader> <text> <body> <p id="p1" > <s id="s1"> <graph root="s1_500"> <terminals>
Architecture and Design of the Platform the NLP functionalities relevant to this deliverable share two mandatory parameters, input and language. When applicable, we document additional, tool-specific parameters. Another prerequisite for integrating a tool in the PANACEA platform is that it can process input and generate output in the common encoding format documented in D3.1, Section 6.1.3. To achieve this goal, PANACEA partners have investigated two approaches. UPF, DCU and CNR have built specific web services10 to perform I/O conversions from and to their tools. ILSP has adapted its NLP tools by integrating importers and exporters from and to the common encoding format. Finally, for each service, we provide links to entries in the PANACEA registry, where (updated) documentation and access information will be provided during and after the project’s timeline, thus ensuring the sustainability of the PANACEA platform.
Architecture and Design of the Platform. These tools will be used in the production of the second version of the monolingual corpora. • T20: Internal deliverable. Version 2 of the monolingual corpora of English, Spanish, Italian, French and Greek annotated for POS and lemma. Result of the 2nd development cycle after the first evaluation cycle.
Architecture and Design of the Platform. These tools will be used in the production of the third version of the monolingual corpora and will be part of the final version of the WP4 prototype. T28: Internal deliverable. Version 3 of the monolingual corpora of English, Spanish, Italian, French and Greek with syntactic annotations. Result of the 3rd development cycle after the first evaluation cycle. T29: Internal deliverable. Version 2 of the bilingual corpora of EN-FR, EN-EL in the environment and labor legislation domain, annotated with syntactic annotations. As required for the third evaluation cycle detailed in D7.1 Criteria for evaluation of resources, technology and integration.
Architecture and Design of the Platform. 2nd version of the prototype and documentation. The revised prototype will integrate dedicated web services for normalization (including boilerplate removal and duplicate document detection), as required for D7.3 Second evaluation report (T22). • T22: Internal deliverable. Partners adapt NLP tools focusing on parsing and/or chunking for DE, EN, EL, ES, IT, FR. The I/O of all tools will be conformant with the
Architecture and Design of the Platform. This document is part of technical documentation generated in the PANACEA Project, Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition (Grant Agreement no. 248064). This documented is licensed under a Creative Commons Attribution 3.0 Spain License. To view a copy of this license, visit xxxx://xxxxxxxxxxxxxxx.xxx/licenses/by/3.0/es/. Please send feedback and questions on this document to: xxxxxxx@xxx.xxx TRL Group (Tecnologies dels Recursos Lingüístics), Institut Universitari de Lingüística Aplicada, Universitat Xxxxxx Xxxxx (XXXX-UPF) 1 Introduction 1 2 Terminology 3
2.1 Definitions 3
2.2 Acronyms 4 3 Goals 5 4 Current state of the art. Analysis of existing tendencies, approaches and tools 6
4.1 Frameworks 6
4.1.1 UIMA 6
4.1.1.1 U-compare 8
4.1.2 GATE 9
4.1.3 Concluding remarks. Comparative analysis and recommendations. 10
4.2 Web Services 11 4.2.1 WSDL 11 4.2.2 SOAP 13 4.2.3 REST 16 4.2.4 AXIS 17 4.2.5 Common interfaces and tool integration 17 4.2.6 Metadata / Ontology / Closed vocabularies 18 4.2.7 Concluding remarks. Comparative analysis and recommendations. 19
4.3 Workflow systems 19 4.3.1 Xxxxxx 20 4.3.2 Xxxxxx 22 4.3.3 Taverna (myGrid) 25 4.3.4 LoonyBin 29 4.3.5 Concluding remarks. Comparative analysis and recommendations. 31 4.4 Grid infrastructure 32 4.4.1 Globus 32 4.4.2 EGEE 34 4.4.3 MyGrid 37 4.4.4 TextGrid 40 4.4.5 NorduGrid 41 4.4.6 Concluding remarks. Comparative analysis and recommendations. 42 4.5 The Registry 43 4.5.1 UDDI 43 4.5.2 Feta 43 4.5.3 BioCatalogue 44 4.5.4 Concluding remarks. Comparative analysis and recommendations. 45 4.6 Wrappers 46 4.6.1 Soaplab 46