XXXX Preprocess and XXXX Tokenizer. The XXXX Preprocess and the XXXX Tokenizer services provide preprocessing functionalities for Spanish. XXXX Preprocess segments text into minor structural units (titles, paragraphs, sentences, etc.); detects entities usually not found in dictionaries (numbers, abbreviations, URLs, emails, proper nouns, etc.); and makes sure that sequences of two or more words (in dates, phrases, proper nouns, etc.) are kept together in a single block. The XXXX Tokenizer service delivers the same results vertically tokenized, one word per line. The two services accept input and output encoded in UTF-8 or ISO-8859-1/-15. Both services employ the XXXX Processing Tool (IPT), developed by Xxxxxxxx and Xxxxxxx (2010). IPT is based on rules that depend on a series of resources to improve obtained results: a grammatical phrase list, a foreign expression list, a follow-up abbreviation list, a word-form lexical database (which is also used by the XXXX POS-tagger described in the following subsection), and a stop-list to increase lexical-lookup efficiency. IPT has been evaluated against a hand-tagged corpus used as a Gold Standard, divided in two domain specific topics (Press and Genomics). Accuracies of 99.39% and 91.55% are reported by Xxxxxxxx et al. (2010) for sentence splitting in the two collections. Respective results for NER are 95.43% and 99.76%. Web form xxxx://xxxxxxxx.xxx.xxx/soaplab2- axis/#chunking_segmentation.iula_preprocess_row, xxxx://xxxxxxxx.xxx.xxx/soaplab2- axis/#tokenization.iula_tokenizer_row WSDL xxxx://xxxxxxxx.xxx.xxx/soaplab2- axis/services/chunking_segmentation.iula_preprocess?wsdl , xxxx://xxxxxxxx.xxx.xxx/soaplab2- axis/services/tokenization.iula_tokenizer?wsdl PANACEA Catalogue Entry xxxx://xxxxxxxx.xxxx.org/services/124, xxxx://xxxxxxxx.xxxx.org/services/119 Table 5 WS Details for XXXX Preprocess and XXXX Tokenizer
Appears in 2 contracts
Samples: repositori.upf.edu, cordis.europa.eu
XXXX Preprocess and XXXX Tokenizer. The XXXX Preprocess and the XXXX Tokenizer services provide preprocessing functionalities for Spanish. XXXX Preprocess segments text into minor structural units (titles, paragraphs, sentences, etc.); detects entities usually not found in dictionaries (numbers, abbreviations, URLs, emails, proper nouns, etc.); and makes sure that sequences of two or more words (in dates, phrases, proper nouns, etc.) are kept together in a single block. The XXXX Tokenizer service delivers the same results vertically tokenized, one word per line. The two services accept input and output encoded in UTF-8 or ISO-8859-1/-15. Both services employ the XXXX Processing Tool (IPT), developed by Xxxxxxxx and Xxxxxxx (2010). IPT is based on rules that depend on a series of resources to improve obtained results: a grammatical phrase list, a foreign expression list, a follow-up abbreviation list, a word-form lexical database (which is also used by the XXXX POS-tagger described in the following subsection), and a stop-list to increase lexical-lookup efficiency. IPT has been evaluated against a hand-tagged corpus used as a Gold Standard, divided in two domain specific topics (Press and Genomics). Accuracies of 99.39% and 91.55% are reported by Xxxxxxxx et al. (2010) for sentence splitting in the two collections. Respective results for NER are 95.43% and 99.76%. Web form xxxx://xxxxxxxx.xxx.xxx/soaplab2- axis/#chunking_segmentation.iula_preprocess_row, xxxx://xxxxxxxx.xxx.xxx/soaplab2- axis/#tokenization.iula_tokenizer_row WSDL xxxx://xxxxxxxx.xxx.xxx/soaplab2- axis/services/chunking_segmentation.iula_preprocess?wsdl , xxxx://xxxxxxxx.xxx.xxx/soaplab2- axis/services/tokenization.iula_tokenizer?wsdl PANACEA Catalogue Entry xxxx://xxxxxxxx.xxxx.org/services/124, xxxx://xxxxxxxx.xxxx.org/services/119 Table 5 6 WS Details for XXXX Preprocess and XXXX Tokenizer
Appears in 2 contracts
Samples: cordis.europa.eu, www.panacea-lr.eu