The Corpus Sample Clauses

The Corpus. For the remaining packages of the lexicon, an automatic contextual disambiguation is tried. To do this, a parallel corpus is used. The goal is to find conceptual contexts in the corpus which allow the disambiguation of translation alternatives.
AutoNDA by SimpleDocs
The Corpus. A corpus of manually-written summaries of texts has been assembled from materials provided to participants in the Document Understanding Conferences, which have been held annually since 2001. It is available at the DUC Web site to readers who are qualified to access the DUC document sets on application to NIST. Most summaries in the corpus are abstracts, written by human readers of the source document to best express its content without restriction in any manner save length (words or characters). One method of performing automatic summarization is to construct the desired amount of output by concatenating representative sentences from the source document, which reduces the task to one of determining most adequately what ‘representative’ means. Such summaries are called extracts. In 2002, recognizing that many participants summarize by extraction, NIST produced versions of documents divided into individual sentences and asked its author volunteers to compose their summaries similarly. Because we use a sentence- extraction technique in our summarization system, this data is of particular interest to us. It is not included in the corpus being treated here and will be discussed in a separate paper. The XXX xxxxxx contains 11,867 files organized in a three-level hierarchy of directories totaling 62MB. The top level identifies the source year and exists simply to avoid the name collision which occurs when DOCUMENTS SUMMARIES D : S 10 50 100 200 ∑ 10 50 100 200 ∑ 2001 28 316 56 400 84 949 165 1198 1 : 3 2002 59 59 626 59 803 116 116 1228 116 1576 1 : 2 2003 624 90 714 2496 360 2856 1 : 4 2004 740 124 864 2960 496 3455 1 : 4 ∑ 1423 87 1156 115 2781 5572 200 3033 281 9086 1 : 3 Table 1: Number of Documents and Summaries by Size and by Year with Document : Summary Ratios different years use same-named subdirectories. The middle 291 directories identify the document clusters; DUC reuses collections of newswire stories assembled for the TREC and TDT research initiatives which report on a common topic or theme. Directories on the lowest level contain tagged and untagged versions of 2,781 individual source documents, and between one and five summaries of each, 9,086 in total. In most cases the document involved is just that: a single story originally published in a newspaper. However 552 directories, approximately 20% of the corpus, represent multi- document summaries—ones which the author has based on all the files in a cluster of related documents. For these summaries...
The Corpus. For our corpus study we extracted data from the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus).4 The CGN is based on roughly 1000 hours of contemporary Dutch from the Netherlands and Flanders. The speech is composed of different genres, ranging from face-to-face and telephone conversations to interviews, debates, radio 4xxxx://xxxxx.xxx.xxx.xx/cgn/ehome.htm shows and read aloud books. The speech files amounting to roughly 10M words have been orthographically transcribed, lemmatized, and tagged for part-of-speech information. Moreover, about 10% of the corpus has been syntactically annotated (van der Wouden et al. 2002). From this syntactically annotated part of the corpus we have ex- tracted all prepositional phrases. This amounted to 57,287 PP in- stances containing 139 unique adpositions and 12,947 unique heads in the adpositional complements. From this set we extracted all heads of the adpositional complements with a frequency higher than 10 oc- currences. These 766 unique words were subsequently annotated by the two authors for their animacy using the coding scheme of Xxxxxx et al. (2004) which provides a 9-way classification. Where possible, disagreement was resolved by discussion. Of these 766 words, 154 were left out due to unresolved disagreement between the two an- notators and 53 because they contained context-dependent elements,
The Corpus. The initial data used to examine the issues mentioned above are first taken from previous accounts on conjunct agreement in both English and Serbian. Thus, the data from English are provided by Lorimor (2007), among others, and the initial data from Serbian are found in Xxxxxxx (1983), Xxxxxxxxxx (1979), and Xxxxxxxx (2009). After the examination of these works and identification of basic problems, a survey was conducted in order to look into the basic patterns of agreement employed by speakers of Serbian in their active production. The survey was completed by 60 participants, native speakers of Serbian. The speakers were asked to do a production task, supplying the missing agreement information on the verb based on the conjoined subjects, whose features were varied. The results of this survey provide the material based on which a theoretical model of conjunct agreement is developed in the thesis. The thesis is organized as follows. Section 2 gives a detailed introduction on the process of agreement, and the role of features in that process, as well as the nature of features themselves. Section 3 focuses on agreement with conjoined subjects. It provides a brief overview of agreement patterns with conjoined subjects in English and Serbian. The purpose of Section 4 is to explain the mechanism of agreement and the structure of coordinate phrase, so as to help the reader understand syntactic mechanisms of conjunct agreement provided in the following sections. Section 5 presents previous syntactic accounts on conjunct agreement. The accounts presented here provide a basis for the analysis of the data gained in the research. Section 6 identifies basic problems tackled by the research. Subsequently, it presents the results of the research together with their analysis. Section 7 contains concluding remarks.
The Corpus. In this chapter, the generation of Covid-themed tweets dataset will be discussed in details. The source of data (Section 3.1), the mechanism and word choice for tweet scraping ((Section 3.1), the rationale for choosing data produced in the twelve-day span (Section 3.2), the preliminary filtering process (Section 3.3), and string removal (Section 3.4) will be elaborated to demonstrate our dataset’s integrity. To ensure the quality of the data, we additionally apply quality assurance procedures (Section 3.5) with a hope to convince readers that Covid-themed Tweets Dataset could serve as a valid and rich event detection research resource in NLP community.
The Corpus. In this chapter, the generation of FriendsQA (Section 3.1 will be dis- cussed in details. The web interface used for crowdsourcing (Section 3.2, the different rounds of experiments (Section 3.6 and the two phases in each round (Section 3.3 and 3.5) will be elaborated and explained to demonstrate our dataset’s integrity and diversity. To ensure the quality of the data, we additionally apply quality assurance procedures (Section 3.4), questions and answers pruning (Section 3.7), inter-annotators agreement (Section 3.8) and an extensive question-answer types analysis (Section 3.9) with a hope to convince that FriendsQA could serve as a valid and rich QA research resource in NLP community.
The Corpus. The corpus analysis was employed to examine agreement patterns in Somali sentences. In this study the relevant sentences must contain a specific focus particle that appears in two different forms waxa and waxaa. To ensure consistency, a search for both variants in the corpora was done. However, to simplify the text, all examples in this thesis were converted to the longer variant waxaa. While the chosen examples are typically short and often start with the focus particle, it is important to note that this does not necessarily mean that the sentences always begin and end in that structure. This study aimed to examine agreement in Somali clauses containing the focus particle waxaa and its impact on agreement, particularly gender agreement in sentences. To narrow the scope, five verbs were selected. In sentences with a verb and the focus particle waxaa, a subject noun can be either feminine or masculine, in both singular and plural forms. The corpus contained thousands of sentences with the focus particle and the five chosen verbs paired with various nouns.. The pictures below are an example that demonstrate two sentences with different verb forms but identical subject noun. Picture 1: waxaa jir(t)a cabsi (source: HaBiT)
AutoNDA by SimpleDocs

Related to The Corpus

  • Executive Committee (A) The Executive Committee shall be composed of not more than nine members who shall be selected by the Board of Directors from its own members and who shall hold office during the pleasure of the Board.

  • Plan Administrator Employees must elect a plan administrator during their initial enrollment in Advantage and may change their plan administrator election only during the annual open enrollment and when permitted under Section 5. Dependents must be enrolled through the same plan administrator as the employee.

  • The Committee For purposes of this Agreement, the term “Committee” means the Compensation Committee of the Board of Directors of the Company or any replacement committee established under, and as more fully defined in, the Plan.

  • GRANTEE Grantee will be in default under this Grant upon the occurrence of any of the following events:

  • the Grant Recipient (a) possesses or will possess a Secure Legal Interest in the Site;

  • TRUST FUNDS The Owner hereby gives power to the Agent to deposit all receipts collected for the Owner, less any sums properly deducted or disbursed, in a financial institution whose deposits are insured by an agency of the United States government. The funds shall be held in a trust account separate from the Agent’s personal accounts. The Agent shall not be liable in the event of a bankruptcy or failure of a financial institution. All funds managed under this section must be done so in accordance with applicable law.

  • The Plan This Plan is the Fund's written distribution and service plan for Class N shares of the Fund (the "Shares"), contemplated by Rule 12b-1 as it may be amended from time to time (the "Rule") under the Investment Company Act of 1940 (the "1940 Act"), pursuant to which the Fund will compensate the Distributor for its services in connection with the distribution of Shares, and the personal service and maintenance of shareholder accounts that hold Shares ("Accounts"). The Fund may act as distributor of securities of which it is the issuer, pursuant to the Rule, according to the terms of this Plan. The terms and provisions of this Plan shall be interpreted and defined in a manner consistent with the provisions and definitions contained in (i) the 1940 Act, (ii) the Rule, (iii) Rule 2830 of the Conduct Rules of the National Association of Securities Dealers, Inc., or any applicable amendment or successor to such rule (the "NASD Conduct Rules") and (iv) any conditions pertaining either to distribution-related expenses or to a plan of distribution to which the Fund is subject under any order on which the Fund relies, issued at any time by the U.S. Securities and Exchange Commission ("SEC").

  • Trust Fund The Buyer is a trust fund whose trustee is a bank or trust company and whose participants are exclusively (a) plans established and maintained by a State, its political subdivisions, or any agency or instrumentality of the State or its political subdivisions, for the benefit of its employees, or (b) employee benefit plans within the meaning of Title I of the Employee Retirement Income Security Act of 1974, but is not a trust fund that includes as participants individual retirement accounts or H.R. 10 plans.

Draft better contracts in just 5 minutes Get the weekly Law Insider newsletter packed with expert videos, webinars, ebooks, and more!