Embedding Training. Wikipedia2Vec provides pre-trained embeddings. These embeddings, however, are not available for all entities in Wikipedia; e.g., 25% of the assessed entities in DBpedia-Entity V2 collection have no pre-trained embedding. The reasons for these missing embeddings are two-fold: (i) “rare” entities were excluded from the training data, and, (ii) entity identifiers evolve over time, resulting in entity mismatches with those in the DBpedia-Entity collection. For training new graph embeddings, we used Wikipedia 2019-07 dump. This was the newest version at the time of training. We address the entity mismatch problem by identifying the entities that have been renamed in the new Wikipedia dump. Some of these entities were obtained using the redirect API of Wikipedia.2 Others were found by matching the Wikipedia page IDs of the two Wikipedia dumps. The page IDs of Wikipedia 2019-07 were available on the Wikipedia website. For the dump where DBpedia-Entity is based on, however, these IDs are not available anymore; we obtained them from the Nordlys package [11]. To avoid excluding rare entities and generate embeddings for a wide range of entities, we changed several Wikipedia2Vec settings. The two settings that resulted in the highest coverage of entities are: (i) minimum number of times an entity appears as a link in Wikipedia, (ii) whether to include or exclude disambiguation pages. Table 1 shows the effect of these settings on the number of missing entities; specifically the number of entities that are assessed in the DBpedia-Entity collection, but have missing embeddings. We categorize these missing entities into two groups: – No-page: Entities without any pages. These entities neither were found by the Wikipedia redirect API nor could be matched by their page IDs. 2 xxxxx://xxxxxxxxx.xxxxxxxxxxx.xx/en/latest/. Table 1. Missing entities with different settings Settings No-emb No-page Total min-entity-count = 5, disambiguation = False 9640 608 10248 min-entity-count = 1, disambiguation = False 0000 000 0000 min-entity-count = 1, disambiguation = True 0000 000 0000 min-entity-count = 0, disambiguation = False 000 000 0000 min-entity-count = 0, disambiguation = True 000 000 0000 – No-emb: Entities that could be found by their identifiers, but were not included in the Wikipedia2Vec embeddings. The first line in Table 1 corresponds to the default setting of Wikipedia2Vec, which covers only 75% of assessed entities in the DBpedia-Entity collection. When considering all entities in the knowledge graph, this setting discards an even larger number of entities, which is not an ideal setup for entity ranking. By choosing the right settings (the last line of Table 1), we increased the coverage of entities to 97.6%. We trained two versions of embeddings: with and without link graph; i.e., using Eq. (4) with and without the Le component.
Appears in 3 contracts
Samples: End User Agreement, End User Agreement, End User Agreement
Embedding Training. Wikipedia2Vec provides pre-trained embeddings. These embeddings, however, are not available for all entities in Wikipedia; e.g., 25% of the assessed entities in DBpedia-Entity V2 collection have no pre-trained embedding. The reasons for these missing embeddings are two-fold: (i) “rare” entities were excluded from the training data, and, (ii) entity identifiers identifiers evolve over time, resulting in entity mismatches with those in the DBpedia-Entity collection. For training new graph embeddings, we used Wikipedia 2019-07 dump. This was the newest version at the time of training. We address the entity mismatch problem by identifying the entities that have been renamed in the new Wikipedia dump. Some of these entities were obtained using the redirect API of Wikipedia.2 Others were found by matching the Wikipedia page IDs of the two Wikipedia dumps. The page IDs of Wikipedia 2019-07 were available on the Wikipedia website. For the dump where DBpedia-Entity is based on, however, these IDs are not available anymore; we obtained them from the Nordlys package [11]. To avoid excluding rare entities and generate embeddings for a wide range of entities, we changed several Wikipedia2Vec settings. The two settings that resulted in the highest coverage of entities are: (i) minimum number of times an entity appears as a link in Wikipedia, (ii) whether to include or exclude disambiguation pages. Table 1 shows the effect effect of these settings on the number of missing entities; specifically specifically the number of entities that are assessed in the DBpedia-Entity collection, but have missing embeddings. We categorize these missing entities into two groups: – No-page: Entities without any pages. These entities neither were found by the Wikipedia redirect API nor could be matched by their page IDs. 2 xxxxx://xxxxxxxxx.xxxxxxxxxxx.xx/en/latest/.
Table 1. Missing entities with different different settings Settings No-emb No-page Total min-entity-count = 5, disambiguation = False 9640 608 10248 min-entity-count = 1, disambiguation = False 0000 000 0000 min-entity-count = 1, disambiguation = True 0000 000 0000 min-entity-count = 0, disambiguation = False 000 000 0000 min-entity-count = 0, disambiguation = True 000 000 0000 – No-emb: Entities that could be found by their identifiersidentifiers, but were not included in the Wikipedia2Vec embeddings. The first first line in Table 1 corresponds to the default setting of Wikipedia2Vec, which covers only 75% of assessed entities in the DBpedia-Entity collection. When considering all entities in the knowledge graph, this setting discards an even larger number of entities, which is not an ideal setup for entity ranking. By choosing the right settings (the last line of Table 1), we increased the coverage of entities to 97.6%. We trained two versions of embeddings: with and without link graph; i.e., using Eq. (4) with and without the Le component.
Appears in 1 contract
Samples: End User Agreement