WikidataIt is a large structured open source knowledge graph that provides support for projects such as Wikipedia. We can useSPARQL(Wikidata official Tutorial)Query it. SPARQL is a specialRDF(Resource Description Framework)Query language for data model design. RDF organizes data through triples (subject subject, predicate, object object). We canWDQS (WikiData Query Service)Use SPARQL to query Wikidata online.
Introduction to WikidataEntities and properties
The most basic concept in Wikidata isEntityandProperty. Due to semantic diversity, each entity and attribute is unique in the form of letters and numbers. The entity is prefixed with Q and the attribute is prefixed with P, such asQ148 (China),P31(Instance of). Entities and attributes respectively constitute the nodes and edges of the entire Wikidata knowledge graph.It should be noted thatThe attribute does not necessarily have to be one edge of the knowledge graph (concatenating two entities). This is very common, such asP31(Instance of) is used to indicate the class to which the entity belongs, and will point to the parent entity corresponding to the entity;P1082(Population) is used to indicate the population of an entity (such as a country) and will directly correspond to a number. In addition, attributes do not necessarily correspond to only one value, as the population changes continuously over time,P1082(Population) There are usually multiple values corresponding to different timestamps.
Data structure
Since online query is subject to Internet speed, you canWikidataDumpsDownload Wikidata's data backup. If downloadedentities/.bz2
, you can download the entire Wikidata knowledge graph data into a json file (compressed file is about 100GB, decompressed file is more than 1TB). The json file contains a list, each element in the list is a dictionary, and each dictionary contains the following fields:
- type: Type, entity item or property property. Most of them are entity items, and property.
- id: Unique identifier.
- labels: Multilingual tag, that is, name.
- descriptions: Multilingual description.
- aliases: Multilingual alias.
- claims: Declaration, containing attributes and their corresponding values. The most important fields in the knowledge graph.
- sitelinks: Links in other wiki projects, such as Chinese wiki, French wiki, etc.
- pageid: Page ID in sites such as Wikipedia.
- ns: The namespace where it is located. The entry is usually in the 0 namespace, while the user page is in the 2 namespace.
- title:title.
- lastrevid: The last modified version ID.
- modified: Last modified time.
Among them, statementclaimsThe dictionary structure of the attribute identifierP[...]
is the key. Each key corresponds to a list, storing corresponding multiple attribute contents. The attribute content dictionary structure is as follows:
-
mainsnak: The main part, containing attributes and attribute values
- snaktype: the type of mainsnak. Commonly used are: value means that there is a valid attribute value; novalue means that there is no value; somevalue means that the value is unknown or uncertain.
- property: The ID of the property (for example: P31).
- datavalue: The value of the attribute may be different data types, such as entity, time, quantitative data, etc.
- value: The specific value (for example: Q5, indicating human).
- type: the type of value, common types are:
- wikibase-entityid: represents the entity (Item or Property) type.
- time: indicates time.
- quantity: represents quantitative data.
- string: represents a string (for example: a description of text type).
- datatype: The data type of the value, usually wikibase-item (pointing to another entity) or quantity (representing the quantity).
- type:nature. Desirable values such as: statement, most common, indicate that this is a valid statement; mediainfo represents information related to media files; sitelinks involves website links.
-
qualifiers: Qualification conditions, used to describe more information about attribute values (usually a list). For example, a certain attribute value may have a time stamp (time limit), a place limit, etc.
- property: The attribute ID of the qualified condition.
- datavalue: The value of a limited condition is usually the specific content of a certain attribute.
- qualifiers-order: Defines the field order of the conditions and helps understand additional information about attributes.
-
rank: declared level. Common levels are:
- normal: Normal level.
- preferred: preferred level (if multiple declarations exist).
- deprecated: Deprecated declaration.
-
references: The source of reference for this statement, usually the cited literature or other source.
- snaks: The specific content referenced is similar to the declared mainsnak, including the attribute ID and the corresponding value.
- snaks-order: The order of referenced attributes.
- id: The unique ID of the declaration (used to distinguish different declarations).
SPARQL-Wikidata query
The following is a simple example to introduce how to use SPARQL to query Wikidata knowledge graphs. Syntax is similar to SQL, usingSELECT
andWHERE
, the key is to use subject-predicate object triplets to form a query.
Basic usage — Subject-predicate object triplets
Inquiry of all Bach’s children:
SELECT ?child
WHERE
{
# ?child father Bach
?child wdt:P22 wd:Q1339.
}
in?child
For the variable to be queried, you can name it at will;wd
Represents the abbreviation of WikiData, used to refer to entities;wdt
Represents WikiData Truthy, which refers to the value of the entity attribute. The statements are connected: there are attributesfather (P22)
The value is entityBach (Q1339)
entity. The above will query the list of entity IDs. If you want to add a tag column to the list, add a magic statement from wikidata:
SELECT ?child ?childLabel
WHERE
{
# ?child father Bach
?child wdt:P22 wd:Q1339.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
inwikibase:label
Fixed fields for the specified entity label. Tag variables?childLabel
The naming of the entity is fixed: the variable of the entity to be queryed must be prefixed, i.e.?child
; and the fields to be searched with capital letters as the suffix, that isLabel
。
Among the children born to Maria Barbara Bach, both composers and pianists:
SELECT ?child
WHERE
{
?child wdt:P22 wd:Q1339;
wdt:P25 wd:Q57487; # P25: mother
wdt:P106 wd:Q36834, wd:Q486748. # P106: occupation
}
The semicolon is displayed here;
comma,
and period.
usage and difference. If a semicolon can be used to omit the subject?child
, commas can be used to omit predicatewdt:P106
。
Bach's (external) grandson/daughter:
SELECT ?grandChild
WHERE
{
wd:Q1339 wdt:P40 ?child. # P40: child
?child wdt:P40 ?grandChild.
}
Here is a multi-hop query, that is, pass?child
As an intermediate springboard, it is limited?grandChild
With BachQ1339
relationship. It can be further simplified to:
SELECT ?grandChild
WHERE
{
wd:Q1339 wdt:P40 [ wdt:P40 ?grandChild ].
}
Use square brackets above[]
To express "there is?grandChild
For the entity of the child", thus omitting the previous springboard variable?child
. It can be understood as a compound sentence using an attributive clause:Bach has a child who has a child ?grandchild.
。
Symbol expansion/*+|
Symbol/
Represents attribute path, used to connect multiple attributes, forming a multi-hop query; symbol*+
Similar to regular expressions,*
Indicates matching zero or more attributes,+
Indicates matching one or more attributes; symbols|
Indicates "or". The following shows the code example.
All works of art:
SELECT ?work ?workLabel
WHERE
{
?work wdt:P31/wdt:P279* wd:Q838948. # instance of any subclass of work of art
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
All descendants of Bach:
SELECT ?descendant ?descendantLabel
WHERE
{
wd:Q1339 wdt:P40+ ?descendant.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
All descendants of Bach:
SELECT ?descendant ?descendantLabel
WHERE
{
?descendant (wdt:P22|wdt:P25)+ wd:Q1339.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
In addition, the question marks?
It also has semantics, indicating that it matches one or zero elements.
Qualifiers
Use Qualifiers to fine-grained limits on the entity to be queried.
Sort and quantity limits
useORDER BY
Sort the query results withLIMIT
Limit the maximum number of results returned.
Return to the top ten sovereign countries with the largest population in descending order:
SELECT ?country ?countryLabel ?population
WHERE
{
?country wdt:P31/wdt:P279* wd:Q3624078; # P31: instance of; P279: subclass of; Q3624078: sovereign state
wdt:P1082 ?population. # P1082: population
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?population)
LIMIT 10
DESC
Indicates descending order,ASC
Indicates ascending order, default ascending order.
Optional criteria
If you want to return some fields but do not want them to affect the search conditions, you can useOPTIONAL
。
All Arthur Conan Doyle books, and the searched fields titles, etc. are optional:
SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
?book wdt:P50 wd:Q35610. # P50: auther; Q35610: Arthur Conan Doyle
OPTIONAL { ?book wdt:P1476 ?title. }
OPTIONAL { ?book wdt:P110 ?illustrator. }
OPTIONAL { ?book wdt:P123 ?publisher. }
OPTIONAL { ?book wdt:P577 ?published. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
The matching of title, illustrator, etc. does not affect?book wdt:P50 wd:Q35610.
If the result of , the value will be returned if it can be matched, otherwise the value will be filled in the blanks. Note the difference from the following code:
SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
?book wdt:P50 wd:Q35610.
OPTIONAL {
?book wdt:P1476 ?title;
wdt:P110 ?illustrator;
wdt:P123 ?publisher;
wdt:P577 ?published.
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
This isOPTIONAL
If any field in the do not match, fill in the blanks.
expressionFILTER
andBIND
Use expressions to judge and other operations. Use it to record again.
combinationGROUPING
Combine the results through a certain field. Use it to record again.