Introduction to the IEDB Query API
Welcome to the new IEDB Query API (IQ-API)! We have now made it possible for you to programmatically query the IEDB using a multitude of endpoints, enabling users to complete most queries available from the IEDB home page and work with the data directly in their preferred environment. We hope this help article provides additional context on the new IQ-API, however it is important to note that this is a beta version and we will continue to improve the features and usability. Similarly, the help material will continue to be refined as additional insight is gained from our users. Be sure to contact us via email at help@iedb.org to provide your feedback.
What is the IQ-API?
The IQ-API is built upon a PostgREST platform that allows for transparent access to the PostGres tables on the backend. Each table can be queried through individual endpoints that are described in this interactive Swagger documentation.
What endpoints are available?
Core endpoints
While there are many endpoints provided, we expect the majority of users will want to search against one or more of the following tables, which correspond to the tabbed search results on the IEDB:
- epitope_search
- antigen_search
- tcell_search (assays)
- bcell_search (assays)
- mhc_search (assays)
- tcr_search (receptors)
- bcr_search (receptors)
- reference_search
Supporting endpoints
Additional supporting endpoints are available that map identifiers between the various tables. These endpoints have names like ‘TABLEX_to_TABLEY’. For instance, the ‘bcell_to_reference’ table maps records between the ‘bcell_search’ and ‘reference_search’ records to link related information. Each of these tables has exactly two columns and maps the unique identifiers (fields with a suffix of ‘_id’) between the tables.
An additional endpoint, ‘curie_map’, is available that links CURIE prefixes (e.g., PMID) to their full IRIs (e.g., https://www.ncbi.nlm.nih.gov/pubmed/?term=). Further details on CURIEs and IRIs can be found below.
Finally, the ‘api_metrics’ endpoint simply provides record counts and build dates for each of the core endpoints.
How can I query the IQ-API?
As IQ-API is based upon PostgREST, queries must be performed using its rich and expressive query syntax, described here in detail. Detailed API walkthroughs are available in our IQ-API use case repository as Jupyter and RMarkdown notebooks, so please have a look there for more.
The most basic example of querying for the first 10 epitopes is provided here, using the ‘curl’ command.
curl "https://query-api.iedb.org/epitope_search?limit=10" | jq
[
{
"structure_id": 7355,
"structure_iri": "IEDB_EPITOPE:7355",
"structure_descriptions": [
"CYDLSCNQTVCQ"
],
"structure_starting_positions": [
136
],
...
Only the first part of the response is shown above, as the full response includes many fields. Note the ‘pipe’ to ‘jq’. This is unnecessary but was used in this case to format the data nicely for display.
Note that, by default, results are returned in JSON format. If TSV format is preferred, an additional header needs to be provided in the GET query. The same query above would become:
curl "https://query-api.iedb.org/epitope_search?limit=10" -H "accept: text/csv"
Output is not shown as there are too many fields to display. However, we can limit the output to a subset of fields with the ‘select’ parameter. For example, if we only want the ‘structure_id’ and ‘linear_sequence’ of the first 10 epitopes and we want it returned in TSV format, the query becomes:
curl "https://query-api.iedb.org/epitope_search?limit=10&select=structure_id,linear_sequence" -H "accept: text/csv"
structure_id,linear_sequence
7355,CYDLSCNQTVCQ
7356,CYEDEATSVIPP
7357,CYEIKCKEPVECSGEPVLVK
7358,CYENDNPGL
7359,CYESLSEEY
7360,CYFDCSKSPPGA
7361,CYFEPQIRIL
7362,CYFILIFNI
7363,CYFILIFNII
7364,CYFMVFLQT
More examples will be added to this help article in the future.
What are IDs, IRIs, and CURIEs?
Several types of identifiers are used throughout the database to track unique records. First, there are internal integer record identifiers denoted with the suffix ‘_id’, e.g., ‘reference_id’. These are generally in the first field of each table. As they are internal to the IEDB, they cannot be linked directly to other resources. Many of the tables in the database also have fields that end in ‘_iri’, e.g., ‘reference_iri’. These are identifiers that resolve uniquely and unambiguously to records both within and outside of the IEDB. The Internationalized Resource Identifier (IRI) specification includes Uniform Resource Locators (URLs), which we use as globally unique identifiers, e.g., https://www.iedb.org/reference/1002786. A shortened version of an IRI, called a CURIE, can be constructed by replacing a portion of the IRI with a common prefix. The above IRI can be represented in CURIE format as ‘IEDB_REFERENCE:1002786’. By querying against the ‘curie_map’ endpoint, it is possible to find prefixes for converting between the two representations, e.g.: https://query-api.iedb.org/curie_map?limit=3
Troubleshooting and FAQs
Common issues and idiosyncrasies
IRIs for Antigens
Users may notice that for the ‘antigen_id’ field in the 'antigen_search' table, an IRI is used, rather than an integer ID. This is because we currently do not have numeric IDs for antigens. We are working to improve how we handle IRIs and CURIEs in the IEDB, and this may change in the future.
Curated vs Parent Terms
Users will see that there are some terms called "curated", such as curated_source_antigens, while other similar terms use the phrase "parent", such as parent_source_antigen_names. "Curated" refers to the precise source protein isoform that matches exactly what an author referred to as the source of a peptide epitope in a specific publication. The curated_source_antigen will 100% BLAST match to the epitope sequence. "Parent" refers to the reference proteome protein that is representative of all protein isoforms that epitopes having the same sequence may have ever been assigned to and associated with many different publications. The parent term is used to group all isoforms and will not 100% BLAST match to the epitope sequence. Similarly, the source_organism_name (shown nested under curated_source_antigens) reflects the precise source organism strain that matches exactly what an author referred to as the source of an epitope in a specific publication. While the parent_source_antigen_source_org_name is the species level organism name that groups all strains that might ever have been associated with that same epitope sequence across all publications. This help desk article goes into further detail: https://help.iedb.org/hc/en-us/articles/114094147251.
Results Page Limit & Default Page Size
By default, the IQ-API has a maximum page size of 10,000 records. In practice, this means that queries that result in more than 10,000 results will be divided into pages and only the first 10,000 records will be returned by the initial query.
NOTE: If a query requires paging, it is critical to also provide the 'order' parameter to determine how the rows are sorted. If it is not provided, rows will be returned in a random order and pages will be inconsistent between queries.
The API will always return a count of the records matching the query, as well as the number of pages of results. This information is embedded in the ‘content-range’ response header, e.g.:
curl -I "https://query-api.iedb.org/antigen_search"
HTTP/2 200
server: nginx/1.19.2
date: Fri, 11 Jun 2021 18:19:28 GMT
content-type: application/json; charset=utf-8
vary: Accept-Encoding
content-range: 0-9999/*
content-location: /antigen_search
strict-transport-security: max-age=15724800; includeSubDomains
Above we can see that the server returned the first 10,000 records (indexed as 0-9999). The trailing ‘/*’ indicates that there are more records matching the query but the total number has not been calculated. To get an exact count of matching records, you must provide the header 'Prefer: count=exact' in your GET request.
curl -I "https://query-api.iedb.org/antigen_search" -H 'Prefer: count=exact'
HTTP/2 206
server: nginx/1.19.2
date: Fri, 11 Jun 2021 18:23:42 GMT
content-type: application/json; charset=utf-8
content-range: 0-9999/73254
content-location: /antigen_search
strict-transport-security: max-age=15724800; includeSubDomains
Now we can see that there are 73,254 matching records, which would correspond to 8 pages of results. Note, that adding this header can be detrimental to query performance. To retrieve the last page of the results, we add the ‘offset’ parameter to the query:
curl -I "https://query-api.iedb.org/antigen_search?offset=73000&order=parent_source_antigen_id" -H 'Prefer: count=exact'
HTTP/2 206
server: nginx/1.19.2
date: Fri, 11 Jun 2021 18:28:47 GMT
content-type: application/json; charset=utf-8
content-range: 73000-73253/73254
content-location: /antigen_search?offset=73000&order=parent_source_antigen_id
strict-transport-security: max-age=15724800; includeSubDomains
If an offset is defined that is higher than the number of matching records, an empty result will be returned.
Error messages
Large Query Error - “Cannot enlarge string buffer containing...out of memory”
This message is the result of a large amount of information being returned. Many of the tables in the database contain fields that are information-dense, which can cause buffering issues on the PostGres backend. If a user receives this message, the recommended workflow is to:
- Try adding a 'limit' parameter to your query to fetch the first X (e.g., 100) records
- Simultaneously, add the request header 'Prefer: count=exact' so you are aware of the total number of records matching the query
- Update the query to filter rows (on field values) or unnecessary columns (using ‘select’) and/or continue to page through the results by adding an 'offset' parameter
FAQs
Can I save API links from the IEDB?
Not yet - this feature is currently in development. You will soon be able to retrieve the relevant API links on the ‘Results’ page using the ‘Export’ function.
Is there a mailing list I can join to receive updates on the IQ-API?
Yes, please email help@iedb.org to be added to the current mailing list.
Comments
0 comments
Article is closed for comments.