Datasets and Resources
Topics for iKAT Year 1
File | Description |
---|---|
2023_train_topics.json | Train topics in JSON format. |
2023_test_topics.json | Test topics in JSON format. |
Addtional Data
File | Description |
---|---|
2023_train_topics_psg_text.jsonl | Text of provenance passages in the train topics. |
2023_test_topics_psg_text.jsonl | Text of provenance passages in the test topics. |
2023_passages_hashes.tsv.bz2 | TSV file containing MD5 hashes of passage texts. The .tsv file has this format: doc_id passage_number passage_MD5 . Total download size is 2.2GB. |
2023_top_1000_query_results.zip | This zip file has queries from both training and testing topics, saved in queries_train.txt and queries_test.txt respectively. The results from the iKAT searcher (BM25 using manually resolved queries) are saved in the query_results_train and query_results_test folders. Each result file, with up to 1000 results, corresponds to a query based on line numbers, starting from zero. For instance, the results for the first query in queries_train.txt can be found in query_results_train/query_results_000.txt . In each result file, every line shows the ClueWeb22 ID followed by the URL . |
Baseline Runs
Below, we provide two baseline runs.
- Method.
BM25+RM3
(Pyserini default) as the initial retrieval (denoted byret_bm25_rm3
in the file name) method to retrieve 100 passages per query (denoted byk_100
).- The query in each turn was re-written using:
- The context, and
- The top-3 relevant PTKB statements (denoted by
num_ptkb_3
in the file name).
- Query re-writing. In all cases, the re-written query was construted by appending the relevant PTKB statements to the (manually or automatically) resolved query.
- Response generation. A response was generated using the top-3 passages retrieved with the re-written query (denoted by
num_psg_3
in the file name). We use theT5
modelmrm8488/t5-base-finetuned-summarize-news
available on HuggingFace for this purpose. - For automatic runs.
- The relevant PTKB statements were determined automatically by re-ranking the statements using
SentenceTransformers
, specifically, the modelcross-encoder/ms-marco-MiniLM-L-6-v2
available on HuggingFace. - The query was re-written automatically using the
castorini/t5-base-canard
model available on HuggingFace.
- The relevant PTKB statements were determined automatically by re-ranking the statements using
- For manual runs.
- The relevant PTKB statements provided in the
ptkb_provenance
field were used. - The manually re-written query provided in the
resolved_utterance
field was used.
- The relevant PTKB statements provided in the
File | Run Type |
---|---|
ret_bm25_rm3--type_automatic--num_ptkb_3--k_100--num_psg_3.official.run.json | Automatic |
ret_bm25_rm3--type_manual--num_ptkb_3--k_100--num_psg_3.official.run.json | Manual |
Document Collection: TREC iKAT 2023 ClueWeb22-B
The collection distribution is being handled directly by CMU and not the iKAT organizers. Please follow these steps to get your data license ASAP:
- Sign the license form available on the ClueWeb22 project web page.
- Send the form to CMU for approval (jlm4@andrew.cmu.edu)
Please give enough time to the CMU licensing office to accept your request. A download link will be sent to you by the ClueWeb22 team at CMU.
Note.
- CMU requires a signature from the organization (i.e., the university or company), not an individual who wants to use the data. This can slow down the process at your end too. So, it’s useful to start the process ASAP.
- If you already have an accepted license for ClueWeb22, you don’t need a new form. Please let us know if that’s the case.
Additional Resources
We provide the following additional resources for the teams:
TREC iKAT 2023 ClueWeb22-B Passage Collection
We provide a segmented version of the TREC iKAT 2023 ClueWeb22-B Document collection available from CMU in two formats: JSONL
and TrecWeb
.
In case you have segmented the document collection yourself, you may check whether your segments match ours using the tsv
file of passage hashes provided.
- Passage collection in
JSONL
format.- These files contain passages in the form
{"id": "[passage id]", "contents": "[passage text]", "url": "[ClueWeb22 document URL]"}
- Passage IDs are structured as:
doc_id:passage_number
- Total download size is approximately 31 GB.
- Total number of passages is 116,838,987
- These files contain passages in the form
-
Passage collection in
TrecWeb
format.- Format as shown on website.
- Total download size is approximately 31 GB
-
Passage hashes.
tsv
file containing MD5 hashes of passage texts.- The
.tsv
file has this format:doc_id passage_number passage_MD5
- Total download size is 2.2 GB
Sparse Lucene Passage Index
We also provide a sparse Lucene index generated from the JSONL
passage files above using Pyserini. The files form a single .tar.bz2
archive split into sections for simpler downloading due to the overall size. To extract the archive, once downloaded, you must combine each of the sections in name order back into a single file:
cat ikat_2023_passage_index.tar.bz2.part* > ikat_203_passage_index.tar.bz2
Total download size is approximately 150 GB
How do I access these resources?
Each team should use a URL of https://ikattrecweb.grill.science/<team_name>
to access the files. The page will ask for a userID and password. Enter the login details which you obtained from the iKAT organizers. You should see a page which lists each type of data and has links to the individual files listed above, along with their checksum files.
NOTE:
- Currently, these teams can access this URL:
MLIA
,TREMA_UNH
andNota
. Please send us a message privately via slack or through the email and we will share the login details with you. - Other teams: You have shared IPs in the
10.x.x.x
range which is for private networks, so we need another IP from you. Can you please share another suitable IP with us so that we may configure the above download link to work for you?
iKAT Searcher
iKAT Searcher is a simple tool developed to help with creating the topics for iKAT. The tool allows topic developers to visually assess the behaviour of a retrieval system, ultimately making it easier to develop challenging, but interesting, topics for the Track. You can interact with the system here. See the GitHub repository.
Run Validation
We provide code for run validation in our Github repository. Please see the associated README file for detailed instructions on how to run the code. It is crucial to validate your submission files before submitting them. The run files that fail the validation phase will be discarded. We advise you to get familiarized with the validation script as soon as possible and let us know if you have any questions or encounter any problems working with it.
Note. You need the MD5 hash file of the passages in the collection to run the validation code. You can download this file from above.
Below is a summary of the checks that the script performs on a run file.
- Can the file be parsed as valid JSON?
- Can the JSON be parsed into the protocol buffer format defined in
protocol_buffers/run.proto
? - Does the run file have at least one turn?
- Is the
run_name
field non-empty? - Is the
run_type
field non-empty and set toautomatic
ormanual
? - Does the number of turns in the run match the number in the test topics file?
- Does the number of turns in each topic in the run match the corresponding number in the test topics file?
- For each turn in the run:
- Is the turn ID valid and matches an entry in the topics file?
- Is any turn ID higher than expected for the selected topic (e.g. turns 1-5 in the topic, but a turn has ID 6 in the run file)?
- (optional, enabled by default) Do all the
passage_provenance
passage IDs appear in the collection? - For each response in the turn:
- Does it have a rank > 0?
- Do the ranks increase with successive responses?
- Does the response have a non-empty
text
field? - For each
passage_provenance
entry: - Does it have a score less than the previous entry?
- Does it have a passage ID containing a single colon and beginning with 'clueweb22-'?
- Are there less than 1000
passage_provenance
entries listed for the response? - Is there at least one
passage_provenance
with itsused
field set to True in the response? - Does the response have at least one
passage_provenance
entry? - Does the response have at least one
ptkb_provenance
entry? - For each
ptkb_provenance
entry: - Does it have a non-empty ID?
- Does the ID appear in the
ptkb
field of the topic data? - Does the text given match that in the topic data?
- Does it have a score less than the previous entry?
Additional Data from TREC CAsT
We provide the data from previous years' TREC CAsT below. The iKAT topics are similar, with the addition of the Personal Text Knowledge Base. For more information on TREC CAsT, see the website and read the overview papers [2019] [2020] [2021] [2022]
Note. TREC CAsT did not include a PTKB but you can be creative and modify the data according to your needs. Also, TREC CAsT used different collections (Wikipedia, KILT, MS MARCO, etc.) at different stages. iKAT is using a subset of the recently released ClueWeb22-B.
CAsT Year 4 (2022)
File | Description |
---|---|
2022_automatic_evaluation_topics_tree_v1.0.json | Contains each conversation tree (topic) with an automatic rewrite generated for each user utterance. |
2022_evaluation_topics_turn_ids.json | Contains each conversation tree (topic) with the resolved query for each user utterance. |
2022_evaluation_topics_tree_v1.0.json | Contains all ids that responses/ranked passages need to be returned for. |
2022_evaluation_topics_flattened_duplicated_v1.0.json | Contains all possible conversation paths across all the conversation trees. |
CAsT Year 3 (2021)
File | Description |
---|---|
2021_automatic_evaluation_topics_v1.0.json | 25 primary evaluation topics in JSON format. Variant: Automatic |
2021_manual_evaluation_topics_v1.0.json | 25 primary evaluation topics in JSON format. Variant: Manual |
2021qrels.txt | Qrels file for passage ranking task. |
CAsT Year 2 (2020)
File | Description |
---|---|
2020_automatic_evaluation_topics_v1.0.json | 25 primary evaluation topics in JSON format. Variant: Automatic |
2020_manual_evaluation_topics_v1.0.json | 25 primary evaluation topics in JSON format. Variant: Manual |
2020qrels.txt | Qrels file for passage ranking task. |
CAsT Year 1 (2019)
File | Description |
---|---|
train_topics_v1.0.json | 30 example training topics in JSON format. |
evaluation_topics_v1.0.json | 50 evaluation topics in JSON format. |
2019qrels.txt | Official evaluation qrels file for passage ranking task. |
train_qrels.txt | Limited (incomplete) training judegements for 5 topics (approximately 50 turns). The judgments are graded on a three point scale (2 very relevant, 1 relevant, and 0 not relevant). |