The Archive Query Log (AQL) is a comprehensive query log collected at the Internet Archive over the last 25 years. This version includes 357 million queries, 306 million search result pages, and 2.6 billion search results across 550 search providers. TIRA allows to run arbitrary software on this dataset in a privacy preserving way: the results of executed software is blinded until they reviewed (both the output, and the software) to ensure that no sensitive data is leaked. In this tutorial, you'll learn how to develop a Docker image that can be executed in TIRA to run your experiment or evaluation.
- Install Docker, git, Python 3
- Install the TIRA client:
pip install tira
Note: On Windows, we recommend to run Docker from the standard Windows command line (PowerShell or CMD) or the Windows Subsystem for Linux, but not from the Git Bash.
In this tutorial, we use a simple baseline experiment (see the Python script: aql-experiment-baseline.py
) that counts how many SERPs each search provider has in the AQL.
Now, create a Docker image (see the Dockerfile) that wraps the script.
docker build -t archive-query-log-experiment-baseline .
Use the TIRA client to execute the experiment locally. This way, you can test how TIRA would execute it.
tira-run \
--command '/aql-experiment-baseline.py --input $inputDataset --output $outputDir' \
--image archive-query-log-experiment-baseline \
--input-directory ./validation-data \
--output-directory ./tira-output
Let's break down this tira-run
command:
- The
--command
option specifies the command to be executed by TIRA, in our example:The example command executes the/aql-experiment-baseline.py --input $inputDataset --output $outputDir
aql-experiment-baseline.py
script and specifies the TIRA input and output directories. The arguments$inputDataset
and$outputDir
are special arguments that are injected by TIRA. - The
--image
option specifies which Docker image is used to execute the command (--command
). - The
--input-directory
option specifies the directory where the input data is found, in our example:./validation-data
. - The
--output-directory
option specifies the directory where the output data should be written.
Inspect the output of the experiment:
cat tira-output/results.json
This should yield a result like this:
{"google": 18, "youtube": 10, "stackoverflow": 10, "twitter": 7, "weibo": 1, "facebook": 1, "sogou": 1, "baidu": 1, "bing": 1}
- Login or register at the TIRA website.
- Open the AQL task overview.
- Click Submit, Docker Submission, then Upload Images.
- Tag your Docker image with the prefix assigned by TIRA (replace
<PREFIX>
with the assigned prefix from the TIRA website):docker tag archive-query-log-experiment-baseline <PREFIX>/archive-query-log-experiment-baseline:0.0.1
- Follow the instructions on the TIRA website to login to the dedicated Docker registry.
- Push your Docker image to the registry (replace
<PREFIX>
with the assigned prefix from the TIRA website):docker push <PREFIX>/archive-query-log-experiment-baseline:0.0.1
- The Docker image should be listed on the TIRA website after a short while. If not, click Refresh Images.
- Open the AQL task overview.
- Click Submit, Docker Submission, then Add Container.
- Specify the command from above as the command to be executed by TIRA:
/aql-experiment-baseline.py --input $inputDataset --output $outputDir
- Select the previously uploaded Docker image from the dropdown menu.
- Click Add Container.
- A new tab will appear with a "random" name. Click on the tab to open the container.
- Select the "Validation" dataset from the dropdown menu.
- Click Run Container to start the experiment.
Congrats! You've successfully executed your experiment on the AQL with TIRA. Our team will now review the outputs of your experiment and unblind it if it doesn't leak private data. You can check the status of your experiment on the TIRA website.