Testing
Let’s take a look at how 🌍 Transformers models are tested and how you can write new tests and improve the existing ones.
There are 2 test suites in the repository:
tests
— tests for the general APIexamples
— tests primarily for various applications that aren’t part of the API
How transformers are tested
Once a PR is submitted it gets tested with 9 CircleCi jobs. Every new commit to that PR gets retested. These jobs are defined in this config file, so that if needed you can reproduce the same environment on your machine.
These CI jobs don’t run
@slow
tests.There are 3 jobs run by github actions:
torch hub integration: checks whether torch hub integration works.
self-hosted (push): runs fast tests on GPU only on commits on
main
. It only runs if a commit onmain
has updated the code in one of the following folders:src
,tests
,.github
(to prevent running on added model cards, notebooks, etc.)self-hosted runner: runs normal and slow tests on GPU in
tests
andexamples
:
Copied
The results can be observed here.
Running tests
Choosing which tests to run
This document goes into many details of how tests can be run. If after reading everything, you need even more details you will find them here.
Here are some most useful ways of running tests.
Run all:
Copied
or:
Copied
Note that the latter is defined as:
Copied
which tells pytest to:
run as many test processes as they are CPU cores (which could be too many if you don’t have a ton of RAM!)
ensure that all tests from the same file will be run by the same test process
do not capture output
run in verbose mode
Getting the list of all tests
All tests of the test suite:
Copied
All tests of a given test file:
Copied
Run a specific test module
To run an individual test module:
Copied
Run specific tests
Since unittest is used inside most of the tests, to run specific subtests you need to know the name of the unittest class containing those tests. For example, it could be:
Copied
Here:
tests/test_optimization.py
- the file with testsOptimizationTest
- the name of the classtest_adam_w
- the name of the specific test function
If the file contains multiple classes, you can choose to run only tests of a given class. For example:
Copied
will run all the tests inside that class.
As mentioned earlier you can see what tests are contained inside the OptimizationTest
class by running:
Copied
You can run tests by keyword expressions.
To run only tests whose name contains adam
:
Copied
Logical and
and or
can be used to indicate whether all keywords should match or either. not
can be used to negate.
To run all tests except those whose name contains adam
:
Copied
And you can combine the two patterns in one:
Copied
For example to run both test_adafactor
and test_adam_w
you can use:
Copied
Note that we use or
here, since we want either of the keywords to match to include both.
If you want to include only tests that include both patterns, and
is to be used:
Copied
Run accelerate tests
Sometimes you need to run accelerate
tests on your models. For that you can just add -m accelerate_tests
to your command, if let’s say you want to run these tests on OPT
run:
Copied
Run documentation tests
In order to test whether the documentation examples are correct, you should check that the doctests
are passing. As an example, let’s use WhisperModel.forward
’s docstring:
Copied
Just run the following line to automatically test every docstring example in the desired file:
Copied
If the file has a markdown extention, you should add the --doctest-glob="*.md"
argument.
Run only modified tests
You can run the tests related to the unstaged files or the current branch (according to Git) by using pytest-picked. This is a great way of quickly testing your changes didn’t break anything, since it won’t run the tests related to files you didn’t touch.
Copied
Copied
All tests will be run from files and folders which are modified, but not yet committed.
Automatically rerun failed tests on source modification
pytest-xdist provides a very useful feature of detecting all failed tests, and then waiting for you to modify files and continuously re-rerun those failing tests until they pass while you fix them. So that you don’t need to re start pytest after you made the fix. This is repeated until all tests pass after which again a full run is performed.
Copied
To enter the mode: pytest -f
or pytest --looponfail
File changes are detected by looking at looponfailroots
root directories and all of their contents (recursively). If the default for this value does not work for you, you can change it in your project by setting a configuration option in setup.cfg
:
Copied
or pytest.ini
/tox.ini
files:
Copied
This would lead to only looking for file changes in the respective directories, specified relatively to the ini-file’s directory.
pytest-watch is an alternative implementation of this functionality.
Skip a test module
If you want to run all test modules, except a few you can exclude them by giving an explicit list of tests to run. For example, to run all except test_modeling_*.py
tests:
Copied
Clearing state
CI builds and when isolation is important (against speed), cache should be cleared:
Copied
Running tests in parallel
As mentioned earlier make test
runs tests in parallel via pytest-xdist
plugin (-n X
argument, e.g. -n 2
to run 2 parallel jobs).
pytest-xdist
’s --dist=
option allows one to control how the tests are grouped. --dist=loadfile
puts the tests located in one file onto the same process.
Since the order of executed tests is different and unpredictable, if running the test suite with pytest-xdist
produces failures (meaning we have some undetected coupled tests), use pytest-replay to replay the tests in the same order, which should help with then somehow reducing that failing sequence to a minimum.
Test order and repetition
It’s good to repeat the tests several times, in sequence, randomly, or in sets, to detect any potential inter-dependency and state-related bugs (tear down). And the straightforward multiple repetition is just good to detect some problems that get uncovered by randomness of DL.
Repeat tests
Copied
And then run every test multiple times (50 by default):
Copied
This plugin doesn’t work with -n
flag from pytest-xdist
.
There is another plugin pytest-repeat
, but it doesn’t work with unittest
.
Run tests in a random order
Copied
Important: the presence of pytest-random-order
will automatically randomize tests, no configuration change or command line options is required.
As explained earlier this allows detection of coupled tests - where one test’s state affects the state of another. When pytest-random-order
is installed it will print the random seed it used for that session, e.g:
Copied
So that if the given particular sequence fails, you can reproduce it by adding that exact seed, e.g.:
Copied
It will only reproduce the exact order if you use the exact same list of tests (or no list at all). Once you start to manually narrowing down the list you can no longer rely on the seed, but have to list them manually in the exact order they failed and tell pytest to not randomize them instead using --random-order-bucket=none
, e.g.:
Copied
To disable the shuffling for all tests:
Copied
By default --random-order-bucket=module
is implied, which will shuffle the files on the module levels. It can also shuffle on class
, package
, global
and none
levels. For the complete details please see its documentation.
Another randomization alternative is: pytest-randomly
. This module has a very similar functionality/interface, but it doesn’t have the bucket modes available in pytest-random-order
. It has the same problem of imposing itself once installed.
Look and feel variations
pytest-sugar
pytest-sugar is a plugin that improves the look-n-feel, adds a progressbar, and show tests that fail and the assert instantly. It gets activated automatically upon installation.
Copied
To run tests without it, run:
Copied
or uninstall it.
Report each sub-test name and its progress
For a single or a group of tests via pytest
(after pip install pytest-pspec
):
Copied
Instantly shows failed tests
pytest-instafail shows failures and errors instantly instead of waiting until the end of test session.
Copied
Copied
To GPU or not to GPU
On a GPU-enabled setup, to test in CPU-only mode add CUDA_VISIBLE_DEVICES=""
:
Copied
or if you have multiple gpus, you can specify which one is to be used by pytest
. For example, to use only the second gpu if you have gpus 0
and 1
, you can run:
Copied
This is handy when you want to run different tasks on different GPUs.
Some tests must be run on CPU-only, others on either CPU or GPU or TPU, yet others on multiple-GPUs. The following skip decorators are used to set the requirements of tests CPU/GPU/TPU-wise:
require_torch
- this test will run only under torchrequire_torch_gpu
- asrequire_torch
plus requires at least 1 GPUrequire_torch_multi_gpu
- asrequire_torch
plus requires at least 2 GPUsrequire_torch_non_multi_gpu
- asrequire_torch
plus requires 0 or 1 GPUsrequire_torch_up_to_2_gpus
- asrequire_torch
plus requires 0 or 1 or 2 GPUsrequire_torch_tpu
- asrequire_torch
plus requires at least 1 TPU
Let’s depict the GPU requirements in the following table:
| n gpus | decorator | |--------+--------------------------------| | >= 0
| @require_torch
| | >= 1
| @require_torch_gpu
| | >= 2
| @require_torch_multi_gpu
| | < 2
| @require_torch_non_multi_gpu
| | < 3
| @require_torch_up_to_2_gpus
|
For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:
Copied
If a test requires tensorflow
use the require_tf
decorator. For example:
Copied
These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is how to set it up:
Copied
Some decorators like @parametrized
rewrite test names, therefore @require_*
skip decorators have to be listed last for them to work correctly. Here is an example of the correct usage:
Copied
This order problem doesn’t exist with @pytest.mark.parametrize
, you can put it first or last and it will still work. But it only works with non-unittests.
Inside tests:
How many GPUs are available:
Copied
Testing with a specific PyTorch backend or device
To run the test suite on a specific torch device add TRANSFORMERS_TEST_DEVICE="$device"
where $device
is the target backend. For example, to test on CPU only:
Copied
This variable is useful for testing custom or less common PyTorch backends such as mps
. It can also be used to achieve the same effect as CUDA_VISIBLE_DEVICES
by targeting specific GPUs or testing in CPU-only mode.
Certain devices will require an additional import after importing torch
for the first time. This can be specified using the environment variable TRANSFORMERS_TEST_BACKEND
:
Copied
Distributed training
pytest
can’t deal with distributed training directly. If this is attempted - the sub-processes don’t do the right thing and end up thinking they are pytest
and start running the test suite in loops. It works, however, if one spawns a normal process that then spawns off multiple workers and manages the IO pipes.
Here are some tests that use it:
To jump right into the execution point, search for the execute_subprocess_async
call in those tests.
You will need at least 2 GPUs to see these tests in action:
Copied
Output capture
During test execution any output sent to stdout
and stderr
is captured. If a test or a setup method fails, its according captured output will usually be shown along with the failure traceback.
To disable output capturing and to get the stdout
and stderr
normally, use -s
or --capture=no
:
Copied
To send test results to JUnit format output:
Copied
Color control
To have no color (e.g., yellow on white background is not readable):
Copied
Sending test report to online pastebin service
Creating a URL for each test failure:
Copied
This will submit test run information to a remote Paste service and provide a URL for each failure. You may select tests as usual or add for example -x if you only want to send one particular failure.
Creating a URL for a whole test session log:
Copied
Writing tests
🌍 transformers tests are based on unittest
, but run by pytest
, so most of the time features from both systems can be used.
You can read here which features are supported, but the important thing to remember is that most pytest
fixtures don’t work. Neither parametrization, but we use the module parameterized
that works in a similar way.
Parametrization
Often, there is a need to run the same test multiple times, but with different arguments. It could be done from within the test, but then there is no way of running that test for just one set of arguments.
Copied
Now, by default this test will be run 3 times, each time with the last 3 arguments of test_floor
being assigned the corresponding arguments in the parameter list.
and you could run just the negative
and integer
sets of params with:
Copied
or all but negative
sub-tests, with:
Copied
Besides using the -k
filter that was just mentioned, you can find out the exact name of each sub-test and run any or all of them using their exact names.
Copied
and it will list:
Copied
So now you can run just 2 specific sub-tests:
Copied
The module parameterized which is already in the developer dependencies of transformers
works for both: unittests
and pytest
tests.
If, however, the test is not a unittest
, you may use pytest.mark.parametrize
(or you may see it being used in some existing tests, mostly under examples
).
Here is the same example, this time using pytest
’s parametrize
marker:
Copied
Same as with parameterized
, with pytest.mark.parametrize
you can have a fine control over which sub-tests are run, if the -k
filter doesn’t do the job. Except, this parametrization function creates a slightly different set of names for the sub-tests. Here is what they look like:
Copied
and it will list:
Copied
So now you can run just the specific test:
Copied
as in the previous example.
Files and directories
In tests often we need to know where things are relative to the current test file, and it’s not trivial since the test could be invoked from more than one directory or could reside in sub-directories with different depths. A helper class transformers.test_utils.TestCasePlus
solves this problem by sorting out all the basic paths and provides easy accessors to them:
pathlib
objects (all fully resolved):test_file_path
- the current test file path, i.e.__file__
test_file_dir
- the directory containing the current test filetests_dir
- the directory of thetests
test suiteexamples_dir
- the directory of theexamples
test suiterepo_root_dir
- the directory of the repositorysrc_dir
- the directory ofsrc
(i.e. where thetransformers
sub-dir resides)
stringified paths---same as above but these return paths as strings, rather than
pathlib
objects:test_file_path_str
test_file_dir_str
tests_dir_str
examples_dir_str
repo_root_dir_str
src_dir_str
To start using those all you need is to make sure that the test resides in a subclass of transformers.test_utils.TestCasePlus
. For example:
Copied
If you don’t need to manipulate paths via pathlib
or you just need a path as a string, you can always invoked str()
on the pathlib
object or use the accessors ending with _str
. For example:
Copied
Temporary files and directories
Using unique temporary files and directories are essential for parallel test running, so that the tests won’t overwrite each other’s data. Also we want to get the temporary files and directories removed at the end of each test that created them. Therefore, using packages like tempfile
, which address these needs is essential.
However, when debugging tests, you need to be able to see what goes into the temporary file or directory and you want to know it’s exact path and not having it randomized on every test re-run.
A helper class transformers.test_utils.TestCasePlus
is best used for such purposes. It’s a sub-class of unittest.TestCase
, so we can easily inherit from it in the test modules.
Here is an example of its usage:
Copied
This code creates a unique temporary directory, and sets tmp_dir
to its location.
Create a unique temporary dir:
Copied
tmp_dir
will contain the path to the created temporary dir. It will be automatically removed at the end of the test.
Create a temporary dir of my choice, ensure it’s empty before the test starts and don’t empty it after the test.
Copied
This is useful for debug when you want to monitor a specific directory and want to make sure the previous tests didn’t leave any data in there.
You can override the default behavior by directly overriding the
before
andafter
args, leading to one of the following behaviors:before=True
: the temporary dir will always be cleared at the beginning of the test.before=False
: if the temporary dir already existed, any existing files will remain there.after=True
: the temporary dir will always be deleted at the end of the test.after=False
: the temporary dir will always be left intact at the end of the test.
In order to run the equivalent of rm -r
safely, only subdirs of the project repository checkout are allowed if an explicit tmp_dir
is used, so that by mistake no /tmp
or similar important part of the filesystem will get nuked. i.e. please always pass paths that start with ./
.
Each test can register multiple temporary directories and they all will get auto-removed, unless requested otherwise.
Temporary sys.path override
If you need to temporary override sys.path
to import from another test for example, you can use the ExtendSysPath
context manager. Example:
Copied
Skipping tests
This is useful when a bug is found and a new test is written, yet the bug is not fixed yet. In order to be able to commit it to the main repository we need make sure it’s skipped during make test
.
Methods:
A skip means that you expect your test to pass only if some conditions are met, otherwise pytest should skip running the test altogether. Common examples are skipping windows-only tests on non-windows platforms, or skipping tests that depend on an external resource which is not available at the moment (for example a database).
A xfail means that you expect a test to fail for some reason. A common example is a test for a feature not yet implemented, or a bug not yet fixed. When a test passes despite being expected to fail (marked with pytest.mark.xfail), it’s an xpass and will be reported in the test summary.
One of the important differences between the two is that skip
doesn’t run the test, and xfail
does. So if the code that’s buggy causes some bad state that will affect other tests, do not use xfail
.
Implementation
Here is how to skip whole test unconditionally:
Copied
or via pytest:
Copied
or the xfail
way:
Copied
Here is how to skip a test based on some internal check inside the test:
Copied
or the whole module:
Copied
or the xfail
way:
Copied
Here is how to skip all tests in a module if some import is missing:
Copied
Skip a test based on a condition:
Copied
or:
Copied
or skip the whole module:
Copied
More details, example and ways are here.
Slow tests
The library of tests is ever-growing, and some of the tests take minutes to run, therefore we can’t afford waiting for an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be marked as in the example below:
Copied
Once a test is marked as @slow
, to run such tests set RUN_SLOW=1
env var, e.g.:
Copied
Some decorators like @parameterized
rewrite test names, therefore @slow
and the rest of the skip decorators @require_*
have to be listed last for them to work correctly. Here is an example of the correct usage:
Copied
As explained at the beginning of this document, slow tests get to run on a scheduled basis, rather than in PRs CI checks. So it’s possible that some problems will be missed during a PR submission and get merged. Such problems will get caught during the next scheduled CI job. But it also means that it’s important to run the slow tests on your machine before submitting the PR.
Here is a rough decision making mechanism for choosing which tests should be marked as slow:
If the test is focused on one of the library’s internal components (e.g., modeling files, tokenization files, pipelines), then we should run that test in the non-slow test suite. If it’s focused on an other aspect of the library, such as the documentation or the examples, then we should run these tests in the slow test suite. And then, to refine this approach we should have exceptions:
All tests that need to download a heavy set of weights or a dataset that is larger than ~50MB (e.g., model or tokenizer integration tests, pipeline integration tests) should be set to slow. If you’re adding a new model, you should create and upload to the hub a tiny version of it (with random weights) for integration tests. This is discussed in the following paragraphs.
All tests that need to do a training not specifically optimized to be fast should be set to slow.
We can introduce exceptions if some of these should-be-non-slow tests are excruciatingly slow, and set them to
@slow
. Auto-modeling tests, which save and load large files to disk, are a good example of tests that are marked as@slow
.If a test completes under 1 second on CI (including downloads if any) then it should be a normal test regardless.
Collectively, all the non-slow tests need to cover entirely the different internals, while remaining fast. For example, a significant coverage can be achieved by testing with specially created tiny models with random weights. Such models have the very minimal number of layers (e.g., 2), vocab size (e.g., 1000), etc. Then the @slow
tests can use large slow models to do qualitative testing. To see the use of these simply look for tiny models with:
Copied
Here is a an example of a script that created the tiny model stas/tiny-wmt19-en-de. You can easily adjust it to your specific model’s architecture.
It’s easy to measure the run-time incorrectly if for example there is an overheard of downloading a huge model, but if you test it locally the downloaded files would be cached and thus the download time not measured. Hence check the execution speed report in CI logs instead (the output of pytest --durations=0 tests
).
That report is also useful to find slow outliers that aren’t marked as such, or which need to be re-written to be fast. If you notice that the test suite starts getting slow on CI, the top listing of this report will show the slowest tests.
Testing the stdout/stderr output
In order to test functions that write to stdout
and/or stderr
, the test can access those streams using the pytest
’s capsys system. Here is how this is accomplished:
Copied
And, of course, most of the time, stderr
will come as a part of an exception, so try/except has to be used in such a case:
Copied
Another approach to capturing stdout is via contextlib.redirect_stdout
:
Copied
An important potential issue with capturing stdout is that it may contain characters that in normal print
reset everything that has been printed so far. There is no problem with pytest
, but with pytest -s
these characters get included in the buffer, so to be able to have the test run with and without -s
, you have to make an extra cleanup to the captured output, using re.sub(r'~.*\r', '', buf, 0, re.M)
.
But, then we have a helper context manager wrapper to automatically take care of it all, regardless of whether it has some ’s in it or not, so it’s a simple:
Copied
Here is a full test example:
Copied
If you’d like to capture stderr
use the CaptureStderr
class instead:
Copied
If you need to capture both streams at once, use the parent CaptureStd
class:
Copied
Also, to aid debugging test issues, by default these context managers automatically replay the captured streams on exit from the context.
Capturing logger stream
If you need to validate the output of a logger, you can use CaptureLogger
:
Copied
Testing with environment variables
If you want to test the impact of environment variables for a specific test you can use a helper decorator transformers.testing_utils.mockenv
Copied
At times an external program needs to be called, which requires setting PYTHONPATH
in os.environ
to include multiple local paths. A helper class transformers.test_utils.TestCasePlus
comes to help:
Copied
Depending on whether the test file was under the tests
test suite or examples
it’ll correctly set up env[PYTHONPATH]
to include one of these two directories, and also the src
directory to ensure the testing is done against the current repo, and finally with whatever env[PYTHONPATH]
was already set to before the test was called if anything.
This helper method creates a copy of the os.environ
object, so the original remains intact.
Getting reproducible results
In some situations you may want to remove randomness for your tests. To get identical reproducible results set, you will need to fix the seed:
Copied
Debugging tests
To start a debugger at the point of the warning, do this:
Copied
Working with github actions workflows
To trigger a self-push workflow CI job, you must:
Create a new branch on
transformers
origin (not a fork!).The branch name has to start with either
ci_
orci-
(main
triggers it too, but we can’t do PRs onmain
). It also gets triggered only for specific paths - you can find the up-to-date definition in case it changed since this document has been written here under push:Create a PR from this branch.
Then you can see the job appear here. It may not run right away if there is a backlog.
Testing Experimental CI Features
Testing CI features can be potentially problematic as it can interfere with the normal CI functioning. Therefore if a new CI feature is to be added, it should be done as following.
Create a new dedicated job that tests what needs to be tested
The new job must always succeed so that it gives us a green ✓ (details below).
Let it run for some days to see that a variety of different PR types get to run on it (user fork branches, non-forked branches, branches originating from github.com UI direct file edit, various forced pushes, etc. - there are so many) while monitoring the experimental job’s logs (not the overall job green as it’s purposefully always green)
When it’s clear that everything is solid, then merge the new changes into existing jobs.
That way experiments on CI functionality itself won’t interfere with the normal workflow.
Now how can we make the job always succeed while the new CI feature is being developed?
Some CIs, like TravisCI support ignore-step-failure and will report the overall job as successful, but CircleCI and Github Actions as of this writing don’t support that.
So the following workaround can be used:
set +euo pipefail
at the beginning of the run command to suppress most potential failures in the bash script.the last command must be a success:
echo "done"
or justtrue
will do
Here is an example:
Copied
For simple commands you could also do:
Copied
Of course, once satisfied with the results, integrate the experimental step or job with the rest of the normal jobs, while removing set +euo pipefail
or any other things you may have added to ensure that the experimental job doesn’t interfere with the normal CI functioning.
This whole process would have been much easier if we only could set something like allow-failure
for the experimental step, and let it fail without impacting the overall status of PRs. But as mentioned earlier CircleCI and Github Actions don’t support it at the moment.
You can vote for this feature and see where it is at these CI-specific threads:
Last updated