Pickle Scanning
Last updated
Last updated
Pickle is a widely used serialization format in ML. Most notably, it is the default format for PyTorch model weights.
There are dangerous arbitrary code execution attacks that can be perpetrated when you load a pickle file. We suggest loading models from users and organizations you trust, relying on signed commits, and/or loading models from TF or Jax formats with the from_tf=True
auto-conversion mechanism. We also alleviate this issue by displaying/“vetting” the list of imports in any pickled file, directly on the Hub. Finally, we are experimenting with a new, simple serialization format for weights called .
From the :
The
pickle
module implements binary protocols for serializing and de-serializing a Python object structure.
What this means is that pickle is a serializing protocol, something you use to efficiently share data amongst parties.
We call a pickle the binary file that was generated while pickling.
At its core, the pickle is basically a stack of instructions or opcodes. As you probably have guessed, it’s not human readable. The opcodes are generated when pickling and read sequentially at unpickling. Based on the opcode, a given action is executed.
Here’s a small example:
Copied
When you run this, it will create a pickle file and print the following instructions in your terminal:
Copied
Pickle is not simply a serialization protocol, it allows more flexibility by giving the ability to users to run python code at de-serialization time. Doesn’t sound good, does it?
As we’ve stated above, de-serializing pickle means that code can be executed. But this comes with certain limitations: you can only reference functions and classes from the top level module; you cannot embed them in the pickle file itself.
Back to the drawing board:
Copied
When we run this script we get the payload.pkl
again. When we check the file’s contents:
Copied
We can see that there isn’t much in there, a few opcodes and the associated data. You might be thinking, so what’s the problem with pickle?
Let’s try something else:
Copied
When you run this, it creates a payload.pkl
and prints the following:
Copied
If we check the contents of the pickle file, we get:
Copied
Basically, this is what’s happening when you unpickle:
Copied
The instructions that pose a threat are STACK_GLOBAL
, GLOBAL
and REDUCE
.
REDUCE
is what tells the unpickler to execute the function with the provided arguments and *GLOBAL
instructions are telling the unpickler to import
stuff.
To sum up, pickle is dangerous because:
when importing a python module, arbitrary code can be executed
you can import builtin functions like eval
or exec
, which can be used to execute arbitrary code
when instantiating an object, the constructor may be called
This is why it is stated in most docs using pickle, do not unpickle data from untrusted sources.
Don’t use pickle
Sound advice Luc, but pickle is used profusely and isn’t going anywhere soon: finding a new format everyone is happy with and initiating the change will take some time.
So what can we do for now?
If you know and trust user A and the commit that includes the file on the Hub is signed by user A’s GPG key, it’s pretty safe to assume that you can trust the file.
TensorFlow and Flax checkpoints are not affected, and can be loaded within PyTorch architectures using the from_tf
and from_flax
kwargs for the from_pretrained
method to circumvent this issue.
E.g.:
Copied
This last format, safetensors
, is a simple serialization format that we are working on and experimenting with currently! Please help or contribute if you can 🔥.
What we have now
We have created a security scanner that scans every file pushed to the Hub and runs security checks. At the time of writing, it runs two types of scans:
ClamAV scans
Pickle Import scans
We have implemented a Pickle Import scan, which extracts the list of imports referenced in a pickle file. Every time you upload a pytorch_model.bin
or any other pickled file, this scan is run.
On the hub the list of imports will be displayed next to each file containing imports. If any import looks suspicious, it will be highlighted.
Note that this is what allows to know if, when unpickling a file, it will REDUCE
on a potentially dangerous function that was imported by *GLOBAL
.
Disclaimer: this is not 100% foolproof. It is your responsibility as a user to check if something is safe or not. We are not actively auditing python packages for safety, the safe/unsafe imports lists we have are maintained in a best-effort manner. Please contact us if you think something is not safe, and we flag it as such, by sending us an email to website at huggingface.co
Potential solutions
Thankfully, there is always a trace of the eval
import, so reading the opcodes directly should allow to catch malicious usage.
The current solution I propose is creating a file resembling a .gitignore
but for imports.
This file would be a whitelist of imports that would make a pytorch_model.bin
file flagged as dangerous if there are imports not included in the whitelist.
One could imagine having a regex-ish format where you could allow all numpy submodules for instance via a simple line like: numpy.*
.
Don’t worry too much about the instructions for now, just know that the module is very useful for analyzing pickles. It allows you to read the instructions in the file without executing any code.
Here we’re using the library for simplicity. It allows us to add pickle instructions to execute code contained in a string via the exec
function. This is how you circumvent the fact that you cannot define functions or classes in your pickles: you run exec on python code saved as a string.
On the Hub, you have the ability to . This does not guarantee that your file is safe, but it does guarantee the origin of the file.
There’s an open discussion in progress at PyTorch on having a – please chime in there!
For ClamAV scans, files are run through the open-source antivirus . While this covers a good amount of dangerous files, it doesn’t cover pickle exploits.
We get this data thanks to which allows us to read the file without executing potentially dangerous code.
One could think of creating a custom in the likes of . But as we can see in this , this won’t work.