Methodology

But why

Let’s start with some hypothetical questions.

“But why generate testing files dynamically”, - you may ask?

And the answer would be, - “for a number of reasons”:

Because you do need files and managing test files is a pain nobody wants to have. You create testing files for one use case, then you need to support another, but you need to modify the original files or make modifications there. You either duplicate or make changes, then at some point, after a number of iterations, your test files collection grows so big, you can’t easily find out how some of the test files different one from another or your test fail, you spend some time to investigate and find out that there has been a slight modification of one of the files, which made your pipeline to fail. You fix the error and decide to document your collection (a good thing anyway). But then your collection grows even more. The burden of managing both test files, the documentation of the test files and the test code becomes unbearable.

Now imagine doing it not for one, but for a number of projects. You want to be smart and make a collection of files, document it properly and think you’ve done a good job, but then you start to realise that you do need to deviate or add new files to the collection to support new use cases. You want to be safe and decide to version control it. Your collection grows, you start ot accept PRs from other devs and go down the rabbit whole of owning another critical repository. Your documentation grows and so does the repository size (mostly binary content). Storing such a huge amount of files becomes a burden. It slows down everyone.

Not even talking about, that you might not be allowed to store some of the you’re using for testing centrally, because you would then need to run obfuscation, anonymization to legally address concerns of privacy regulations.

When test files are generated dynamically

When test files are generated dynamically, you are relieved from most of the concerns mentioned above. There are a couple of drawbacks here too, such as tests execution time (because generating of the test files on the fly does require some computation resources and therefore - your CI execution time will grow).

Best practices

In some very specific use-cases, mimicking original files might be too difficult and you might want to still consider including some of the very specific and hard-to-recreate files in the project repository, but on much lower scale. Use faker-file for simple use cases and only use custom files when things get too complicated otherwise. The so-called hybrid approach.

Identify what kind of files do you need

faker-file supports a large variety of file types, but content of files can be generally broken down by 2 categories:

Text based: Useful when testing OCR or text processing pipelines. ATM, most of the faker-file providers generate text-based content.
Non-text based: Typically images and non-human readable formats such as BIN. Useful when you need to test validity of the uploaded file, but don’t care much about what’s inside.

Image providers:

File type	Graphic	Text	Generator
BMP	GraphicBmpFileProvider	BmpFileProvider	Pillow, WeasyPrint
GIF	GraphicGifFileProvider	GifFileProvider	Pillow, WeasyPrint
ICO	GraphicIcoFileProvider	IcoFileProvider	Pillow, Imagekit, WeasyPrint
JPEG	GraphicJpegFileProvider	JpegFileProvider	Pillow, Imagekit, WeasyPrint
PDF	GraphicPdfFileProvider	PdfFileProvider	Pillow, Imagekit, WeasyPrint
PNG	GraphicPngFileProvider	PngFileProvider	Pillow, Imagekit, WeasyPrint
SVG	(not supported)	SvgFileProvider	Imagekit
TIFF	GraphicTiffFileProvider	TiffFileProvider	Pillow, Imagekit*, WeasyPrint
WEBP	GraphicWebpFileProvider	WebpFileProvider	Pillow, Imagekit*, WeasyPrint

Note

Items marked with * may require xvfb to function properly.

At the moment, 2 of the 3 text-to-image providers require additional system dependencies (such as wkhtmltopdf for imgkit and poppler for WeasyPrint, both of which are available for most popular operating systems, including Windows, macOS and Linux).

A few formats, such as BMP, GIF and TIFF, which are not supported by imgkit and underlying wkhtmltopdf, rely on WeasyPrint, pdf2image and poppler through the WeasyPrintImageGenerator.

The lightest alternative to imgkit and WeasyPrint generators is the Pillow generator (PilImageGenerator), which is basic, but does not require additional system dependencies to be installed (most of the system dependencies for Pillow are likely already installed on your system: libjpeg, zlib, libtiff, libfreetype6 and libwebp).

Graphic image providers on the other hand rely on Pillow and underlying system dependencies mentioned above.

Take a good look at the prerequisites to identify required dependencies.

TL;DR

For text-to-image file generation you could use Pillow based generators, which are basic, but do not require additional system dependencies. For advanced text-to-image file generation you could use either imgkit or WeasyPrint based generators, which require wkhtmltopdf and poppler respectively.

For graphic file generation, the only option is to use graphic file providers, which depend on Pillow (and underlying system dependencies) only.

Installation

When using faker-file for automated tests in a large project with a lot of dependencies, the recommended way to install it is to carefully pick the dependencies required and further use requirements management package, like pip-tools, to compile them into hashed set of packages working well together.

For instance, if we only need DOCX and PDF support, your requirements.in file could look as follows:

faker
faker-file
python-docx
reportlab

If you only plan to use faker-file as a CLI application, just install all common dependencies as follows:

pipx install "faker-file[common]"

Creating files

A couple of use-cases when faker-file can help you out:

Create a simple DOCX file

Let’s imagine we need to generate a DOCX file with text 50 chars long (just for observability).

from faker import Faker
from faker_file.providers.docx_file import DocxFileProvider

FAKER = Faker()
FAKER.add_provider(DocxFileProvider)

docx_file = FAKER.docx_file(max_nb_chars=50)
print(docx_file)  # Sample value: 'tmp/tmpgdctmfbp.docx'
print(docx_file.data["content"])  # Sample value: 'Learn where receive social.'
print(docx_file.data["filename"])  # Sample value: '/tmp/tmp/tmpgdctmfbp.docx'

See the full example here

Create a more structured DOCX file

Imagine, you need a letter sample. It contains

TEMPLATE = """
{{date}} {{city}}, {{country}}

Hello {{name}},

{{text}}

Address: {{address}}

Best regards,

{{name}}
{{address}}
{{phone_number}}
"""

docx_file = FAKER.docx_file(content=TEMPLATE)

print(docx_file)  # Sample value: 'tmp/tmpgdctmfbp.docx'
print(docx_file.data["content"])
# Sample value below:
#  2009-05-14 Pettyberg, Puerto Rico
#  Hello Lauren Williams,
#
#  Everyone bill I information. Put particularly note language support
#  green. Game free family probably case day vote.
#  Commercial especially game heart.
#
#  Address: 19017 Jennifer Drives
#  Jamesbury, MI 39121
#
#  Best regards,
#
#  Robin Jones
#  4650 Paul Extensions
#  Port Johnside, VI 78151
#  001-704-255-3093

See the full example here

Create even more structured DOCX file

Imagine, you need to generate a highly custom document with types of data, such as images, tables, manual page breaks, paragraphs, etc.

from faker_file.base import DynamicTemplate
from faker_file.contrib.docx_file import (
    add_page_break,
    add_paragraph,
    add_picture,
    add_table,
)

# Create a DOCX file with paragraph, picture, table and manual page breaks
# in between the mentioned elements. The ``DynamicTemplate`` simply
# accepts a list of callables (such as ``add_paragraph``,
# ``add_page_break``) and dictionary to be later on fed to the callables
# as keyword arguments for customising the default values.
docx_file = FAKER.docx_file(
    content=DynamicTemplate(
        [
            (add_paragraph, {}),  # Add paragraph
            (add_page_break, {}),  # Add page break
            (add_picture, {}),  # Add picture
            (add_page_break, {}),  # Add page break
            (add_table, {}),  # Add table
            (add_page_break, {}),  # Add page break
        ]
    )
)

See the full example here

Note

All callables do accept arguments. You could provide content=TEMPLATE argument to the add_paragraph function and instead of just random text, you would get a more structured paragraph (from one of previous examples).

For when you think faker-file isn’t enough

As previously mentioned, sometimes when test documents are too complex it might be hard to replicate them and you want to store just a few very specific documents in the project repository.

faker-file comes up with a couple of providers that might still help you in that case.

Both FileFromPathProvider and RandomFileFromDirProvider are created to support the hybrid approach.

FileFromPathProvider

Create a file by copying it from the given path.

Create an exact copy of a file under a different name.
Prefix of the destination file would be zzz.
path is the absolute path to the file to copy.

from faker_file.providers.file_from_path import FileFromPathProvider

FAKER.add_provider(FileFromPathProvider)

# We assume that directory "/tmp/tmp/" exists and contains a file named
# "file.docx".
docx_file_copy = FAKER.file_from_path(
    path="/tmp/tmp/file.docx",
    prefix="zzz",
)

See the full example here

Now you don’t have to copy-paste your file from one place to another. It will be done for you in a convenient way.

RandomFileFromDirProvider

Create a file by copying it randomly from the given directory.

Create an exact copy of the randomly picked file under a different name.
Prefix of the destination file would be zzz.
source_dir_path is the absolute path to the directory to pick files from.

from faker_file.providers.random_file_from_dir import RandomFileFromDirProvider

FAKER.add_provider(RandomFileFromDirProvider)

# We assume that directory "/tmp/tmp/" exists and contains files with".docx"
# extension.
docx_file_copy = FAKER.random_file_from_dir(
    source_dir_path="/tmp/tmp/",
    prefix="zzz",
)

See the full example here

Now you don’t have to copy-paste your file from one place to another. It will be done for you in a convenient way.

Clean up files

FileSystemStorage is the default storage and by default files are stored inside a tmp directory within the system’s temporary directory, which is commonly cleaned up after system restart. However, there’s a mechanism of cleaning up files after the tests run. At any time, to clean up all files created by that moment, call clean_up method of the FileRegistry class instance, as shown below:

# Import instance at once
from faker_file.registry import FILE_REGISTRY

# Trigger the clean-up
FILE_REGISTRY.clean_up()

See the full example here

Typically you would call the clean_up method in the tearDown.

To remove a single file, use remove method of FileRegistry instance.

# We assume that there's an initialized `txt_file` instance to remove.
FILE_REGISTRY.remove(txt_file)  # Where file is an instance of ``StringValue``

See the full example here

If you only have a string representation of the StringValue, try to search for its’ correspondent StringValue instance first using search method.

# We assume that there's an initialized `filename` (str) to remove.
txt_file = FILE_REGISTRY.search(filename)
if txt_file:
    FILE_REGISTRY.remove(txt_file)

See the full example here