Methodology
But why
Let’s start with some hypothetical questions.
“But why generate testing files dynamically”, - you may ask?
And the answer would be, - “for a number of reasons”:
Because you do need files and managing test files is a pain nobody wants to have. You create testing files for one use case, then you need to support another, but you need to modify the original files or make modifications there. You either duplicate or make changes, then at some point, after a number of iterations, your test files collection grows so big, you can’t easily find out how some of the test files different one from another or your test fail, you spend some time to investigate and find out that there has been a slight modification of one of the files, which made your pipeline to fail. You fix the error and decide to document your collection (a good thing anyway). But then your collection grows even more. The burden of managing both test files, the documentation of the test files and the test code becomes unbearable.
Now imagine doing it not for one, but for a number of projects. You want to be smart and make a collection of files, document it properly and think you’ve done a good job, but then you start to realise that you do need to deviate or add new files to the collection to support new use cases. You want to be safe and decide to version control it. Your collection grows, you start ot accept PRs from other devs and go down the rabbit whole of owning another critical repository. Your documentation grows and so does the repository size (mostly binary content). Storing such a huge amount of files becomes a burden. It slows down everyone.
Not even talking about, that you might not be allowed to store some of the you’re using for testing centrally, because you would then need to run obfuscation, anonymization to legally address concerns of privacy regulations.
When test files are generated dynamically
When test files are generated dynamically, you are relieved from most of the concerns mentioned above. There are a couple of drawbacks here too, such as tests execution time (because generating of the test files on the fly does require some computation resources and therefore - your CI execution time will grow).
Best practices
In some very specific use-cases, mimicking original files might be too difficult and you might want to still consider including some of the very specific and hard-to-recreate files in the project repository, but on much lower scale. Use faker-file for simple use cases and only use custom files when things get too complicated otherwise. The so-called hybrid approach.
Identify what kind of files do you need
faker-file supports a large variety of file types, but content of files can be generally broken down by 2 categories:
Text based: Useful when testing OCR or text processing pipelines. ATM, most of the faker-file providers generate text-based content.
Non-text based: Typically images and non-human readable formats such as BIN. Useful when you need to test validity of the uploaded file, but don’t care much about what’s inside.
Image providers:
File type |
Graphic |
Text |
Generator |
---|---|---|---|
BMP |
GraphicBmpFileProvider |
BmpFileProvider |
Pillow, WeasyPrint |
GIF |
GraphicGifFileProvider |
GifFileProvider |
Pillow, WeasyPrint |
ICO |
GraphicIcoFileProvider |
IcoFileProvider |
Pillow, Imagekit, WeasyPrint |
JPEG |
GraphicJpegFileProvider |
JpegFileProvider |
Pillow, Imagekit, WeasyPrint |
GraphicPdfFileProvider |
PdfFileProvider |
Pillow, Imagekit, WeasyPrint |
|
PNG |
GraphicPngFileProvider |
PngFileProvider |
Pillow, Imagekit, WeasyPrint |
SVG |
(not supported) |
SvgFileProvider |
Imagekit |
TIFF |
GraphicTiffFileProvider |
TiffFileProvider |
Pillow, Imagekit*, WeasyPrint |
WEBP |
GraphicWebpFileProvider |
WebpFileProvider |
Pillow, Imagekit*, WeasyPrint |
Note
Items marked with * may require xvfb to function properly.
At the moment, 2 of the 3 text-to-image providers require additional system dependencies (such as wkhtmltopdf for imgkit and poppler for WeasyPrint, both of which are available for most popular operating systems, including Windows, macOS and Linux).
A few formats, such as BMP, GIF and TIFF, which are not supported
by imgkit and underlying wkhtmltopdf, rely on WeasyPrint,
pdf2image and poppler through the WeasyPrintImageGenerator
.
The lightest alternative to imgkit and WeasyPrint generators is the
Pillow generator (PilImageGenerator
), which is basic, but does not
require additional system dependencies to be installed (most of the
system dependencies for Pillow are likely already installed on
your system: libjpeg
, zlib
, libtiff
, libfreetype6
and
libwebp
).
Graphic image providers on the other hand rely on Pillow and underlying system dependencies mentioned above.
Take a good look at the prerequisites to identify required dependencies.
TL;DR
For text-to-image file generation you could use Pillow based generators, which are basic, but do not require additional system dependencies. For advanced text-to-image file generation you could use either imgkit or WeasyPrint based generators, which require wkhtmltopdf and poppler respectively.
For graphic file generation, the only option is to use graphic file providers, which depend on Pillow (and underlying system dependencies) only.
Installation
When using faker-file for automated tests in a large project with a lot of dependencies, the recommended way to install it is to carefully pick the dependencies required and further use requirements management package, like pip-tools, to compile them into hashed set of packages working well together.
For instance, if we only need DOCX and PDF support, your requirements.in
file could look as follows:
faker
faker-file
python-docx
reportlab
If you only plan to use faker-file as a CLI application, just install all common dependencies as follows:
pipx install "faker-file[common]"
Creating files
A couple of use-cases when faker-file can help you out:
Create a simple DOCX file
Let’s imagine we need to generate a DOCX file with text 50 chars long (just for observability).
from faker import Faker
from faker_file.providers.docx_file import DocxFileProvider
FAKER = Faker()
FAKER.add_provider(DocxFileProvider)
docx_file = FAKER.docx_file(max_nb_chars=50)
print(docx_file) # Sample value: 'tmp/tmpgdctmfbp.docx'
print(docx_file.data["content"]) # Sample value: 'Learn where receive social.'
print(docx_file.data["filename"]) # Sample value: '/tmp/tmp/tmpgdctmfbp.docx'
See the full example
here
Create a more structured DOCX file
Imagine, you need a letter sample. It contains
TEMPLATE = """
{{date}} {{city}}, {{country}}
Hello {{name}},
{{text}}
Address: {{address}}
Best regards,
{{name}}
{{address}}
{{phone_number}}
"""
docx_file = FAKER.docx_file(content=TEMPLATE)
print(docx_file) # Sample value: 'tmp/tmpgdctmfbp.docx'
print(docx_file.data["content"])
# Sample value below:
# 2009-05-14 Pettyberg, Puerto Rico
# Hello Lauren Williams,
#
# Everyone bill I information. Put particularly note language support
# green. Game free family probably case day vote.
# Commercial especially game heart.
#
# Address: 19017 Jennifer Drives
# Jamesbury, MI 39121
#
# Best regards,
#
# Robin Jones
# 4650 Paul Extensions
# Port Johnside, VI 78151
# 001-704-255-3093
See the full example
here
Create even more structured DOCX file
Imagine, you need to generate a highly custom document with types of data, such as images, tables, manual page breaks, paragraphs, etc.
from faker_file.base import DynamicTemplate
from faker_file.contrib.docx_file import (
add_page_break,
add_paragraph,
add_picture,
add_table,
)
# Create a DOCX file with paragraph, picture, table and manual page breaks
# in between the mentioned elements. The ``DynamicTemplate`` simply
# accepts a list of callables (such as ``add_paragraph``,
# ``add_page_break``) and dictionary to be later on fed to the callables
# as keyword arguments for customising the default values.
docx_file = FAKER.docx_file(
content=DynamicTemplate(
[
(add_paragraph, {}), # Add paragraph
(add_page_break, {}), # Add page break
(add_picture, {}), # Add picture
(add_page_break, {}), # Add page break
(add_table, {}), # Add table
(add_page_break, {}), # Add page break
]
)
)
See the full example
here
Note
All callables do accept arguments. You could provide content=TEMPLATE
argument to the add_paragraph
function and instead of just random text,
you would get a more structured paragraph (from one of previous examples).
For when you think faker-file isn’t enough
As previously mentioned, sometimes when test documents are too complex it might be hard to replicate them and you want to store just a few very specific documents in the project repository.
faker-file comes up with a couple of providers that might still help you in that case.
Both FileFromPathProvider and RandomFileFromDirProvider are created to support the hybrid approach.
FileFromPathProvider
Create a file by copying it from the given path.
Create an exact copy of a file under a different name.
Prefix of the destination file would be
zzz
.path
is the absolute path to the file to copy.
from faker_file.providers.file_from_path import FileFromPathProvider
FAKER.add_provider(FileFromPathProvider)
# We assume that directory "/tmp/tmp/" exists and contains a file named
# "file.docx".
docx_file_copy = FAKER.file_from_path(
path="/tmp/tmp/file.docx",
prefix="zzz",
)
See the full example
here
Now you don’t have to copy-paste your file from one place to another. It will be done for you in a convenient way.
RandomFileFromDirProvider
Create a file by copying it randomly from the given directory.
Create an exact copy of the randomly picked file under a different name.
Prefix of the destination file would be
zzz
.source_dir_path
is the absolute path to the directory to pick files from.
from faker_file.providers.random_file_from_dir import RandomFileFromDirProvider
FAKER.add_provider(RandomFileFromDirProvider)
# We assume that directory "/tmp/tmp/" exists and contains files with".docx"
# extension.
docx_file_copy = FAKER.random_file_from_dir(
source_dir_path="/tmp/tmp/",
prefix="zzz",
)
See the full example
here
Now you don’t have to copy-paste your file from one place to another. It will be done for you in a convenient way.
Clean up files
FileSystemStorage
is the default storage and by default files are stored
inside a tmp
directory within the system’s temporary directory, which is
commonly cleaned up after system restart. However, there’s a mechanism of
cleaning up files after the tests run. At any time, to clean up all files
created by that moment, call clean_up
method of the FileRegistry
class instance, as shown below:
# Import instance at once
from faker_file.registry import FILE_REGISTRY
# Trigger the clean-up
FILE_REGISTRY.clean_up()
See the full example
here
Typically you would call the clean_up
method in the tearDown
.
To remove a single file, use remove
method of FileRegistry
instance.
# We assume that there's an initialized `txt_file` instance to remove.
FILE_REGISTRY.remove(txt_file) # Where file is an instance of ``StringValue``
See the full example
here
If you only have a string representation of the StringValue
, try to search
for its’ correspondent StringValue
instance first using search
method.
# We assume that there's an initialized `filename` (str) to remove.
txt_file = FILE_REGISTRY.search(filename)
if txt_file:
FILE_REGISTRY.remove(txt_file)
See the full example
here