Testing files like a pro

Note

Talk from the PyGrunn conference in 2023.

Create files with fake data. In many formats. With no efforts.

Introduction

Thank you for choosing this talk and for being here.

There might be many reasons why you’re here. Perhaps you haven’t done any testing in Python that required files, so you’re curious. Or maybe you have done it many times, but never really liked what you did because it was too verbose or intrusive.

Every time I had to deal with testing files, I had to invent things, reinvent things, recall things from the past, and each time, before diving into a rabbit hole of writing many lines of code or producing yet another collection of files stored somewhere, I checked for available solutions that could simplify things for me, make it easier, less intrusive, and less work. I wanted to make it just fine and enjoyable to work with.

As of today, I have found a solution that works well for me, and that’s what I want to share with you.

Why/motivation

But why, you may ask?

Because test files are often not available when you need them. At least, not at the right time for testing, because your customer or partner doesn’t have them. And even if they do have the right files, there are dozens of reasons for never-ending delays, most of which are related to privacy regulations, such as NDAs to be signed, anonymization, and so on.

And yet, there are deadlines. You have to come up with something, every time. For every project you work on. For every file format you are expected to support.

Or maybe you do have a few test files, and you decide to test your pipeline with the 100 you have (if you’re lucky to have that much) and it all works. Then you go live and discover that your system doesn’t perform well enough to handle thousands of them.

But what are files really? Are they not just pieces of texts and images, sometimes tables, audios and videos, spreadsheets, presentations - all mostly originated from text. We can generate text!

Nowadays, we have concepts such as Synthetic Data and libraries like Faker to support these concepts.

Intermezzo

And if you have never heard of Faker or the term Synthetic Data, I’ll make a quick recap for you.

Synthetic data, or fake data, is computer-generated data that is similar to real-world data. It’s primary purpose is to increase the privacy and integrity of systems.

As everything else in life, it has pros, cons and alternatives.

The pros

Data privacy: Because it’s fake - there’s no risk of exposing sensitive user data and no need to comply with data privacy regulations.
Scalability: You can generate as much data as you need.
Controll: You have full control over the data, so you can test specific rare edge cases.

The cons

Realism: Because it’s fake it does not always accurately represent real data or contain the same patterns and anomalies. That could lead to less accurate testing.
Generation complexity: Creating realistic data can be complex and time-consuming, depending on the domain and the complexity of the data structures.
Maintenance: Keeping the data generation logic up-to-date with evolving application requirements does take time.

The alternatives

Production data anonymization: When you take a copy (or subset) of the real production data and anonymize it to remove or obfuscate sensitive information.
Manual test data creation: When you manually create test data, usually done for smaller scale or more specific testing.
Data augmentation: When you modify existing data to create new data.

All of the alternatives have their pros and cons too, but I’m not going to cover any of that in this presentation.

Faker is a Python package for generating synthetic text data. It’s knows many patterns and locales. It can generate names, texts, addresses, zip codes, ISBN numbers and a lot more.

I started to use Faker around 2016. It was such a relief! You could just do things like this:

from faker import Faker
FAKER = Faker()
FAKER.first_name()
FAKER.last_name()
FAKER.address()
FAKER.zip_code()
FAKER.text()
FAKER.isbn13()
FAKER.email()
FAKER.company_email()
FAKER.company()
FAKER.date_between(start_date="-30y", end_date="+30y")

Before Faker there was Lorem Ipsum (or Lipsum), which was OK (or better than nothing), but didn’t make much sense.

Then Faker (and Faker-like libraries for creating fake data) emerged to save us.

Then test cases became more complex. Primary data sources were often files. We needed to test data/ETL pipelines. Faker still helped a lot, but it was inconvenient to replicate your previous best approach for files and reinvent the wheel for each new project.

That’s why faker-file was created. I wrote it mainly for myself, but you may find it useful too.

How does faker-file help to solve that problem?

In essence, faker-file is just a set of providers for the famous Faker library.

You can use it with Faker and factory_boy (for ORM integration).
It works with Django.
It supports remote storages (AWS S3, Google Cloud Storage, Azure Cloud Storage).
You are in control of the generated content. By default, for most basic cases, content it’s generated using Faker’s text method, but you could easily tweak that using the content argument.

You can use it to run a comprehensive integration test of your pipeline in your favorite cloud.

Some of the most commonly-used file formats are supported:

BIN
CSV
DOCX
EML
EPUB
ICO
JPEG
MP3
ODP
ODS
ODT
PDF
PNG
RTF
PPTX
SVG
TXT
WEBP
XLSX
XML
ZIP

Installation

pip install faker-file[common]

Using it is as simple as follows.

Generate a DOCX file with fake content

Generate 1 DOCX file with fake content (generated by Faker).

# Import the Faker class from faker package
from faker import Faker

# Import the file provider we want to use
from faker_file.providers.docx_file import DocxFileProvider

FAKER = Faker()  # Initialise Faker instance

FAKER.add_provider(DocxFileProvider)  # Register the DOCX file provider

file = FAKER.docx_file()  # Generate a DOCX file

# Note, that `file` is this case is an instance of either `StringValue`
# or `BytesValue` objects, which inherit from `str` and `bytes`
# respectively, but add meta data. Meta data is stored inside the `data`
# property (`Dict`). One of the common attributes of which (among all
# file providers) is the `filename`, which holds an absolute path to the
# generated file.
print(file.data["filename"])

# Another common attribute (although it's not available for all providers)
# is `content`, which holds the text used to generate the file with.
print(file.data["content"])

Provide content manually

Generate 1 DOCX file with developer defined content.

# The text we want have in our generated DOCX file
TEXT = """
"The Queen of Hearts, she made some tarts,
    All on a summer day:
The Knave of Hearts, he stole those tarts,
    And took them quite away."
"""

# Generate a DOCX file with the given text
file = FAKER.docx_file(content=TEXT)

Similarly, generate 1 PNG file.

from faker_file.providers.png_file import PngFileProvider

FAKER.add_provider(PngFileProvider)

file = FAKER.png_file()

Similarly, generate 1 PDF file. Limit the line width to 80 characters.

from faker_file.providers.pdf_file import PdfFileProvider

FAKER.add_provider(PdfFileProvider)

file = FAKER.pdf_file(wrap_chars_after=80)

Provide templated content

You can generate documents from pre-defined templates.

TEMPLATE = """
{{date}} {{city}}, {{country}}

Hello {{name}},

{{text}}

Address: {{address}}

Best regards,

{{name}}
{{address}}
{{phone_number}}
"""

file = FAKER.pdf_file(content=TEMPLATE, wrap_chars_after=80)

Archive types

ZIP archive containing 5 TXT files

As you might have noticed, some archive types are also supported. The created archive will contain 5 files in TXT format (defaults).

from faker_file.providers.zip_file import ZipFileProvider

FAKER.add_provider(ZipFileProvider)

file = FAKER.zip_file()

ZIP archive containing 3 DOCX files with text generated from a template

from faker_file.providers.helpers.inner import create_inner_docx_file

file = FAKER.zip_file(
    prefix="zzz",
    options={
        "count": 3,
        "create_inner_file_func": create_inner_docx_file,
        "create_inner_file_args": {
            "prefix": "xxx_",
            "content": TEMPLATE,
        },
        "directory": "yyy",
    }
)

Nested ZIP archive

And of course nested archives are supported too. Create a ZIP file which contains 5 ZIP files which contain 5 ZIP files which contain 2 DOCX files.

5 ZIP files in the ZIP archive.
Content is generated dynamically.
Prefix the filenames in archive with nested_level_1_.
Prefix the filename of the archive itself with nested_level_0_.
Each of the ZIP files inside the ZIP file in their turn contains 5 other ZIP files, prefixed with nested_level_2_, which in their turn contain 2 DOCX files.

from faker_file.providers.helpers.inner import create_inner_zip_file

file = FAKER.zip_file(
    prefix="nested_level_0_",
    options={
        "create_inner_file_func": create_inner_zip_file,
        "create_inner_file_args": {
            "prefix": "nested_level_1_",
            "options": {
                "create_inner_file_func": create_inner_zip_file,
                "create_inner_file_args": {
                    "prefix": "nested_level_2_",
                    "options": {
                        "count": 2,
                        "create_inner_file_func": create_inner_docx_file,
                        "create_inner_file_args": {
                            "content": TEXT + "\n\n{{date}}",
                        }
                    }
                },
            }
        },
    }
)

It works similarly for EML files (using EmlFileProvider).

from faker_file.providers.eml_file import EmlFileProvider
from faker_file.providers.helpers.inner import create_inner_docx_file

FAKER.add_provider(EmlFileProvider)

file = FAKER.eml_file(
    prefix="zzz",
    content=TEMPLATE,
    options={
        "count": 3,
        "create_inner_file_func": create_inner_docx_file,
        "create_inner_file_args": {
            "prefix": "xxx_",
            "content": TEXT + "\n\n{{date}}",
        },
    }
)

Create a ZIP file with variety of different file types within

50 files in the ZIP archive (limited to DOCX, EPUB and TXT types).
Content is generated dynamically.
Prefix the filename of the archive itself with zzz_archive_.
Inside the ZIP, put all files in directory zzz.

from faker import Faker
from faker_file.providers.helpers.inner import (
    create_inner_docx_file,
    create_inner_epub_file,
    create_inner_txt_file,
    fuzzy_choice_create_inner_file,
)
from faker_file.providers.zip_file import ZipFileProvider
from faker_file.storages.filesystem import FileSystemStorage

FAKER = Faker()
STORAGE = FileSystemStorage()

kwargs = {"storage": STORAGE, "generator": FAKER}
file = ZipFileProvider(FAKER).zip_file(
    prefix="zzz_archive_",
    options={
        "count": 50,
        "create_inner_file_func": fuzzy_choice_create_inner_file,
        "create_inner_file_args": {
            "func_choices": [
                (create_inner_docx_file, kwargs),
                (create_inner_epub_file, kwargs),
                (create_inner_txt_file, kwargs),
            ],
        },
        "directory": "zzz",
    }
)

Another way to create a ZIP file with variety of different file types within

3 files in the ZIP archive (1 DOCX, and 2 XML types).
Content is generated dynamically.
Filename of the archive itself is alice-looking-through-the-glass.zip.
Files inside the archive have fixed name (passed with basename argument).

from faker import Faker
from faker_file.providers.helpers.inner import (
    create_inner_docx_file,
    create_inner_xml_file,
    list_create_inner_file,
)
from faker_file.providers.zip_file import ZipFileProvider
from faker_file.storages.filesystem import FileSystemStorage

FAKER = Faker()
STORAGE = FileSystemStorage()

kwargs = {"storage": STORAGE, "generator": FAKER}
file = ZipFileProvider(FAKER).zip_file(
    basename="alice-looking-through-the-glass",
    options={
        "create_inner_file_func": list_create_inner_file,
        "create_inner_file_args": {
            "func_list": [
                (create_inner_docx_file, {"basename": "doc"}),
                (create_inner_xml_file, {"basename": "doc_metadata"}),
                (create_inner_xml_file, {"basename": "doc_isbn"}),
            ],
        },
    }
)

Using raw=True features in tests

If you pass raw=True argument to any provider or inner function, instead of creating a file, you will get bytes back (or to be totally correct, bytes-like object BytesValue, which is basically bytes enriched with meta-data). You could then use the bytes content of the file to build a test payload as shown in the example test below:

import os
from io import BytesIO

from django.test import TestCase
from django.urls import reverse
from faker import Faker
from faker_file.providers.docx_file import DocxFileProvider
from rest_framework.status import HTTP_201_CREATED
from upload.models import Upload

FAKER = Faker()
FAKER.add_provider(DocxFileProvider)

class UploadTestCase(TestCase):
    """Upload test case."""

    def test_create_docx_upload(self) -> None:
        """Test create an Upload."""
        url = reverse("api:upload-list")

        raw = FAKER.docx_file(raw=True)
        test_file = BytesIO(raw)
        test_file.name = os.path.basename(raw.data["filename"])

        payload = {
            "name": FAKER.word(),
            "description": FAKER.paragraph(),
            "file": test_file,
        }

        response = self.client.post(url, payload, format="json")

        # Test if request is handled properly (HTTP 201)
        self.assertEqual(response.status_code, HTTP_201_CREATED)

        test_upload = Upload.objects.get(id=response.data["id"])

        # Test if the name is properly recorded
        self.assertEqual(str(test_upload.name), payload["name"])

        # Test if file name recorded properly
        self.assertEqual(str(test_upload.file.name), test_file.name)

Create a HTML file predefined template

If you want to generate a file in a format that is not (yet) supported, you can try to use GenericFileProvider. In the following example, an HTML file is generated from a template.

from faker import Faker
from faker_file.providers.generic_file import GenericFileProvider

file = GenericFileProvider(Faker()).generic_file(
    content="<html><body><p>{{text}}</p></body></html>",
    extension="html",
)

Storages

Example usage with Django (using local file system storage)

from django.conf import settings
from faker_file.providers.txt_file import TxtFileProvider
from faker_file.storages.filesystem import FileSystemStorage

STORAGE = FileSystemStorage(
    root_path=settings.MEDIA_ROOT,
    rel_path="tmp",
)

FAKER.add_provider(TxtFileProvider)

file = FAKER.txt_file(content=TEXT, storage=STORAGE)

Example usage with AWS S3 storage

from faker_file.storages.aws_s3 import AWSS3Storage

S3_STORAGE = AWSS3Storage(
    bucket_name="test-bucket",
    root_path="tmp",  # Optional
    rel_path="sub-tmp",  # Optional
    # Credentials are optional too. If your AWS credentials are properly
    # set in the ~/.aws/credentials, you don't need to send them
    # explicitly.
    # credentials={
    #     "key_id": "YOUR KEY ID",
    #     "key_secret": "YOUR KEY SECRET"
    # },
)

file = FAKER.txt_file(storage=S3_STORAGE)

Augment existing files

If you think Faker generated data doesn’t make sense for you and you want your files to look like a collection of 100 files you already have, you could use augmentation features.

You will need additional requirements:

pip install faker-file[ml]

Usage example:

from faker_file.providers.augment_file_from_dir import (
    AugmentFileFromDirProvider,
)

FAKER.add_provider(AugmentFileFromDirProvider)

file = FAKER.augment_file_from_dir(
    source_dir_path="/home/me/Documents/faker_file_source/",
    wrap_chars_after=120,
)

Generated file will resemble text of the original document, but will not be the same.

CLI

Even if you’re not using automated testing, but still want to quickly generate a file with fake content, you could use faker-file:

faker-file generate-completion
source ~/faker_file_completion.sh

Generate an MP3 file:

faker-file mp3_file --prefix=my_file_

Generate 10 DOCX files:

faker-file docx_file --nb_files 10 --prefix=my_file_

Without faker-file 

There are alternatives.

You could simply store a collection of test files somewhere. If you do so, make sure you “know” your collection. It should be obvious of how to use it. In other words - document it properly, alongside snippets to make most of it.

Then there comes a natural question - where to store? Should it be centrally hosted or per repository?

An obvious drawback of centrally hosted approach is that modifications become critical. A mistake may cause failure of your CI/CD pipeline. Also, you need to take care of the setup (for both CI/CD and development).

On the other hand, if you do it per project/repository basis, or even using a blue-print repository, you miss these direct contributions to the upstream.

BTW, consider storing your test files in GitLFS.

Besides, adding test files to the repository still feels a little bit strange to me. There’s always a case when you need to have a variation and therefore you need to make another copy, sometimes a very long copy. And oh, refactoring and cleaning up becomes almost unmanageable.

Additionally, you could always go for a mixed approach, when some of the essentially needed files you still do store in the repository (and that can be project specific), while you still make use of the synthetic data for the cases when it’s justified.

Recap/conclusion

Most likely, combination of Faker, factory_boy and faker-file will do just fine for your MVP and even way beyond that (you have all in one: synthetic data + dynamic fixtures + generation of files). This approach also saves you from thinking about where to store your test data, and overall, makes your code more manageable and simplifies the development process.
If you need to test files in your project, think upfront about the details, such as amount of test files you will need, where to store them, how to store them, etc.
If some of your test cases are too specific to replicate with faker-file, consider using hybrid approach.