Testing files like a pro
========================
.. _Faker: https://faker.readthedocs.io/
.. _Lipsum: https://www.lipsum.com/
.. _factory_boy: https://factoryboy.readthedocs.io/
.. _Django: https://www.djangoproject.com/
.. _faker-file: https://faker-file.readthedocs.io/
.. _PyGrunn: https://pygrunn.org/

.. note::

    Talk from the `PyGrunn`_ conference in 2023.

Create files with fake data. In many formats. With no efforts.

Introduction
------------
Thank you for choosing this talk and for being here.

There might be many reasons why you're here. Perhaps you haven't done any
testing in Python that required files, so you're curious. Or maybe you have
done it many times, but never really liked what you did because it was too
verbose or intrusive.

Every time I had to deal with testing files, I had to invent things,
reinvent things, recall things from the past, and each time, before diving into
a rabbit hole of writing many lines of code or producing yet another
collection of files stored somewhere, I checked for available solutions that
could simplify things for me, make it easier, less intrusive, and less work.
I wanted to make it just fine and enjoyable to work with.

As of today, I have found a solution that works well for me, and that's what I
want to share with you.

Why/motivation
--------------

But why, you may ask?

Because test files are often not available when you need them. At least, not
at the right time for testing, because your customer or partner doesn't have
them. And even if they do have the right files, there are dozens of reasons
for never-ending delays, most of which are related to privacy regulations,
such as NDAs to be signed, anonymization, and so on.

And yet, there are deadlines. You have to come up with something, every time.
For every project you work on. For every file format you are expected
to support.

Or maybe you do have a few test files, and you decide to test your
pipeline with the 100 you have (if you're lucky to have that much) and it
all works. Then you go live and discover that your system doesn't perform
well enough to handle thousands of them.

But what are files really? Are they not just pieces of texts and images,
sometimes tables, audios and videos, spreadsheets, presentations -
all mostly originated from text. We can generate text!

Nowadays, we have concepts such as Synthetic Data and libraries like `Faker`_
to support these concepts.

Intermezzo
----------

And if you have never heard of `Faker`_ or the term Synthetic Data, I'll make a
quick recap for you.

Synthetic data, or fake data, is computer-generated data that is similar to
real-world data. It's primary purpose is to increase the privacy and integrity
of systems.

As everything else in life, it has pros, cons and alternatives.

The pros
~~~~~~~~

- **Data privacy**: Because it's fake - there's no risk of exposing sensitive
  user data and no need to comply with data privacy regulations.
- **Scalability**: You can generate as much data as you need.
- **Controll**: You have full control over the data, so you can test specific
  rare edge cases.

The cons
~~~~~~~~

- **Realism**: Because it's fake it does not always accurately represent real
  data or contain the same patterns and anomalies. That could lead to less
  accurate testing.
- **Generation complexity**: Creating realistic data can be complex and
  time-consuming, depending on the domain and the complexity of the data
  structures.
- **Maintenance**: Keeping the data generation logic up-to-date with evolving
  application requirements does take time.

The alternatives
~~~~~~~~~~~~~~~~

- **Production data anonymization**: When you take a copy (or subset) of the
  real production data and anonymize it to remove or obfuscate sensitive
  information.
- **Manual test data creation**: When you manually create test data, usually
  done for smaller scale or more specific testing.
- **Data augmentation**: When you modify existing data to create new data.

All of the alternatives have their pros and cons too, but I'm not going to
cover any of that in this presentation.

`Faker`_ is a Python package for generating synthetic text data.
It's knows many patterns and locales. It can generate names, texts, addresses,
zip codes, ISBN numbers and a lot more.

I started to use `Faker`_ around 2016. It was such a relief! You could just
do things like this:

.. code-block:: python

    from faker import Faker
    FAKER = Faker()
    FAKER.first_name()
    FAKER.last_name()
    FAKER.address()
    FAKER.zip_code()
    FAKER.text()
    FAKER.isbn13()
    FAKER.email()
    FAKER.company_email()
    FAKER.company()
    FAKER.date_between(start_date="-30y", end_date="+30y")

Before `Faker`_ there was `Lorem Ipsum` (or `Lipsum`_), which was OK (or better
than nothing), but didn't make much sense.

Then `Faker`_ (and `Faker`-like libraries for creating fake data) emerged to
save us.

Then test cases became more complex. Primary data sources were often
files. We needed to test data/ETL pipelines. `Faker`_ still helped a lot, but
it was inconvenient to replicate your previous best approach for files and
reinvent the wheel for each new project.

That's why `faker-file`_ was created. I wrote it mainly for myself, but
you may find it useful too.

How does `faker-file`_ help to solve that problem?
--------------------------------------------------

In essence, `faker-file`_ is just a set of providers for the
famous `Faker`_ library.

- You can use it with `Faker`_ and `factory_boy`_ (for ORM integration).
- It works with `Django`_.
- It supports remote storages (AWS S3, Google Cloud Storage, Azure Cloud
  Storage).
- You are in control of the generated content. By default, for most basic
  cases, content it's generated using `Faker`_'s ``text`` method, but you
  could easily tweak that using the ``content`` argument.

You can use it to run a comprehensive integration test of your pipeline in your
favorite cloud.

Some of the most commonly-used file formats are supported:

- `BIN`
- `CSV`
- `DOCX`
- `EML`
- `EPUB`
- `ICO`
- `JPEG`
- `MP3`
- `ODP`
- `ODS`
- `ODT`
- `PDF`
- `PNG`
- `RTF`
- `PPTX`
- `SVG`
- `TXT`
- `WEBP`
- `XLSX`
- `XML`
- `ZIP`

**Installation**

.. code-block:: sh

    pip install faker-file[common]

Using it is as simple as follows.

Generate a `DOCX` file with fake content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Generate 1 `DOCX` file with fake content (generated by `Faker`_).

.. code-block:: python

    # Import the Faker class from faker package
    from faker import Faker

    # Import the file provider we want to use
    from faker_file.providers.docx_file import DocxFileProvider

    FAKER = Faker()  # Initialise Faker instance

    FAKER.add_provider(DocxFileProvider)  # Register the DOCX file provider

    file = FAKER.docx_file()  # Generate a DOCX file

    # Note, that `file` is this case is an instance of either `StringValue`
    # or `BytesValue` objects, which inherit from `str` and `bytes`
    # respectively, but add meta data. Meta data is stored inside the `data`
    # property (`Dict`). One of the common attributes of which (among all
    # file providers) is the `filename`, which holds an absolute path to the
    # generated file.
    print(file.data["filename"])

    # Another common attribute (although it's not available for all providers)
    # is `content`, which holds the text used to generate the file with.
    print(file.data["content"])

Provide content manually
~~~~~~~~~~~~~~~~~~~~~~~~
- Generate 1 `DOCX` file with developer defined content.

.. code-block:: python

    # The text we want have in our generated DOCX file
    TEXT = """
    "The Queen of Hearts, she made some tarts,
        All on a summer day:
    The Knave of Hearts, he stole those tarts,
        And took them quite away."
    """

    # Generate a DOCX file with the given text
    file = FAKER.docx_file(content=TEXT)

- Similarly, generate 1 `PNG` file.

.. code-block:: python

    from faker_file.providers.png_file import PngFileProvider

    FAKER.add_provider(PngFileProvider)

    file = FAKER.png_file()

- Similarly, generate 1 `PDF` file. Limit the line width to 80 characters.

.. code-block:: python

    from faker_file.providers.pdf_file import PdfFileProvider

    FAKER.add_provider(PdfFileProvider)

    file = FAKER.pdf_file(wrap_chars_after=80)

Provide templated content
~~~~~~~~~~~~~~~~~~~~~~~~~
You can generate documents from pre-defined templates.

.. code-block:: python

    TEMPLATE = """
    {{date}} {{city}}, {{country}}

    Hello {{name}},

    {{text}}

    Address: {{address}}

    Best regards,

    {{name}}
    {{address}}
    {{phone_number}}
    """

    file = FAKER.pdf_file(content=TEMPLATE, wrap_chars_after=80)

Archive types
~~~~~~~~~~~~~
ZIP archive containing 5 TXT files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As you might have noticed, some archive types are also supported.
The created archive will contain 5 files in TXT format (defaults).

.. code-block:: python

    from faker_file.providers.zip_file import ZipFileProvider

    FAKER.add_provider(ZipFileProvider)

    file = FAKER.zip_file()

ZIP archive containing 3 DOCX files with text generated from a template
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python

    from faker_file.providers.helpers.inner import create_inner_docx_file

    file = FAKER.zip_file(
        prefix="zzz",
        options={
            "count": 3,
            "create_inner_file_func": create_inner_docx_file,
            "create_inner_file_args": {
                "prefix": "xxx_",
                "content": TEMPLATE,
            },
            "directory": "yyy",
        }
    )

Nested ZIP archive
^^^^^^^^^^^^^^^^^^
And of course nested archives are supported too. Create a `ZIP` file which
contains 5 `ZIP` files which contain 5 `ZIP` files which contain 2 `DOCX`
files.

- 5 `ZIP` files in the `ZIP` archive.
- Content is generated dynamically.
- Prefix the filenames in archive with ``nested_level_1_``.
- Prefix the filename of the archive itself with ``nested_level_0_``.
- Each of the `ZIP` files inside the `ZIP` file in their turn contains 5 other
  `ZIP` files, prefixed with ``nested_level_2_``, which in their turn contain
  2 `DOCX` files.

.. code-block:: python

    from faker_file.providers.helpers.inner import create_inner_zip_file

    file = FAKER.zip_file(
        prefix="nested_level_0_",
        options={
            "create_inner_file_func": create_inner_zip_file,
            "create_inner_file_args": {
                "prefix": "nested_level_1_",
                "options": {
                    "create_inner_file_func": create_inner_zip_file,
                    "create_inner_file_args": {
                        "prefix": "nested_level_2_",
                        "options": {
                            "count": 2,
                            "create_inner_file_func": create_inner_docx_file,
                            "create_inner_file_args": {
                                "content": TEXT + "\n\n{{date}}",
                            }
                        }
                    },
                }
            },
        }
    )

It works similarly for `EML` files (using ``EmlFileProvider``).

.. code-block:: python

    from faker_file.providers.eml_file import EmlFileProvider
    from faker_file.providers.helpers.inner import create_inner_docx_file

    FAKER.add_provider(EmlFileProvider)

    file = FAKER.eml_file(
        prefix="zzz",
        content=TEMPLATE,
        options={
            "count": 3,
            "create_inner_file_func": create_inner_docx_file,
            "create_inner_file_args": {
                "prefix": "xxx_",
                "content": TEXT + "\n\n{{date}}",
            },
        }
    )

Create a ZIP file with variety of different file types within
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- 50 files in the ZIP archive (limited to DOCX, EPUB and TXT types).
- Content is generated dynamically.
- Prefix the filename of the archive itself with `zzz_archive_`.
- Inside the ZIP, put all files in directory zzz.

.. code-block:: python

  from faker import Faker
  from faker_file.providers.helpers.inner import (
      create_inner_docx_file,
      create_inner_epub_file,
      create_inner_txt_file,
      fuzzy_choice_create_inner_file,
  )
  from faker_file.providers.zip_file import ZipFileProvider
  from faker_file.storages.filesystem import FileSystemStorage

  FAKER = Faker()
  STORAGE = FileSystemStorage()

  kwargs = {"storage": STORAGE, "generator": FAKER}
  file = ZipFileProvider(FAKER).zip_file(
      prefix="zzz_archive_",
      options={
          "count": 50,
          "create_inner_file_func": fuzzy_choice_create_inner_file,
          "create_inner_file_args": {
              "func_choices": [
                  (create_inner_docx_file, kwargs),
                  (create_inner_epub_file, kwargs),
                  (create_inner_txt_file, kwargs),
              ],
          },
          "directory": "zzz",
      }
  )

Another way to create a ZIP file with variety of different file types within
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- 3 files in the ZIP archive (1 DOCX, and 2 XML types).
- Content is generated dynamically.
- Filename of the archive itself is `alice-looking-through-the-glass.zip`.
- Files inside the archive have fixed name (passed with basename argument).

.. code-block:: python

  from faker import Faker
  from faker_file.providers.helpers.inner import (
      create_inner_docx_file,
      create_inner_xml_file,
      list_create_inner_file,
  )
  from faker_file.providers.zip_file import ZipFileProvider
  from faker_file.storages.filesystem import FileSystemStorage

  FAKER = Faker()
  STORAGE = FileSystemStorage()

  kwargs = {"storage": STORAGE, "generator": FAKER}
  file = ZipFileProvider(FAKER).zip_file(
      basename="alice-looking-through-the-glass",
      options={
          "create_inner_file_func": list_create_inner_file,
          "create_inner_file_args": {
              "func_list": [
                  (create_inner_docx_file, {"basename": "doc"}),
                  (create_inner_xml_file, {"basename": "doc_metadata"}),
                  (create_inner_xml_file, {"basename": "doc_isbn"}),
              ],
          },
      }
  )

Using raw=True features in tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you pass ``raw=True`` argument to any provider or inner function, instead
of creating a file, you will get bytes back (or to be totally correct,
bytes-like object ``BytesValue``, which is basically bytes enriched with
meta-data). You could then use the bytes content of the file to build a test
payload as shown in the example test below:

.. code-block:: python

  import os
  from io import BytesIO

  from django.test import TestCase
  from django.urls import reverse
  from faker import Faker
  from faker_file.providers.docx_file import DocxFileProvider
  from rest_framework.status import HTTP_201_CREATED
  from upload.models import Upload

  FAKER = Faker()
  FAKER.add_provider(DocxFileProvider)

  class UploadTestCase(TestCase):
      """Upload test case."""

      def test_create_docx_upload(self) -> None:
          """Test create an Upload."""
          url = reverse("api:upload-list")

          raw = FAKER.docx_file(raw=True)
          test_file = BytesIO(raw)
          test_file.name = os.path.basename(raw.data["filename"])

          payload = {
              "name": FAKER.word(),
              "description": FAKER.paragraph(),
              "file": test_file,
          }

          response = self.client.post(url, payload, format="json")

          # Test if request is handled properly (HTTP 201)
          self.assertEqual(response.status_code, HTTP_201_CREATED)

          test_upload = Upload.objects.get(id=response.data["id"])

          # Test if the name is properly recorded
          self.assertEqual(str(test_upload.name), payload["name"])

          # Test if file name recorded properly
          self.assertEqual(str(test_upload.file.name), test_file.name)

Create a HTML file predefined template
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you want to generate a file in a format that is not (yet) supported, you
can try to use ``GenericFileProvider``. In the following example, an HTML file
is generated from a template.

.. code-block:: python

  from faker import Faker
  from faker_file.providers.generic_file import GenericFileProvider

  file = GenericFileProvider(Faker()).generic_file(
      content="<html><body><p>{{text}}</p></body></html>",
      extension="html",
  )

Storages
~~~~~~~~

Example usage with `Django` (using local file system storage)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

    from django.conf import settings
    from faker_file.providers.txt_file import TxtFileProvider
    from faker_file.storages.filesystem import FileSystemStorage

    STORAGE = FileSystemStorage(
        root_path=settings.MEDIA_ROOT,
        rel_path="tmp",
    )

    FAKER.add_provider(TxtFileProvider)

    file = FAKER.txt_file(content=TEXT, storage=STORAGE)

Example usage with AWS S3 storage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code-block:: python

    from faker_file.storages.aws_s3 import AWSS3Storage

    S3_STORAGE = AWSS3Storage(
        bucket_name="test-bucket",
        root_path="tmp",  # Optional
        rel_path="sub-tmp",  # Optional
        # Credentials are optional too. If your AWS credentials are properly
        # set in the ~/.aws/credentials, you don't need to send them
        # explicitly.
        # credentials={
        #     "key_id": "YOUR KEY ID",
        #     "key_secret": "YOUR KEY SECRET"
        # },
    )

    file = FAKER.txt_file(storage=S3_STORAGE)

Augment existing files
~~~~~~~~~~~~~~~~~~~~~~
If you think `Faker`_ generated data doesn't make sense for you and you want
your files to look like a collection of 100 files you already have, you could
use augmentation features.

You will need additional requirements:

.. code-block:: sh

    pip install faker-file[ml]

Usage example:

.. code-block:: python

    from faker_file.providers.augment_file_from_dir import (
        AugmentFileFromDirProvider,
    )

    FAKER.add_provider(AugmentFileFromDirProvider)

    file = FAKER.augment_file_from_dir(
        source_dir_path="/home/me/Documents/faker_file_source/",
        wrap_chars_after=120,
    )

Generated file will resemble text of the original document, but
will not be the same.

CLI
~~~
Even if you're not using automated testing, but still want to quickly
generate a file with fake content, you could use faker-file:

.. code-block:: sh

  faker-file generate-completion
  source ~/faker_file_completion.sh

Generate an MP3 file:

.. code-block:: sh

  faker-file mp3_file --prefix=my_file_

Generate 10 DOCX files:

.. code-block:: sh

  faker-file docx_file --nb_files 10 --prefix=my_file_

Without `faker-file`_
---------------------
There are alternatives.

You could simply store a collection of test files somewhere. If you do so, make
sure you "know" your collection. It should be obvious of how to use it. In
other words - document it properly, alongside snippets to make most of it.

Then there comes a natural question - where to store? Should it be centrally
hosted or per repository?

An obvious drawback of centrally hosted approach is that modifications become
critical. A mistake may cause failure of your CI/CD pipeline. Also, you need to
take care of the setup (for both CI/CD and development).

On the other hand, if you do it per project/repository basis, or even using a
blue-print repository, you miss these direct contributions to the upstream.

BTW, consider storing your test files in GitLFS.

Besides, adding test files to the repository still feels a little bit strange
to me. There's always a case when you need to have a variation and therefore
you need to make another copy, sometimes a very long copy. And oh, refactoring
and cleaning up becomes almost unmanageable.

Additionally, you could always go for a mixed approach, when some of the
essentially needed files you still do store in the repository (and that can
be project specific), while you still make use of the synthetic data for the
cases when it's justified.

Recap/conclusion
----------------
- Most likely, combination of `Faker`_, `factory_boy`_ and `faker-file`_ will
  do just fine for your MVP and even way beyond that (you have all in one:
  synthetic data + dynamic fixtures + generation of files).
  This approach also saves you from thinking about where to store your test
  data, and overall, makes your code more manageable and simplifies the
  development process.
- If you need to test files in your project, think upfront about the details,
  such as amount of test files you will need, where to store them, how to store
  them, etc.
- If some of your test cases are too specific to replicate with `faker-file`_,
  consider using hybrid approach.