This is sample Python template project (boilerplate), designed to implement various automation tasks.
It's also designed that scripts inside this projects can be executed with different orchestration platforms (especially if built and deployed as container image) - orchestrator-agnostic design.
This boilerplate/template project has implemented basic functionality to effectively catch errors and log them.
Running this project in development environment can be done in two ways:
- "Classic way": Python and additional libraries should be installed on operating system where development is done
- Development Containers (Dev Containers): https://containers.dev/
The preferred way is using Development Containers.
Using Dev Containers makes establishing development environment as fast as possible without bothering installing all required software dependencies on host operating system. Dev Containers with configuration (.devcontainer.json) also provies all required additional extensions, libraries and other software inside container image.
- Docker
- Visual Studio Code
- Dev Containers extension for Visual Studio Code: https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers
- (optional): Git: If source version control is needed
- (optional): Container Tools: For additional help working with container environments inside Visual Studio Code: https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-containers
Remarks for Docker
- For running Dev Containers only docker container runtime is required and not also GUI counterpart - Docker Desktop. Docker Container Runtime is opensource (Docker Desktop is not).
- Other containers tools (like Rancher Desktop, Podman) can be used to establish Dev Containers functionality, but it may require additional configuration to make it work properly.
- For installing docker container runtime on Linux, do not use "pre-packed" docker versions (like snap version on Ubuntu) and always install docker via "official" channel (e.g.
sudo apt install docker.ioon Ubuntu)
Remarks for development using Dev Containers
Visual Studio Code has full support for Dev Containers. Other tools may have support for Dev Containers, but it may lack specific features.
Build process should be done when the project is establish on new development environment (e.g. computer) or when .devcontainer.json or Dockerfile is changed.
To (Re)build in Visual Studio Code, press F1 function key and start type "Dev Containers:" and then choose option Dev Containers: Rebuild Container or Dev Containers: Rebuild Container Without Cache.
Development and debugging this project using Dev Containers is identical compared to "classic way" where all required software dependencies are installed on host OS.
The most preferred way of deployment is to pack the whole project (inside src folder) with all required dependencies (additional libraries, drivers,...) in container image (often named as "Docker image").
OCI image: https://opencontainers.org/
When packed the described way, this is the most deterministic way that runtime environment is created (and reproduced).
Scripts inside container can be executed manually (using SSH session in terminal and executing desired command) or with dedicated orchestrator platform (which should have support for dealing with containerized environments) to establish real automation with advanced features (e.g. monitoring & alerting, retry mechanisms,...). Examples:
- Kestra: https://kestra.io/
- Apache Airflow: https://airflow.apache.org/
In order to keep this project organized, it's recommended that Dockerfile files are placed inside docker folder (and subfolders).
It's also highly recommended that Dockerfile files are not placed directly inside docker folder in order to make more obvious the purpose of each Dockerfile (e.g. for development, for deployment,...).
Inside docker folder, there are additional subfolders, where Dockerfile files can be placed:
-
deploy: Dockerfile files that are used for deployment purposes (regardless of environment - production, test,...). It's recommended that deployment images contain minimal required amount of additional software packages (libraries,...) that are mandatory for this software solution. -
dev: Dockerfile files that are used for development purposes (e.g. Dev Containers). Subfolders inside this folder are development-technology-specific (e.g. Dev Containers, Eclipse Che DevWorkspace) configuration (mostly due to the most suitable base container image that integrates all required debugging libraries and tools for development specific technology).
It's recommended that all Dockerfile files are inside subfolders of folderdevand not directly insidedevfolder to make obvious which Dockerfile is associated with a specific development-container technology.
-
Keep Linux distribution which is used for container images the same and also the same version in order to make development in deployment runtimes as similar as possible (e.g. Debian 13 - Trixie).
This task requires manual synchronization (in Dockerfile files) when newer version and/or different distribution of Linux is used. -
Keep software packages (e.g. libraries, drivers) the same versions between multiple Dockerfile files in order to make development in deployment runtimes as similar as possible.
This task requires manual synchronization (in Dockerfile files) when newer or different sofotware packages are installed on base operating system (container image).
(Thereforerequirements.txtwith additional Python dependencies are defined on root and referenced in Dockerfile files.)
It's highly recommended that base images with all required software packages (e.g. libraries, drivers,...) are created and deployed on container registry repository, because this approach has at least these three benefits:
- (much) faster builds
- higher-level of independance from public cloud artifact repositories (container images, libraries,...)
- in highly-secured environment, disconnection from public internet is preferred
-
Official Python base image with all required software packages that this Python project needs (e.g. ODBC drivers,...).
It's also recommended that installed build tools are removed after successfull creation. -
Second stage takes stored container image from first stage and install all required Python packages for this project.
Store this image as second-stagged image where only source code for this particular project is not installed in container image. -
In the third stage, take image from previous stage (where Python libraries are installed) and copy the
srcfolder from this repository.
When developing this project, in most cases only image from second stage (with all required software packages and also additional Python libraries) is taken and transferred src folder inside creation of current actual container image.
When a new or updated Python libraries should be to this project, only new second-stagged container image should be created (and the third stage should refer this new image).
When the whole base should be upgraded (e.g. new major version of base operating system and/or software packages), then all container images from stage one should be generated.
Naming conventions for Dockerfile files are defined with postfixes.
When generationg custom base images (which are then stored inside container registry), the following naming conventions are recommended:
-
Dockerfile.base00: This Dockerfile is just to have reference for official image that can be stored inside desired container registry (to have full control to be independent from public container registries). -
Dockerfile.base01: This Dockerfile continues from base00 and install all required software packages that this particular project needs (but without Python libraries).
It's recommended that potentially installed build tools are removed before end of build process inside container image. -
Dockerfile.base02: This Dockerfile continues from base01 and install all required additional Python packages.
It's highly recommended to generate container images and store them to container registry. This can speed up development and deployment process.
-
Dockerfile.multistage: This Dockerfile definition contains all stagges from official Python base image to the stage where container image withsrcfolder is created.
Building from this Dockerfile takes the most time and reuires also active internet connection (to download and install all required software packages and libraries).
Use this option only when there is no possibility to store base container images. -
Dockerfile: This Dockerfile takes custom built container image where all required software packages and also Python packages are already installed - base02 image. Stages that have to be implemented:- transferring
srcfolder inside container image - security and other optimizations
- transferring
-
In Dockerfile, which produces final container image with application inside (with
srcfolder), create dedicated system user with minimal amount of privileges that scripts would execute. -
Do not write secrets directly inside code or config files and commit it to source version control (e.g. Git)!
When establishing a new development environment (e.g. developer onboarding process), developer should create.envfile directly inside foldersrcand place key-value pairs for desired environment variables.
(Of course - in the.envfile can also be key-value pairs which are not secrets if the development team decides that way.)
(Commiting.envfile to Git is excluded (prevented) based on configuration inside.gitignorefile.)
It's also highly recommended that access to the project code (with .env file) on development environments (e.g. machines) is protected as much as possible and more important - secrets for different environments (e.g. development, test, production) are different.
Example of.envfile content:MY_WEBSERVICE_API_TOKEN="adjjghg78lfdjngfdf887vjkdd7gjjkld9dkljkldfnjkld8" MY_DB_USER_PWD="MySuperPwd@!1568"There are methods inside
secrets_managerhelper (placed directly insidesrcfolder) with methodsget_secret_by_code(key: str)andget_secret_by_code_with_source(key: str)which can be called inside other scripts to obtain secret value based on secret id/code.
It's good practice that in development environment.envis loaded only once inside script execution. Thereforesecrets_managershould be instantiated first and then this instance should be used inside the whole script execution (even passing this secrets_manager instance to methods inside script if there is parallel code inside).
Example how to instantiate secrets_manager in code before using methodt that retrive secrets:sm = secrets_manager.get_secrets_manager()Example code (without source info):
sm = secrets_manager.get_secrets_manager() api_token = sm.get_secret_by_code("MY_WEBSERVICE_API_TOKEN") db_pwd = sm.get_secret_by_code("DB_SYS_USER_PWD")Example code (with source info):
sm = secrets_manager.get_secrets_manager() api_token_with_info = sm.get_secret_by_code_with_source("MY_WEBSERVICE_API_TOKEN") api_token_value = api_token_with_info.secret_value(Use this method if you want to know and log the source where secred was read.)
When deploying the app, the described methods in
secrets_managershould be further developed to support interaction with the secret management system of choice (e.g. Hashicorp Vault).
It's important that logic inside described methods insecrets_managerfirst check if.envfile exists and if it exists it should search in this file first.
(Based on project specifics, the described methods can be implemented that look further for secrets existance on other secret sources or raise exception if secret is not found on specific secret source.)
It's important that the IDs (codes) - likeMY_WEBSERVICE_API_TOKEN(based on example) are identical among keys inside.envfile and other secret management system which are accessed from scripts when they are deployed. This is the best way that scripts are deployed unchanged from development environment to test/production/... environment.
(In corporate/enterprise environment is highly advisable that such key-value secrets are shared to developer using dedicated secret management solutions like Passbolt in order to secure this part of transferring secrets (the developer can access secrets which are relevant for position inside organization).)
All Business Logic should be placed inside src folder.
It's highly recommended that all scripts are organized in subfolders and not placed directly inside srtc folder (except pre-defined functionalities inside global_parameters.py, global_functions.py, global_logger.py, secrets_manager.py,... - those functionalities are referenced in all (or most) other scripts).
-
Think that subfolder names inside
srcfolder should be treated as applications and subfolder names inside application folders like a feature names. -
It's recommended to follow PEP8 Python Coding Style Guide when namining folders, classes, methods,...: https://peps.python.org/pep-0008/
-
There is special folder named
_templatesinsidesrcfolder where base templates are stored.
Those templates are "copy/paste"-ready with all required instructions (as comments) to start implementing a new feature (script). -
Python templates have some predefined arguments, which can be passed when a specific python script (based on these templates) is executer (in terminal or via orchestrator).
Those arguments have predefined default values, which can be overriden.
The developer can of course add an additional arguments to specific script (based in these templates).
The development team is advised that consider using database for some scenarios to store additional parameters that are required by scripts. Those parameters can also be key-value and logic for updating them insideglobal_functions.pyor dedicated helper library on this folder level (directly insidesrcfolder). This pattern introcudes some kind of "datababase-driven-execution" of scripts.-
--orchestrator: Parameter that is used to informglobal_logger.pyhow to format logging messages to shown correctly inside current executiong environment.
Default: UNSPECIFIED
For development environment leave UNSPECIFIED that logging messages are visible in console (terminal) and colored (warn=yellow, error=red).
When defining script execution inside orchestrator of choice (e.g. Kestra, Airflow,...), pass and set this parameter correctly.
(Implement message formatting for additional orchestrators insideglobal_logger.pyto support orchestrator of choice. Kestra is already supported.) -
--logging_level: The amount of logging messages that should appear.
Defined logging levels: DEBUG => INFO => WARN => ERROR
Setting logging level logs messages only from that level onwards (e.g. setting "INFO" logging level would log only INFO, WARN and ERROR messages)
Default: Defined byglobal_parameters.ENVIRONMENT_LOGGING_DEFAULT_LOGGING_LEVEL_NAME -
--show_script_running_id: Internally generated (inside script itself) unique ID generated when the script is started. Usually this is not needed and is reserved for very special debugging cases.
Default: False -
--show_logging_instance_id: It's highly recommended that only one instance of logger (global_logger.py) is created inside script. This is internally generated unique ID when logger instance is created (global_logger.Logger(...)).
Consider multiple instance of logger only when script has multithreaded parts (to easily group logging messages produced by the whole scripts by threads).
Default: False -
--show_timestamp_utc: Parameter that is used to informglobal_logger.pyto show UTC-based timestamps at the start of every log message.
Usually orchestrators (like Kestra, Airflow,...) append such timestamp and is no need to explicitly write it at the beginning of the message.
Setting this parameter toTruecan only be beneficial to measure time when executing code inside development environment (where loggine messages are shown in console/terminal).
Default: False -
--show_timestamp_local: The same as--show_timestamp_utcbut to show local time instead of UTC.
Default: False--show_timestamp_utcand--show_timestamp_localare not exclusive and can be shown both at the same time.
-
- Developer should set development environment as described in this documentation.
- When developer gets project (e.g. clone it from Git)
- Developer should create
.envfile insidesrcfolder and placed all environment variables that are needed insice various scripts inside project (also described inside this documentation).
After developer establishes his environment, the recommended way to implement a new feature is:
-
Developer creates a new
.pyfile insidesrcstructure (optionally also creates additional subfolders insidesrcfolder). -
Developer copy the content of the most appropriate base template (located inside
_templatesfolder to this new.pyfile). -
Before further development, developer should check:
-
Import additional required libraries (other not imported standard Python libraries or libraries installed via package manager) for implementing specific functionality inside comment block:
# ---------- IMPORTANT: CUSTOM (Additional) Libraries for Script - START ----------
# ---------- IMPORTANT: CUSTOM (Additional) Libraries for Script - END ---------- -
Adjust
ROOT = Path(__file__).resolve().parents[2]value inside brackets ( [] ) if needed (this number is depended inside subfolder level where this new.pyscript lives insidesrcfolder) -
Optionally import custom developed libraries/helpers/... - e.g. "local - feature-related global_functions.py" inside comment block:
# ----- CUSTOM LIBRARY IMPORT PLACEHOLDER - START -----
# ----- CUSTOM LIBRARY IMPORT PLACEHOLDER - END -----
Look examples (written as comments) in this part of template how fo properly name and import custom library/helper. -
Placing an actual feature implementation inside global
try-exceptblock of code located after comment:
# -------------------- ACTUAL SCRIPT - Logic - START --------------------
Developer should write an actual code insidetryblock and optionally fill error handling insideexceptblock inside additionaltry-exceptblock insidetrypart.
Developer should not write a code aftersys.exit()part of code (that the final exit code that should potentially be recognized by orchestrator is not overriden).
Usually, at this part of script is implementing some king of alerting (e.g. sending alerting emails, messages to instant messengers (Teams, Slack, Discord,...).)
-
Other modifications are of course possible (e.g. adding additional arguments to script,...), but are usually not necessary.
-
global_parameters.py: File where global constants for project should be placed - it's some kind of configuration file.
DO NOT PLACE SECRET VALUES HERE. This is not a safe location.
It's recommended that constants are written all in UPPER case characters and underscore for space.
KeepENVIRONMENTandENVIRONMENT_LOGGING_DEFAULT_LOGGING_LEVEL_NAMEconstants.
(E.g. forENVIRONMENTset appropriate value based on Git branch.)
All scripts and libraries can read from this file. -
global_logger.py: Implementation of global logger, which can handle formatting and logging based on passed arguments when executing scripts for various orcheatrators or development environment.
Add additional logic to handle logging to orchestrator of choice (e.g. Kestra, Airflow,...). -
secrets_manager.py: Helper library for reading secrets and handling accessing to the right source for secrets when running scripts inside various environments (development, test/production,...). -
notification_email_manager.py: Helper library to hendle sending technically-related emails (errors, info,...).
It's recommended that developer team separated additional helpers for each communication channel (e.g.notification_ms_teams_manager.py,notification_discord_manager.py,...). -
global_functions.py: Functionalities, that are globally relevant (content-wise) to the project.
Keep in mind that development teams should recognize "locally global functions" (scoped gobally) inside project features in order to maintain codebase readability optimal. -
db_connection_strings_helper.py: Helper that has implemented methods for holding connection strings templates (with placeholder values for sensitive data) fo various database types (e.g. Microsoft SQL Server, IBM DB2,...).
Helper also contains internal method called_get_connection_string(), which returns connection string with real values for secret data (replacements of placeholder vales).
Method_get_connection_string()expects the following parameters to be passed:-
template_string: str: Template connection string with placeholder values which is is specific to database (e.g. Microsoft SQL Server, IBM DB2,...). -
hostname_or_ip_address: str: Hostname or IP address where database is located.
Corresponding placeholder value in connection string template that is replaced with real value:[PLACEHOLDER_HOSTNAME_OR_IP] -
port: int: Port where database is accessible on defined hostname or IP address.
Corresponding placeholder value in connection string template that is replaced with real value:[PLACEHOLDER_PORT] -
database_name: str: Database name which should be connected on defined hostname/IP address and port.
Corresponding placeholder value in connection string template that is replaced with real value:[PLACEHOLDER_DATABASE_NAME] -
username: str: Username to connect to defined database.
Corresponding placeholder value in connection string template that is replaced with real value:[PLACEHOLDER_USERNAME] -
password: str: Password to connect to defined database.
Corresponding placeholder value in connection string template that is replaced with real value:[PLACEHOLDER_PASSWORD]
The parameters mentioned above are common to any type of database (e.g. Microsoft SQL Server, IBM DB2,...) and should be considered also as sensitive data.
Therefore there are only differences in specific properties inside connection strings for various database types.
When a new database type (e.g. Oracle) should be added, a new method (or more of them) should be added todb_connection_strings_helper.pywhere also connection string template (URI-style or Key-Value style) should be defined. When creating new connection string template to support a new database type (e.g. Oracle), use predefined placeholder values in order to leverage method_get_connection_string().Example of method that should be added to
db_connection_strings_helper.pyto support a new database type:def get_connection_string_mssql_uri( hostname_or_ip_address: str, port: int, database_name: str, username: str, password: str ) -> str: TEMPLATE_STRING = ( "mssql://" "[PLACEHOLDER_USERNAME]" ":" "[PLACEHOLDER_PASSWORD]" "@" "[PLACEHOLDER_HOSTNAME_OR_IP]" ":" "[PLACEHOLDER_PORT]" "/" "[PLACEHOLDER_DATABASE_NAME]" "?encrypt=false&trustservercertificate=true" ) return _get_connection_string( template_string=TEMPLATE_STRING, hostname_or_ip_address=hostname_or_ip_address, database_name=database_name, port=port, username=username, password=password )It's highly recommended that the method name is descriptive.
Examples:- get_connection_string_db2_uri(...)
- get_connection_string_oracle_key_value(...)
IMPORTANT
Real secret values for placeholder values in connection strings should be stored safely.
Therefore it's recommended to usesecrets_manager.pyto retrieve secrets data for specific connection string that can be passed to the correct method insidedb_connection_strings_helper.pyFor each access to specific database instance (e.g. two SQL Server databases and one IBM DB2 database), it's highly recommended that developer creates additional methods in
db_manager.pyinside section# ---- Database Management - SPECIFIC - START -----and# ---- Database Management - SPECIFIC - END -----.It's recommended that each secret for each database access has descriptive ID/code in secret management platform and that those IDs/codes are aligned across all environments (development, test, prod,...).
Examples:- DB_MSSQL_<DB_NAME>_HOSTNAME_OR_IP_ADDRESS
- DB_DB2_<DB_NAME>_PORT
Inside
db_manager.py, it's recommended that for each specific database access is created method by convention get_database_connection_data_<database_type><database_name>[uri_key_value]
Example:- get_database_connection_data_mssql_<database_name>_uri
- get_database_connection_data_db2_<database_name>_key_value
Example method implementation (copy/paste-ready for each new database and only parameter names should be changed to reflect correct parameter names in secret store for this specific database):
def get_database_connection_data_mssql_<db_name>( secrets_manager_instance: secrets_manager.SecretsManager = None ) -> DatabaseConnectionDataResult: return _get_database_connection_data( hostname_or_ip_address_parameter_name="DB_MSSQL_<DB_NAME>_HOSTNAME_OR_IP_ADDRESS", port_parameter_name="DB_MSSQL_<DB_NAME>_PORT", database_name_parameter_name="DB_MSSQL_<DB_NAME>_DATABASE_NAME", username_parameter_name="DB_MSSQL_<DB_NAME>_USER_USERNAME", password_parameter_name="DB_MSSQL_<DB_NAME>_USER_PASSWORD", secrets_manager_instance=secrets_manager_instance )After that, one or more "associated" methods can be created to return whole connection string (as string).
It's also recommended that method naming conventions followsget_database_connection_string_<db_type_<db_name>_[uri_key_value].
Example method names:- get_database_connection_string_mssql_<db_name>_uri
- get_database_connection_string_db2_<db_name>_key_value
Example method (mostly copy/paste-ready with minimal modifications):
def get_database_connection_string_mssql_<db_name>_uri( secrets_manager_instance: secrets_manager.SecretsManager = None ) -> str: connection_data = get_database_connection_data_mssql_<db_name>( secrets_manager_instance=secrets_manager_instance ) return db_connection_strings_helper.get_connection_string_mssql_uri( hostname_or_ip_address=connection_data.hostname_or_ip_address, port=connection_data.port, database_name=connection_data.database_name, username=connection_data.username, password=connection_data.password )The following example shows how a specific connection string can be retrieved:
sm = secrets_manager.get_secrets_manager() DB_URI = db_manager.get_database_connection_string_mssql_<db_name>_uri(sm) # or without sm parameter if sm is not elswhere in the script # (otherwise is recommended that sm is created once and passed between methods) DB_URI = db_manager.get_database_connection_string_mssql_<db_name>_uri()Remarks
- Some of these parameters (specially database name) can also be hardcoded or inside other non-secret management platforms inside
db_manager.py(methods), because the database name is already part of method name (that developer can read code more clearly).
-
-
requirements.txt(outsidesrcfolder - directly on project root): Globally defined additional libraries that are required by this project.
AllDockerfilefiles and build processess (CI/CD) should reference this file to install all required and proper versions of additional libraries.
-
Prefer faster dataframe libraries like polars (https://pola.rs/) compared to pandas if possible especially for ELT/ETL pipelines.
Such dataframe libraries leverage the full potential of modern hardware (multicore CPU, advanced vector instruction sets in CPU,...) which brings significant performance gains. -
Write asynchronous and/or parallel code using ThreadPoolExecutor (https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor) for I/O-bound tasks, ProcessPoolExecutor (https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor) for CPU-intensive tasks and/or asyncio (https://docs.python.org/3/library/asyncio.html).
Look also: -
Avoid ODBC driver when accessing database if possible. ODBC introduces additional layer (Driver Manager) which brings additional operation overhead (memory consumption, longer processing times,...). Use a native python driver for database of choice if possible.
- mssql-python: https://pypi.org/project/mssql-python/
- oracledb: https://pypi.org/project/oracledb/
- connectorx: https://github.com/sfu-db/connector-x
- Apache ADBC: https://arrow.apache.org/adbc/current/index.html
In many cases, the performance benefit is 2-4x or even more.
-
For basic exports to Excel .xlsx files, use highly optimized libraries (usually writen in Rust or C/C++) like rustpy-xlsxwriter (https://pypi.org/project/rustpy-xlsxwriter/), because performance gains compared to "traditional" python-based xlsx libraries are huge (5-6x faster export).
-
For ELT/ETL pipelines, prefer data transferring by chunks on database-level.
Prefer keyset pagination when reading large tables and not offset-based pagination if possible (It's not possible to use a keyset pagination in every case.).
Offset-based pagination can become slow on every page, because skipped pages (records) should also read by the database engine.
When doing ELT/ETL tasks, usually data transfer pipeline reads pages sequentially (without skipping pages) and if there is a stable key (int/bigint is the best option), it's the fastest possible option for described scenario.
More about database pagination: https://dev.to/scion01/optimizing-pagination-in-postgresql-offsetlimit-vs-keyset-21dp