Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Daily Summarize
name: Daily Summarization

on:
workflow_dispatch:
Expand Down
87 changes: 77 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,20 +48,37 @@ Currently, the download data is collected from the following distributions:
In the future, we may expand the source distributions to include:
* [GitHub Releases](https://github.com/): Information about the project downloads from GitHub releases.

# Install
Install pymetrics using pip (or uv):
```shell
pip install git+ssh://git@github.com/sdv-dev/pymetrics
```

## Local Usage
Collect metrics from PyPI by running `pymetrics` on your computer. You need to provide the following:

1. BigQuery Credentials. In order to get PyPI download data, you need to execute queries on Google BigQuery.
Therefore, you will need an authentication JSON file, which must be provided to you by a privileged admin.
Once you have this JSON file, export the contents of the credentials file into a
`BIGQUERY_CREDENTIALS` environment variable.
2. A list of PyPI projects for which to collect the download metrics, defined in a YAML file.
See [config.yaml](./config.yaml) for an example.
3. Optional. A set of Google Drive Credentials can be provided in the format required by `PyDrive`. The
credentials can be passed via the `PYDRIVE_CREDENTIALS` environment variable.
- See [instructions from PyDrive](https://pythonhosted.org/PyDrive/quickstart.html).

You can run pymetrics with the following CLI command:

```shell
pymetrics collect-pypi --max-days 30 --add-metrics --output-folder {OUTPUT_FOLDER}
```

## Workflows

### Daily Collection
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see below).

#### Metrics
This PyPI download metrics are computed along several dimensions:
On a daily basis, this workflow collects download data from PyPI and Anaconda. The data is then published in CSV format (`pypi.csv`). In addition, it computes metrics for the PyPI downloads (see [#Aggregation Metrics](#aggregation-metrics))

- **By Month**: The number of downloads per month.
- **By Version**: The number of downloads per version of the software, as determined by the software maintainers.
- **By Python Version**: The number of downloads per minor Python version (eg. 3.8).
- **And more!**

### Daily Summarize
### Daily Summarization

On a daily basis, this workflow summarizes the PyPI download data from `pypi.csv` and calculates downloads for libraries. The summarized data is published to a GitHub repo:
- [Downloads_Summary.xlsx](https://github.com/sdv-dev/sdv-dev.github.io/blob/gatsby-home/assets/Downloads_Summary.xlsx)
Expand All @@ -77,5 +94,55 @@ Installing the main SDV library also installs all the other libraries as depende

This methodology prevents double-counting downloads while providing an accurate representation of SDV usage.

## PyPI Data
PyMetrics collects download information from PyPI by querying the [public PyPI download statistics dataset on BigQuery](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi&page=dataset). The following data fields are captured for each download event:

**Temporal & Geographic Data:**
* `timestamp`: The timestamp at which the download happened
* `country_code`: The 2-letter country code

**Package Information:**
* `project`: The name of the PyPI project (library) that is being downloaded
* `version`: The downloaded version
* `type`: The type of file that was downloaded (source or wheel)

**Installation Environment:**
* `installer_name`: The installer used for the download, like `pip` or `bandersnatch` or `uv`
* `implementation_name`: The name of the Python implementation, such as `cpython`
* `implementation_version`: The Python version
* `ci`: A boolean flag indicating whether the download originated from a CI system (True, False, or null). This is determined by checking for specific environment variables set by CI platforms such as Azure Pipelines (`BUILD_BUILDID`), Jenkins (`BUILD_ID`), or general CI indicators (`CI`, `PIP_IS_CI`)

**System Information:**
* `distro_name`: Name of the Linux or Mac distribution (empty if Windows)
* `distro_version`: Distribution version (empty for Windows)
* `system_name`: Type of OS, like Linux, Darwin (for Mac), or Windows
* `system_release`: OS version in case of Windows, kernel version in case of Unix
* `cpu`: CPU architecture used

## Aggregation Metrics

If the `--add-metrics` option is passed to `pymetrics`, a spreadsheet with aggregation
metrics will be created alongside the raw PyPI downloads CSV file for each individual project.

The aggregation metrics spreasheets contain the following tabs:

* **By Month:** Number of downloads per month and increase in the number of downloads from month to month.
* **By Version:** Absolute and relative number of downloads per version.
* **By Country Code:** Absolute and relative number of downloads per Country.
* **By Python Version:** Absolute and relative number of downloads per minor Python Version (X.Y, like 3.8).
* **By Full Python Version:** Absolute and relative number of downloads per Python Version, including
the patch number (X.Y.Z, like 3.8.1).
* **By Installer Name:** Absolute and relative number of downloads per Installer (e.g. pip)
* **By Distro Name:** Absolute and relative number of downloads per Distribution Name (e.g. Ubuntu)
* **By Distro Name:** Absolute and relative number of downloads per Distribution Name AND Version (e.g. Ubuntu 20.04)
* **By Distro Kernel:** Absolute and relative number of downloads per Distribution Name, Version AND Kernel (e.g. Ubuntu 18.04 - 5.4.104+)
* **By OS Type:** Absolute and relative number of downloads per OS Type (e.g. Linux)
* **By Cpu:** Absolute and relative number of downloads per CPU Version (e.g. AMD64)
* **By CI**: Absolute and relative number of downloads by CI status (automated vs. manual installations)
* **By Month and Version:** Absolute number of downloads per month and version.
* **By Month and Python Version:** Absolute number of downloads per month and Python version.
* **By Month and Country Code:** Absolute number of downloads per month and country.
* **By Month and Installer Name:** Absolute number of downloads per month and Installer.

## Known Issues
1. The conda package download data for Anaconda does not match the download count shown on the website. This is due to missing download data in the conda package download data. See this: https://github.com/anaconda/anaconda-package-data/issues/45
1 change: 0 additions & 1 deletion config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
max-days: 7
projects:
- sdv
- ctgan
Expand Down
6 changes: 3 additions & 3 deletions pymetrics/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ def _collect_pypi(args):
config = _load_config(args.config_file)
projects = args.projects or config['projects']
output_folder = args.output_folder
max_days = args.max_days or config.get('max-days')
max_days = args.max_days

collect_pypi_downloads(
projects=projects,
Expand Down Expand Up @@ -175,7 +175,7 @@ def _get_parser():
'--max-days',
type=int,
required=False,
help='Max days of data to pull if start-date is not given.',
help='Max days of data to pull if start-date is not given',
)
collect_pypi.add_argument(
'-f',
Expand Down Expand Up @@ -241,7 +241,7 @@ def _get_parser():
type=int,
required=False,
default=90,
help='Max days of data to pull.',
help='Max days of data to pull. Default to last 90 days.',
)
return parser

Expand Down
2 changes: 1 addition & 1 deletion pymetrics/drive.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def upload(content, filename, folder, convert=False):

drive_file.content = content
drive_file.Upload({'convert': convert})
LOGGER.info('Uploaded file %s', drive_file.metadata['alternateLink'])
LOGGER.info(f'Uploaded filename {filename}')


def download(folder, filename, xlsx=False):
Expand Down
65 changes: 32 additions & 33 deletions pymetrics/metrics.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
"""Functions to compute aggregation metrics over raw downloads."""

import logging
import re

import numpy as np
import pandas as pd
from packaging.version import InvalidVersion, Version

from pymetrics.output import create_spreadsheet

LOGGER = logging.getLogger(__name__)


def _groupby(downloads, groupby, index_name=None, percent=True):
grouped = downloads.groupby(groupby).size().reset_index()
grouped = downloads.groupby(groupby, dropna=False).size().reset_index()
grouped.columns = [index_name or groupby, 'downloads']
if percent:
grouped['percent'] = (grouped.downloads * 100 / grouped.downloads.sum()).round(3)
Expand Down Expand Up @@ -78,6 +79,7 @@ def _get_sheet_name(column):
'distro_kernel',
'OS_type',
'cpu',
'ci',
]
SORT_BY_DOWNLOADS = [
'country_code',
Expand All @@ -104,34 +106,6 @@ def _get_sheet_name(column):
]


RE_NUMERIC = re.compile(r'^\d+')


def _version_element_order_key(version):
components = []
last_component = None
last_numeric = None
for component in version.split('.', 2):
if RE_NUMERIC.match(component):
try:
numeric = RE_NUMERIC.match(component).group(0)
components.append(int(numeric))
last_component = component
last_numeric = numeric
except AttributeError:
# From time to time this errors out in github actions
# while it shouldn't enter the `if`.
pass

components.append(last_component[len(last_numeric) :])

return components


def _version_order_key(version_column):
return version_column.apply(_version_element_order_key)


def _mangle_columns(downloads):
downloads = downloads.rename(columns=RENAME_COLUMNS)
for col in [
Expand All @@ -153,6 +127,32 @@ def _mangle_columns(downloads):
return downloads


def _safe_version_parse(version_str):
if pd.isna(version_str):
return np.nan

try:
version = Version(str(version_str))
except InvalidVersion:
cleaned = str(version_str).rstrip('+~')
try:
version = Version(cleaned)
except (InvalidVersion, TypeError):
LOGGER.info(f'Unable to parse version: {version_str}')
version = np.nan

return version


def _version_order_key(version_column):
return version_column.apply(_safe_version_parse)


def _sort_by_version(data, column, ascending=False):
data = data.sort_values(by=column, key=_version_order_key, ascending=ascending)
return data


def compute_metrics(downloads, output_path=None):
"""Compute aggregation metrics over the given downloads.

Expand All @@ -171,8 +171,7 @@ def compute_metrics(downloads, output_path=None):
if column in SORT_BY_DOWNLOADS:
sheet = sheet.sort_values('downloads', ascending=False)
elif column in SORT_BY_VERSION:
sheet = sheet.sort_values(column, ascending=False, key=_version_order_key)

sheet = _sort_by_version(sheet, column=column, ascending=False)
sheets[name] = sheet

for column in HISTORICAL_COLUMNS:
Expand All @@ -181,7 +180,7 @@ def compute_metrics(downloads, output_path=None):
sheets[name] = _historical_groupby(downloads, [column])

if output_path:
create_spreadsheet(output_path, sheets)
create_spreadsheet(output_path, sheets, na_rep='<NaN>')
return None

return sheets
10 changes: 5 additions & 5 deletions pymetrics/output.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ def get_path(folder, filename):
return str(pathlib.Path(folder) / filename)


def _add_sheet(writer, data, sheet_name):
data.to_excel(writer, sheet_name=sheet_name, index=False, engine='xlsxwriter')
def _add_sheet(writer, data, sheet_name, na_rep=''):
data.to_excel(writer, sheet_name=sheet_name, index=False, engine='xlsxwriter', na_rep=na_rep)

for column in data:
column_length = None
Expand All @@ -51,7 +51,7 @@ def _add_sheet(writer, data, sheet_name):
)


def create_spreadsheet(output_path, sheets):
def create_spreadsheet(output_path, sheets, na_rep=''):
"""Create a spreadsheet with the indicated name and data.

If the ``output_path`` variable starts with ``gdrive://`` it is interpreted
Expand All @@ -74,11 +74,11 @@ def create_spreadsheet(output_path, sheets):

with pd.ExcelWriter(output, engine='xlsxwriter') as writer: # pylint: disable=E0110
for title, data in sheets.items():
_add_sheet(writer, data, title)
_add_sheet(writer, data, title, na_rep=na_rep)

if drive.is_drive_path(output_path):
LOGGER.info('Creating file %s', output_path)
folder, filename = drive.split_drive_path(output_path)
LOGGER.info(f'Creating filename {filename}')
drive.upload(output, filename, folder, convert=True)
else:
if not output_path.endswith('.xlsx'):
Expand Down
8 changes: 5 additions & 3 deletions pymetrics/pypi.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
details.system.name as system_name,
details.system.release as system_release,
details.cpu as cpu,
details.ci as ci,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE file.project in {projects}
AND timestamp > '{start_date}'
Expand All @@ -44,6 +45,7 @@
'system_name',
'system_release',
'cpu',
'ci',
]


Expand Down Expand Up @@ -129,9 +131,9 @@ def get_pypi_downloads(
if previous is not None:
if isinstance(projects, str):
projects = (projects,)
previous_projects = previous[previous.project.isin(projects)]
min_date = previous_projects.timestamp.min().date()
max_date = previous_projects.timestamp.max().date()
previous_projects = previous[previous['project'].isin(projects)]
min_date = previous_projects['timestamp'].min().date()
max_date = previous_projects['timestamp'].max().date()
else:
previous = pd.DataFrame(columns=OUTPUT_COLUMNS)
min_date = None
Expand Down
4 changes: 2 additions & 2 deletions pymetrics/summarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,14 +140,14 @@ def get_previous_pypi_downloads(output_folder, dry_run=False):
'system_name': pd.CategoricalDtype(),
'system_release': pd.CategoricalDtype(),
'cpu': pd.CategoricalDtype(),
'ci': pd.BooleanDtype(),
},
}
if dry_run:
read_csv_kwargs['nrows'] = 10_000
data = load_csv(csv_path, read_csv_kwargs=read_csv_kwargs)
LOGGER.info('Parsing version column to Version class objects')
if 'version' in data.columns:
data['version'] = data['version'].apply(parse)
data['version'] = data['version'].apply(parse)
return data


Expand Down
Loading