Skip to content

Calculate ecosystem downloads with library interdependencies #14

@gsheni

Description

@gsheni

Context

When calculating downloads for the SDV ecosystem, we take into account the library dependencies. For example, SDV downloads are subtracted from RDT (since SDV depends on RDT). This avoids double counting downloads. See below formula for this calculation.

This download adjustment needs to be extended to other ecosystem that maintain multiple related packages on PyPI. This will allow us to get an accurate picture of usage of other ecosystems.

Formula for SDV Ecosystem Download Count

  • To calculate the download count for SDV ecosystem:
    • Get the number of downloads for sdgym and sdv
    • Adjust the number of downloads sdv by subtracting the number of sdgym download
    • Get the number of downloads for rdt, copulas, ctgan, deepecho, sdmetrics
    • Adjust the downloads for rdt, copulas, ctgan, deepecho, sdmetrics by subtracting the number of sdv downloads.
      • These libraries are all direct dependencies of SDV.
    • Ensure no download count is negative (ex max(0, copulas_adjusted_count))
    • Sum all downloads to get SDV (ecosystem) download count
  • A Colab example.

Problem

Download calculations for external library ecosystems currently don't account for their internal dependencies, leading to inflated download numbers. We need a system to:

  1. Identify all libraries within each ecosystem
  2. Map their interdependencies
  3. Apply a similar dependency-aware calculation used for SDV ecosystem.
  4. Keep this information current as ecosystems evolve (external libraries will add/remove dependencies)

Description

When generating the Summary of downloads, the download counts should be adjusted for Gretel, ydata, and mostly. If future libraries are added, the dependencies of these libraries should be identified and taken into account.

Ecosystem Definition

A library is considered part of an ecosystem if:

  • It shares the same maintainers with other packages, AND
  • It has internal dependencies within the ecosystem

Deliverables

  • In the daily workflow, determine the libraries that are in the ecosystem.
  • For each ecosystem, adjust the download numbers when populating the Summary of Downloads
    • Use the above calculation as an example for how to adjust the downloads
    • Ensure ecosystem downloads are clearly marked in the Google Sheet (by (ecosystem) in the name)

Tools to get dependencies

  • pipgrip
    • This requires the library to not be installed (ideal)
  • PyPI API (look at PyPI users via API, gretal-ai, ydata )
  • Manually define it in a YAML file?
  • pipdeptree
    • This requires the library to be installed (not ideal)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions