Background #
Naming is important (as well as hard). Good names are good. Bad names are… bad. And we name a lot of things. In Python (a name) code is bundled and uploaded as a package (another name), usually to the Python Package Index (PyPI, a third name). Inside of these packages is code. Most commonly, Python code (but they truly can contain anything (like whole books). The only relationship between the package name and the module name(s) is the same as the relationship between a cat’s name and its demeanor. Usually “fluffy” is a fluffy cat, but it’s a convention. “Cupcake” could be the name of a very very naughty kitten.
You install requests
and import requests
. You install typing-extensions
and import
typing_extensions
. You install python-dateutil
you import dateutil
(I guess it isn’t
python_dateutil
because that would imply you could import go_dateutil
and that’s a can
of worms the author(s) didn’t want to open). But, again, this is merely a convention.
So let’s say you wanted to map module names to package names? How well would you be doing so only via conventions? How unique are modules? What are the conventions, and who isn’t following them?
Well, that’s exactly what I needed to find out for my hobby project, so hop in! Come along!
The plan #
Step 1: We need data #
Hugo van Kemenade has this list of the top 8000 most downloaded packages on PyPI, updated monthly. That was easy (Thanks Hugo!).
PyPI has a nice simple API (quite literally) for getting links to downloadables for a package.
pip
wants to extract METADATA out of wheels (which are just zips) without downloading the
entire thing, so it has some clever code
for doing “range requests” to only fetch a fraction of the bytes. (For tarballs, we’re out of luck).
Swirl all that in a big pot, and voila! You can quickly scrape PyPI to get each package’s filenames.
Step 2: That was too easy, let’s add some complexity #
Since getting data was kinda easy, the universe has evened things out by making analyzing that data (in a useful way) kinda hard. That’s for two reasons:
- Source distributions (sdists, as opposed to binary ones, bdists) go through a build process. That means there is only a loose relationship between the files inside them and the files that would be inside a built distribution (part of that build process could be moving or creating files). There are 658 sdist-only packages on the list.
- Namespace packages. Namespaces might’ve been “one honking great idea” but namespace packages are usually misunderstood, and a honking painful thing to have to remember.
The solution to 1. is easy, just build (most of) them myself (how I did this is worthy of a blog post to come).
The solution to 2. is annoyingly complex. Namespace packages come in two forms:
- Implicit namespace packages. These are the reason you can
mkdir foo
, thenimport foo
even though there’s no__init__.py
in it. Any directory can be imported without a__init__.py
and is treated as an implicit namespace packages. That’s a daily annoyance for me, but in this case its actually easier to handle. - Explicit namespace packages. These have a
__init__.py
with one or two magic incantations that basically say “I’m a namespace”. And they can’t/shouldn’t have much more.
Because of 2., if I was to try and find what common “prefixes” a package has by simply looking at filenames,
both opencensus
and opencensus-context
and opencensus-ext-azure
would all claim opencensus
.
So, for any __init__.py
whose path shows up in more than one package, we need to see if it contains one of the
magic incantations.
Step 3: Let’s have fun with data #
(Link to online datasette
which all of the following links will use. I can’t guarantee, however, the schema won’t change.
)
So, of the 7,893 packages scraped:
- 7,337 wheels were found on PyPI
- I built another 556
- But 217 packages didn’t have an importable file
- Most of these are
types-
or-stubs
- Some were “meta” packages that just contained requirements on other packages
- Most of these are
Which leaves us with 7676 packages to analyze.
Fun with the filepaths
table #
This table is a simple “what Python files of valid importable names are in the zip?”
- Sorting by filecount, descending
shows us:
ansible
tops the chart with 13,650 files,- followed by
plotly
at 13,443 files, - and
oci
with 12,778 files
- There are 802 packages with only a single file
- There are 210 packages which include a top-level
test
ortests
directory- (This can get annoying if your Python set up finds these before finding your tests directory, as your tests won’t be importable)
Fun with the namespace_packages
table #
These filepaths are __init__.py
filepaths found in >1 package.
Out of 8,829 candidate package/filepath combinations:
- There are 3408 distinct filepaths
- Only 180 of the 8,829 are explicit namespace packages representing only 91 distinct filepaths
- 63 filepaths are marked as a namespace package in one package, but not in another
- Virtually all of these are from packages which have undergone some kind of migration, and therefore the colliding packages shouldn’t be installed at the same time anyways.
- 3320 filepaths appeared in multiple packages and weren’t marked as namespace packages
- From scrolling the data it appears these are largely from packages which are alternates (or forks) of other packages.
- the deepest namespace packages are
aws_cdk/aws_cloudfront/experimental
andazureml/train/automl
at 3 directories deep
Fun with the prefixes
table #
These prefixes are calculated by:
- Take the filepaths for a package
- Remove any
__init__.py
files that are namespace packages - Calculate the lowest common ancestor among the Python files
The intent is to try and find unambiguous prefixes for each package.
Of 16,681 total package/prefix combos (with 16,177 distinct prefixes):
- 7147 packages have one prefix
- On the other end, sorting packages by prefix count reveals
ansible
tops the charts with 5,915 prefixes- This is because there are a lot of nested directories with only one Python file in them and no
__init__.py
, making that a “unique” prefix.
- This is because there are a lot of nested directories with only one Python file in them and no
- (In fact most of the high-prefixers are due to a lack of
__init__.py
) - So if we filter out prefixes of multiple depths
timedelta
now tops the charts with 130 unique prefixes- It (and its friends near the top of the chart) all appear to be a snafu and didn’t intend to include several dozens of directories in the wheel.
- 360 prefixes are shared by more than one package
- However, just like shared filepaths, it appears these are largely from packages which are alternates (or forks) of other packages.
- Some are legitimate though, like
haystack
being a prefix of bothdjango-haystack
andhaystack-ai
Step 4: Funtime is over, let’s find conventions #
By far, the most common convention is (unsurprisingly) normalizing the module name:
- 5,551 prefixes map to their package name after normalization
- E.g.
requests
->requests
orsqlalchemy_views
->sqlalchemy-views
orShazamAPI
->shazamapi
- Of those, 5,303 prefixes solely identify their package
- E.g.
pytest
wouldn’t count because_pytest
andpy
are also prefixes. - That’s almost 70% of packages!
- E.g.
- E.g.
- There are a few common prefixes/suffixes, too (numbers are packages with 1 prefix):
- Then for multi-level prefixes:
- 441 more are found by replacing the
.
with-
- E.g.
flufl.lock
->flufl-lock
- E.g.
- 441 more are found by replacing the
- Then 52 more with an
apache-
prefix
So sticking solely to normalization, and applying some common prefixes/suffixes, you get…
81% of packages have a single prefix, which when normalized directly correlates to the package name.
Conclusion, and next steps #
So here we are. Sitting on a pile of data, and a concrete understanding of package name -> module name conventions (which I’m sure most of y’all reading already had on your BINGO card). But we’ve turned hunch into proof, and more importantly we can also compile a little mapping of the top packages’ prefixes that don’t fit the mold.
If you’re already publishing packages to PyPI, or planning to do so, be a peach::
- Stick to a convention for your module names
- Upload wheels
- Avoid implicit or explicit namespace packages if you can help it
- Otherwise, if you have to choose… well, you know the saying ;)
I’ll probably run this collection periodically, and maybe even evolve it some. However, now I can get back to my hobby project (as well as this hobby project’s hobby project: building wheels where missing)..