Background #

Naming is important (as well as hard). Good names are good. Bad names are… bad. And we name a lot of things. In Python (a name) code is bundled and uploaded as a package (another name), usually to the Python Package Index (PyPI, a third name). Inside of these packages is code. Most commonly, Python code (but they truly can contain anything (like whole books). The only relationship between the package name and the module name(s) is the same as the relationship between a cat’s name and its demeanor. Usually “fluffy” is a fluffy cat, but it’s a convention. “Cupcake” could be the name of a very very naughty kitten.

You install requests and import requests. You install typing-extensions and import typing_extensions. You install python-dateutil you import dateutil (I guess it isn’t python_dateutil because that would imply you could import go_dateutil and that’s a can of worms the author(s) didn’t want to open). But, again, this is merely a convention.

So let’s say you wanted to map module names to package names? How well would you be doing so only via conventions? How unique are modules? What are the conventions, and who isn’t following them?

Well, that’s exactly what I needed to find out for my hobby project, so hop in! Come along!

The plan #

Step 1: We need data #

Hugo van Kemenade has this list of the top 8000 most downloaded packages on PyPI, updated monthly. That was easy (Thanks Hugo!).

PyPI has a nice simple API (quite literally) for getting links to downloadables for a package.

pip wants to extract METADATA out of wheels (which are just zips) without downloading the entire thing, so it has some clever code for doing “range requests” to only fetch a fraction of the bytes. (For tarballs, we’re out of luck).

Swirl all that in a big pot, and voila! You can quickly scrape PyPI to get each package’s filenames.

Step 2: That was too easy, let’s add some complexity #

Since getting data was kinda easy, the universe has evened things out by making analyzing that data (in a useful way) kinda hard. That’s for two reasons:

  1. Source distributions (sdists, as opposed to binary ones, bdists) go through a build process. That means there is only a loose relationship between the files inside them and the files that would be inside a built distribution (part of that build process could be moving or creating files). There are 658 sdist-only packages on the list.
  2. Namespace packages. Namespaces might’ve been “one honking great idea” but namespace packages are usually misunderstood, and a honking painful thing to have to remember.

The solution to 1. is easy, just build (most of) them myself (how I did this is worthy of a blog post to come).

The solution to 2. is annoyingly complex. Namespace packages come in two forms:

  1. Implicit namespace packages. These are the reason you can mkdir foo, then import foo even though there’s no in it. Any directory can be imported without a and is treated as an implicit namespace packages. That’s a daily annoyance for me, but in this case its actually easier to handle.
  2. Explicit namespace packages. These have a with one or two magic incantations that basically say “I’m a namespace”. And they can’t/shouldn’t have much more.

Because of 2., if I was to try and find what common “prefixes” a package has by simply looking at filenames, both opencensus and opencensus-context and opencensus-ext-azure would all claim opencensus.

So, for any whose path shows up in more than one package, we need to see if it contains one of the magic incantations.

Step 3: Let’s have fun with data #

(Link to online datasette which all of the following links will use. I can’t guarantee, however, the schema won’t change. )

So, of the 7,893 packages scraped:

Which leaves us with 7676 packages to analyze.

Fun with the filepaths table #

This table is a simple “what Python files of valid importable names are in the zip?”

Fun with the namespace_packages table #

These filepaths are filepaths found in >1 package.

Out of 8,829 candidate package/filepath combinations:

Fun with the prefixes table #

These prefixes are calculated by:

  • Take the filepaths for a package
  • Remove any files that are namespace packages
  • Calculate the lowest common ancestor among the Python files

The intent is to try and find unambiguous prefixes for each package.

Of 16,681 total package/prefix combos (with 16,177 distinct prefixes):

Step 4: Funtime is over, let’s find conventions #

By far, the most common convention is (unsurprisingly) normalizing the module name:

So sticking solely to normalization, and applying some common prefixes/suffixes, you get…

81% of packages have a single prefix, which when normalized directly correlates to the package name.

Conclusion, and next steps #

So here we are. Sitting on a pile of data, and a concrete understanding of package name -> module name conventions (which I’m sure most of y’all reading already had on your BINGO card). But we’ve turned hunch into proof, and more importantly we can also compile a little mapping of the top packages’ prefixes that don’t fit the mold.

If you’re already publishing packages to PyPI, or planning to do so, be a peach::

  • Stick to a convention for your module names
  • Upload wheels
  • Avoid implicit or explicit namespace packages if you can help it
    • Otherwise, if you have to choose… well, you know the saying ;)

I’ll probably run this collection periodically, and maybe even evolve it some. However, now I can get back to my hobby project (as well as this hobby project’s hobby project: building wheels where missing)..