If you want to discover
how would you do it?
That's the problem we set out to solve when we created a large simulation of the StackAid platform using real-world projects. Let's have a look at the problems and the ways around them.
To start, most of the
package.json files we’re looking for will be on GitHub, so let’s start with
GitHub Search. We could search for
package.json files and iterate over the results. Let’s try it:
Wow, 460M results. That number seems high, but that doesn't matter for our purposes.
Unfortunately, GitHub Search returns
package-lock.json files in the search results and
package.json files found in the
node_modules directory. Also, as shown
in the screenshots, the search results aren't stable, and sometimes we don't get a full page of
results. Unfortunately, GitHub Advanced Search doesn't expose the query operations that would
allow us to narrow our search results. For more details take a look at GitHub's search code
GitHub is not the only company that indexes GitHub repositories. SourceGraph provides a powerful search engine for source code and provides a free service for searching public repositories on GitHub.
Let’s give that a shot. SourceGraph supports many search operators. In our query, we’re looking
package.json files but exclude any found in
Here's the Sourcegraph query:
The query does a number of things to help narrow down the results:
package.jsonfiles found in dot/hidden directories.
- Excludes files found in directories such as
- Excludes archived and forked repositories
Here are the results of the query on Sourcegraph.com:
Several things to note about the results. The search results are stable and sorted by popularity
of the repositories. When specifying
count:all, Sourcegraph does an exhaustive search and
returns the total match count, 1.3M results, from their search index. The exhaustive search on
Sourcegraph takes approximately 25 seconds. Executing and downloading the results from Sourcegraph
using their CLI takes only 36 seconds for me.
Before moving on, there's one other thing to note about the Sourcegraph results. On the right
side of the results page is a module to allow grouping results. The default is a grouping by
repository. For us, it was obvious that there are a number of repositories with many
package. json files. Something to consider when sampling from the result set.
Finding non-NPM repositories
Every result from SourceGraph has the containing repository. Thankfully, we also have an export list of GitHub repositories for all NPM packages. Our task is to whittle the SourceGraph results down to the list of repositories not in the exported list of NPM package repositories.
Putting it all together
- Docker Compose
- Search: Query and save SourceGraph results
- Filter: Remove NPM package repositories from SourceGraph results
- Fetch: Benthos consume search results, fetches package.json files from GitHub, and publishes them to another NSQ topic.
- Persist: Save package.json files to disk.
- Transform: Create a SQLite DB from filtered search results and fetched package.json files.
The collection pipeline is a series of shell commands executed in order. Our preferred tool is Taskfile instead of bash scripts or Makefiles. The whole pipeline can be executed via Docker Compose to save you the trouble of installing a bunch of utilities.
While there are many frameworks for crawling, the majority of the work bookended the collection step, and so using anything more than Benthos and NSQ felt like overkill. Your mileage may vary for your use case.
Our project for querying Sourcegraph and collecting the
package.json files from GitHub is open
source and available at
We rate limit ourselves to be kind to GitHub, and so the full collection takes about 12 hours. To save you the hassle, we're making a snapshot available using Datasette on Fly.io. You can download the entire database or query the collection.
On the packages database overview page, you will see 3 tables.
- npm_package_repositories: A snapshot of all GitHub repositories for NPM packages
- non_npm_packages: All the collected package.json files and where they came from.
- dependencies: Dependencies and dev dependencies extracted from non_npm_packages to make querying across packages easier.
At the bottom of the overview page is a link to download the entire 3.3GB database.
Here's a preview of the
So what can you do with this data?
- Count of projects with and without tests
- Percentage of projects using Grunt
- How often is NextJS used with Axios
- Most popular dependency and semver constraint combinations
- Most popular co-dependencies for React
We hope the data is useful to you. If you dig into the data, let us know what you find! Big thanks to Sourcegraph and all the other utilities that we used to create our collection pipeline. As always, we fund all projects used here with StackAid.