/ Blog
Using Sourcegraph to discover non-NPM JS projects

If you want to discover package.json files for JavaScript projects that are not NPM libraries, how would you do it?

That's the problem we set out to solve when we created a large simulation of the StackAid platform using real-world projects. Let's have a look at the problems and the ways around them.

To start, most of the package.json files we’re looking for will be on GitHub, so let’s start with GitHub Search. We could search for package.json files and iterate over the results. Let’s try it:

Wow, 460M results. That number seems high, but that doesn't matter for our purposes. Unfortunately, GitHub Search returns package-lock.json files in the search results and package.json files found in the node_modules directory. Also, as shown in the screenshots, the search results aren't stable, and sometimes we don't get a full page of results. Unfortunately, GitHub Advanced Search doesn't expose the query operations that would allow us to narrow our search results. For more details take a look at GitHub's search code documentation.

What now?

GitHub is not the only company that indexes GitHub repositories. SourceGraph provides a powerful search engine for source code and provides a free service for searching public repositories on GitHub.

Let’s give that a shot. SourceGraph supports many search operators. In our query, we’re looking for package.json files but exclude any found in node_modules directories.

Here's the Sourcegraph query:

The query does a number of things to help narrow down the results:

  • Excludes package.json files found in dot/hidden directories.
  • Excludes files found in directories such as node_modules, tests and examples
  • Excludes archived and forked repositories

Here are the results of the query on Sourcegraph.com:

Several things to note about the results. The search results are stable and sorted by popularity of the repositories. When specifying count:all, Sourcegraph does an exhaustive search and returns the total match count, 1.3M results, from their search index. The exhaustive search on Sourcegraph takes approximately 25 seconds. Executing and downloading the results from Sourcegraph using their CLI takes only 36 seconds for me.

Before moving on, there's one other thing to note about the Sourcegraph results. On the right side of the results page is a module to allow grouping results. The default is a grouping by repository. For us, it was obvious that there are a number of repositories with many package. json files. Something to consider when sampling from the result set.

Finding non-NPM repositories

Every result from SourceGraph has the containing repository. Thankfully, we also have an export list of GitHub repositories for all NPM packages. Our task is to whittle the SourceGraph results down to the list of repositories not in the exported list of NPM package repositories.

Putting it all together

Our tools:

The steps:

  1. Search: Query and save SourceGraph results
  2. Filter: Remove NPM package repositories from SourceGraph results
  3. Fetch: Benthos consume search results, fetches package.json files from GitHub, and publishes them to another NSQ topic.
  4. Persist: Save package.json files to disk.
  5. Transform: Create a SQLite DB from filtered search results and fetched package.json files.

The collection pipeline is a series of shell commands executed in order. Our preferred tool is Taskfile instead of bash scripts or Makefiles. The whole pipeline can be executed via Docker Compose to save you the trouble of installing a bunch of utilities.

While there are many frameworks for crawling, the majority of the work bookended the collection step, and so using anything more than Benthos and NSQ felt like overkill. Your mileage may vary for your use case.

Our project for querying Sourcegraph and collecting the package.json files from GitHub is open source and available at github.com/stackaid/non-npm-projects

The results

We rate limit ourselves to be kind to GitHub, and so the full collection takes about 12 hours. To save you the hassle, we're making a snapshot available using Datasette on Fly.io. You can download the entire database or query the collection.

On the packages database overview page, you will see 3 tables.

  • npm_package_repositories: A snapshot of all GitHub repositories for NPM packages
  • non_npm_packages: All the collected package.json files and where they came from.
  • dependencies: Dependencies and dev dependencies extracted from non_npm_packages to make querying across packages easier.

At the bottom of the overview page is a link to download the entire 3.3GB database.

Here's a preview of the non_npm_packages database:

So what can you do with this data?

We hope the data is useful to you. If you dig into the data, let us know what you find! Big thanks to Sourcegraph and all the other utilities that we used to create our collection pipeline. As always, we fund all projects used here with StackAid.

Start funding your dependencies.
Claim open source projects.
Get Started