Package and Artifact management

What is the problem?

The main problem to be discussed here revolves around managing artifacts. In order to really discuss the problem, we need a shared understanding of what an artifact is.

What is an artifact?

For the purposes of this discussion, an artifact is any file, or collection of files, that serves a particular purpose. This is an overly vague description, so illustrative examples can help. An artifact can be:

  • A compiled program or script
  • An installer for a software application (including device firmware images)
  • A disk image (like an ISO)
  • A log file
  • A report (or any other document)
  • A crash log
  • A software package or library
  • A container image (you can think of this as a specialized example of a package)

Effective artifact storage requires more than simply saving the contents of the file(s) that comprise the artifact—details are required that provides the context within which this artifact has meaning. These details are collectively referred to as metadata, formally defined as “a set of data that describes or provides information about other data.”

What kinds of metadata are important?

For instance, an installer is useless unless you know several things about it such as:

  • That it is an installer, and which kind of installer it is
  • What software it installs
  • What version of that software it installs
  • File hashes or signatures so that its validity and/or integrity can be verified by the downloader
  • Which operating system is supported by this installer

Additional information about the artifact can also be helpful to the creator of that artifact when problems invariably arise. Some examples include:

  • What machine was the installer created on?
  • Which compiler and/or external library versions were used during compilation?

And perhaps additional details may be useful for package creators like:

  • How long did it take to create the image?
  • When was the image created?

What is a package?

A package is an artifact that conforms to a target specification, and is often paired with an application—called a package manager—that processes packages conforming to those specifications.

In addition to defining a specification that an artifact must meet to ensure it can be understood, package managers are also frequently paired with web services that provide repositories of indexed packages that provide an easy way to deliver packages to clients. There are a plethora of package managers and associated repositories in use today. Some are summarized in the table below.

Package managerWeb repositoryLanguage or OS
nugethttps://www.nuget.orgC#, .NET, .NET Core
piphttps://pypi.orgPython
condahttps://anaconda.org/anaconda/repoPython (mostly)
dpkg/aptVarious Ubuntu/Debian sourcesDebian Linux and variants (including Ubuntu)
yum/dnfVarious RHEL, CentOS, Fedora sourcesRed Hat, CentOS, Fedora, and variants
gemhttps://rubygems.org/Ruby
npmhttps://www.npmjs.com/Javascript, NodeJS
dockerhttps://hub.docker.com/Docker container images
swifthttps://swift.org/package-manager/Swift/Objective C
gomany sources, including GitHubgolang
Example package managers listed alongside the online repositories they support and the language or operating system for which they provide packages

All successful package repository systems must implement a set of features to be useful for use in production-grade software development:

  • Package integrity and validity assurances—a repository that cannot be trusted to deliver code it claims to deliver cannot be used for production software development
  • Capacity to handle web traffic associated with downloading (and to a much smaller extent, uploading) packages
  • Authentication and authorization for uploading packages. Unauthenticated package uploads cannot be trusted, so any package repository that doesn’t authenticate and authorize package uploads cannot be trusted.
  • Package indexing and browsing available metadata

Why create and use packages?

At its core, packages allow you to bundle together general, reusable software components like libraries. They are a highly effective approach to the DRY principledon’t repeat yourself.

It allows you to modularize software development—including testing. A fully unit-tested package can be included in a broader software project with the confidence that it is reliable.

For some projects (or even companies), the packages are the product. Perhaps it is a physics engine included in realistic games, or the ever-popular numpy project that underpins the entirety of the Scientific Python software stack. For other companies, internal packages allow efficient software re-use throughout the company.

Wait, so what is the problem again?

The problem we are trying to solve is how does a company store artifacts that it generates and/or uses in an effective way?

An ideal solution would provide a customizable authentication and authorization model that protects a flexible web application capable of associating arbitrary metadata with arbitrary artifacts.

Even more ideally, such a solution would also serve as a flexible package repository. The requirements of a package manager are so numerous as to justify an entire section to their discussion.

Package Repositories

Packages are highly structured artifacts conforming to an often rigid set of specifications. This rigidity allows the development of a flexible and robust manager that can validate, process, and install (or uninstall) packages. This is an interesting paradox that pops up in many places—it is precisely the rigidity of a package specification that permits the flexibility and robustness of package managers.

Given the power of packages with respect to leveraging the DRY principle, many packages themselves utilize functionality available from other packages, and so on. Such a relationship is called a dependency as one package depends on another. Since software is constantly evolving, these dependencies will contain more detail than simply the name of another package—it may have a specific version or range of versions (or even variants) of the dependency that the needed package requires.

Packages, then, require the ability to specify these dependencies. Naturally, package managers then need to be able to install the required packages as well as all of the dependencies. Solving these dependencies is itself a challenging problem (for example, two needed packages may depend on the same package but with different version requirements)—at times some packages are simply incompatible. While a discussion of dependency resolution is well beyond the scope of this discussion, I highlight it here simply to illustrate the level of sophistication that goes into package managers.

There are many reasons a company or project may want to host package repositories on infrastructure they control. These are highlighted in the sections below.

Mirror public repositories

There are three major advantages to mirroring public package repositories:

  • Improved build times
  • DevSecOps
  • Improved repository availability*

I’ll expand on each in their own section below.

Improved Build Times

Most software build times are driven not by the speed and latency of fetching packages—especially since many package managers will cache downloaded packages to eliminate the need to repeatedly fetch packages—there are some instances where mirroring a package repository within the same network as your build infrastructure can substantially reduce the time spent downloading packages. This is especially true for ephemeral build infrastructure—that is, build agents that are created on-demand and torn down when unneeded.

Improved Build Times

Most software build times are driven not by the speed and latency of fetching packages—especially since many package managers will cache downloaded packages to eliminate the need to repeatedly fetch packages—there are some instances where mirroring a package repository within the same network as your build infrastructure can substantially reduce the time spent downloading packages. This is especially true for ephemeral build infrastructure—that is, build agents that are created on-demand and torn down when unneeded.

DevSecOps

There are two major ways that mirroring public repositories aids in building robust security. Access to external packages (that is, packages developed by developers outside the company) can be controlled and restricted according to company policies. It empowers companies to build whitelists of trusted third-party packages and correpsonding versions.

This can be enforced by placing software build infrastructure behind a firewall configured with a whitelist of permitted internet locations accessible from the build agents. This allows a company to build security practices that does not depend on trusting the security practices of public, third-party repositories like Microsoft’s NuGet and the PSF’s PyPI.

The second way that public repository mirroring can improve DevSecOps is by analyzing mirrored packages and scanning for vulnerabilities or reported CVEs. Such repositories can be configured in a way that forces an organization to acknowledge and/or address known vulnerabilities before shipping software.

Improved repository availability

When software repositories experience service disruptions or degradation, it can impact a company’s work substantially. A several-hour loss in NuGet availability in the middle of a workday could eliminate an entire team’s productivity for an entire workday.

While it is reasonable to be skeptical of a privately-deployed package repository matching the reliability of dedicated infrastructure whose entire purpose is to serve these packages, private repositories will also be subject to substantially less demand and so may be more resilient. Additionally, developers can work around a private repository outage by pulling from public repositories, lessening the impact of public repository downtime.

How can you solve Artifact management?

A naive approach to solving artifact (and package) management may be to utilize a simple cloud storage solution (like AWS S3 or Azure Blob) to store files and maintain private “spaces” in public repositories to store internal packages for various languages.

While this approach may seem appealing in its simplicity and ease of launching, its complexity rapidly escalates as projects discover the importance of functionality not granted by these simplistic approaches.

Even for storage of simple files like custom firmware images, reports, logs, etc., as a company scales they will discover the need to start managing metadata alongside artifacts and devise ways of browsing and search for specific artifacts.

For packages management, leveraging private “spaces” (or “organizations”) on public repository infrastructure becomes unwieldy as the number of package repositories that a company needs grows. Furthermore, many public repositories offer very limited authentication and authorization models. As a result, each public repository adds additional maintenance overhead to a company as access policies must be duplicated and managed for each public repository in use.

If an employee or programmer leaves a company or project, their access must be revoked both for the company’s infrastructure (like Active Directory) in addition to every public repository. Such a situation is difficult to maintain and is rife with the opportunity for exposing security vulnerabilities.

Better is a solution that implements an authentication (and authorization) model that integrates with the company’s core account management system—so-called Single Sign-On, or SSO—and manages all artifacts within a single web application.

Artifact storage options

There are several options for storing artifacts, including Artifactory, Sonatype Nexus, GitHub, and GitLab. By far the most fully-featured options are Sonatype Nexus and Artifactory.

Artifactory

Artifactory supports perhaps the widest range of packages including, but not limited to:

  • NuGet (v2 and v3)
  • Docker
  • PyPI
  • conda
  • APT/Debian
  • Vagrant
  • opkg
  • Git LFS
  • CRAN
  • Ruby Gems
  • Golang
  • Gradle
  • Maven
  • Helm
  • Ivy
  • npm
  • rpm

It has a very rich REST API which enables many other capabilities that will be expanded upon in a later post. Furthermore, Artifactory develops and distributes a command-line utility to manage artifacts and packages (uploading, downloading, annotating, etc.).

Artifactory also provides both an on-premises and hosted deploy (where the cloud hosting provider is user-selectable).

Artifactory supports SAML-based single sign-on, making it ideal in corporate environments.

Sonatype Nexus

Perhaps next most full-featured, Nexus boasts a wide range of package support including, but not limited to:

  • NuGet (v2 and v3)
  • Docker
  • PyPI
  • conda
  • APT/Debian
  • CRAN
  • Ruby Gems
  • Golang
  • Gradle
  • Maven
  • Helm
  • Ivy
  • npm
  • rpm

Sonatype’s REST API for Nexus is much more limited than Artifactory, so there are fewer management options compared to Artifactory. Furthermore, Sonatype only offers an on-premises solution, meaning that users are responsible for maintaining the infrastructure and deployment.

Sonatype Nexus supports SAML-based single sign-on for their paid plans as well as LDAP (for both free and paid plans), making it ideal in corporate environments.

GitHub

GitHub offers tight integration with users of their source control management system (which itself supports SAML-SSO login). However, the package types supported by GitHub is far more limited than other options and includes just:

GitLab

Like GitHub, GitLab offers tight integration with users of their source control management system (which supports SAML-SSO). While the package support is broader than GitHub, it still lacks many repository types supported by dedicated solutions like Nexus and Artifactory. The packages it currently supports are:

  • Composer
  • Conan
  • Go
  • Maven
  • npm
  • NuGet
  • PyPI
  • Docker

My personal choice given the flexibility in hosting options, SSO support, and rich REST API is JFrog’s Artifactory. The next post will focus on how we setup and manage Artifactory to serve as an artifact and package repository.