What's inside a PEM file
One of the more confusing aspects of dealing with public key cryptography is that there are so many
different file formats:
.p8 are just some of the file extensions we
commonly encounter, and with PEM, DER, BER, PKCS#1, PKCS#8, there is no shortage of acronyms
defining these file formats. So how are they all related?
To make sense of this alphabet soup, let’s unpack how some of these standards and formats relate to another.
Let’s start with the Abstract Syntax Notation One (ASN.1), which RSA describes as:
ASN.1 is a flexible notation that allows one to define a variety of data types, from simple types such as integers and bit strings to structured types such as sets and sequences, as well as complex types defined in terms of others.
That’s quite abstract, and that’s intentional: ASN.1 lets you define the logical structure of data, without defining its physical representation. That’s similar to how interface definition languages like proto3, DCE-IDL, or CORBA IDL define the structure of an interface and data without defining how data is formatted on the wire.
Here’s an example of how the logical structure of an RSA private key looks like (screenshot from ASN.1 Editor):
First, notice the absence of metadata. We can see that there are sequences of integers and object identifiers, but the structure doesn’t include field names. To make sense of an ASN.1 data structure, we therefore need some clue about what data is being represented. We’ll get to that later.
Second, notice that the structure is sufficiently abstract that we could think of various ways to encode it into a binary file. We could even turn it into XML or JSON if we wanted to! So we also need something below ASN.1 that defines the encoding. Which leads us to…
BER and DER
BER stands for Basic Encoding Rules and describes how ASN.1 values can be represented as bits and bytes. BER isn’t super-strict though and allows the same value to be encoded in different ways. That ambiguity of removed by the Distinguished Encoding Rule (DER), which defines a subset of BER:
DER (Distinguished Encoding Rules) is a restricted variant of BER for producing unequivocal transfer syntax for data structures described by ASN.1. Like CER, DER encodings are valid BER encodings. DER is the same thing as BER with all but one sender's options removed.
DER is a subset of BER providing for exactly one way to encode an ASN.1 value. DER is intended for situations when a unique encoding is needed, such as in cryptography, and ensures that a data structure that needs to be digitally signed produces a unique serialized representation. DER can be considered a canonical form of BER.
For example, in BER a boolean value of
true can be encoded as any of 255 non-zero byte values,
while in DER there is one way to encode
With DER and ASN.1, we can serialize structured data into a byte stream and deserialize it back into structure data.
Binary data formats like DER are great for storing data on disk because they save space and are efficient to parse. But what if we need to exchange data over text-based protocols like SMTP? This is where PEM comes into play.
PEM stands for Privacy-Enhanced Mail and it lets you encode binary data in ASCII by defining an envelope data format: To turn a binary data blob into PEM format, you base64-encode the data and wrap it by a BEGIN/END header and footer:
-----BEGIN [LABEL]----- base64([DATA]) -----END [LABEL]-----
You could use PEM to encode arbitrary data like photos and Word documents and come up with your own labels. But in practice, PEM is used for a limited set of PKI-related data formats only. And to make that more practical, RFC 7468 defines a number of labels and their semantics:
Sec. Label ASN.1 Type Reference Module ----+----------------------+-----------------------+---------+---------- 5 CERTIFICATE Certificate [RFC5280] id-pkix1-e 6 X509 CRL CertificateList [RFC5280] id-pkix1-e 7 CERTIFICATE REQUEST CertificationRequest [RFC2986] id-pkcs10 8 PKCS7 ContentInfo [RFC2315] id-pkcs7\* 9 CMS ContentInfo [RFC5652] id-cms2004 10 PRIVATE KEY PrivateKeyInfo ::= [RFC5208] id-pkcs8 OneAsymmetricKey [RFC5958] id-aKPV1 11 ENCRYPTED PRIVATE KEY EncryptedPrivateKeyInfo [RFC5958] id-aKPV1 12 ATTRIBUTE CERTIFICATE AttributeCertificate [RFC5755] id-acv2 13 PUBLIC KEY SubjectPublicKeyInfo [RFC5280] id-pkix1-e
It is this set of predefined labels that makes PEM so useful: Previously, we saw
that to make sense of an ASN.1 data structure, we need some clue about what data is being
represented. The PEM label gives us that clue: For example, when we see the header
-----PUBLIC KEY-----, the last row of the table above tells us that what follows is:
- A public key, represented as a
SubjectPublicKeyInfoASN.1 data structure,
- encoded in DER,
- encoded in base64.
Whereas, if we see a
-----BEGIN CERTIFICATE----- header, we can expect:
CertificateASN.1 data structure,
- encoded in DER,
- encoded in base64.
And so on.
In the next post, we’ll take a closer look at how PEM files are used to encode and store public keys.