Comma Separated Values: Which CSV encoding should I use?

There are so many file formats it’s hard to keep track of them. Whether writing a paper on aChromebookor saving a photo on your Android phone, each file has several potential formats. One format that comes up often when working with spreadsheets is a CSV file, which allows you to store data as text and move it between applications. There are multiple types of CSV files, so what’s the difference between each of them?

What is a CSV file?

A CSV file is a text file that only contains text. It doesn’t contain formulas or program-specific data. CSV stands for comma-separated values file, which refers to how the text in the file is stored in a way that separates data with a delimiter or field separator.

A CSV file is a great way to store and transfer large amounts of data since the file type is compatible with many programs. All CSV files have different characteristics that determine how they are formatted.

Delimiters

A delimiter, or field separator, is a character or sequence of characters that separate fields in a text file. There are many possible delimiters, but commas, tabs, spaces, and semicolons are the most common. When using a delimiter, any field can be quoted (put between quotation marks), but some fields must be quoted. There are several rules around this, but common cases are fields that include quotes or the character that is delimiting them. A CSV often shows its delimiter in the file type. For example, a CSV file format with comma delimiters will be called CSV (Comma delimited).

UTF-8 vs. UTF-16 vs. UTF-32

To understand these CSV differentiators, we first have to discuss how computers store data, which is done through a binary system. Binary means data is stored in sequences of 1s or 0s, where a single 1 or 0 is called a bit. The next smallest way to store data is a byte, which is constructed from eight bits. For example, “1” is a bit, and “01001101” is a byte.

Bytes are put together to form every digital item you’ve ever interacted with, from a picture on your smartphone to the Excel program on your computer. You’ll often see file sizes in kilobytes (a thousand bytes), megabytes (a million bytes), and gigabytes (a billion bytes).

Colored binary code layered on top of black binary code

To use binary practically, regular language characters and symbols must be translated to binary. One way to do this is ASCII (American Standard Code for Information Interchange), which converts human language into binary through a unique code and byte sequence.

This system works but has limitations because there are only 256 unique combinations of 1s and 0s. When it was created, this was fine since it only needed to house upper-case letters, lower-case letters, and punctuation. The system quickly ran out of room as technology evolved and the need to add characters from other languages arose.

MacBook Pro 14 inch on a wooden desk

The solution is another system called Unicode, which allows for a unique code, called a code point, for all characters across all languages and emojis. A code point is made up of aU+followed by a unique set of letters and numbers. For example,Ais represented in Unicode byU+0041. A code point is not binary, so we need a way to convert from a code point into binary. This is where UTF comes in.

UTF stands for Unicode Transformation Format and translates any code point into a binary sequence and vice versa. The number at the end of the UTF encoding system represents the minimum number of bits in which it can store a character. For example, UTF-8 can store a character in one, two, three, or four bytes since 8 is equivalent to the number of bits in a byte. UTF-16 can only store characters in two or four bytes. UTF-32 can only store characters in four bytes.

UTF-8 stores the most common characters in one byte and less common characters in a greater number of bytes. This means most English characters are stored in a single byte, while characters only used in a single language are often stored in four bytes. Only UTF-8 is compatible with ASCII, but all UTF encoding systems are compatible with Unicode.

UTF-8 is the optimal encoding system for files that use many English characters because it saves space and processes faster. If a file uses a lot of uncommon characters, UTF-16 will likely be optimal because it is a good balance between UTF-8 and UTF-32 and yields a smaller file size.

Macintosh and MS-DOS

Special CSV formats are compatible with Mac or MS-DOS operating systems. The way a CSV file needs to be formatted is slightly different since these operating systems are different from Windows. For Macintosh CSV files, the main differentiator is distinguished character coding, the way a row or line ends. Macintosh CSV uses Carriage Return (CR). MS-DOS and other CSV formats use Carriage Return/Line Feed (CR/LF). CR uses a single character, and CR/LF uses multiple characters to signify the end of a line.

Excell at sheets

CSV seems like a simple file format, but it has a lot of nuances. A few characteristics differentiate CSV files and make them optimal for different applications and operating systems. To work well with CSV files, try theseGoogle Sheets tips and tricks.

What is a CSV file?#

Delimiters#

UTF-8 vs. UTF-16 vs. UTF-32#

Macintosh and MS-DOS#

Excell at sheets#

What is a CSV file?

Delimiters

UTF-8 vs. UTF-16 vs. UTF-32

Macintosh and MS-DOS

Excell at sheets