The best answers to the question “What is the difference between a string and a byte string?” in the category Dev.
I am working with a library which returns a byte string and I need to convert this to a string.
Although I’m not sure what the difference is – if any.
Assuming Python 3 (in Python 2, this difference is a little less well-defined) – a string is a sequence of characters, ie unicode codepoints; these are an abstract concept, and can’t be directly stored on disk. A byte string is a sequence of, unsurprisingly, bytes – things that can be stored on disk. The mapping between them is an encoding – there are quite a lot of these (and infinitely many are possible) – and you need to know which applies in the particular case in order to do the conversion, since a different encoding may map the same bytes to a different string:
>>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-16') '蓏콯캁澽苏' >>> b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'.decode('utf-8') 'τoρνoς'
Once you know which one to use, you can use the
.decode() method of the byte string to get the right character string from it as above. For completeness, the
.encode() method of a character string goes the opposite way:
>>> 'τoρνoς'.encode('utf-8') b'\xcf\x84o\xcf\x81\xce\xbdo\xcf\x82'
The only thing that a computer can store is bytes.
To store anything in a computer, you must first encode it, i.e. convert it to bytes. For example:
- If you want to store music, you must first encode it using
- If you want to store a picture, you must first encode it using
- If you want to store text, you must first encode it using
UTF-8 are examples of encodings. An encoding is a format to represent audio, images, text, etc in bytes.
In Python, a byte string is just that: a sequence of bytes. It isn’t human-readable. Under the hood, everything must be converted to a byte string before it can be stored in a computer.
On the other hand, a character string, often just called a “string”, is a sequence of characters. It is human-readable. A character string can’t be directly stored in a computer, it has to be encoded first (converted into a byte string). There are multiple encodings through which a character string can be converted into a byte string, such as
'I am a string'.encode('ASCII')
The above Python code will encode the string
'I am a string' using the encoding
ASCII. The result of the above code will be a byte string. If you print it, Python will represent it as
b'I am a string'. Remember, however, that byte strings aren’t human-readable, it’s just that Python decodes them from
ASCII when you print them. In Python, a byte string is represented by a
b, followed by the byte string’s
A byte string can be decoded back into a character string, if you know the encoding that was used to encode it.
b'I am a string'.decode('ASCII')
The above code will return the original string
'I am a string'.
Encoding and decoding are inverse operations. Everything must be encoded before it can be written to disk, and it must be decoded before it can be read by a human.
Let’s have a simple one-character string
'š' and encode it into a sequence of bytes:
>>> 'š'.encode('utf-8') b'\xc5\xa1'
For the purpose of this example let’s display the sequence of bytes in its binary form:
>>> bin(int(b'\xc5\xa1'.hex(), 16)) '0b1100010110100001'
Now it is generally not possible to decode the information back without knowing how it was encoded. Only if you know that the
utf-8 text encoding was used, you can follow the algorithm for decoding utf-8 and acquire the original string:
11000101 10100001 ^^^^^ ^^^^^^ 00101 100001
You can display the binary number
101100001 back as a string:
>>> chr(int('101100001', 2)) 'š'
Note: I will elaborate more my answer for Python 3 since the end of life of Python 2 is very close.
In Python 3
bytes consists of sequences of 8-bit unsigned values, while
str consists of sequences of Unicode code points that represent textual characters from human languages.
>>> # bytes >>> b = b'h\x65llo' >>> type(b) <class 'bytes'> >>> list(b) [104, 101, 108, 108, 111] >>> print(b) b'hello' >>> >>> # str >>> s="nai\u0308ve" >>> type(s) <class 'str'> >>> list(s) ['n', 'a', 'i', '̈', 'v', 'e'] >>> print(s) naïve
str seem to work the same way, their instances are not compatible with each other, i.e,
str instances can’t be used together with operators like
+. In addition, keep in mind that comparing
str instances for equality, i.e. using
==, will always evaluate to
False even when they contain exactly the same characters.
>>> # concatenation >>> b'hi' + b'bye' # this is possible b'hibye' >>> 'hi' + 'bye' # this is also possible 'hibye' >>> b'hi' + 'bye' # this will fail Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't concat str to bytes >>> 'hi' + b'bye' # this will also fail Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can only concatenate str (not "bytes") to str >>> >>> # comparison >>> b'red' > b'blue' # this is possible True >>> 'red'> 'blue' # this is also possible True >>> b'red' > 'blue' # you can't compare bytes with str Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: '>' not supported between instances of 'bytes' and 'str' >>> 'red' > b'blue' # you can't compare str with bytes Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: '>' not supported between instances of 'str' and 'bytes' >>> b'blue' == 'red' # equality between str and bytes always evaluates to False False >>> b'blue' == 'blue' # equality between str and bytes always evaluates to False False
Another issue when dealing with
str is present when working with files that are returned using the
open built-in function. On one hand, if you want ot read or write binary data to/from a file, always open the file using a binary mode like ‘rb’ or ‘wb’. On the other hand, if you want to read or write Unicode data to/from a file, be aware of the default encoding of your computer, so if necessary pass the
encoding parameter to avoid surprises.
In Python 2
str consists of sequences of 8-bit values, while
unicode consists of sequences of Unicode characters. One thing to keep in mind is that
unicode can be used together with operators if
str only consists of 7-bit ASCI characters.
It might be useful to use helper functions to convert between
unicode in Python 2, and between
str in Python 3.