Bug report
Bug description:
When a source file declares an invalid or incompatible encoding, direct execution currently reports a generic SyntaxError: encoding problem: <encoding>. Importing the same file, or executing it via -m, reports the underlying cause, such as unknown encoding: ... or the concrete codec decode error. This makes direct execution less informative and inconsistent with the import/runpy path.
Unknown encoding: current vs expected
❯ cat t.py
# coding: dict-unpacking-at-home
print('Hi')
❯ ./python.exe t.py
-SyntaxError: encoding problem: dict-unpacking-at-home
+ File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
+SyntaxError: unknown encoding: dict-unpacking-at-home
❯ cat t2.py
import t
❯ ./python.exe t2.py
Traceback (most recent call last):
File "/Users/bartosz.slawecki/Python/cpython/t2.py", line 1, in <module>
import t
File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: unknown encoding: dict-unpacking-at-home
❯ ./python.exe -m t
Traceback (most recent call last):
File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 192, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 163, in _get_module_details
code = loader.get_code(mod_name)
File "<frozen importlib._bootstrap_external>", line 877, in get_code
File "<frozen importlib._bootstrap_external>", line 806, in source_to_code
File "<frozen importlib._bootstrap>", line 543, in _call_with_frames_removed
File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: unknown encoding: dict-unpacking-at-home
Encoding error: current vs expected
❯ cat t.py
# coding: shift-jis
lubię Pythona i jego team
❯ ./python.exe t.py
-SyntaxError: encoding problem: shift-jis
+ File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
+ # coding: shift-jis
+SyntaxError: 'shift_jis' codec can't decode byte 0x99 in position 8: illegal multibyte sequence
❯ cat t2.py
import t
❯ ./python.exe t2.py
Traceback (most recent call last):
File "/Users/bartosz.slawecki/Python/cpython/t2.py", line 1, in <module>
import t
File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: 'shift_jis' codec can't decode byte 0x99 in position 26: illegal multibyte sequence
❯ ./python.exe -m t
Traceback (most recent call last):
File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 192, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
File "/Users/bartosz.slawecki/Python/cpython/Lib/runpy.py", line 163, in _get_module_details
code = loader.get_code(mod_name)
File "<frozen importlib._bootstrap_external>", line 877, in get_code
File "<frozen importlib._bootstrap_external>", line 806, in source_to_code
File "<frozen importlib._bootstrap>", line 543, in _call_with_frames_removed
File "/Users/bartosz.slawecki/Python/cpython/t.py", line 0
SyntaxError: 'shift_jis' codec can't decode byte 0x99 in position 26: illegal multibyte sequence
Investigation
Direct execution uses the file tokenizer, which reads and decodes the file incrementally. Importing uses importlib to read the source into memory and then compiles the full string, so it goes through the string tokenizer.
Both tokenizers interpret the declared source code encoding differently when invoked via set_readline() in this line of _PyTokenizer_check_coding_spec:
|
if (strcmp(cs, "utf-8") != 0 && !set_readline(tok, cs)) { |
|
_PyTokenizer_error_ret(tok); |
|
PyErr_Format(PyExc_SyntaxError, "encoding problem: %s", cs); |
|
PyMem_Free(cs); |
|
return 0; |
|
} |
|
tok->encoding = cs; |
-
If set_readline is fp_setreadl (file tokenizer), it creates a file reader with builtins.open and uses that new reader to re-read the encoding declaration line and the remaining lines of the file.
Only fp_setreadl can allow SyntaxError: encoding problem from line 422 to bubble up, unless the erroneous bytes aren't translated by the new reader; either because of (1) no newline at the end of file, which deferred the translation in the incremental newline decoder, or (2) because the erroneous bytes are in a future chunk that was not yet read from the disk (around 8kB after the encoding spec line).
-
If set_readline is buf_setreadl (string tokenizer), it just sets tok->enc and always succeeds. The encoding error only happens later in _PyTokenizer_translate_into_utf8 which bails out before the parser is created; it goes through _PyPegen_raise_tokenizer_init_error in _PyPegen_run_parser_from_string to ensure that it is a SyntaxError. This explains why the vague exception never occurred when the code was imported / ran via runpy.
In the file tokenizer path, _PyTokenizer_check_coding_spec() overwrites the concrete exception raised by set_readline() with the generic SyntaxError: encoding problem: <encoding>. The user is no longer given the information of whether there was codec lookup error or a file decoding error.
The linked PR fixes this by dropping the encoding problem exception entirely and replacing it with the _PyPegen_raise_tokenizer_init_error(tok->filename) call since all set_readline failure paths provide an active exception. This way we have a consistent experience in both tokenizers.
EDIT: I moved the tokenizer-init error wrapper from pegen to tokenizer helpers. This used to live in pegen because only parser entry points needed it. The file tokenizer now needs the same normalization at the point where set_readline() fails, so the helper was moved to tokenizer helpers rather than making tokenizer code depend on pegen internals. I'm happy to adjust the layering if there is a better home for this helper.
cc @lysnikolaou @pablogsal @serhiy-storchaka
CPython versions tested on:
3.16, main
Operating systems tested on:
macOS
Linked PRs
Bug report
Bug description:
When a source file declares an invalid or incompatible encoding, direct execution currently reports a generic
SyntaxError: encoding problem: <encoding>. Importing the same file, or executing it via-m, reports the underlying cause, such as unknown encoding: ... or the concrete codec decode error. This makes direct execution less informative and inconsistent with the import/runpy path.Unknown encoding: current vs expected
Encoding error: current vs expected
Investigation
Direct execution uses the file tokenizer, which reads and decodes the file incrementally. Importing uses importlib to read the source into memory and then compiles the full string, so it goes through the string tokenizer.
Both tokenizers interpret the declared source code encoding differently when invoked via
set_readline()in this line of_PyTokenizer_check_coding_spec:cpython/Parser/tokenizer/helpers.c
Lines 420 to 426 in 6679ac0
If
set_readlineisfp_setreadl(file tokenizer), it creates a file reader withbuiltins.openand uses that new reader to re-read the encoding declaration line and the remaining lines of the file.Only
fp_setreadlcan allowSyntaxError: encoding problemfrom line 422 to bubble up, unless the erroneous bytes aren't translated by the new reader; either because of (1) no newline at the end of file, which deferred the translation in the incremental newline decoder, or (2) because the erroneous bytes are in a future chunk that was not yet read from the disk (around 8kB after the encoding spec line).If
set_readlineisbuf_setreadl(string tokenizer), it just setstok->encand always succeeds. The encoding error only happens later in_PyTokenizer_translate_into_utf8which bails out before the parser is created; it goes through_PyPegen_raise_tokenizer_init_errorin_PyPegen_run_parser_from_stringto ensure that it is aSyntaxError. This explains why the vague exception never occurred when the code was imported / ran via runpy.In the file tokenizer path,
_PyTokenizer_check_coding_spec()overwrites the concrete exception raised byset_readline()with the genericSyntaxError: encoding problem: <encoding>. The user is no longer given the information of whether there was codec lookup error or a file decoding error.The linked PR fixes this by dropping the
encoding problemexception entirely and replacing it with the_PyPegen_raise_tokenizer_init_error(tok->filename)call since allset_readlinefailure paths provide an active exception. This way we have a consistent experience in both tokenizers.EDIT: I moved the tokenizer-init error wrapper from pegen to tokenizer helpers. This used to live in pegen because only parser entry points needed it. The file tokenizer now needs the same normalization at the point where
set_readline()fails, so the helper was moved to tokenizer helpers rather than making tokenizer code depend on pegen internals. I'm happy to adjust the layering if there is a better home for this helper.cc @lysnikolaou @pablogsal @serhiy-storchaka
CPython versions tested on:
3.16, main
Operating systems tested on:
macOS
Linked PRs