Discussion:
Python 2.x unicode coercion
Brian Cole
2014-06-06 20:28:50 UTC
Permalink
Hi,

As we've been migrating to a code base that has to straddle both Python 2 and Python 3 the most difficult pain point has been dealing with the unicode transition. Most python web frameworks already deal exclusively in unicode objects in Python 2.x instead of bytes leading to a lot of code written like this:

result = Foo(str(unquote(data_from_network))

Where 'Foo' is the C++ function being wrapped by SWIG that takes either a 'const char *' or a 'std::string'. Using 'str' in python makes the code compatible between Python 2 and 3, but really isn't what we want to be proposing to our users when writing code. In works well in Python 3, passing UTF-8 strings into the underlying C++ function. However, in Python 2, it will throw an exception whenever an actual unicode character occurs.

Is there a fundamental reason SWIG doesn't automatically coerce Python 2.x unicode objects into UTF-8?

I hacked in a little experimental patch to SWIG that does just that here: https://github.com/coleb/swig/commit/148eaba5d45a7b95290aa114fa4fe43ba4a30774

All the SWIG tests pass in both 2 and 3 (including the one I added), as well as all our test cases as well. Is this something SWIG would consider for inclusion? I've never contributed code to SWIG before (despite being a user for over 10 years), so I'm looking for any guidance on what else would need changing. Also, should I require the user to specify an extra command line flag or macro define? The side effects could be wider than expected for this type of change.

Thanks,
Brian
Brian Cole
2014-06-07 17:01:36 UTC
Permalink
Hi,

As we've been migrating to a code base that has to straddle both Python 2 and Python 3 the most difficult pain point has been dealing with the unicode transition. Most python web frameworks already deal exclusively in unicode objects in Python 2.x instead of bytes leading to a lot of code written like this:

result = Foo(str(unquote(data_from_network))

Where 'Foo' is the C++ function being wrapped by SWIG that takes either a 'const char *' or a 'std::string'. Using 'str' in python makes the code compatible between Python 2 and 3, but really isn't what we want to be proposing to our users when writing code. In works well in Python 3, passing UTF-8 strings into the underlying C++ function. However, in Python 2, it will throw an exception whenever an actual unicode character occurs.

Is there a fundamental reason SWIG doesn't automatically coerce Python 2.x unicode objects into UTF-8?

I hacked in a little experimental patch to SWIG that does just that here: https://github.com/coleb/swig/commit/148eaba5d45a7b95290aa114fa4fe43ba4a30774

All the SWIG tests pass in both 2 and 3 (including the one I added), as well as all our test cases as well. Is this something SWIG would consider for inclusion? I've never contributed code to SWIG before (despite being a user for over 10 years), so I'm looking for any guidance on what else would need changing. Also, should I require the user to specify an extra command line flag or macro define? The side effects could be wider than expected for this type of change.

Thanks,
Brian
William S Fulton
2014-06-10 18:55:16 UTC
Permalink
Post by Brian Cole
Hi,
As we've been migrating to a code base that has to straddle both Python
2 and Python 3 the most difficult pain point has been dealing with the
unicode transition. Most python web frameworks already deal exclusively
in unicode objects in Python 2.x instead of bytes leading to a lot of
result = Foo(str(unquote(data_from_network))
Where 'Foo' is the C++ function being wrapped by SWIG that takes either
a 'const char *' or a 'std::string'. Using 'str' in python makes the
code compatible between Python 2 and 3, but really isn't what we want to
be proposing to our users when writing code. In works well in Python 3,
passing UTF-8 strings into the underlying C++ function. However, in
Python 2, it will throw an exception whenever an actual unicode
character occurs.
Is there a fundamental reason SWIG doesn't automatically coerce Python
2.x unicode objects into UTF-8?
I hacked in a little experimental patch to SWIG that does just that
https://github.com/coleb/swig/commit/148eaba5d45a7b95290aa114fa4fe43ba4a30774
All the SWIG tests pass in both 2 and 3 (including the one I added), as
well as all our test cases as well. Is this something SWIG would
consider for inclusion? I've never contributed code to SWIG before
(despite being a user for over 10 years), so I'm looking for any
guidance on what else would need changing. Also, should I require the
user to specify an extra command line flag or macro define? The side
effects could be wider than expected for this type of change.
Seems to be a reasonable approach.

If you use Travis for testing this, you'll get tests for Python 2.4 and
upwards... either enable Travis on your own repo, or if you put together
a pull request to the main swig/swig repo, the Travis tests will run.
Please add a runtime test for this and modify unicode_strings.i for
strings containing unicode as input (currently it just contains strings
as return values).

I don't know why the unicode coercion is only for py3 and later. Is
there something specific in py3 to do with unicode string handling? When
was the unicode api *PyUnicode_Check and PyUnicode_AsUTF8String)
introduced? We may as well enable it from those versions onwards in
which case you could use the existing code for unicode handling in
SWIG_AsCharPtrAndSize instead.

William
William S Fulton
2015-12-19 04:15:26 UTC
Permalink
I've added this into master, but it requires a macro to be defined
when compiling the generated code in order to use it:
SWIG_PYTHON_2_UNICODE.

William
Post by William S Fulton
Post by Brian Cole
Hi,
As we've been migrating to a code base that has to straddle both Python
2 and Python 3 the most difficult pain point has been dealing with the
unicode transition. Most python web frameworks already deal exclusively
in unicode objects in Python 2.x instead of bytes leading to a lot of
result = Foo(str(unquote(data_from_network))
Where 'Foo' is the C++ function being wrapped by SWIG that takes either
a 'const char *' or a 'std::string'. Using 'str' in python makes the
code compatible between Python 2 and 3, but really isn't what we want to
be proposing to our users when writing code. In works well in Python 3,
passing UTF-8 strings into the underlying C++ function. However, in
Python 2, it will throw an exception whenever an actual unicode
character occurs.
Is there a fundamental reason SWIG doesn't automatically coerce Python
2.x unicode objects into UTF-8?
I hacked in a little experimental patch to SWIG that does just that
https://github.com/coleb/swig/commit/148eaba5d45a7b95290aa114fa4fe43ba4a30774
All the SWIG tests pass in both 2 and 3 (including the one I added), as
well as all our test cases as well. Is this something SWIG would
consider for inclusion? I've never contributed code to SWIG before
(despite being a user for over 10 years), so I'm looking for any
guidance on what else would need changing. Also, should I require the
user to specify an extra command line flag or macro define? The side
effects could be wider than expected for this type of change.
Seems to be a reasonable approach.
If you use Travis for testing this, you'll get tests for Python 2.4 and
upwards... either enable Travis on your own repo, or if you put together a
pull request to the main swig/swig repo, the Travis tests will run. Please
add a runtime test for this and modify unicode_strings.i for strings
containing unicode as input (currently it just contains strings as return
values).
I don't know why the unicode coercion is only for py3 and later. Is there
something specific in py3 to do with unicode string handling? When was the
unicode api *PyUnicode_Check and PyUnicode_AsUTF8String) introduced? We may
as well enable it from those versions onwards in which case you could use
the existing code for unicode handling in SWIG_AsCharPtrAndSize instead.
William
------------------------------------------------------------------------------
David Beazley
2015-12-21 12:40:45 UTC
Permalink
------------------------------------------------------------------------------
Loading...