Brian Cole
2014-06-06 20:28:50 UTC
Hi,
As we've been migrating to a code base that has to straddle both Python 2 and Python 3 the most difficult pain point has been dealing with the unicode transition. Most python web frameworks already deal exclusively in unicode objects in Python 2.x instead of bytes leading to a lot of code written like this:
result = Foo(str(unquote(data_from_network))
Where 'Foo' is the C++ function being wrapped by SWIG that takes either a 'const char *' or a 'std::string'. Using 'str' in python makes the code compatible between Python 2 and 3, but really isn't what we want to be proposing to our users when writing code. In works well in Python 3, passing UTF-8 strings into the underlying C++ function. However, in Python 2, it will throw an exception whenever an actual unicode character occurs.
Is there a fundamental reason SWIG doesn't automatically coerce Python 2.x unicode objects into UTF-8?
I hacked in a little experimental patch to SWIG that does just that here: https://github.com/coleb/swig/commit/148eaba5d45a7b95290aa114fa4fe43ba4a30774
All the SWIG tests pass in both 2 and 3 (including the one I added), as well as all our test cases as well. Is this something SWIG would consider for inclusion? I've never contributed code to SWIG before (despite being a user for over 10 years), so I'm looking for any guidance on what else would need changing. Also, should I require the user to specify an extra command line flag or macro define? The side effects could be wider than expected for this type of change.
Thanks,
Brian
As we've been migrating to a code base that has to straddle both Python 2 and Python 3 the most difficult pain point has been dealing with the unicode transition. Most python web frameworks already deal exclusively in unicode objects in Python 2.x instead of bytes leading to a lot of code written like this:
result = Foo(str(unquote(data_from_network))
Where 'Foo' is the C++ function being wrapped by SWIG that takes either a 'const char *' or a 'std::string'. Using 'str' in python makes the code compatible between Python 2 and 3, but really isn't what we want to be proposing to our users when writing code. In works well in Python 3, passing UTF-8 strings into the underlying C++ function. However, in Python 2, it will throw an exception whenever an actual unicode character occurs.
Is there a fundamental reason SWIG doesn't automatically coerce Python 2.x unicode objects into UTF-8?
I hacked in a little experimental patch to SWIG that does just that here: https://github.com/coleb/swig/commit/148eaba5d45a7b95290aa114fa4fe43ba4a30774
All the SWIG tests pass in both 2 and 3 (including the one I added), as well as all our test cases as well. Is this something SWIG would consider for inclusion? I've never contributed code to SWIG before (despite being a user for over 10 years), so I'm looking for any guidance on what else would need changing. Also, should I require the user to specify an extra command line flag or macro define? The side effects could be wider than expected for this type of change.
Thanks,
Brian