From CDOT Wiki
Jump to: navigation, search

Date  : Nov 7, 2008
Topic : Unicode!
People: James Boston, David Humphrey, Jason Orendorff, Ted Mielczarek, Benjamin Smedberg

09:37 <@humph> jboston: you there?
09:37 < jboston> humph: ack
09:38 <@humph> jboston: meet jorendorff
09:38 < jboston> jorendorff: allo
09:38 < jorendorff> jboston: hi!
09:38 <@humph> jboston: I think your project, and some stuff he wants to do, is doing, are tied together
09:38 <@humph> jboston: he works on the js engine and other scary things
09:39 < jboston> ah!
09:39 < jorendorff> jboston: what should I read?  point me at your blog and stuff
09:39 < jboston>
09:39 < jboston>
09:40 <@humph> jboston: where is that initial conversation with bsmedberg from irc
09:40 <@humph> that might help here
09:40 < jboston>
09:41  * jorendorff reads and reads
09:45 < jorendorff> so, that IRC chatlog refers to "issues with character sets"
09:45 <@humph> yeah, the unicode piece is right in front of him now
09:45 < jboston> Yes. A tricky issue.
09:45 < jorendorff> jboston: Yeah, do you have ideas already?
09:46 < jboston> Well, the problem I have is that I am using the Netscape Portable Runtime, but the APIs I use don't support unicode.
09:46 < jorendorff> I want to look at how this is handled in Python 3.0... maybe the stdin/stdout deal in bytes, not text
09:46 < jorendorff> jboston: This is a horrible problem to have :-P
09:46 < jorendorff> jboston: It feels too much like NSPR is just in the wy
09:46 < jorendorff> *way
09:47 <@humph> there's a whole bunch of never-implemented or implemented-badly bits in nspr that sort of draw your attention, though
09:47 < jboston> Yes and no. It looks like over the years people have hacked things into nsIProcess to avoid using the NSPR. But there's a lot of useful stuff in there.
09:47 < ted> heh
09:48 < jorendorff> Ultimately I want JS to have a byte-array type.
09:48 < jorendorff> The language has strings, which are immutable arrays of 16-bit "characters" (actually UTF-16 or UCS2 code units)
09:48 <@humph> so bypassing the unicode problem by doing byte-by-byte is interesting
09:48 < ted> JS could use some way to handle binary data
09:48 < jorendorff> yes, it's a fairly common request; Flash has such a thing
09:49 <@humph> jorendorff: do you have a bug on this?
09:49 <@humph> the byte array?
09:49 < jorendorff> jboston:  the awkward thing here is that it feels like a prerequisite to what you're doing, and certainly I don't want to block what you're doing
09:49 < ted> if you can give the user a stream of bytes, we have streams that will let you get out unicode data
09:49 < ted>
09:50 <@humph> nice
09:50 < jorendorff> let me search, i'm not aware of a bug
09:50 < jboston> I need to do more research on how NSPR APIs for piping handle unicode. The problem I had run into with character sets had to do with passing arguments to processes.
09:50 < jorendorff> for byte arrays
09:50 < ted> like nsIFileInputStream just gives you bytes
09:50 < ted> yeah, isn't that one of the main problems you were looking into?
09:50 < ted> since on windows, filenames are actually UTF-16
09:50 <@humph> right
09:50 < jorendorff> filenames and command lines both
09:51 < ted> yeah
09:51 < jorendorff> on unix, the executable filename is 8-bit, and the argv strings are 8-bit, and there is no command line
09:51 <@humph> do either of you have any tips for him on solving this?
09:51 < ted> so i guess ideally, your interface would just take nsStrings
09:51 < ted> jorendorff: well, not true
09:51 < jorendorff> ted: ?
09:51 < ted> linux and OSX use UTF-8 natively, most of the time, now
09:52 < ted> (although you can change the encoding you use)
09:52 < jorendorff> ted: yep, most of the time.
09:52 < ted> you can find the platform charset in those cases though, it shouldn't be a big deal
09:52 < ted> and we have plenty of APIs for converting charsets
09:53 < ted> jboston: are you going to just make a new API?
09:53 < ted> something like nsIProcess2 ?
09:53  * ctyler wonders why MS chose UTF-16. Not big enough to encode the 
09:53 < jorendorff> The problem is drawing the boundary... in particular, you have to use whatever NSPR exposes
09:53 < ted> they committed too early
09:53 < jboston> ted: I think that will happen.
09:53 < ted> and then unicode said oops
09:54 <@humph> jorendorff: or change nspr
09:54 <@ctyler> ah
09:54 < jorendorff> ctyler: that decision predates Unicode being >16bits
09:54 < ted> of course, using UCS-4 natively is sort of insane
09:54 <@ctyler> Unicode was always >16 bits, it's just the BMP that was <16
09:54 < ted> i'm pretty sure glibc does that
09:55 < ted> "let's use 4x the memory of ascii just in case we have to support insane non-BMP characters!"
09:55 < mhoye> you're not seriously defending ASCII, are you?
09:55 <@humph> ted: that's how you sell new machines with more ram, note.
09:55 < jorendorff> ctyler: i... that is inconsistent with my vague understanding of the history
09:55  * humph just read knuth ranting about 64-bit pointers being a sin for the same reason :)
09:56 < ted> mhoye: no, i support UTF-8
09:56 < ted> all the compatibility without paying the insane memory cost
09:56 < mhoye> ?
09:56 < ted> of UCS-2
09:56 < ted> er
09:56 < ted> UCS-4
09:56 < mhoye> Man, memory is free. 
09:56 < ted> sez you
09:57 < mhoye> At least as far as text data is concerned.
09:57 < mhoye> Hellz, yeah.
09:57 <@ctyler> imho, the only sane options are UTF-8 (decent size for most data streams) and UCS-4/UTF-32 (no escape tokens to parse)
09:57  * ted shudders to think of what mozilla's memory footprint would look like if we used UCS-4 natively
09:57 < ted> mhoye: databases?
09:57 < ted> ctyler: i agree
09:57 < ted> i think UCS-4 has its place, if you know you're going to be dealing with lots of non-ascii data
09:59 < jorendorff> jboston: but we digress
09:59 < ted> yeah
09:59 < jboston> I'm wondering what I an change in the NSPR? I don't want to break things.
10:00 < jorendorff> jboston: back to first principles - we definitely want to support launching a process by providing a bunch of strings
10:00 < jboston> Well, that's possible if you don't use Japanese.
10:00 < ted> NSPR is just code :)
10:01 <@humph> it's just macros, actually :)
10:01 < jorendorff> jboston: Suppose one of the strings contains Japanese
10:01 < jorendorff> like,  Popen(['hg', 'commit', '-u', username])
10:01 < jorendorff> jboston: You have some working code -- what are you doing right now?
10:02 < jboston> 
10:02 < jboston> That bascilly just fixes nsIProcess so that you can start and stop a process. Nothing else.
10:03 <@humph> jorendorff: he's been trying to decided how to approach this, from js-api level or up from nspr.  the path is not clear atm
10:03 < jorendorff> So, is the NSPR process management stuff just totally undocumented?
10:04 < jboston> No. There
10:04 < jboston> There's some stuff at devmo.
10:04 < jboston>
10:04 < jboston> But it's the usual terse description.
10:04 < jorendorff> Right, I see that, but
10:04 < jorendorff> is empty
10:05 < jorendorff> OK, does NSPR deal with character encodings anywhere else?
10:05  * jorendorff doesn't see it if so
10:05 <@humph> I thought there was something with filenames
10:06 < jboston> I think so. 
10:07 < ted>
10:07 < ted> not all of the NSPR docs have made it to MDC yet
10:07 < jorendorff> well that sucks!
10:08 < ted> yep
10:08 < ted> should get sheppy to fix that
10:08  * humph tries to use ted's voice
10:08 < jorendorff> we should double the size of our doc team... to 2
10:08 <@humph> "it's a wiki! fix it!"
10:08 < ted> hah
10:08 < jboston> I'm fishing around in the code looking for unicode stuff. Here's something:
10:08 < ted> humph: yeah, but migrating lots of docs over is a better task for someone dedicated
10:09 <@humph> jboston: yes that's what I remember
10:09 <@humph> ted: for sure
10:09 <@humph> actually, that could be a good project for our doc writing team
10:09 < jorendorff> ok, I'm searching for where this stuff is implemented...
10:09 < jorendorff> humph: gosh yes
10:09  * humph sends a mail
10:12  * jorendorff sees #define _MD_OPEN_FILE    _PR_MD_OPEN_FILE
10:12 < jorendorff> and vice versa!
10:12 <@humph> it's macro mania
10:12 < jboston> Experimental:
10:13 < ted> jboston: you could email wtc and ask him about these things
10:13 < jboston> There are a lot of defines. You have to go through 3 or 4 levels to reach the thing being defined.
10:13 < ted> if you're interested in modifying NSPR
10:13 < jorendorff> good idea...
10:13 < jboston> Who is wtc?
10:13 < jorendorff> jboston: yeah, I just needed to poke around a little and find stuff
10:13 < jorendorff> I see the implementation now
10:14 < jorendorff> maze of twisty little passages -- it happens, not necessarily for bad reasons
10:14 <@humph> yeah, code grows hair
10:15 < ted> jboston: Wan-Teh Chang, the NSPR owner
10:15 < ted>
10:15 < jboston> ted: thanks.
10:15 < ted> he doesn't irc much, but he's responsive to email
10:15 < jorendorff> The problem with making any kind of change to NSPR is that there are more platforms than any human can understand and test
10:16 < ted> sure
10:16 < jorendorff> and my impression is that they really really don't want regressions, but I'm sure I'd have a friendlier impression if I'd actually spoken to any of them
10:16 < ted> nspr is used in other projects, afaik
10:16 < jboston> It is.
10:17 < ted> i've worked with wtc to get fixes to the NSPR build system that i needed
10:17 < ted> he's pretty helpful
10:18 < jorendorff> so, I see an implementation of _PR_MD_OPEN_FILE_UTF16 in w95io.c
10:18 < jorendorff> but not in ntio.c
10:18 < jorendorff> which worries me a touch
10:19 < jboston> Perhaps utf16 is default for nt?
10:19 < ted> well, it is
10:19 < ted> but does NSPR know that? :)
10:19 < jorendorff> jboston: at the OS level, but not in NSPR
10:22 < jboston> jorendorff: Yes. The nt function only takes a char*. Hrm...
10:22 < jorendorff> jboston: So, regarding stdin/stdout... totally agree with ted that you should just produce byte streams,
10:22 < jorendorff> and use JS, and existing classes, to
10:22 <@humph> can I suggest that we get all of this into a suitable bug?
10:23 < jorendorff> provide text streams as desired
10:23 < jboston> So I should ask wtc if I can create PR_MD_OPEN_FILE_UTF16 for nt. That sort of thing.
10:23 < jorendorff> command lines and filenames are a separate thing
10:23 < jboston>
10:23 < jorendorff> i'll write all this in the bug in a sec
10:23 < firebot> jboston: Bug 459572 nor, --, ---,, UNCO, PR_CreateProcess in NSPR needs unicode support
10:24 < jorendorff> the thing about this is, mostly we're interested in JS users, who just want to pass JS strings
10:24 < jorendorff> which we should treat as UTF-16.
10:24 < jorendorff> It would be nice if NSPR supported that.  Then you wouldn't have to worry about it.
10:25 < jorendorff> ted: what's our usual XPCOM class for filenames in Moz?
10:25  * jorendorff can never remember
10:25 < jorendorff> nsIProcess knows it...
10:25 < jorendorff> s/class/interface/
10:26 <@humph> nsIFile?
10:26 < jboston> nsIFile
10:27 < ted> jorendorff: XPCOM does pretty seamless translation from JS strings to nsString
10:27 < ted> which is in turn pretty easy to get to whatever encoding you want
10:27 < jorendorff> all the implicitness makes my head hurt, but yeah
10:28 < jboston> Getting wide characters from js into xpcom is easy. But then how to pass them to NSPR?
10:28 < jorendorff> the thing is:
10:29 < jorendorff> (the preceding comment explains that warning somewhat)
10:29 < ted> well yeah, mac classic had that problem
10:29 < ted> not sure it's relevant for mozilla
10:30 < ted> nsLocalFileMac might still hold a FSref or something
10:30 < jorendorff> yeesh, does NSPR ever end-of-life anything?
10:30 < ted> ostensibly you can have paths on windows that aren't really representable by pathnames as well, like the Control Panel
10:30  * jorendorff looks surprised
10:31 < jorendorff> I thought everything on Windows could be represented by a path somehow, but it's all kind of mysterious
10:31 < jboston> I think osx using the unix implementation in the nspr. 
10:31 < jboston> There's some stuff with paths that has to be handled:
10:32 < ted>
10:32 < jorendorff> oh, that's not what I meant
10:32 < jorendorff> I meant that there are filenameoids on Windows for stuff like devices and registry keys
10:33 < jorendorff> names that you can use to attach permissions and stuff
10:33 < jboston> Am I going the correct route using the nspr at all? I think it makes design sense.
10:35 < jorendorff> judgement call
10:35 < jorendorff> you have some working code, which tends to make me believe you're on the right track :)
10:35 < ted> jorendorff: ah, yeah
10:35 < ted> the shell deals with PIDLs though
10:36 < jorendorff> jboston: ok, so I suspect that you'll have OS-specific code eventually anyway
10:36 < jorendorff> unless NSPR wants to add some features.
10:36 < jboston> They must want unicode? Everybody want unicode.
10:36 < jorendorff> The reason I think this is because I think you want something like Python's shell=True option
10:37 < jorendorff> right, I figure NSPR probably wouldn't mind adding UTF-16 APIs, it's worth asking
10:37 < jorendorff> But
10:38 < jorendorff> it's unobvious what those APIs should do on POSIX, though for any given UNIX it's pretty straightforward, if clunky
10:38 < jorendorff> what I would do
10:38 < jorendorff> in terms of implementing the UTF-16 API on a UNIX
10:38 < jorendorff> is, first convert it to wchar_t if wchar_t is not already UTF-16 on that platform; then use wcstombs
10:39 < jorendorff> and pass the resulting char string to the relevant UNIX api.
10:39 < jorendorff> shell=True is very much a separate issue; NSPR probably doesn't want it.
10:39  * jorendorff doesn't know
10:42 < jboston> I think that I will try to implement ipc without unicode before moving on to unicode.
10:55 < mhoye>
11:00 < jorendorff> jboston is right, byte streams first
11:05 < ted> well, as stated, there are two issues here
11:05 < ted> the encoding of the file names/command line args
11:06 < ted> and the encoding of the stdio
11:06 < ted> the first is kind of hard
11:06 < ted> the second we already have plenty of ways to work with in the tree
11:09 < bsmedberg> yeah, the filename/commandline args are more important to me
11:10 < bsmedberg> the nsIScriptable{Input,Output}Stream interfaces mostly take care of the stream stuff
11:10 < bsmedberg> although bytearray would be nice
11:10 < jorendorff> yeah, bytearray :|
11:10 < jorendorff> do we have a bug on that?
11:11 < jorendorff> "Bug xxx - can i has bytearray"
11:11  * humph would love if it was a 3 digit bug num
11:11 < bsmedberg> does ES3.1 have a spec for ByteArray?
11:11 < ted> have you ever seen my pure JS+XPCOM EXIF parser?
11:12  * jorendorff pastes bsmedberg's question in #jslang
11:15 < jorendorff> mailing list suggests no...
11:18 < jboston> jorendorff: I just read your comment on my blog. Very informative, thanks. I'll have to investigate that further.
11:19 < jorendorff> yeah, i can't honestly tell if that code really does what they claimed it did
11:19 < jorendorff> but it seemed like my experience was worth sharing anyway :-P
11:20 < jboston> I will look through the nspr code to see how/if they handle that problem.
12:06 < bsmedberg> jboston: I think that NSPR will hold back progress significantly
12:06 < jboston> Do you recommend bypassing nspr?
12:07 < bsmedberg> yes, probably
12:07 < bsmedberg> it took nearly a year to get PR_LoadLibraryWithFlags to accept wide-character paths
12:08 < bsmedberg> and the WithFlags API already existed, we were just adding a new flag
12:08 <@humph> holy crap
12:10 < jboston> Oh dear. Well, for the filename + arguments problem I can do it another way. But if the i/o stuff is a stream of bytes that should be ok?
12:11 < jboston> I'll try to do it a way where process creation can be swapped out from one to the other as the situation evolves.
12:26 < jorendorff> i/o = stream of bytes is not just ok but a hard requirement, anything else is nuts

Jason Orendorff's comment on my blog dealing with children inheriting handles from parents: