build: encode non-ASCII Latin1 characters as one byte in JS2C by joyeecheung · Pull Request #51605 · nodejs/node
added
c++
labels
Jan 30, 2024
joyeecheung
changed the title
tools: encode non-ASCII Latin1 characters as one byte in JS2C
build: encode non-ASCII Latin1 characters as one byte in JS2C
Previously we had two encodings for JS files: 1. If a file contains only ASCII characters, encode it as a one-byte string (interpreted as uint8_t array during loading). 2. If a file contains any characters with code point above 127, encode it as a two-byte string (interpreted as uint16_t array during loading). This was done because V8 only supports Latin-1 and UTF16 encoding as underlying representation for strings. To store the JS code as external strings to save encoding cost and memory overhead we need to follow the representations supported by V8. Notice that there is a gap in the Latin1 range (128-255) that we encoded as two-byte, which was an undocumented TODO for a long time. That was fine previously because then files that contained code points beyond the 0-127 range contained code points >255. Now we have undici which contains code points in the range 0-255 (minus a replaceable code point >255). So this patch adds handling for the 128-255 range to reduce the size overhead caused by encoding them as two-byte. This could reduce the size of the binary by ~500KB and helps future files with this kind of code points. Drive-by: replace `’` with `'` in undici.js to make it a Latin-1 only string. That could be removed if undici updates itself to replace this character in the comment.
targos pushed a commit that referenced this pull request
Feb 19, 2024Previously we had two encodings for JS files: 1. If a file contains only ASCII characters, encode it as a one-byte string (interpreted as uint8_t array during loading). 2. If a file contains any characters with code point above 127, encode it as a two-byte string (interpreted as uint16_t array during loading). This was done because V8 only supports Latin-1 and UTF16 encoding as underlying representation for strings. To store the JS code as external strings to save encoding cost and memory overhead we need to follow the representations supported by V8. Notice that there is a gap in the Latin1 range (128-255) that we encoded as two-byte, which was an undocumented TODO for a long time. That was fine previously because then files that contained code points beyond the 0-127 range contained code points >255. Now we have undici which contains code points in the range 0-255 (minus a replaceable code point >255). So this patch adds handling for the 128-255 range to reduce the size overhead caused by encoding them as two-byte. This could reduce the size of the binary by ~500KB and helps future files with this kind of code points. Drive-by: replace `’` with `'` in undici.js to make it a Latin-1 only string. That could be removed if undici updates itself to replace this character in the comment. PR-URL: #51605 Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
zcbenz added a commit that referenced this pull request
Feb 23, 2024marco-ippolito pushed a commit that referenced this pull request
Feb 26, 2024marco-ippolito pushed a commit that referenced this pull request
Feb 26, 2024marco-ippolito pushed a commit that referenced this pull request
Feb 27, 2024rdw-msft pushed a commit to rdw-msft/node that referenced this pull request
Mar 20, 2024Previously we had two encodings for JS files: 1. If a file contains only ASCII characters, encode it as a one-byte string (interpreted as uint8_t array during loading). 2. If a file contains any characters with code point above 127, encode it as a two-byte string (interpreted as uint16_t array during loading). This was done because V8 only supports Latin-1 and UTF16 encoding as underlying representation for strings. To store the JS code as external strings to save encoding cost and memory overhead we need to follow the representations supported by V8. Notice that there is a gap in the Latin1 range (128-255) that we encoded as two-byte, which was an undocumented TODO for a long time. That was fine previously because then files that contained code points beyond the 0-127 range contained code points >255. Now we have undici which contains code points in the range 0-255 (minus a replaceable code point >255). So this patch adds handling for the 128-255 range to reduce the size overhead caused by encoding them as two-byte. This could reduce the size of the binary by ~500KB and helps future files with this kind of code points. Drive-by: replace `’` with `'` in undici.js to make it a Latin-1 only string. That could be removed if undici updates itself to replace this character in the comment. PR-URL: nodejs#51605 Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
richardlau pushed a commit that referenced this pull request
Mar 25, 2024Previously we had two encodings for JS files: 1. If a file contains only ASCII characters, encode it as a one-byte string (interpreted as uint8_t array during loading). 2. If a file contains any characters with code point above 127, encode it as a two-byte string (interpreted as uint16_t array during loading). This was done because V8 only supports Latin-1 and UTF16 encoding as underlying representation for strings. To store the JS code as external strings to save encoding cost and memory overhead we need to follow the representations supported by V8. Notice that there is a gap in the Latin1 range (128-255) that we encoded as two-byte, which was an undocumented TODO for a long time. That was fine previously because then files that contained code points beyond the 0-127 range contained code points >255. Now we have undici which contains code points in the range 0-255 (minus a replaceable code point >255). So this patch adds handling for the 128-255 range to reduce the size overhead caused by encoding them as two-byte. This could reduce the size of the binary by ~500KB and helps future files with this kind of code points. Drive-by: replace `’` with `'` in undici.js to make it a Latin-1 only string. That could be removed if undici updates itself to replace this character in the comment. PR-URL: #51605 Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
richardlau pushed a commit that referenced this pull request
Mar 25, 2024richardlau pushed a commit that referenced this pull request
Mar 25, 2024Previously we had two encodings for JS files: 1. If a file contains only ASCII characters, encode it as a one-byte string (interpreted as uint8_t array during loading). 2. If a file contains any characters with code point above 127, encode it as a two-byte string (interpreted as uint16_t array during loading). This was done because V8 only supports Latin-1 and UTF16 encoding as underlying representation for strings. To store the JS code as external strings to save encoding cost and memory overhead we need to follow the representations supported by V8. Notice that there is a gap in the Latin1 range (128-255) that we encoded as two-byte, which was an undocumented TODO for a long time. That was fine previously because then files that contained code points beyond the 0-127 range contained code points >255. Now we have undici which contains code points in the range 0-255 (minus a replaceable code point >255). So this patch adds handling for the 128-255 range to reduce the size overhead caused by encoding them as two-byte. This could reduce the size of the binary by ~500KB and helps future files with this kind of code points. Drive-by: replace `’` with `'` in undici.js to make it a Latin-1 only string. That could be removed if undici updates itself to replace this character in the comment. PR-URL: #51605 Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>
richardlau pushed a commit that referenced this pull request
Mar 25, 2024rdw-msft pushed a commit to rdw-msft/node that referenced this pull request
Mar 26, 2024This is a follow-up to nodejs#51605. PR-URL: nodejs#51818 Reviewed-By: Michaël Zasso <targos@protonmail.com> Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com> Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io> Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
jkleinsc pushed a commit to electron/electron that referenced this pull request
Apr 17, 2024* chore: bump node in DEPS to v20.12.0 * chore: update build_add_gn_build_files.patch * chore: update patches * chore: bump node in DEPS to v20.12.1 * chore: update patches * build: encode non-ASCII Latin1 characters as one byte in JS2C nodejs/node#51605 * crypto: use EVP_MD_fetch and cache EVP_MD for hashes nodejs/node#51034 * chore: update filenames.json * chore: bump node in DEPS to v20.12.2 * chore: update patches * src: support configurable snapshot nodejs/node#50453 * test: remove test-domain-error-types flaky designation nodejs/node#51717 * src: avoid draining platform tasks at FreeEnvironment nodejs/node#51290 * chore: fix accidentally deleted v8 dep * lib: define FormData and fetch etc. in the built-in snapshot nodejs/node#51598 * chore: rebase on main * chore: remove stray log --------- Co-authored-by: electron-roller[bot] <84116207+electron-roller[bot]@users.noreply.github.com> Co-authored-by: Cheng <zcbenz@gmail.com> Co-authored-by: Shelley Vohr <shelley.vohr@gmail.com> Co-authored-by: PatchUp <73610968+patchup[bot]@users.noreply.github.com>
codebytere added a commit to electron/electron that referenced this pull request
Jun 1, 2024* chore: bump node in DEPS to v20.13.1 * chore: bump node in DEPS to v20.14.0 * chore: update build_add_gn_build_files.patch * chore: update patches * chore: update patches * build: encode non-ASCII Latin1 characters as one byte in JS2C nodejs/node#51605 * crypto: use EVP_MD_fetch and cache EVP_MD for hashes nodejs/node#51034 * chore: update filenames.json * chore: update patches * src: support configurable snapshot nodejs/node#50453 * test: remove test-domain-error-types flaky designation nodejs/node#51717 * src: avoid draining platform tasks at FreeEnvironment nodejs/node#51290 * chore: fix accidentally deleted v8 dep * lib: define FormData and fetch etc. in the built-in snapshot nodejs/node#51598 * chore: remove stray log * crypto: enable NODE_EXTRA_CA_CERTS with BoringSSL nodejs/node#52217 * test: skip test for dynamically linked OpenSSL nodejs/node#52542 * lib, url: add a `windows` option to path parsing nodejs/node#52509 * src: use dedicated routine to compile function for builtin CJS loader nodejs/node#52016 * test: mark test as flaky nodejs/node#52671 * build,tools: add test-ubsan ci nodejs/node#46297 * src: preload function for Environment nodejs/node#51539 * deps: update c-ares to 1.28.1 nodejs/node#52285 * chore: fixup * events: extract addAbortListener for safe internal use nodejs/node#52081 * module: print location of unsettled top-level await in entry points nodejs/node#51999 * fs: add stacktrace to fs/promises nodejs/node#49849 * chore: fixup indices --------- Co-authored-by: electron-roller[bot] <84116207+electron-roller[bot]@users.noreply.github.com> Co-authored-by: Cheng <zcbenz@gmail.com> Co-authored-by: Shelley Vohr <shelley.vohr@gmail.com> Co-authored-by: PatchUp <73610968+patchup[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters