16527 – Wishlist: Make str() faster for long strings

For information on using this system, please visit this page.

Bug 16527 - Wishlist: Make str() faster for long strings

Summary: Wishlist: Make str() faster for long strings

Status:	CLOSED FIXED

Alias:	None

Product:	R
Classification:	Unclassified
Component:	Wishlist (show other bugs)
Version:	R-devel (trunk)
Hardware:	All All

Importance:	P5 enhancement
Assignee:	R-core

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-09-01 14:15 UTC by Mikko Korpela
Modified:	2016-06-02 10:08 UTC (History)
CC List:	1 user (show)

See Also:

Attachments
Proposed patch (2.68 KB, patch) 2015-09-01 14:15 UTC, Mikko Korpela	Details \| Diff
Test code (3.31 KB, text/x-r-source) 2015-09-01 14:21 UTC, Mikko Korpela	Details
Precomputed test results (i5-3470 CPU) (11.24 KB, application/x-r-data) 2015-09-01 14:24 UTC, Mikko Korpela	Details
Boxplot by test code (i5-3470 CPU) (27.81 KB, image/png) 2015-09-01 14:35 UTC, Mikko Korpela	Details
Violin plot by test code (i5-3470) (42.77 KB, image/png) 2015-09-01 14:45 UTC, Mikko Korpela	Details
Updated test code (3.33 KB, text/plain) 2015-09-01 15:39 UTC, Mikko Korpela	Details
Fixed patch (2.70 KB, patch) 2015-09-01 15:58 UTC, Mikko Korpela	Details \| Diff
Boxplot with explicit vec.len=4 (i5-3470) (27.50 KB, image/png) 2015-09-02 12:35 UTC, Mikko Korpela	Details
Precomputed test results, vec.len=4 (i5-3470) (11.12 KB, application/x-r-data) 2015-09-02 12:49 UTC, Mikko Korpela	Details
Zip of str.default.2.R, test_str.R & str_bench.rds (19.61 KB, application/octet-stream) 2016-05-31 14:10 UTC, Martin Maechler	Details
Trivial changes to str.default.2.R (1.10 KB, patch) 2016-06-01 13:17 UTC, Mikko Korpela	Details \| Diff
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Mikko Korpela 2015-09-01 14:15:04 UTC

Created attachment 1901 [details]
Proposed patch

The attached patch for "R Under development (unstable) (2015-08-31 r69232)" improves the speed of 'str(x)' (str.default) when 'x' is a character vector with long strings. The improvement is due to calling encodeString() on substrings instead of whole strings. There is a small performance penalty for short strings. Some of it could be avoided if encodeString() reliably handled non-printable characters on all platforms and locales, but that may be a difficult goal to achieve. One of the problem characters on the test platform is 'intToUtf8(8203)', the zero-width space.

The patch is not an improvement for some (possibly rare) cases where 'encodeString(x)' passes non-printable characters through unchanged, and 'nchar(encodeString(x), type = "width")' considers some of the characters to have zero or negative width, which causes the result to appear narrower than required. As a fallback, the patched code will then discard the encodeString() result of the substring and encode the whole string instead, like the existing implementation.

The patch also has a small change to the code that determines the number of elements to be shown: the number of characters (type = "width") in a double quoted encoded string is counted as 2 (the quotes) or more. Normally this makes no difference, but the existing implementation may choose to display a larger number of elements when nchar() gets confused by non-printable characters and returns a number smaller than 2.

A patched R including the new str.R (referred to as str2.R in the patch file) passes all tests in 'make check'.

> sessionInfo()
R Under development (unstable) (2015-08-31 r69232)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] microbenchmark_1.4-2 multcomp_1.4-1       TH.data_1.0-6       
[4] survival_2.38-3      mvtnorm_1.0-3       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0      lattice_0.20-33  codetools_0.2-14 zoo_1.7-12      
 [5] digest_0.6.8     MASS_7.3-44      grid_3.3.0       plyr_1.8.3      
 [9] gtable_0.1.2     magrittr_1.5     scales_0.3.0     ggplot2_1.0.1   
[13] stringi_0.5-5    reshape2_1.4.1   sandwich_2.3-3   proto_0.3-10    
[17] splines_3.3.0    tools_3.3.0      stringr_1.0.0    munsell_0.4.2   
[21] colorspace_1.2-6

Comment 1 Mikko Korpela 2015-09-01 14:21:02 UTC

Created attachment 1902 [details]
Test code

Benchmark the original str.default() vs the patched version and plot the results. The code can load precomputed results from an .RData file.

Comment 2 Mikko Korpela 2015-09-01 14:24:03 UTC

Created attachment 1903 [details]
Precomputed test results (i5-3470 CPU)

This file is compatible with the previously attached test code.

Comment 3 Mikko Korpela 2015-09-01 14:35:48 UTC

Created attachment 1904 [details]
Boxplot by test code (i5-3470 CPU)

The results are presented as comparable pairs, reading from left to right: original (input 1), patched (input 1), original (input 2), patched (input 2), etc. The clarity suffers because of the wide scale in the results. 

The left half shows results for inputs with 1 string. The inputs used in the right half have 26 (identical) strings. The number of characters is increasing from left to right, in each half: 1, 26, 2600, 260000, 26000000.

As an extreme example, the median (100 repeats) runtime in the 26 * 26000000 case decreases from 278 seconds to less than 1 second.

Comment 4 Mikko Korpela 2015-09-01 14:45:22 UTC

Created attachment 1905 [details]
Violin plot by test code (i5-3470)

This picture presents a subset of the results, concentrating on the cases where the patched str.default() is slower than the original. The increase percentages are median vs median. As the timings are in the sub-millisecond scale, the performance loss is quite insignificant.

Comment 5 Mikko Korpela 2015-09-01 15:39:16 UTC

Created attachment 1906 [details]
Updated test code

Previous version was missing 'library(vioplot)'.

Comment 6 Mikko Korpela 2015-09-01 15:58:12 UTC

Created attachment 1907 [details]
Fixed patch

The original patch had a mistake in the branch that handles the case where the "vec.len" argument has been used.

Comment 7 Mikko Korpela 2015-09-02 12:35:00 UTC

Created attachment 1908 [details]
Boxplot with explicit vec.len=4 (i5-3470)

str.default() detects if the "vec.len" argument has been specified. This plot was done by modifying the previously attached test code so that 'vec.len = 4' is used in each function call compared by microbenchmark(). As a highlight, the median runtime of the 26 * 26000000 case dropped from 109 to 0.13 seconds, original vs. patched.

The options affecting str.default() were at their default values when running these and previous tests:

> getOption("str")
$strict.width
[1] "no"

$digits.d
[1] 3

$vec.len
[1] 4

> getOption("width")
[1] 80

Comment 8 Mikko Korpela 2015-09-02 12:49:11 UTC

Created attachment 1909 [details]
Precomputed test results, vec.len=4 (i5-3470)

The results corresponding to the previous boxplot, with an explicit 'vec.len = 4' used in the calls to str.default().

Moving from the original str.default() to the patched version, we see the following relative changes in the median runtime when looking at the input sizes selected for the earlier violin plot: +9%, +9%, +10%, -25%. When the patched version is slower, it is not too much slower.

Comment 9 Martin Maechler 2015-09-08 16:56:52 UTC

Thank you for your careful proposal including checks etc.

We are first looking into the other bogous behavior, you've mentioned about
'nchar(intToUtf8(8203), "width")'  to be negative.

That may (or may not) simplify the patch needed for the speedup of  str.default().

Thank you indeed,
Martin

Comment 10 Martin Maechler 2016-05-31 14:06:04 UTC

(In reply to Martin Maechler from comment #9)
> Thank you for your careful proposal including checks etc.
> 
> We are first looking into the other bogous behavior, you've mentioned about
> 'nchar(intToUtf8(8203), "width")'  to be negative.
> 
> That may (or may not) simplify the patch needed for the speedup of 
> str.default().
> 
> Thank you indeed,
> Martin

Time has passed and the nchar() problem has indeed been solved a while ago.

I still keep a version of your patch ("pretty edited", some extra comments, also in line with the small str() changes since...)  in my sources, and I would like to close this by really implementing such a speedup for such large strings.

In the mean time, str() and R too has improved a bit, but current str.default()
is still very slow for large strings and much faster after your patch.

I'll attach my current version of test_str.R  and of str.default.2.R 
so you (and other readers) can work with it.
Currently I'd hope for a much more concise patch with basically the same effect,
as it is a bit problematic to add much R code just for dealing with this rare extreme situation.

Comment 11 Martin Maechler 2016-05-31 14:10:15 UTC

Created attachment 2099 [details]
Zip of str.default.2.R,  test_str.R & str_bench.rds

Updated 'test_str.R' (leave away the longest lasting case)
updated  str.default.2.R ( = str.default *my* version)
precomputed microbenchmarks 'str_bench.rds'

Comment 12 Mikko Korpela 2016-06-01 13:17:26 UTC

Created attachment 2100 [details]
Trivial changes to str.default.2.R

This attachment is a patch to make str.default.2.R a tiny bit more concise (removing an unnecessary assignment, etc.), but I am not able to come up with more fundamental changes at this point.

It's nice that the negative width issue has been solved. However, it is probably impossible to avoid checking the display width, i.e. nchar(type = "w"), of the encoded string, and then possibly resorting to encoding the whole string. A small scale (but extreme) example:

> multi_diacritics <- paste0("a", intToUtf8(768:783), "b")
> nchar(multi_diacritics)
[1] 18
> nchar(multi_diacritics, "width")
[1] 2
> nchar(substr(multi_diacritics, 1, 17), "width")
[1] 1
> encodeString(multi_diacritics) == multi_diacritics
[1] TRUE

Here we stack multiple diacritics on the letter "a", followed by the letter "b". Any proper non-empty prefix of 'multi_diacritics', up to substr(multi_diacritics, 1, 17), has a display width of 1. The same is true for the corresponding encoded version. In this case, the whole string is needed for a result with a display width of 2. The way encodeString() works here (it does nothing) seems correct because diacritics are printable characters.

Comment 13 Martin Maechler 2016-06-02 10:08:58 UTC

I have now committed (to R-devel) a version of str.default()  which *is* fast,
and seems to need less "slack" (could probably be decreased even more?), hopefully working fine also in the case of diacriticals or other mult-byte characters, by using 'strtrim(C, n)' instead of  'substr(C, 1, n)'.

------------------------------------------------------------------------
r70698 | maechler | 2016-06-02 12:01:11 +0200 (Thu, 02 Jun 2016) | 1 line
Changed paths:
   M /trunk/doc/NEWS.Rd
   M /trunk/src/library/utils/R/str.R
   M /trunk/tests/reg-tests-3.R

str() on large strings is much faster now and string truncation is more accurate
------------------------------------------------------------------------

My current plan is to also port it to 'R patched', possibly in time for R 3.3.1,
or then a bit later.

Thank you Mikko for the suggestions, code and test file proposals!

I'm closing the bug now, as the *speed* problem is solved.
If there are still problems with multibyte chars, I'd want you to open a new bug report.