UnicodeString-Optimierung für StrToInt

Website · Beitrag von **Socke** » Fr 27. Jan 2017, 23:30

Guten Tag zusammen,

ich möchte eine für UnicodeString optimierte Version von StrToInt mit euch teilen.

Hintergrund
Aktuell arbeite ich mit einer Bibltiothek die ausschließlich mit UTF-16 kodierten Zeichenketten arbeitet und dort viele Zahlen, Datums- und Zeitangaben als UTF-16 kodierte Zeichenketten verwendet werden.
Verwendet man die Funtion StrToInt aus der Unit SysUtils, müssen die Zeichenketten zuerst konvertiert werden. Zwar spielt die Ziel-Code-Page in der Regel keine Rolle (da die betroffenen Zeichen oft den gleichen Zahlwert haben), doch benötigt diese Konvertierung Rechenleistung.
Da alle Zeichen, die für die Darstellung von Zahlen benötigt werden, in der Basic Multilingual Plane liegen, ändert sich nichts an den Algorithmen. Es müssen nur andere

Features und Einschränkungen

Funktioniert wie StrToInt - verwendet aber UnicodeString
Bei Fehlern (ungültige Zahl, außerhalb des Gültigkeitsbereichs) wird EConvertError ausgelöst
Binärzahlen
Dezimalzahlen
Oktalzahlen
Hexadezimalzahlen
Grundlegende Richtigkeit mit FPCUnit getestet (falls Fehler enthalten sind, bitte den Eingabewert mitteilen)
Es gibt nur eine LongInt-Version; für andere muss der Ergebnis-Typ und die Eingrenzung in ParseOctal geändert werden.
Es gibt nur die StrToInt-Version (kein TryStrToInt, StrToIntDefoder die System-Funktion val ohne Exceptions)

Laufzeitverhalten
Getestet wurde mit langen und kurzen Binär-, Oktal-, Dezimal- und Hexadezimalzahlen. Die Funktion wurde für jeden Wert sehr oft* aufgerufen und mit GetTickCount64 die Zeit über alle Iterationen gemessen. Daher gebe ich keine absoluten Zahlen an, sondern nur das Verhältnis zum Aufruf von StrToInt aus der SysUtils Unit (Benötigte Zeit Unicode-Optimiert / Benötigte Zeit SysUtils.StrToInt).

Code: Alles auswählen

Format| Zeit | Wert
------+------+----------------------------------
DEC   | 45 % | 2147483647
DEC   | 38 % | 1234
DEC   | 33 % | 1
BIN   | 39 % | %11111111111111111111111111111111
OCT   | 25 % | &37777777777
HEX   | 20 % | $12345678
HEX   |  8 % | $FF

* = Getestet hatte ich sowohl mit $FFFF * 1000 Aufrufen als auch mit $FFFF * 5000 Aufrufen; wobei das Verhältnis zwischen den Funktionen unabhängig vom Faktor 1000 oder 5000 ist.

Lizenz
Es wird die selbe Lizenz wie bei der LCL oder RTL verwendet: die LGPL mit Linking Exception.

Quelltext

Code: Alles auswählen

{ UnicodeString (UTF-16) optimized StrToInt
 
  Copyright (C) 2017 Simon Ameis <simon.ameis@web.de>
 
  This library is free software; you can redistribute it and/or modify it
  under the terms of the GNU Library General Public License as published by
  the Free Software Foundation; either version 2 of the License, or (at your
  option) any later version with the following modification:
 
  As a special exception, the copyright holders of this library give you
  permission to link this library with independent modules to produce an
  executable, regardless of the license terms of these independent modules,and
  to copy and distribute the resulting executable under terms of your choice,
  provided that you also meet, for each linked independent module, the terms
  and conditions of the license of that module. An independent module is a
  module which is not derived from or based on this library. If you modify
  this library, you may extend this exception to your version of the library,
  but you are not obligated to do so. If you do not wish to do so, delete this
  exception statement from your version.
 
  This program is distributed in the hope that it will be useful, but WITHOUT
  ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
  FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License
  for more details.
 
  You should have received a copy of the GNU Library General Public License
  along with this library; if not, write to the Free Software Foundation,
  Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 
  Version:
  2017-01-28 v1 Initial Publishing
  2017-01-28 v2 Added missing POP compiler instruction
}     
// enables inlining for embedded procedures
// this save some time as calling functions is very expensive
{$DEFINE IntToStr_SUB_INLINE}
uses SysUtils, RtlConsts;
function StrToInt(const aStr: UnicodeString): LongInt;
var
  StartIdx: SizeInt;
  v: LongInt;
procedure ParseBinary;{$IFDEF IntToStr_SUB_INLINE} inline; {$ENDIF}
var
  i: SizeInt;
begin
  if Length(aStr) > BitSizeOf(Result) + 1 then
    raise EConvertError.CreateFmt('The binary string "%s" is too long for LongInt.', [aStr]);
  if Length(aStr) < 2 then // must be at least % + hex char
    raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
 
  for i := StartIdx to Length(aStr) do
  begin
    Result := Result shl 1;
    case aStr[i] of
      UnicodeChar('0'): ; // 0 already shifted in; but use this for validity check
      UnicodeChar('1'): Result := Result or $1;
    else
      raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
    end;
  end;
end;
procedure ParseOctal;{$IFDEF IntToStr_SUB_INLINE} inline; {$ENDIF}
var
  i: SizeInt;
begin
  if Length(aStr) < 2 then // must be at least & + hex char
    raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
  if Length(aStr) > 12 then // & char + 11 octal chars = maximum value
    raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
  // &37777777777 = maximum value
  // &40000000000 requires 33 bits
  if (Length(aStr) = 12) and (aStr[2] > UnicodeChar('3')) then
    raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
 
  for i := StartIdx to Length(aStr) do
  begin
    if (aStr[i] >= UnicodeChar('0')) and (aStr[i] <= UnicodeChar('7')) then
      v := Word(aStr[i]) - Word(UnicodeChar('0'))
    else
      raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
    Result := (Result shl 3) or (Byte(v) and &7);
  end;
end;
// Using two different procedure for positive and negative decimal numbers
// is about 25 percent faster as there are less (jump!) instructions within
// the loop
procedure ParseDecimalPositive;{$IFDEF IntToStr_SUB_INLINE} inline; {$ENDIF}
var
  i: SizeInt;
begin
  try
    for i := StartIdx to Length(aStr) do
    begin
      {$PUSH}
      {$OverflowChecks ON}
      Result := Result * 10;
      {$POP}
      if (aStr[i] >= UnicodeChar('0')) and (aStr[i] <= UnicodeChar('9')) then
      begin
        v := Word(aStr[i]) - $0030;
        {$PUSH}
        {$OverflowChecks ON}
        Result := Result + v;
        {$POP}
      end
      else
        raise EConvertError.CreateFmt('Invalid number character at index %d.', [i]);
    end;
  except
    // only EConvertError may be raised, thus catch EIntOverflow
    on EIntOverflow do
      raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
  end;
end;
procedure ParseDecimalNegative;{$IFDEF IntToStr_SUB_INLINE} inline; {$ENDIF}
var
  i: SizeInt;
begin
  try
    for i := StartIdx to Length(aStr) do
    begin
      {$PUSH}
      {$OverflowChecks ON}
      Result := Result * 10;
      {$POP}
      if (aStr[i] >= UnicodeChar('0')) and (aStr[i] <= UnicodeChar('9')) then
      begin
        v := Word(aStr[i]) - $0030;
        {$PUSH}
        {$OverflowChecks ON}
        Result := Result - v;
        {$POP}
      end
      else
        raise EConvertError.CreateFmt('Invalid number character at index %d.', [i]);
    end;
  except
    on EIntOverflow do
      raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
  end;
end;
 
procedure ParseHexadecimal; {$IFDEF IntToStr_SUB_INLINE} inline; {$ENDIF}
var
  i: SizeInt;
begin
  if Length(aStr) > SizeOf(Result) * 2 + 1 then
    raise EConvertError.CreateFmt('The hexadecimal string "%s" is too long for LongInt.', [aStr]);
  if Length(aStr) < 2 then // must be at least $ + hex char
    raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
 
  for i := StartIdx to Length(aStr) do
  begin
    if (aStr[i] >= UnicodeChar('0')) and (aStr[i] <= UnicodeChar('9')) then
      v := Word(aStr[i]) - Word(UnicodeChar('0'))
    else
    if (aStr[i] >= UnicodeChar('A')) and (aStr[i] <= UnicodeChar('F')) then
      v := Word(aStr[i]) - Word(UnicodeChar('A')) + 10
    else
    if (aStr[i] >= UnicodeChar('a')) and (aStr[i] <= UnicodeChar('f')) then
      v := Word(aStr[i]) - Word(UnicodeChar('a')) + 10
    else
      raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
 
    Result := (Result shl 4) or (Byte(v) and $F);
  end;
end;
 
begin
  if Length(aStr) = 0 then
    raise EConvertError.CreateFmt(SInvalidInteger, ['']);
  Result := 0;
 
  StartIdx := 2;
  case aStr[1] of
    UnicodeChar('%'): begin ParseBinary; end;
    UnicodeChar('&'): begin ParseOctal; end;
    UnicodeChar('-'): begin ParseDecimalNegative end;
    UnicodeChar('+'): begin ParseDecimalPositive; end;
    UnicodeChar('$'): begin ParseHexadecimal; end;
  else
    StartIdx := 1;
    ParseDecimalPositive;
  end;
end;

mse · Beitrag von **mse** » Sa 28. Jan 2017, 10:31

Die entsprechende Funktion aus MSEgui arbeitet mit Pointern und ist noch etwas schneller:

Code: Alles auswählen

 
function trystrtoint(const text: msestring; out value: integer): boolean;
const
 max = maxint div 10;
var
 po1: pmsechar;
 neg: boolean;
begin
 result:= false;
 neg:= false;
 value:= 0;
 if text <> '' then begin
  po1:= pointer(text);
  while (po1^ = ' ') or (po1^ = c_tab) do begin
   inc(po1);
  end;
  neg:= po1^ = msechar('-');
  if not neg then begin
   if po1^ = '+' then begin
    inc(po1);
   end;
  end
  else begin
   inc(po1);
  end;
  if po1^ = #0 then begin
   exit;
  end;
  while po1^ <> #0 do begin
   if (po1^ < '0') or (po1^ > '9')  then begin
    exit;
   end;
   if card32(value) > max then begin
    exit;
   end;
   value:= value * 10 + (word(po1^) - word('0'));
   inc(po1);
  end;
 end;
 if neg then begin
  if (value < 0) and (value <> minint) then begin
   exit;
  end;  
  value:= -value;
 end
 else begin
  if value < 0 then begin
   exit;
  end;
 end;
 result:= true;
end;
 
{$define IntToStr_SUB_INLINE}
 
function StrToIntlaz(const aStr: UnicodeString): LongInt;
var
  StartIdx: SizeInt;
  v: LongInt;
// Using two different procedure for positive and negative decimal numbers
// is about 25 percent faster as there are less (jump!) instructions within
// the loop
procedure ParseDecimalPositive;{$IFDEF IntToStr_SUB_INLINE} inline; {$ENDIF}
var
  i: SizeInt;
begin
  try
    for i := StartIdx to Length(aStr) do
    begin
      {$PUSH}
      {$OverflowChecks ON}
      Result := Result * 10;
      {$POP}
      if (aStr[i] >= UnicodeChar('0')) and (aStr[i] <= UnicodeChar('9')) then
      begin
        v := Word(aStr[i]) - $0030;
        {$PUSH}
        {$OverflowChecks ON}
        Result := Result + v;
        {$POP}
      end
      else
        raise EConvertError.CreateFmt('Invalid number character at index %d.', [i]);
    end;
  except
    // only EConvertError may be raised, thus catch EIntOverflow
    on EIntOverflow do
      raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
  end;
end;
procedure ParseDecimalNegative;{$IFDEF IntToStr_SUB_INLINE} inline; {$ENDIF}
var
  i: SizeInt;
begin
  try
    for i := StartIdx to Length(aStr) do
    begin
      {$PUSH}
      {$OverflowChecks ON}
      Result := Result * 10;
      {$POP}
      if (aStr[i] >= UnicodeChar('0')) and (aStr[i] <= UnicodeChar('9')) then
      begin
        v := Word(aStr[i]) - $0030;
        {$PUSH}
        {$OverflowChecks ON}
        Result := Result - v
      end
      else
        raise EConvertError.CreateFmt('Invalid number character at index %d.', [i]);
    end;
  except
    on EIntOverflow do
      raise EConvertError.CreateFmt(SInvalidInteger, [aStr]);
  end;
end;
 
begin
  if Length(aStr) = 0 then
    raise EConvertError.CreateFmt(SInvalidInteger, ['']);
  Result := 0;
 
  StartIdx := 2;
  case aStr[1] of
//    UnicodeChar('%'): begin ParseBinary; end;
//    UnicodeChar('&'): begin ParseOctal; end;
    UnicodeChar('-'): begin ParseDecimalNegative end;
    UnicodeChar('+'): begin ParseDecimalPositive; end;
//    UnicodeChar('$'): begin ParseHexadecimal; end;
  else
    StartIdx := 1;
    ParseDecimalPositive;
  end;
end;
 
procedure tmainfo.exeev(const sender: TObject);
const
 count = 10000000;
var
 i1,i2: int32;
 t1: tdatetime;
begin
 
 t1:= nowutc();
 for i1:= 0 to count-1 do begin
  i2:= strtointlaz('123');
  i2:= strtointlaz('+123');
  i2:= strtointlaz('-123');
  i2:= strtointlaz('123456789');
  i2:= strtointlaz('+123456789');
  i2:= strtointlaz('-123456789');
 end;
 t1:= nowutc()-t1;
 writeln('count: ',count,' laz: ',formatfloatmse(t1*24*60*60,'0.000s'));
 
 t1:= nowutc();
 for i1:= 0 to count-1 do begin
  trystrtoint('123',i2);
  trystrtoint('+123',i2);
  trystrtoint('-123',i2);
  trystrtoint('123456789',i2);
  trystrtoint('+123456789',i2);
  trystrtoint('-123456789',i2);
 end;
 t1:= nowutc()-t1;
 writeln('count: ',count,' mse: ',formatfloatmse(t1*24*60*60,'0.000s'));
end;

Resultat mit -O3 kompiliert:

Code: Alles auswählen

 
count: 10000000 laz: 2.546s
count: 10000000 mse: 0.731s
count: 10000000 laz: 2.506s
count: 10000000 mse: 0.723s
count: 10000000 laz: 2.505s
count: 10000000 mse: 0.746s
count: 10000000 laz: 2.514s
count: 10000000 mse: 0.742s

Die exception frames im Lazarus code sind vermutlich teuer, siehe laz.png. Die MSEgui Version kommt ohne aus, siehe mse.png.

Website · Beitrag von **Socke** » Mo 30. Jan 2017, 22:11

mse hat geschrieben:Die exception frames im Lazarus code sind vermutlich teuer, siehe laz.png. Die MSEgui Version kommt ohne aus, siehe mse.png.

Das stimmt. Mir war nur kein Weg eingefallen, die Overflow-Bedingungen ohne {$OverflowChecks ON} abzufangen.
Das Ersetzen der Overflow-Checks durch explizit ausgelöste Exceptions gibt gibt ca. 2 weiter Prozentpunkte Geschwindigkeitsvorteil.

mse · Beitrag von **mse** » Di 31. Jan 2017, 07:52

Bist du sicher, dass dann kein implizites try/finally angelegt wird? Der Geschwindigkeitsvorteil von MSEgui ist immerhin >3 und nicht nur einige Prozentpunkte.
Vielleicht ist auch der Verzicht auf Zeichenaddressierung mittels Integer-Index ausschlaggebend.

Website · Beitrag von **Socke** » Di 31. Jan 2017, 08:37

mse hat geschrieben:Bist du sicher, dass dann kein implizites try/finally angelegt wird? Der Geschwindigkeitsvorteil von MSEgui ist immerhin >3 und nicht nur einige Prozentpunkte.
Vielleicht ist auch der Verzicht auf Zeichenaddressierung mittels Integer-Index ausschlaggebend.

Ja, ich bin mir sicher, dass weiterhin Exception-Frames angelegt werden.
Ich hatte nur die beiden Bereiche, die mit {$OverflowChecks ON} umgeben sind, durch präventive Prüfungen ersetzt.
In einem weiteren Schritt will ich die Implementierung Exception-frei machen (d.h. wie die SystemProzedur val). Dann sollte sich die Geschwindigkeit deiner Implementierung weiter annähern.

Deutsches Lazarusforum

UnicodeString-Optimierung für StrToInt

UnicodeString-Optimierung für StrToInt

Re: UnicodeString-Optimierung für StrToInt

Re: UnicodeString-Optimierung für StrToInt

Re: UnicodeString-Optimierung für StrToInt

Re: UnicodeString-Optimierung für StrToInt