Important MO file optimisation for en_* locales, and partly others
During GUADEC, Tomas Frydrych gave a talk on exmap-console, a cut-down version of exmap that can work well on mobile devices.
During the presentation, Tomas showed how to use the tool to find the culprits in memory (ab)use on the GNOME desktop. One issue that came up was that the MO files taking up space though the desktop showed English. Why would the MO translation files loaded in memory be so big in size?
gtk20.mo : VM 61440 B, M 61440 B, S 61440 B atk10.mo : VM 8192 B, M 8192 B, S 8192 B libgnome-2.0.mo : VM 28672 B, M 24576 B, S 24576 B glib20.mo : VM 20480 B, M 16384 B, S 16384 B gtk20-properties.mo : VM 128 KB, M 116 KB, S 116 KB launchpad-integration.mo : VM 4096 B, M 4096 B, S 4096 B
A translation file looks like
msgid "File"
msgstr ""
When translated to Greek it is
msgid "File"
msgstr "Αρχείο"
In the English UK translation it would be
msgid "File"
msgstr "File"
This actually is not necessary because if you leave those messags untranslated, the system will use the original messages that are embedded in the executable file.
However, for the purposes of the English UK, English Canadian, etc teams, it makes sense to copy the same messages in the translated field because it would be an indication that the message was examined by the translation. Any new messages would appear as untranslated and the same process would continue.
Now, the problem is that the gettext tools are not smart enough when they compile such translation files; they replicate without need those messages occupying space in the generated MO file.
Apart from the English variants, this issue is also present in other languages when the message looks like
msgid "GConf"
msgstr "GConf"
Here, it does not make much sense to translate the message in the locale language. However, the generated MO file contains now more than 10 bytes (5+5) , plus some space for the index.
Therefore, what's the solution for this issue?
One solution is to add to msgattrib the option to preprocess a PO file and remove those unneeded copies. Here is a patch,
--- src.ORIGINAL/msgattrib.c 2007-07-18 17:17:08.000000000 +0100
+++ src/msgattrib.c 2007-07-23 01:20:35.000000000 +0100
@@ -61,7 +61,8 @@
REMOVE_FUZZY = 1 << 2,
REMOVE_NONFUZZY = 1 << 3,
REMOVE_OBSOLETE = 1 << 4,
- REMOVE_NONOBSOLETE = 1 << 5
+ REMOVE_NONOBSOLETE = 1 << 5,
+ REMOVE_COPIED = 1 << 6
};
static int to_remove;
@@ -90,6 +91,7 @@
{ "help", no_argument, NULL, 'h' },
{ "ignore-file", required_argument, NULL, CHAR_MAX + 15 },
{ "indent", no_argument, NULL, 'i' },
+ { "no-copied", no_argument, NULL, CHAR_MAX + 19 },
{ "no-escape", no_argument, NULL, 'e' },
{ "no-fuzzy", no_argument, NULL, CHAR_MAX + 3 },
{ "no-location", no_argument, &line_comment, 0 },
@@ -314,6 +316,10 @@
to_change |= REMOVE_PREV;
break;
+ case CHAR_MAX + 19: /* --no-copied */
+ to_remove |= REMOVE_COPIED;
+ break;
+
default:
usage (EXIT_FAILURE);
/* NOTREACHED */
@@ -436,6 +442,8 @@
--no-obsolete remove obsolete #~ messages\n"));
printf (_("\
--only-obsolete keep obsolete #~ messages\n"));
+ printf (_("\
+ --no-copied remove copied messages\n"));
printf ("\n");
printf (_("\
Attribute manipulation:\n"));
@@ -536,6 +544,21 @@
: to_remove & REMOVE_NONOBSOLETE))
return false;
+ if (to_remove & REMOVE_COPIED)
+ {
+ if (!strcmp(mp->msgid, mp->msgstr) && strlen(mp->msgstr)+1 >= mp->msgstr_len)
+ {
+ return false;
+ }
+ else if ( strlen(mp->msgstr)+1 < mp->msgstr_len )
+ {
+ if ( !strcmp(mp->msgstr + strlen(mp->msgstr)+1, mp->msgid_plural) )
+ {
+ return false;
+ }
+ }
+ }
+
return true;
}
However, if we only change msgattrib, we would need to adapt the build system for all packages.
Apparently, it would make sense to change the default behaviour of msgfmt, the program that compiles PO files into MO files.
An e-mail was sent to the email address for the development team of gettext regarding the issue. The development team does not appear to have a Bugzilla to record these issues. If you know of an alternative contact point, please notify me.
Update #1 (23Jul07): As an indication of the file size savings, the en_GB locale on Ubuntu in the installation CD occupies about 424KB where in practice it should have been 48KB.
A full installation of Ubuntu with some basic KDE packages (only for the basic libraries, i.e. KBabel - (ls k* | wc -l = 499)) occupies about 26MB of space just for the translation files. When optimising in the MO files, the translation files occupy only 7MB. This is quite important because when someone installs for example the en_CA locale, all en_?? locales are added.
The reason why the reduction is more has to do with the message types that KDE uses. For example,
msgid ""
"_: Unknown State\n"
"Unknown"
msgstr "Unknown"
I cannot see a portable way to code the gettext-tools so that they understand that the above message can be easily omitted. For the above reduction to 7MB, KDE applications (k*) occupy 3.6MB. The non-KDE applications include GNOME, XFCE and GNU traditional tools. The biggest culprits in KDE are kstars (386KB) and kgeography (345KB).
Update #2 (23Jul07): (Thanks Deniz for the comment below on gweather!) The po-locations translations (gnome-applets/gweather) of all languages are combined together to generate a big XML file that can be found at usr/share/gnome-applets/gweather/Locations.xml (~15MB).
This file is not kept in memory while the gweather applet is running.
However, the file is parsed when the user opens the properties dialog to change the location.
I would say that the main problem here is the file size (15.8MB) that can be easily reduced when stripping copied messages. This file is included in any Linux distribution, whatever the locale.
The po-locations directory currently occupies 107MB and when copied messages are eliminated it occupies 78MB (a difference of 30MB). The generated XML file is in any case smaller (15.8MB without optimisation) because it does not include repeatedly the msgid lines for each language.
I regenerated the Locations.xml file with the optimised PO files and the resulting file is 7.6MB. This is a good reduction in file space and also in packaging size.
Update #3 (25Jul07): Posted a patch for gettext-tools/msgattrib.c. Sent an e-mail to the kde-i18n-doc mailing list and got good response and a valid argument for the proposed changes. Specifically, there is a case when one gives custom values to the LANGUAGE variable. This happens when someone uses the LANGUAGE variable with a value such as "es:fr" which means show me messages in Spanish and if something is untranslated show me in French. If a message has msgid==msgstr for Spanish but not for French, then it would show in French if we go along with the proposed optimisation.
Say No to OOXML
Click on the image above to visit the petition page.
I copy here the terms of the petition to say no on the standardisation of MSOOXML at ISO.
I ask the national members of ISO to vote "NO" in the ballot of ISO DIS 29500 (Office OpenXML or OOXML format) for the following reasons:
- There is already a standard ISO26300 named Open Document Format (ODF): a dual standard adds costs, uncertainty and confusion to industry, government and citizens;
- There is no provable implementation of the OOXML specification: Microsoft Office 2007 produces a special version of OOXML, not a file format which complies with the OOXML specification;
- There is missing information from the specification document, for example how to do a autoSpaceLikeWord95 or useWord97LineBreakRules;
- More than 10% of the examples mentioned in the proposed standard do not validate as XML;
- There is no guarantee that anybody can write a software that fully or partially implements the OOXML specification without being liable to patent damages or patent license fees by Microsoft;
- This standard proposal conflicts with other ISO standards, such as ISO 8601 (Representation of dates and times), ISO 639 (Codes for the Representation of Names and Languages) or ISO/IEC 10118-3 (cryptographic hash);
- There is a bug in the spreadsheet file format which forbids to enter any date before the year 1900: such bugs affects the OOXML specification as well as software versions such as Microsoft Excel 2000, XP, 2003 or 2007.
- This standard proposal has not been created by bringing together the experience and expertise of all interested parties (such as the producers, sellers, buyers, users and regulators), but by Microsoft alone.
This project is an initiative by the Foundation for a Free Information Infrastructure (FFII), the non-profit that helped achieve the rejection of the EU software patent directive in July 2005.
Update #1: Currently (26Jun07 - noon) there are 8805 signatures.
Update #2: Currently (26Jun07 - evening) there are 9481 signatures.
Update #3:
IT IS URGENT THAT YOU CONTACT YOUR STANDARDISATION BODY IN YOUR COUNTRY AND EXPLAIN THEM WHY OOXML IS BROKEN; SENDING A NICE LETTER TO YOUR STANDARDISATION BODY IN YOUR COUNTRY IS MORE IMPORTANT THEN SIGNING THE PETITION
International Call for Artists’ film and video
AT HOME IN EUROPE
Generous European Culture2000 funding enables ISIS Arts (UK) and it’s
international project partners BEK (Norway), InterSpace (Bulgaria) and
RIXC (Latvia) to curate a NEW SCREENING PROGRAMME around the theme of
European Identity for the Big M, ISIS Art’s inflatable touring space.
Daily, more and more European people decide to live in other European
countries. With a shifting concept of nationality it becomes
increasingly important to consider what it means to be European. Is
there such a thing as European Identity and how does it relate to
national identity?
For this programme we invite submissions of films or video works on this
theme from artists of any nationality.
Selected works will become part of the new screening programme which
will tour to the four partnering countries between May 2007 and
September 2007.
Work will be selected through open submission. In order to be considered
individual works must:
- Have a running time of 5 minutes or less
- Be single channel and non interactive
- Address the project theme
Selected artists will receive an exhibition fee of € 300 (The Big M is
not a commercial venture and admission is free). Copyright remains
solely with the artist.
The Big M is a highly stylised inflatable structure that functions as a
temporary and mobile venue for the presentation of video and digital
media. Unique in both design and function, the Big M provides an
alternative to the conventional gallery setting and exhibits work by
emerging and established artists to diverse audiences.
See: http://www.isisarts.org.uk/index2.html
To submit pieces for consideration please send work on DVD, CD Rom (720x
576 dpi QuickTime movie) or mini DV, titled and with a synopsis of 50
words maximum, a CV and a stamped addressed envelope (if you want your
materials returned) to:
BEK
C Sundtsg 55
9. etage
5004 Bergen
Norway
Deadline for receipt of submissions is the 3rd of February 2007
Further inquiries to isis at isisarts dot org dot uk
Further project information can be found on
http://www.athomeineurope.eu/
Federico on GNOME optimisation

Federico on GNOME optimisation
Προέρχεται από τον simosx.
The presentation of Federico on GNOME optimisation.
He covered issues of optimising GNOME so that the end-user experience follows a "flow"; that there are no bottlenecks or annoying delays in the duration of a desktop session.
Είσαι προγραμματιστής;
Το Google διοργανώνει Διαγωνισμό Πληροφορικής αυτό το μήνα, το Google Code Jam Europe. Ο διαγωνισμός διεξάγεται μέσω Διαδικτύου μέσα από μια ενδιαφέρουσα πλατφόρμα. Γραφτείτε τώρα και έχετε τη δυνατότητα να δοκιμάσετε την πλατφόρμα με προβλήματα-δείγματα.
Only four of the 48 best computer programmers in the world are Americans, at least according to a computer-programming competition run by TopCoder. Poland had 11 of the final 48, and Russia had 8. Wall Street Journal columnist Lee Gomes asks whether this is more evidence of a sad decline in American education and competitiveness: 'Surprisingly, the Eastern Europeans don't seem to think so. Poland's Krzysztof Duleba, 22, explained that in countries like his own, there are so few economic opportunities for students that competitions like these are their one chance to participate in the global economy. Some of the Eastern Europeans even seemed slightly embarrassed by their over-representation, saying it isn't evidence of any superior schooling or talent so much as an indicator of how much they have to prove.'
Πηγή: The Wall Street Journal
Υπάρχουν άραγε οικονομικές ευκαιρίες στην Ελλάδα;
Ελληνικά στην αλληλογραφία, μέρος πρώτο
Πρέπει να λαμβάνετε γράμματα / ανακοινώσεις από μερικούς δικτυακούς τόπους όπου η κωδικοποίηση για τα ελληνικά δεν είναι σωστή, είτε στο σώμα του μηνύματος, είτε στην κεφαλίδα (From: "Ανακοίνωση"
Συγκεκριμένα, δεν καθορίζεται η κωδικοποίηση οπότε είναι θέμα εξ ορισμού ρυθμίσεων του παραλήπτη για να δει το αποτέλεσμα.
Ας δούμε πως μπορείτε μέσα από μια εφαρμογή PHP να στείλετε αλληλογραφία με ελληνικά. Το ίδιο μπορεί να γίνει και από άλλες γλώσσες, όπως Perl και Python.
<?php
include('Mail.php');
include('Mail/mime.php');$from = "From: \"" . mb_encode_mimeheader('Όνομα Αποστολέα') . "\" < αποστολέας στο gmail τελεία com>";
$to = mb_encode_mimeheader('Όνομα Παραλήπτη') . " < παραλήπτης στο gmail τελεία com>";
$subject = 'Θέμα γράμματος';
$body = 'Περιεχόμενο του γράμματος.';mb_send_mail($to, $subject, $body, $from);
?>
Το γράμμα που θα παραχθεί θα μοιάζει με
Από: Όνομα Αποστολέα < αποστολέας στο gmail τελεία com>
Προς: Όνομα Παραλήπτη < παραλήπτης στο gmail τελεία com>
Θέμα: Θέμα γράμματοςΠεριεχόμενο του γράμματος.
Απαιτεί την εγκατάσταση του πακέτου php-mbstring που το έχουν όλες οι καλές διανομές Linux. Διαφορετικά είναι δυνατόν
να έχετε το ίδιο αποτέλεσμα αλλά θα κάνετε τα παραπάνω χειρωνακτικά.
Ακόμα, πρέπει να ρυθμίσετε το /etc/php.ini με τα παρακάτω:
[mbstring]
; language for internal character representation.
; Neutral σημαίνει Unicode
mbstring.language = Neutral; internal/script encoding.
; Some encoding cannot work as internal encoding.
; (e.g. SJIS, BIG5, ISO-2022-*)
mbstring.internal_encoding = UTF-8; http input encoding.
mbstring.http_input = UTF-8; http output encoding. mb_output_handler must be
; registered as output buffer to function
mbstring.http_output = UTF-8; enable automatic encoding translation accoding to
; mbstring.internal_encoding setting. Input chars are
; converted to internal encoding by setting this to On.
; Note: Do _not_ use automatic encoding translation for
; portable libs/applications.
mbstring.encoding_translation = On; substitute_character used when character cannot be converted
; one from another
; σημαίνει ότι στην μετατροπή αν κάτι πάει στραβά, θα εκτυπώσει των κωδικό U+xxxx του χαρακτήρα.
mbstring.substitute_character = long;
Αν είστε χρήστης της εφαρμογής phplist, ενημερώστε τη σελίδα αυτή.
Σημείωση: Όλα τα παραπάνω είναι σε κωδικοποίηση utf-8 (Unicode).
