User:Glrx

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Notes about images[edit]

Mona Lisa JPEG image is 90 MB. The served JPEG is 21 kB.

WMF does some interesting processing when it displays images. One might think that a JPEG image of the Mona Lisa is just transmitted to a browser, but that is not the case. The File:Mona Lisa, by Leonardo da Vinci, from C2RMF retouched.jpg is 90 MB, and that much data takes time to transmit (about 1 second at 1 Gbit/s or 10 seconds at 100 Mbit/s) and could put a big dent in a modest cellular telephone data plan.

Instead, WMF does something different. The 90 MB, 7,479 × 11,146 pixel, JPEG is downsampled to the display size. The result is a small image is transferred over the network. The transfer is faster, and the impact on data bandwidth and data plans is much smaller. For example, the downsampled JPEG is just 21143 bytes, a mere fraction of 90 MB.

Information for downsampling[edit]

The wiki markup will specify a particular image to display. The markup will specify a desired size (such as a width of 160 pixels).

Below is speculation. JPEG and PNG files may be reparsed at each inclusion. I need to check the sources.

I expect a database to hold critical properties. With quick access to some properties, it will not be necessary to parse the image file or the image description page. Such properties would be a base URL, file type, width, and height should be accessed quickly. From that information, one could quickly determine the HTML to include the image. The img element's width, height, and URL can be determined without reading the image file.

That is not true for SVG files. The image size can be determined, but SVG URLs many need to specify the desired language. My understanding is the SVG file must be parsed to learn the IETF language tags that the file supports (that step can be skipped if |lang= is set or for the en.Wiki). Reparsing the file makes using an SVG file expensive.

Note: Looks like SVG files need not be reparsed for languages. The MW database has an edited list of languages in its metadata. A MediaWiki API imageinfo query provides metadata and languages (MW type not IETF?):
Languages are filtered: zh_CN does not appear. Non-existent langtags also appear: zh.
https://www.mediawiki.org/wiki/Manual:File_metadata_handling

Caching[edit]

The WMF servers confront a computational burden because they must downsample the image, but local computation is less expensive than data bandwidth.

In addition, WMF servers also cache the images that it downsamples. If I ask for Mona Lisa at a particular width, then a WMF server will generate that size. That work is stored in a cache, so if I or somebody else asks for the same size image at another time, the cached version is supplied rather than re-downsampling the image. The cache saves computation and time.

The moment when the caching is done is also significant. Although I can ask for images at particular sizes, the usual scenario is the image (such as Mona Lisa) is used on a wiki article page. When a wiki page is updated, MediaWiki rebuilds the wiki page (creating a cached version of the wiki page) and also caching any new images that were added to the page. That can require a lot of computation, but the result is the wiki page and all of its images are now in the server caches. Preloading the cache reduces the latency that a user would experience when he views the page. He need only wait for the data to be transferred; he would not wait for the downsampling because it has already been done.

Moreover, WMF is telling my browser to cache local copies on my computer. If I view a wiki page with the Mona Lisa image on it, the wiki page and the Mona Lisa image are copied to my computer. I can leave that wiki page, but the local copies remain on my computer. If I reload the page later, my browser can display the page without re-downloading the page and image from the WMF server.

That local caching interaction can be involved. The mechanism is part of the Hyper Text Transfer Protocol (HTTP). When a server transfers web pages or images using HTTP, it will specify some caching information. That information tells my browser if it may cache the data and how long the cached data is accurate.[1]

Caches can cause trouble[edit]

Say wiki page ABC uses an image XYZ.

If page ABC is rebuilt every time it is accessed, then the page will always be up to date. If the page is cached, then the cache may have a stale version.

If somebody edits page ABC, then it is clear that page ABC should be purged from the cache.

If somebody edits image XYZ, then the cache should be cleared of XYZ. But the appearance of page ABC may also change even though none of the wikitext for page ABC has changed. How does page ABC get updated?

If the aspect ratio of XYZ does not change, then nothing much needs to happen. When page ABC is accessed, it comes out of the cache. The cached page has a reference to XYZ, but that image has been invalidated, so the new version of XYZ is fetched.

If the aspect ratio of XYZ changed, then the layout of ABC may have been altered. ABC needs to be rebuilt. MW maintains a database of where each image is used, so MW can invalidate all of the pages that use XYZ. There is a cascade: the invalidated pages may be transcluded, so more pages may need to be invalidated and rebuilt.

More on server caching[edit]

The server cache can be a separate set of servers positioned between the users and the actual servers. WMF uses Varnish.

Domain names[edit]

Domain names such as commons.wikimedia.org or upload.wikimedia.org must be resolved to an IP address. That resolution need not be to a single IP address. Check namespace resolution and redirect messages as ways to shuffle the load.

A domain name resolves to one IP address. Many domain names may resolve to the same IP address.[2] But I think a name may have many A records. I'm looking for information about random selection.[3]

Alt text[edit]

Proposal about alt= text being added to HTML.

Page regeneration[edit]

Consider a typical Wikipedia page. It will use templates and images.

If one of the templates is edited, then the Wikipedia page probably needs to be rebuilt. The template may affect the page content or layout. MediaWiki keeps track of which pages use a template, so when a template is edited, then MediaWiki knows which pages need to be regenerated. There can also be as cascade because some templates use other templates.

That means that editing a template that is used on thousands of Wikipedia pages would trigger the regeneration of thousands of Wikipedia pages. Editing commonly used templates should not be done lightly. Commonly used templates may be protected. For example, {{Cite book}} on the English Wikipedia affects almost 1.5 million pages.

Editing an image does not require rebuilding the pages that use the image. The page still references the same image name, but now the image scalers will supply the new image rather than the old one. The cached HTML of the Wikipedia page is still good.

Well, not quite. When MediaWiki builds a page, it specifies the width and height attributes of the img element. That allows the browser to layout the page before it has downloaded all the images. That avoids continual layout adjustments as image sizes are learned. So rebuilding pages may be required. WikiMedia could just do it all the time, or it could update pages only if a significant change occurred. If the image aspect ratio changed, then img elements would need to be updated. If a multilingual SVG file added another language, then pages may need updating.

SVG images[edit]

SVG Map of Gibraltar is 290 kB. The served PNG is 48 kB.

WMF processes SVG images in a similar manner as JPEG images. Instead of serving the actual SVG file on a wiki page, WMF builds a PNG file of the requested width and serves the PNG. There are a couple of advantages to serving a PNG.

First, serving the PNG file can be much smaller than the SVG file. For example, the SVG map of Gibraltar is 290 kB. The request above produced a 48 kB PNG file:

accept-ranges: bytes
access-control-allow-origin: *
access-control-expose-headers: Age, Date, Content-Length, Content-Range, X-Content-Duration, X-Cache
age: 74796
content-disposition: inline;filename*=UTF-8''Gibraltar_map-en.svg.png
content-length: 47987
content-type: image/png
date: Tue, 21 Dec 2021 02:25:44 GMT
etag: 8391f68640a7f0cedd3971fef7b8b3d3
last-modified: Mon, 01 Feb 2021 12:23:52 GMT
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
permissions-policy: interest-cohort=()
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
server: ATS/8.0.8
server-timing: cache;desc="hit-front", host;desc="cp1078"
strict-transport-security: max-age=106384710; includeSubDomains; preload
timing-allow-origin: *
x-cache: cp1078 hit, cp1078 hit/2
x-cache-status: hit-front
x-timestamp: 1612182231.00644

However, the SVG file is transferred with GZip compression; the transfer size is only 89 kB. The compression factor is 290/89 = 3.26. The transfer size is larger than the PNG, but it is less than twice the size of the PNG (89/48 = 1.85).

Second, when WMF started supporting the SVG file format, the browser support for SVG was nonexistent or uneven. Serving PNG files had strong support. Serving PNG renditions of SVG files also gives a uniform presentation. SVG images can vary depending upon the availability of particular fonts and the depth of SVG support.

Directly serving SVG[edit]

SVG client side rendering (Phabricator T5593)

The img element allows animations but should block scripts. The object element allows scripts.

Scripts can be malicious. WMF blocks uploading SVG files that contain scripts.

There is also a concern with animated files triggering seizures. That has been cited as a reason to not serve SVG directly. Detecting animated SVG is also made difficult because there are both SMIL and CSS animations. Automatically detecting update rate may be difficult. Even with a fast update rate, an animation may not trigger a seizure. See Commons:Deletion requests/File:Color Flash.gif.

What happens to mouse clicks? Wrap an a element around a bitmap file. Wrap it around an SVG file.

SVG files can be malicious. An SVG file could be a computational nightmare that taxes the computer. PNG files will render in finite time. The SVG renderer on WMF servers put a time limit on the rendering. If it does not complete within a few seconds, then the process is terminated. There are some SVG files on Commons that can hit that time limit.

There are some language translation differences when an SVG file is directly served; see below.

SVG is XML or not[edit]

Dislike XML.

SVG has namespaces, but HTML does not. HTML lossage creeps in.

If XML is so good, why is CSS not XML?

XML details[edit]

Some notes for later.

The XML Spec 1.0 (Fifth edition). https://www.w3.org/TR/xml/

The XML prolog is optional.

  • XML version
  • encoding (ASCII / ISO issue) EBCDIC and UTF 16.
  • standalone
    From the XML specification § 2.9:
    The standalone document declaration must have the value "no" if any external markup declarations contain declarations of:
    • attributes with default values, if elements to which these attributes apply appear in the document without specifications of values for these attributes, or
    • entities (other than amp, lt, gt, apos, quot), if references to those entities appear in the document, or
    • attributes with tokenized types, where the attribute appears in the document with a value such that normalization will produce a different value from that which would be produced in the absence of the declaration, or
    • element types with element content, if white space occurs directly within any instance of those types.

The default attribute values raise issues with #REQUIRED, #IMPLIED, #FIXED and default. https://www.w3.org/TR/xml/#dt-default

The SVG 1.1 style element has the type attribute, and that attribute is #REQUIRED. See https://www.w3.org/Graphics/SVG/1.1/styling.html#StyleElement That means Phab:T68672 ("SVG style element ignored if no type attribute is specified") may have been invalid, and that Commons:Commons SVG Checker should require type="text/css".

The SVG 1.1 DTD has

<!ATTLIST %SVG.style.qname;
    xml:space ( preserve ) #FIXED 'preserve'
    %SVG.id.attrib;
    %SVG.base.attrib;
    %SVG.lang.attrib;
    %SVG.Core.extra.attrib;
    type %ContentType.datatype; #REQUIRED
    media %MediaDesc.datatype; #IMPLIED
    title %Text.datatype; #IMPLIED
>

For SVG 2.0, the type attribute has an initial value of text/css. https://svgwg.org/svg2-draft/styling.html#StyleElement

There is also the style element content being a CDATA section. The SVG 1.1 conservative view was

  <defs>
    <style type="text/css"><![CDATA[
      rect {
        fill: red;
        stroke: blue;
        stroke-width: 3
      }
    ]]></style>
  </defs>

The CDATA section was needed to avoid entity interpretation and <. I remember having trouble at some point, but I think that was resolved by using CSS character literals rather than XML character literals. It may also be that the modern style element is a CDATA section rather than PCDATA. Find the references.

That SVG snippet also shows the style element within a defs element. That used to be common practice, but it may have never been needed. The advantage of the defs element was its content would never be rendered. There is more to say about defs; many elements (such as linearGradient) do not need to be within a defs element.

SVG treading where it should not[edit]

SVG is about representing vector images, but it often steps into areas that do not affect the appearance of the image or where it has no authority.

Reinventing namespaces.

Effectively merging xml:lang and lang attributes. If both exist, they must be equal. Why not just keep one? It also complicates the CSS lang() psuedo-selector. The only bad thing about xml:lang is its impact on RDF metadata, but that can be fixed with xml:lang="". (Check XML specification.)

Deprecating xml:space for CSS. XSLT knows how to handle xml:space but it does not know how to handle CSS. Compare the options. Also raise the issue about text-anchor and directions.

Why merge xlink:href and href? The XLink specification exists and was incorporated. Look at other parts of XLink. Yes, there are better ways to handle titles. List the XLink attributes. A more pointed complaint is that having adopted xlink:href in SVG 1.0, switching to href in SVG 2.0 is not downward compatible. Why make a breaking change?

The translate attribute is inherited from ITS and its its:translate attribute. The attribute tells language translators to not translate this text. It has no impact on the display of SVG. SVG is also missing features such as its:term. That tells language translators that the phrase is a technical term that should have a consistent translation. SVG should have just pointed to ITS and suggested its use. Yes, ITS should not be using multiple namespaces.

The data-* attributes reinvent what should be the data namespace. Inherited from HTML. Also inherited is the crazy case-insensitive mapping that is not needed in case-sensitive XML. Show some examples. These attributes have no impact on the display of SVG. They are there to simplify some handling in the DOM, but that should be a separable extension.

The aria-* attributes reinvent what should be the aria namespace. The SVG Working Group does not control the meaning of these screen reading attributes, so it should have just pointed to the ARIA specification. That specification should have used namespace syntax.

SVG DOM[edit]

Significant advantage.

Looking for type hierarchy, but not seeing what I want.

Descriptive elements desc and title.

Metadata element metadata.

Container elements such as g. SVG 2.0 says, "An element which can have graphics elements and other container elements as child elements. Specifically: ‘a’, ‘clipPath’, ‘defs’, ‘g’, ‘marker’, ‘mask’, ‘pattern’, ‘svg’, ‘switch’ and ‘symbol’."

Graphics elements such as line and text. Inherits from SVGGraphicsElement, so it has some methods, but not a type? Interface SVGGraphicsElement.

Style information in the DOM[edit]

I do not believe the SVG DOM makes all the style information available. A couple years ago I went looking for aural stylesheet information, and it was not there. Consequently, I do not believe that style properties such as -inkscape-font-specification are broken out. Does that mean that they disappear completely when the DOM is written out?

It may be that none of such style properties make any difference, so removing them could be seen as beneficial.

Attributes that could be removed[edit]

Is there a list of Inkscape attributes that are always safe to remove? For example, Inkscape could always regenerate the node type list. If Inkscape can regenerate the information, then why keep it?

  • sodipodi:nodetypes attribute can be regenerated
  • sodipodi:role="line"
  • -inkscape-font-specification CSS can be removed
  • line-height CSS (has multiple defaults)
  • font-style if default
  • font-weight if default
  • font-stretch if default
  • font-variant if default (short cut?)
  • font-variant-ligatures if default
  • font-variant-caps if default
  • font-variant-numeric if default
  • font-variant-east-asian if default
  • letter-spacing property if default
  • word-spacing property if default

Some Inkscape and sodipodi attributes should be preserved. Some g elements are identified as layers. Information about drawing grids does not take up much space, so removing that information does not have much benefit.

  • g element (perhaps toplevel) with inkscape:groupmode="layer" and inkscape:label="name" and id="identifier"
    the sodipodi:namedview element will have inkscape:current-layer="identifier"

The significant benefit is removing verbose style information.

Additional information to remove would be needless graphics state. For example, if stroke="none", then we probably do not care about stroke-width, stroke-dasharray, stroke-dashoffset, line joins, and end caps. Some font information may be a little different. If text has been converted to curves, keeping that information around would help in reconstructing the text.

SVG recommendations[edit]

Small size is a significant goal.

SVG is not arbitrarily scalable. Scalable is more about eliminating jaggies.

Fixed width lines. (CSS can adjust.)

SVG is not a good file format for bitmap images such as bar codes and QR codes. Those objects are not arbitrarily scalable; they must fit on a pixel grid. One could use barcode fonts within an SVG file; fonts will align to an underlying pixel grid.

Not for photographs (but can be used to label photographs).

Limited colors (can use color gradients). Color blocking suggestions rather than enormous detail.

Filters can produce complex objects such as chalk textures and clouds.

File size[edit]

SVG files can be small, but they can also be surprisingly large.

Files that unnecessarily large[edit]

Some files are inordinately large. Extraneous clipping paths, gradients, and copied rather than instanced symbols.

See also SVG House of Estridsen.

Polar patterns[edit]

Consider some images from Category:NAVAID pictograms:[4]

The first image is a central dot and 12 line segments, and it has a simple representation. The second image is problematic. it is a central dot, a central solid ring, and 10 dotted radial rings. It has a lot of dots, but why does it need so many bytes? Each dot is not a circle element, but rather a path that looks like a circle. The third image is nearly as complex (only 7 dotted rings), but it is a more efficient representation. Instead of round dots, it uses stroke-dasharray for the dots. Notice that the dash array has some issues along the north axis.

We can get dots easily. Use a circle element (stroked but no fill), set the stroke-dasharray="0 xxx", and set stroke-linecap="round". The value xxx is chosen to be an integer fraction of the circumference. A close look at the NDB-DME symbol shows dashes instead of dots.

There is a problem with librsvg: the stroke-dasharray attribute must use commas rather than spaces.

Compression using pattern elements[edit]

Here is a simple file that takes 307 kB:

It has linear and radial gradients, and many groups are scaled. It should use a pattern or feTile.

✓ Done Reduced file size to 9 kB. The interesting challenge here is the pattern fill. The shield outline is simple (and its path can be used for clipping). The shield can have a solid green fill that is overlaid with left and right red stripes. The white cross can be a path with a white fill (the path needs to be stroked within the shield, so overlaying two rectangles does not work). The pattern is difficult because it does not have a simple rectangular pitch. After filling the white cross, it is drawn two more times: first with a rectangular pattern of half the figures, and second with a similar pattern offset by half the pitch. The whole shield is then covered by a radial gradient. Finally, the outline of the shield is stroked at 3 pixels.

Three are many similar designs at w:List of Breton flags.

This file should have a trivial size:

A file that has already been compressed using group elements.

Filters[edit]

File:Award-star-gold-3d.svg could use several SVG filters. Simple filter would be the shadow. Lighting would be more complicated.

Difficult tasks[edit]

Illustrating metal badges such as File:Police Badge,P.R.China.svg at 274 kB.

Extraction of groups[edit]

The magazine files.

Chemistry diagrams such as File:Calvin-cycle4.svg.

SVG optimizers[edit]

I am in favor of optimized SVG files. Some editors include a lot of pointless information. For example, an unstroked path may have attributes specifying the stroke color and the stroke dash properties. A rect may have a font-family attribute.

I'm also OK with bona fide structural groups sharing style information. If a group has several common elements, then it makes sense that they be styled similarly.

There are command-line SVG optimizers, such as

In general, I'm skeptical of using such optimizers. The notion of optimization is often based on deleting as many bytes as possible. Consequently, path data attributes become difficult to read, and metadata may get tossed completely. While specifying coordinates to the micro is often pointless, truncating coordinates to a fixed number of digits is a pretty big hammer. I do not know for sure, but I believe some tools may group neighboring elements to share formatting information; that is a poor way of imposing styles.

See also

Optimizers, if they do their job, do not change the appearance of the image.

Title and description[edit]

It is reasonable to include title and desc elements.

SVG 2.0 will allow language versions (using the lang or xml:lang attribute — not the systemLanguage attribute).[5] The acceptance of the language versions is not clear, and it has an at-risk warning in the SVG 2.0 specification.

The title and desc elements are not display elements. For that reason, they cannot be selected within a switch element. In that context, the elements would be giving a title and a description to the parent switch element.

There should be support for the Dublin Core dc:title and dc:description elements with xml:lang attributes.

Metadata[edit]

This section needs reorganization and clarity about types. It is about machine readable metadata.

Machine readable often turns to RDF. RDF is sophisticated, so the use is often a limited subset.

There are vocabularies such as Dublin Core. There are also schemas that describe how the vocabulary is used.

Then there is how the metadata is actually used.

Metadata and copyright are intertwined. Metadata should include information about the origin of an image, and several copyright licenses require that some information be provided.

The Creative Commons licenses require some specific information. For example, there should be a link to the CC license. Derivative works need to say what was changed. In many cases, these requirements may not be met.

I believe all SVG files should include metadata. It is not hard to add, and it can be useful. Including license data in the image metadata may fulfill licensing requirements or at least provide a colorable defense. Failure to follow all licensing requirements may lead to trouble.[6][7][8]

Moral rights. Even if I do not need to credit an author, there may be a moral obligation to give them credit. Sometimes that moral right can become a legal right. Some contributors allow free use of some or all of their work. LadyofHats is a notable example. That means I can use the work for any purpose, and I do not need to give anyone credit. That does not seem reasonable or even right. I could take Herman Melville's Moby Dick and publish it under my own name. It seems far better to say it is Melville's work.

Providing metadata also makes it easier for someone else to check the licensing rights. Commons encourages everybody (not just its wikipedia projects) to use the available art. Say Alice uses some CC0 SVG images from Commons on her website. The images are CC0, so Alice does not mention any licensing details. Bob sees Alice's website and likes the images, but how can Bob determine the licensing of the images?

The license check would be simple if the SVG file included the licensing information. Just given an image, it may be hard to find out who made it. If the image has metadata, then that information may be easy to find. The information may not be accurate (somebody may be license washing), but it is a starting point and could serve as a good defense.

https://www.dublincore.org/specifications/dublin-core/dcq-rdf-xml/

Other metadata[edit]

Metadata is not about just copyright information. Metadata can include other relevant information.

For maps, the metadata may include information about the map projection. With that information, one could take the (x, y) location of a point on the map and convert it to the corresponding latitude and longitude.

For chemicals, the metadata may include structured descriptions of the chemical.

I'm leery of too much metadata. SVG should be more of an output format rather than a container for detailed information. Providing a small amount of information is reasonable, but including lots of information may be inappropriate. The intended use of SVG is to display an image.

The mess that is xml:lang[edit]

The issues with lang and xml:lang. Watch out for accidental captures.

Creative Commons license requirements[edit]

Creative Commons licenses are used extensively on Commons and WMF servers.

State the common legal requirements of CC licenses.

  1. CC- requires a link to the CC license. That means a it is easy to find the license terms. Check if a CC0 license also has this requirement.
  2. CC-BY must provide reasonable attribution. May distribute and alter. May impose more restrictive license.
  3. CC-SA (implies a derivative work) must not use a more restrictive license and must describe the changes.
  4. CC-ND allows use but not modification.
  5. CC-NC does not allow commercial use. (What is the constraint on commercial? May a nonprofit use the work in its fundraising? May the Girl Scouts use it to sell their cookies? In the US, agency settles some of these questions.)

State the failings.

The file description pages are often inadequate. Sometimes there are gross errors such as an improper license. Derivative works often omit the attribution information in the license. The description of a derivative work often fails to describe the changes made to the original work.

Most file uses on WMF servers satisfy the requirements because MW links the file to its description page:

[[File:Yellow banana.svg|A picture of a yellow banana.]]

Presumably, the file description page has a link to the CC license and meets the attribution and modification requirements.

However, the file use may alter that link (MW:Help:Images).

[[File:Yellow banana.svg|link=https://www.nowhere.com/bitbucket|A picture of a yellow banana.]]

or

[[File:Yellow banana.svg|link=|A picture of a yellow banana.]]

If the override link does not provide the needed licensing information, then the license is violated. There can be disastrous ramifications. MW should not allow such links for CC-licensed material.

Dublin Core and Creative Commons[edit]

Reasonable SVG metadata should use both Dublin Core and Creative Commons vocabularies. The metadata can be expressed using RDF.

Dublin Core[edit]

A general reference:

It suggests some vocabularies. Looking for "Terman, Frederick" gives the MARC value

Dublin Core provides a vocabulary for references. There are two Dublin Core namespaces:

Sometimes, the dcterms namespace uses the dc prefix. The goal is to use the dcterms vocabulary rather than the 15-element dc namespace. It is possible to translate dc to dcterms (e.g., using XSLT), but that translation may confuse existing software.

Dublin Core elements/1.1/ is a short (15 term), general, vocabulary for works:

  • dc:title (there is also an SVG title element)
  • dc:date
  • dc:creator
  • dc:contributor (I would use for translators)
  • dc:source
  • dc:format (less important) for SVG, use image/svg+xml
  • dc:type (less important) often rdf:resource="http://purl.org/dc/dcmitype/StillImage"
  • dc:publisher (If empty, I would have this point to Wikimedia Commons)
  • dc:subject DC states, "Typically, the subject will be represented using keywords, key phrases, or classification codes. Recommended best practice is to use a controlled vocabulary." I do not see a widely adopted practice here. Most people would probably use a text string of comma-separated keyword phrases. That would match the HTML meta tag: e.g., <meta name="keywords" content="HTML, CSS, Javascript" >. However, the obvious RDF approach would use an rdf:Bag that holds each keyword phrase: <cc:license><rdf:Bag><rdf:li>HTML</rdf:li><rdf:li>CSS</rdf:li><rdf:li>Javascript</rdf:li></rdf:Bag></cc:license>. The dcterms: mirror is not a list of keywords.
  • dc:coverage Time or location. Not widely used? E.g., Port Royal earthquake.
  • dc:description
  • dc:identifier
  • dc:language
  • dc:relation
  • dc:rights The clearer practice here would be to use cc:license

The Dublin Core vocabulary uses general rather than specific terms. For example, the dc:creator predicate covers several possibilities such as author, composer, lyricist, illustrator, and photographer. There are vocabularies that make finer distinctions,[9] but those distinctions may not be necessary for many works, and most applications probably do not support the terms.

Usage examples:

Interesting metadata in

Specifies data types.

Here is the metadata secion:

<metadata>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/">
	<rdf:Description rdf:about="">
		<dc:title xml:lang="ja">DCMIのロゴ</dc:title>
		<dc:creator rdf:resource="https://meta.wikimedia.org/?curid=10484470"/>
		<dc:subject rdf:datatype="http://ndl.go.jp/dcndl/terms/NIISubject">情報学</dc:subject>
		<dc:description xml:lang="ja" rdf:parseType="Resource">
			<dc:format rdf:datatype="http://purl.org/dc/terms/IMT">text/x-wiki</dc:format>
			<rdf:value><![CDATA[
[https://www.dublincore.org/ '''ダブリンコアメタデータイニシアチブ''']({{en|1=Dublin Core Metadata Initiative; DCMI|inline=inline}})のロゴ画像。
{{quote|lang=ja|text=
中央の円はイニシアチブの中核を,それを取り囲む内側の円達は[[w:ja:Dublin_Core#基本記述要素一覧|DCMIメタデータ要素集合]]([https://webdesk.jsa.or.jp/books/W11M0090/index/?bunsyo_id=JIS+X+0836:2005 JIS X 0836:2005])で利用できる15の基本記述要素を,外側の円達は要素集合の解釈及び拡張を,それぞれ表す。
|cite=ダブリンコアメタデータイニシアチブ
|source=[https://www.dublincore.org/about/#web-site-policies-software-logo-banner About DCMI/Web site, policies, software, logo, banner]の[[User:cmplstofB]]による試訳
}}
]]><!--  --></rdf:value>
		</dc:description>
		<dc:contributor rdf:resource="https://www.dublincore.org/"/>
		<dc:date rdf:datatype="http://www.w3.org/2001/XMLSchemadate">2019-09-11</dc:date>
		<dc:type rdf:datatype="http://purl.org/dc/terms/DCMIType">StillImage</dc:type>
		<dc:format rdf:datatype="http://purl.org/dc/terms/IMT">image/svg+xml</dc:format>
		<dc:source rdf:resource="https://www.dublincore.org/images/DCMI_logo_cropped.jpg"/>
		<dc:language rdf:datatype="http://purl.org/dc/terms/ISO639-2">jpn</dc:language>
		<dc:rights rdf:resource="http://www.wtfpl.net/about/"/>
	</rdf:Description>
</rdf:RDF>
</metadata>

Many of the Dublin Core fields are text. They will use rdf:datatype attributes. Some of those use the dcterms vocabulary, but sometimes they use some another set of types. Consider dc:date: it uses http://www.w3.org/2001/XMLSchemadate. Some other fields use rdf:resource rather than text. Is there a rewrite rule for rdf:resource? A URI string with a URI datatype?

Dublin Core schemas:

Looking at a schema for elements

Looks like an arbitrary sequence of the 15 elements. Looks like the element content is text only (xml:lang attributes are allowed).

  <xs:complexType name="elementType">
    <xs:simpleContent>
      <xs:extension base="xs:string">
        <xs:attribute ref="xml:lang" use="optional"/>
      </xs:extension>
    </xs:simpleContent>
  </xs:complexType>

Significantly, this declaration does not show using rdf:resource attribute.

I expected the dcterms schema to be more restrictive.

However, the schema states

Encoding schemes are defined as complexTypes which are restrictions of the dc:SimpleLiteral complexType. These complexTypes restrict values to an appropriates syntax or format using data typing, regular expressions, or enumerated lists. In order to specify one of these encodings an xsi:type attribute must be used in the instance document. Also, note that one shortcoming of this approach is that any type can be applied to any of the elements or refinements. There is no convenient way to restrict types to specific elements using this approach.

Here's a dcterms to dc and what looks like a W3C Date-Time Format.

<xs:element name="date" substitutionGroup="dc:date"/>

<xs:complexType name="W3CDTF">
  <xs:simpleContent>
    <xs:restriction base="dc:SimpleLiteral">
      <xs:simpleType>
        <xs:union memberTypes="xs:gYear xs:gYearMonth xs:date xs:dateTime"/>
      </xs:simpleType>
      <xs:attribute ref="xml:lang" use="prohibited"/>
    </xs:restriction>
  </xs:simpleContent>
</xs:complexType>

The details are both troubling and confusing. Dublin Core looks like simple text (simpleType). What impact does that have? For multiple authors, one either uses several creator elements or puts the list in simple text. The dcterms set does not provide access to rdf:Seq.

Does common usage of Dublin Core violate the schema?

The schemas, without more, do not do a sensible validation of, for example, date syntax.

A reification from dcterms to elements is clear.

Creative Commons[edit]

Creative Commons adds some terms for specifying the license and attribution:

  • cc:license (said to be the same as xhtml:license; Commons does not allow uploading SVG that uses the xhtml namespace)
  • cc:attributionURL (may be needed for CC-BY; I would have this point to the File: page on Commons)
  • cc:attributionName (may be needed for CC-BY)

For Commons files, making the cc:attributionURL point to the file page on Commons may satisfy the attribution requirements of CC-BY licenses.

Resource Description Framework (RDF) statements have a subject, a predicate, and an object.

Although vocabularies are specified, how those vocabularies should be used is not nailed down. If there are two creators, how should that be specified? Should there be an RDF dc:creator statement for each creator? Should there be one dc:creator statement whose object is a set of the creators? The situation for licenses is more obvious. If the user gets to choose which of several licenses, then there should be one cc:license, and the object should be an rdf:Alt that identifies the alternative licenses. However, most software probably expects exactly one license rather than a list of alternatives. The simple approach is to offer only one license.

The lack of consistency implies problems. If a graphics program does not understand the input RDF, then it may get corrupted on output. The appropriate goal is to have metadata that most graphics editors understand. That way, the metadata is preserved during import and export.

Consistency and accuracy are also missing in many Commons licenses.

Say Alice creates a CC-BY-SA image and uploads it to Commons. Bob then reuses Alice's image. Bob is required to use a CC-BY-SA license, and Bob's image must carry attribution to Alice. Many mistakes happen on Commons. Bob's image may not mention Alice's licensed image. Bob may claim his work is CC0 (license washing). Bob may use CC-BY-SA, but he may not point out that Alice must be acknowledged, too. Given license information on Commons may be missing or incomplete, it is no surprise that license metadata may be haphazard, too.

A CC-BY-SA license permits modification (i.e., derivative works). The licenses require the modifier to describe the changes, but Creative Commons does not have a vocabulary term for describing the modifications.

Creative Commons and closure[edit]

Creative Commons does a good job for the original work. The license is declared, and there are constructs for attribution. If a work is used without modification, then the metadata has the information for proper attribution.

The metadata is insufficient when the original work is modified. The license requires that the changes be identified, but there are no XML elements for describing the changes.

List the licenses and the issues.

  • 0
  • -BY
  • -SA
  • -ND
  • -NC

Another issue is how graphics editors can merge metadata.

Adobe Systems XMP[edit]

Adobe XMP uses the elements namespace:

Sigh.

Adobe Systems includes metadata, and it has settled on a specific syntax with its eXtensible Metadata Platform (XMP). Adobe solves the multiple creator problem by always using a set of creators (even if there is only one creator). Adobe also restricts the use of complex RDF syntax.

In XMP, the dc:creator should be an ordered list of ProperName.[10] A ProperName is a simple text value.

<rdf:Description rdf:about="">
  <dc:creator>
    <rdf:Seq>
      <rdf:li>John Smith</rdf:li>
      <rdf:li>Richard Roe</rdf:li>
    </rdf:Seq>
  </dc:creator>
</rdf:Description>

Should discuss the equivalent form.

<cc:Work rdf:about="">
  <dc:creator>Alice</dc:creator>
</cc:Work>
<rdf:Description rdf:about="">
  <rdf:type rdf:resource="http://creativecommons.org/ns#Work" />
  <dc:creator>Alice</dc:creator>
</rdf:Description>

Inkscape metadata[edit]

Which namespace does Inkscape use? elements or dcterms? If it uses elements, then it should upgrade. Or at least accept one or the other. I'm looking at a file I believe to be Inkscape, and it has xmlns:dc="http://purl.org/dc/elements/1.1/".

Inkscape has a metadata form to fill in, but Inkscape uses an agent description. (Pull a copy of Inkscape metadata).

<dc:creator>
  <cc:Agent>
    <dc:title>Andy Fitzsimon</dc:title>
  </cc:Agent>
</dc:creator>

Please note that cc:Agent is not part of the http://creativecommons.org/ns namespace.

Dublin Core has a dc:Agent, so it is possible that Inkscape meant dc:Agent rather than cc:Agent.

I'm not a happy camper...

The page does not include the attributionName or attributionURL elements. It has a set of licenses. It also points to some SIL licenses.

There is a significant but unresolved issue here.[11] An original goal is to identify the license and the creator. Not a lot of information is needed to acknowledge those rights; a simple text reference to a name might be good enough. However, more details can be given about the rights holder, so should the representation give more details? At what point would there be too much information. More information could be added, but very few systems will be able to process that information. The simple approach is to keep the information simple enough to satisfy license requirements and avoid adding extraneous details.

A URL is a better method of identifying a person than some text. Many people have the name John Smith, but the URL https://www.imdb.com/name/nm0808774/ identifies a particular John Smith. Unfortunately, many applications probably expect a text string and cannot handle a URL. If an application expects this input

<dc:creator>John Smith</dc:creator>

then how will it handle this input (i.e., a URL that identifies a particular John Smith)?

<dc:creator rdf:resource="https://www.imdb.com/name/nm0808774/" />

Try this out in Inkscape.... Try this out in Adobe Illustrator....

Well-known licenses[edit]

Creative Commons wants a cc:License element that summarizes the license, but I do not like that practice for well-known licenses. What happens if the summary is inaccurate? Say the license URL is CC-BY-SA 4.0 but the license summary prohibits commercial use? Does the summary take precedence over the URL?

In theory, it should be easy to obtain the RDF description of a well-known license. For example, the license HTML at

has a link in the HTML

  • <link rel="alternate" type="application/rdf+xml" href="rdf" />
    

which refers to the license RDF at

Consequently, an RDF description of a well-known license is available.

It is possible to check whether the license summary is consistent with the published URL.

Creators also misuse CC-BY licenses on Commons by stating additional license terms. For example, the creator may state that the attribution must appear next to where the image is used. Creative Commons CC-BY licenses require attribution, but the license lets the licensee use any reasonable method of attribution. Here's the text about attribution from CC-BY-SA 4.0:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

If a creator states the license is CC-BY-SA 4.0, then the creator should not be able to state additional requirements. Additional requirements contradict the terms of CC-BY-SA 4.0.

Complicated licenses[edit]

If there is only one author, then license information is simple. When a work builds on others, then the license is complicated.

Consider File:Angriffe Antibiotika.svg. It is an improved copy of FEERERO's CC-SA 3.0 File:Angriffe Antibiotika.png. In addition, it borrows from

Consequently, the metadata should contain a lot of information.

A trivial solution has the metadata element just point to the Commons File: page.

Metadata checker[edit]

A few years ago, I did some tests on RDF XML validation.

A sophisticated metadata checker could...

  • look for appropriate namespaces
  • check value consistency (ISO dates, finite set ranges)
  • calculate a list of ranges
  • validate XML schemas (valid RDF, valid CC, ...)
  • learn frequency of metadata (lots of image/xml+svg but little cc:creator)
  • possibility of rewriting metadata
  • possibility of adding metadata

My notes on RDF. GRIDDL.

Related topics[edit]

A more general topic, SVG validation, is Help:SVG. There were more involved discussions about SVG validation. Validation is often too strict (complaining about extensions such as Inkscape or new SVG features).

Commons:Overwriting existing files.

Removing metadata[edit]

Removing metadata from an SVG file (or any other file) may be inappropriate. Removing metadata may trigger legal issues but leaving it in has minimal cost.

Consider a signed painting. Should someone come along and paint over the signature?

Removing metadata is similar to removing a watermark. See Legal issues with the removal of watermarks and Removal of watermarks from Commons images. WMF legal staff opines that removing a watermark could violate the DMCA and even violate the terms of some Creative Commons licenses.

Compare to removing a watermark that was not part of the original image. Sometimes a person who hosts an image may add a watermark.

Sometimes metadata is inadvertently removed. When librsvg produces a PNG, I doubt that it copies metadata from the SVG to the PNG.

Perhaps a scan of files should look for comments that imply an optimizer was used. Optimizers often strip metadata. See Massive edits. For example, scanning history of File:BSicon hKRZWa red.svg would reveal the upload comment "Slimmed down with svgomg".

Other uses of metadata[edit]

Images on Commons should have free licenses, but many uploads violate the creator's license. Generally, Commons relies on its users to upload only free material.

Some of that checking can be done automatically. Consider an image that is published on some website, and the website states a non-free license for that image. Alice likes the image, so she uploads it to Commons claiming it is her own work. Commons does not know.

Now consider that the image has metadata that says the creator is Bob and the license is CC-BY-NC. Commons could read the metadata, realize it does not know that Alice is Bob, and recognize the CC-BY-NC license is not compatible with Commons. Thus Commons could refuse the upload automatically.

At upload, Commons could also notice that a work is CC-BY-SA with required attribution. Commons could fill in the attribution details.

Graphics applications might also warn users about editing files that carry CC-BY-ND licenses.

Removing watermarks[edit]

An unexpected issue. Instead of watermarks or timestamps, the information should be put in the file's metadata.

Detection of this information? Small fonts, strange fill colors, and outside of viewport.

Detect symbol candidates[edit]

Artists often copy-paste an image component rather than creating and using a symbol.

Also works for text-to-curve images.

Use styles[edit]

There is a difference between content and style. The content is the information, and the style is how it is displayed. Content that is in a particular class may be displayed the same way by using CSS to select and style SVG elements in that class.

In general, it is better to use CSS to achieve a consistent display rather than individually formatting graphics elements.

In particular, elements should not be grouped merely to impose a consistent style.

Consider a map. We may want the rivers and the names of rivers displayed in blue. A river class can set the color for both the rivers and the font fill. Cities with a population under 100,000 may use a small dot and a small font, and cities over that size may use a larger dot and a larger font. CSS can set the font size and even the radius of a circle. Capital cities may use a star instead of a dot.

Using CSS and the class attribute can make the display both consistent and easy to change. Fill colors and font families are set in just the CSS rather than on each SVG element. Changing the CSS will apply the change to all elements in the class.

Graphics editors should have a way to manage styles, but they may not round-trip them.

Media queries[edit]

SVG allows images to adapt. Printing a color image in black and white may not produce a satisfactory result. A red fill may look similar to a dark gray fill. SVG can use CSS media queries and adjust the presentation.

Consider the image on the right. If the media supports color, then the picture can have a blue background with white text and lines. If the media is black and white, then the background can be white, and the text and lines can be black. Mechanical drawings can be more complex. For color media, solid color fills may distinguish different components; for black and white media, crosshatches may distinguish the components.

An SVG style element has a media attribute. CSS syntax allows @media queries.

SVG 1.1 / CSS 2 media query support is very limited. SVG 2.0 is much richer. There is some support, but it may not work well. In tests tried around 2019, one browser could distinguish color and monochrome requests, but it would not follow changes in the printer properties.

Consider these images:

The color is nice, but when the image is printed on a black and white printer, the colors will be levels of gray.

Media queries can change the presentation on a black and white printer.

Width, height, and viewBox[edit]

These svg element attributes can cause problems.

Specifying width and height gives a concrete size for the image and also implies clipping. If such an image is put in a small container, then that container can be scrolled. For example, File:2022 Russian invasion of Ukraine.svg is a detailed file where zooming and panning make sense. To see the fine detail in the image, it must be scrolled.

Specifying viewBox sets a view port on the SVG and does not imply clipping. Parts of the image that are outside of the view port are still visible. Putting such an image in a small container just sizes the image to fit; it does not imply scrolling.

Errors and validation[edit]

Validation is often overly strict: issues cited as errors are often reasonable extensions. For example, HTML 5 allows data-* attributes, and those attributes have found their way into the SVG 2.0 drafts. An SVG 1.1 validator will list them as errors, but they are harmless.

I'm on the fence with some other errors. Some SVG tools emit invalid XML identifiers (e.g., an identifier beginning with a digit is invalid; for example, File:Diagram_of_IGNORE.svg). Most XML implementations will handle such identifiers, so they are not a big deal. However, it may also be reasonable to fix these legitimate errors. What if some future XML spec required implementation to throw an exception when encountering such identifiers? Similarly, duplicate identifiers (such as those emitted by SVG Translate) are errors that may be reasonable to fix. (Do the duplicate identifiers confuse SVG Translate?)

SVG text[edit]

In general, the text within an SVG file should be in SVG text elements. Avoid converting text to paths/curves. Such path text expands the size of the file, and it is often unnecessary. Artistic text (such as used in logos) may need to be converted to curves.

In general, if an SVG file contains text, then users should be able to copy and paste that text from the SVG file. A simple test is to load the SVG file into a browser and then try to select all the text (control-A in Windows). If no text is selected, then the diagram's text has been converted to curves.

The text that is selected should be readable, grouped appropriately, and spaced correctly. Independent phrases should be in their own text element; they should not be combined with other phrases. Independent phrases that need two lines should not use two text elements but rather code the lines in tspan elements. That keeps the phrase together.

In addition, the selected text should not be missing spaces or have extra spaces. If the text is displayed on two lines, then it should have a space between those two lines. For example, the better result is "Holy Roman Empire" rather than "Holy RomanEmpire". Unfortunately, SVG does not handle spaces well. Spaces at the beginning or end of a line may not align as expected (the SVG hanging space problem of text-anchor).

<text><tspan>Holy Roman</tspan><tspan x="0" dy="20">Empire</tspan></text>
<text><tspan>Holy Roman </tspan><tspan x="0" dy="20">Empire</tspan></text>

Sometimes, text is spaced for emphasis. For example, a map of the United States may have text that looks like United States. That text should copy and paste without the additional spaces. Instead of inserting actual spaces to achieve the effect, the graphic artist should set the letter-spacing of the string. Furthermore, do not space text by individually placing the characters. That makes the text difficult to translate, and it may render poorly when fonts are substituted. Use the mechanisms that SVG provides.

Similarly, a string that displays as all capital letters should use text-transform: uppercase. For example, United States uses a text transform and will copy-paste as "United States". There are other text transforms, but they are less useful.

Sadly, text-transform is not very smart. The result of text-transform: capitalize applied to "united states of america" is united states of america. It also does not follow some language conventions such as capitalizing a Dutch "IJ": ijland; works with ij ligature character (U+133): ijland.

The perils of hidden text. It can confuse editors. Any hidden elements can cause confusion.

Fonts[edit]

Point to section about fonts and what scaling them means.

A good example of the benefits of nonlinear scaling of fonts is a bar code font. The font symbols are scaled to integer pixel widths. The symbols use Manhattan geometry, so the edges are sharp; no anti-aliasing is needed. The strict symbol geometries are maintained.

Recommend the CSS generic fonts serif and sans-serif. If possible, do not use exotic fonts.

WMF also has problems because librsvg. There are times that we want a text string to be an exact length. SVG supports that with the textLength attribute, but librsvg does not support it.

In practice, there is not much support for particular font properties.

font-family
depends on system. Use CSS fallback.
font-size
specific size has excellent support; relative sizes may not be supported
font-style
normal, italic. Good support. Oblique....
font-weight
normal, bold. Other options have little support.
font-stretch
normal. Other options have little support.
font-variant
all purpose OTF support depends on font.
font-size-adjust
mumble.
baseline....

It may be appropriate to fix some SVG files. For example, File:Planets2013.svg uses font-family="Arial-BoldMT" and does not have font-weight="bold". Most font matchers will fail. There was a Phabricator request to have MediaWiki do that automatically, but it would be better as a robot task.

The WMF font list:

Character encoding[edit]

Unicode and Symbol encodings. The two lines should be similar if a Symbol encoding font is available.

Unicode is common now, so most SVG files will use Unicode or a Unicode-compatible subset. In practice, that means UTF-8, but UTF-16 is also a possibility. UTF-16 wants a byte-order mark (BOM), and some UTF-8 files will also include a BOM. Software should handle those cases.

Even though a file may claim to be Unicode, that does not mean the file uses Unicode. There are many special fonts that put exotic glyphs in non-Unicode character positions. The Adobe Symbol font, for example, uses its own character encoding.[12] Zapf Dingbats[13] and Adobe Sonata[14] also use their own encodings.

Symbol
ABCDEabcde → ABCDEabcde
Zapf Dingbats, Wingdings
ABCDEabcde → ABCDEabcde
Sonata
ABCDEabcde → ABCDEabcde

Even common fonts may have non-Unicode character assignments.[15] For example, many Adobe fonts use the Adobe Standard Encoding[16] which puts a dagger at 0xD1 (Ñ instead of U+2020: †) and the "fi" ligature at 0xAE (® instead of U+FB01: fi).

Files that claim to be Unicode but use non-Unicode fonts should be recoded with Unicode fonts. Font substitution may not work when fonts use non-Unicode character encodings.

See Phab:T272133 Make all Postscript core 35 fonts available to SVG by installing some packages.

Adobe Font list. https://adobe-type-tools.github.io/font-tech-notes/pdfs/5090.FontNameList.pdf

Courier, Helvetica, Times, Symbol, Avant Garde Gothic, Bookman, New Century Schoolbook, Palatino, Zaph Chancery, Zaph Dingbats.

Files that use less common character encodings (such as Shift-JIS) do not need to be recoded if they use Unicode fonts. XML files that use such encodings can convert the text to Unicode.

Detecting non-Unicode files would be involved. The first step is converting to Unicode. The XML charset attribute should be authoritative, and it offers a clear route to convert an XML file to Unicode. The XML DOM should automatically convert a known charset to a DOMString, which is essentially Unicode. (XML DOM now hides the character encoding.)

The second step is searching for fonts within an SVG file. If a font is Unicode, then the text content is OK. If the font uses a non-Unicode charset, then the text content should be searched for non-Unicode characters. If no non-Unicode assignments are used, then the text content is OK. If non-Unicode assignments are used, then select a Unicode font replacement, and edit the text content to change the non-Unicode characters to equivalent Unicode characters.

Those steps require a significant database.

  1. font family
  2. character encoding (points to a (possibly standard) table)
  3. replacement font

Character sets

see w:Mojibake

Character metrics[edit]

Differences in font metrics are expected. The character widths will vary. The cap-height and x-height will vary.

But there are technical placement issues that spell trouble.

  • Find the Phabricator item about SVG music and fonts.

The first music font was Adobe Sonata. It uses its own 8-bit character encodings. The common staff height is 1 em. The notehead centers were set on the baseline. So we can get the simple

  • Sonata
    =qq=

Unicode defined a music block, but it did not specify the glyph sizes or positions. Google's Noto Music font has different note positioning but similar stem heights:

  • Noto Music
    𝄚𝅘𝅥𝅘𝅥𝄚

The SMuFL Bravura font also uses the Unicode block character assignments, but it uses the Sonata note positions.

  • Bravura
    𝄚𝅘𝅥𝅘𝅥𝄚

SMuFL fonts should have a JSON file that specifies the metrics. Reading that file should make all SMuFL fonts compatible, but many tools do not read the JSON files.

Other Unicode music fonts are less structured. A notehead may not fit between staff lines. The staff height may not be 1 em.

  • default music font
    𝄚𝅘𝅥𝅘𝅥𝄚

Path text[edit]

Talk about what path text looks like in some contexts.

There are files that should have path text removed.

SVG files that have converted text to path are often marked with {{Path text SVG}}. Another (earlier?) convention was to explicitly categorize files that should use text to Category:Convert to TXT. The category has JPEG, PNG, and SVG files in it. Ah! It's from {{ShouldBeText}}. That template wants the figures to be converted to wikitext rather than using an illustration. It may be better to mark some files with {{Path text SVG}} or {{Convert to SVG}}.

Here is a file in the category:

The file has not only converted the text to paths, each letter is a symbol, and the text is typeset by placing those symbols. This file is also interesting in that it describes a technical standard, and the text in the file are candidates for translate="no".

Another file:

Files with even more confusion (text as symbols and curves drawn as line segments):

ARIA label[edit]

Some Inkscape files have aria-label attributes.

Here is some curve text from the SVG:

      <g
         style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:12px;line-height:0%;font-family:Arial;-inkscape-font-specification:Arial;text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;display:inline;fill:#646464;fill-opacity:1;stroke:none;stroke-width:0.999998"
         id="text3787"
         aria-label="Uusimaa">
        <path
           id="path6751"
           style="font-size:19.715px;line-height:1.25;stroke-width:0.999998"
           d="m 381.84099,1384.7132 ... z" />
        <path
           id="path6753"
           style="font-size:19.715px;line-height:1.25;stroke-width:0.999998"
           d="m 393.30611,1398.8256 ... z" />
        <path
           id="path6755"
           style="font-size:19.715px;line-height:1.25;stroke-width:0.999998"
           d="m 396.88715,1395.774 ... z" />
        <path
           id="path6757"
           style="font-size:19.715px;line-height:1.25;stroke-width:0.999998"
           d="m 407.44738,1386.7058 ... z" />
        <path
           id="path6759"
           style="font-size:19.715px;line-height:1.25;stroke-width:0.999998"
           d="m 411.82742,1398.8256 ... z" />
        <path
           id="path6761"
           style="font-size:19.715px;line-height:1.25;stroke-width:0.999998"
           d="m 434.94057,1397.5645 ... z" />
        <path
           id="path6763"
           style="font-size:19.715px;line-height:1.25;stroke-width:0.999998"
           d="m 445.91473,1397.5645 ... z" />
      </g>

Observations:

  • The DOM can reconstruct some details.
  • The aria-label gives most of the text (it may be multiline).
  • Various font properties can be used to determine font, size, style, fill, and anchor.
  • The SVGDOM can determine the bounding box of the g element.
  • The SVGDOM can determine the bounding box of the synthesized text element.
  • The SVG has some contradictions such as font size being both 12 and 19.715 pixels and line height of 0 or 1.25.
  • The SVG has irrelevant stroke information.
  • The CORS issue....

SVG with hidden text[edit]

Some graphics artists have hidden text and visible path text.

Generally, hidden text may indicate path text. Consider the cases

  • hidden text (display="none" or visibility="hidden")
  • text outside of the viewport
    often used for notes or example graphics
    may be part of a clipped or zoomed=in image

The difference between display and visibility

Hidden text may be a standard practice in Category:SVG labeled maps of administrative divisions (location map scheme), which has more than 2,500 files. Styles should be used instead of concrete formatting, but Inkscape makes that difficult.

An example file is File:Togo,_administrative_divisions_-_de_-_colored.svg.

<g id="TT" display="none">
	<g id="TT_Countries" display="inline">
		<text transform="matrix(1 0 0 1 8.2725 236.5)" fill="#646464" font-family="'DejaVuSans-Bold'" font-size="18">GHANA</text>
		<text transform="matrix(1 0 0 1 41.6494 16.6387)" fill="#646464" font-family="'DejaVuSans-Bold'" font-size="18">BURKINA FASO</text>
		<text transform="matrix(1 0 0 1 292.7539 119.7319)" fill="#646464" font-family="'DejaVuSans-Bold'" font-size="18">BENIN</text>
	</g>
	
		<text id="TT_Sea" transform="matrix(0.75 0 0 1 239.7402 608.8057)" display="inline" fill="#0978AB" font-family="'DejaVuSerif-BoldItalic'" font-size="16">ATLANTISCHER OZEAN</text>
	<g id="TT_Regions" display="inline">
		<text transform="matrix(0.8 0 0 1 97.8931 73.6895)"><tspan x="0" y="0" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="22">Savanes</tspan><tspan x="93.168" y="0" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="22" letter-spacing="33.75">	</tspan></text>
		<text transform="matrix(0.8 0 0 1 173.4487 197.23)" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="22">Kara</text>
		<text transform="matrix(0.8 0 0 1 172.1001 299.9858)" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="22">Centrale</text>
		<text transform="matrix(0.8 0 0 1 181.4487 442.9453)" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="22">Plateaux</text>
		<text transform="matrix(0.8 0 0 1 188.5601 537.9512)" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="22">Maritime</text>
	</g>
	<g id="Nmbrs_Regions" display="inline">
		<text transform="matrix(0.8 0 0 1 135.1401 110.792)" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="50">1</text>
		<text transform="matrix(0.8 0 0 1 173.4487 197.23)" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="50">2</text>
		<text transform="matrix(0.8 0 0 1 188.5601 311.9995)" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="50">3</text>
		<text transform="matrix(0.8 0 0 1 203.7798 450.9453)" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="50">4</text>
		<text transform="matrix(0.8 0 0 1 223.1953 549.5059)" fill="#646464" font-family="'DejaVuSansCondensed-Bold'" font-size="50">5</text>
	</g>
</g>
<g id="PIX">
	<g id="TT_Countries_1_">
      ...
    </g>
    ...
</g>

It has particular id elements. TT for TrueType?

Most of the text should be anchored to the middle rather than the start.

The map files are using the Adobe method of burying font characteristics in the font name. The font specification DejaVuSansCondensed-Bold should be font-family="DejaVu Sans, sans-serif" font-weight="bold" font-stretch="condensed". In addition, the transform x-scale should be 1 rather than 0.8; automatic detection may be difficult.

Should SVG with such font names be edited?

Notice the SVG uses groups to impose consistent style information. Using structural hierarchy to impose style is an odd practice. The text is not grouped with the other graphics but rather with siblings. Separating one region would involve exploding several groups to remove irrelevant regions.

Classifier to style? These files may be simple. Would like to do some normalization (such as pulling out a default fill for the text).

Another hidden-text example is File:Cochlea-crosssection.svg. It also has the unsupported flowRoot element, so the hidden text may not be such a bad thing.

tref element[edit]

SVG user agents did not implement the tref element, and the SVG 2.0 specification drops the element. As specified, the element does not appear useful. Duplicating rather than referencing text is simple enough.

Consider a map. Most place names (such as towns and cities) will be used once, so a reference would not be useful. A generic feature such as an airport may appear several times, but it can use a symbol to repeat the text.

Rivers, on the other hand, are long, so they may be labeled on a map a few times. For example, the Amazon may appear on a map several times. One could use symbol to do the repeat, but rivers are oven labeled using a path that follows the river. A tref might be convenient to label the river at several places along its twisting length.

<defs>
  <text id="amazontext">Amazon River</text>
  <path id="amazonpath1" d="..."/>
  <path id="amazonpath2" d="..."/>
  <path id="amazonpath3" d="..."/>
</defs>
<text><textPath xlink:href="#amazonpath1"><tref xlink:href="#amazontext"/></textPath></text>
<text><textPath xlink:href="#amazonpath2"><tref xlink:href="#amazontext"/></textPath></text>
<text><textPath xlink:href="#amazonpath3"><tref xlink:href="#amazontext"/></textPath></text>

However, the approach is not suitable for switch translations.

A textPath element may have tref children.

The target of tref is not well specified. It sounds like it could point to any element with text content, but that does not mean that pointing to a switch element would provide just the rendered text.

A construct using convertional translation tools and its:term would work better.

Glyphs[edit]

A (non) candidate for glyphs. 115 kB.

SVG 1.0 and 1.1 have SVG elements that allow a user to embed a font.

For those cases where converting text to curves makes sense, using glyphs offers potential benefits.

Discussion at Commons:Graphic_Lab/Illustration_workshop/Archive/2021#Vietnamese-style_seal_of_the_Government-General_of_French_Indo-China (and several sections immediately following)

Also https://commons.wikimedia.org/w/index.php?title=Commons:Graphic_Lab/Illustration_workshop&oldid=753110204#The_seals_of_ministries_of_the_Nguy%E1%BB%85n_Dynasty

General information about w:en:Seal script.

The seal is 115 kB for 15 characters. That is about 7700 bytes per character, which is rather large. Using the path element, one should be able to describe a line segment in less than 100 bytes. Examining the image with magnification shows that the character strokes have a lot of noise.

Modern script (not seal script) using writing-mode: vertical-rl:

大法國欽命
總統東法全
權大臣管理

Some SVG files embedded commercial fonts as glyphs. For example, an Adobe Illustration file might embed a portion of the Arial font in an SVG file. That practice should be discouraged.

For Unicode seal script, there is a list of Unicode documents at https://www.unicode.org/L2/topical/seal/

sample files from that list

Finding the corresponding characters in that document or a similar document might be helpful.

For example, the second document above maps 林 to Serial No.字序 04418

SVG 2.0 drops glyph[edit]

SVG was developed when web fonts were not available, so SVG included a rudimentary embedded font mechanism.[17] With web fonts, such a facility is not as important, so the mechanism has been deprecrated. As of 2021, support may still be found in the Safari and Android browsers.

Glyphs would not work with some scripts[edit]

The Unicode specification will not add any new composed characters. That simplifies the number of characters needed. For example, Siddham script has thousands of glyphs, but most of those glyphs are composed characters. In Unicode, Siddham has a small number of fundamental characters. Composed characters are still drawn, but they no longer have exposed codepoints.

WMF prohibits web fonts[edit]

SVG 2.0 may have dropped glyph support because web fonts are now available. In the past, web pages depended upon the fonts that a user already had on his local machine. If the local machine did not have the font, then it would substitute some other font. Those substitutions could lead to bizarre results.

It gets even more troublesome when the desired font is for uncommon Unicode scripts. Unicode supports many scripts, but most users will not have those scripts. Unicode has assignments for Egyptian hieroglyphics and ancient Sanskrit Siddham.

CSS now has a mechanism to load web font.

Google offers a lot of fonts, and it also has CSS files to use those fonts as web fonts. (Reference)

The downside is the webfonts allow some tracking. The web font files have a long caching time (was it a year?). A browser would download the font and use it without continually querying the Google servers. The CSS files have relatively short cache times, so the browser would be contacting Google servers often. (Reference)

Alberta road signs[edit]

Road signs can be thorny. They may contain artistic text, and they may contain ordinary text.

Even with artistic text, the file sizes are often not large because the signs are simple (they do not contain much text).

Old Alberta road signs could sensibly use a stylized font.

The modern road signs are too stylized.

See Category:Alberta Highway shields

User:Highway Route Marker Bot

Fonts are not that important to signs. See File:AB69ewSigns-TwoFontsYMM (28172571140).jpg which shows two road signs using the old Alberta logo but the highway numbers are in different fonts.

Font height may remain fixed, but the font weight (e.g., bold) or font stretch (e.g., condensed) may vary.

2, 2A, 93, 93A

w:en:Symbols_of_Alberta

File:Alberta wordmark 2009.svg

File:AB-provincial highway.svg

Polish road signs[edit]

About wide variation in Category:Diagrams of Voivodeship road signs of Poland. Height is 270, but widths are all over the map.

The font is Drogowskaz. See en:Polish road signs typeface.

Arial is a reasonable facsimile: 0123456789.

<?xml version="1.0" encoding="UTF-8" ?>
<svg
   xmlns:dc="http://purl.org/dc/elements/1.1/"
   xmlns:cc="http://creativecommons.org/ns#"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns="http://www.w3.org/2000/svg"
   version="1.1"
   viewBox="0 0 540 270"
   font-family="Drogowskaz, Arial, sans-serif"
   font-size="230"
   font-weight="bold">
  
  <metadata id="metadata15">
    <rdf:RDF>
      <cc:Work rdf:about="">
        <dc:format>image/svg+xml</dc:format>
        <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
      </cc:Work>
    </rdf:RDF>
  </metadata>
  
  <rect x="6" y="6" width="528" height="258" stroke="black" stroke-width="12" rx="27" fill="#fafd4a" />
  
  <text x="50%" y="220" text-anchor="middle">100</text>

</svg>

Text anchors[edit]

SVG files should use reasonable anchors. The usual choices are left aligned, center aligned, or right aligned. If I want text aligned on the right edge, I should not insert some left-aligned text and then move the position of the whole string so the right edge ends whene I want it to end.

Alignment is important because font metrics vary. Text that seems to align correctly with one font may look ragged with another font.

Choosing the correct text anchor is a simple defense against varying font metrics.

The leader lines should be placed with precision. The lines should be at the start, the middle, or end of the text. Consider a lengthy English label may translate to a single character. If the leader starts at the third character, it may be misaligned for Chinese. It's not a great example, but compare pancreas (Q9618) and 胰脏(Q9618). Also, a leader line should be careful about character clearance. Starting just below the baseline will not work will if the translation has a descending character there. Similarly, starting a leader just above some English text may intersect with the Chinese translation because Chinese uses taller characters.

Perhaps show translation boxes?

Text anchors, direction, and BIDI[edit]

There are subtle problems with SVG text anchors. The SVG semantics do not play well with text direction. If the anchor sets a starting point, then left-to-right text builds to the right, but right-to-left text builds to the left. That can give screwy results.

The issue is a bit more complex. There is an interaction between the specified text direction and the Unicode BIDI algorithm. They will give reasonable results in simple cases.

Consider start-aligned text: text-anchor="start".

For English, we expect direction="ltr", so we would expect

        |English

For Arabic, we expect direction="rtl", so we would expect

  cibarA|

But if the Arabic is laid out with direction="ltr", the layout sort of works due to BIDI:

        |cibarA

Do nothing, and the layout sort of works. It fails when strings contain both LTR and RTL characters. Consider "English 17 kg". Where we want to keep the "kg" units. The result with direction="ltr" is

        |English 17 kg
        |17 cibarA kg

BIDI starts out in LTR. It sees the Arabic, so it starts a RTL block for "Arabic". The space is neutral, so it is added RTL. The numbers are weak LTR, so they are added to the RTL block as a subblock. The "kg" is strong LTR, so they terminate the blocks and go back to toplevel.

Here are some span elements in HTML:

  • English 17 kg
  • عربي 17 kg

To do it correctly, we need to set the direction for the entire phrase and swap the text-anchor. Here is the text setting the direction for the entire span

  • English 17 kg
  • عربي 17 kg

Ideally, the text element should set the text direction that is appropriate for the script. English text should set the text direction as left-to-right. Arabic text should set the text direction as right-to-left. Unicode BIDI will then layout the strings correctly, but now the layout will head in opposite directions. For expected results, one must change both the text anchor and the text direction. That's a headache.

In theory, CSS can fix the problem, but SVG agents may have weak CSS implementations.

Phab:T271663 Offer to invert text-anchor for RTL languages

SVG warts[edit]

SVG is not HTML[edit]

SVG 1.1 used xlink; namespace. HTML does not have namespaces, so HTML uses just href rather than xlink:href. For some reason (perhaps embedding SVG within HTML), SVG 2.0 has decided to use href.

The problem with xml:lang and lang.

SVG should be about making marks on a screen or a piece of paper. It should not be about myriad other topics. If the semantics are not about marks on paper, then the semantics do not belong in the SVG specification.

For example, there is a notion that some text might be translated to another language, while other text should not be. People who were interested in XML markup developed the Internationalization Tag Set for making such notations. Consequently, one could add rules and attributes to an XML file that translation utilities could use. The attribute its:translate="no" means do not translate the content, and its:translate="yes" means translate the content. The specification also included rules using XPATH patterns to identify what should or should not be translated. Everything in the its namespace is distinct from other namespaces (and the default namespace).

HTML does not have namespaces, so the use of ITS is a bit awkward. So HTML added the translate attribute. It is not as powerful as the ITS specification, but it is simple. SVG comes along and copies the HTML translate attribute. There is no reason. SVG can support the its: namespace; it is not crippled like HTML.

HTML does not have namespaces. Instead of following XML and adopting namespaces (as XHTML did), HTML invented a poor man's namespace. If attributes start with data- or aria-, then they are in a quasi-namespace. SVG is XML, so it need not stoop to such measures. SVG should have used data: and aria: namespaces.

HTML ignores capitalization. Consequently, <Head> is the same as <HEAD>. It is the same for attributes. However, the data- attributes wanted to have database keys that were cases sensitive. So HTML uses a hyphen algorithm. Everything after the data- prefix is in lowercase unless it is immediately preceded by a hyphen. The attribute DATA-NAME="Smith" sets database["name"] = "Smith". If we wanted the database key to be all capitals, we must say data--N-A-M-E="Smith". That's due to the case-insensitive nature of HTML. XML and SVG can be much simpler. They could say either data:name="Smith" or data:NAME="Smith". No need for a pseudo namespace, and no need for a hyphen capitalization rule.

Zoom and pan[edit]

Can SVG images be inserted such that the image can be zoomed and panned without affecting the including page?

Does the use of viewBox frustrate that goal?

MediaWiki problems[edit]

librsvg color issue[edit]

Where was this?

The librsvg renderer produces PNG bitmaps, but those bitmaps may not set the colorspace to sRGB.

MediaWiki whitelisted namespaces[edit]

MediaWiki has a select set of namespaces for SVG.

The intention of the whitelist is to avoid script injection. For example, the SVG might include an XHTML namespace subtree, and that subtree might allow scripting that is not detected by the ordinary SVG filter.

It looks like the test requires elements to be in whitelisted namespaces but does not require attribute values to be in whitelisted namespaces. I should check that distinction. Might try

<svg version="1.1"
     xmlns="http://www.w3.org/2000/svg"
     xmlns:xlink="http://www.w3.org/1999/xlink"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://namespace... http://schemalocation/schema.xsd">

See also Help:SVG.

		static $validNamespaces = [
			'',
			'adobe:ns:meta/',
			'http://creativecommons.org/ns#',
			'http://inkscape.sourceforge.net/dtd/sodipodi-0.dtd',
			'http://ns.adobe.com/adobeillustrator/10.0/',
			'http://ns.adobe.com/adobesvgviewerextensions/3.0/',
			'http://ns.adobe.com/extensibility/1.0/',
			'http://ns.adobe.com/flows/1.0/',
			'http://ns.adobe.com/illustrator/1.0/',
			'http://ns.adobe.com/imagereplacement/1.0/',
			'http://ns.adobe.com/pdf/1.3/',
			'http://ns.adobe.com/photoshop/1.0/',
			'http://ns.adobe.com/saveforweb/1.0/',
			'http://ns.adobe.com/variables/1.0/',
			'http://ns.adobe.com/xap/1.0/',
			'http://ns.adobe.com/xap/1.0/g/',
			'http://ns.adobe.com/xap/1.0/g/img/',
			'http://ns.adobe.com/xap/1.0/mm/',
			'http://ns.adobe.com/xap/1.0/rights/',
			'http://ns.adobe.com/xap/1.0/stype/dimensions#',
			'http://ns.adobe.com/xap/1.0/stype/font#',
			'http://ns.adobe.com/xap/1.0/stype/manifestitem#',
			'http://ns.adobe.com/xap/1.0/stype/resourceevent#',
			'http://ns.adobe.com/xap/1.0/stype/resourceref#',
			'http://ns.adobe.com/xap/1.0/t/pg/',
			'http://purl.org/dc/elements/1.1/',
			'http://purl.org/dc/elements/1.1',
			'http://schemas.microsoft.com/visio/2003/svgextensions/',
			'http://sodipodi.sourceforge.net/dtd/sodipodi-0.dtd',
			'http://taptrix.com/inkpad/svg_extensions',
			'http://web.resource.org/cc/',
			'http://www.freesoftware.fsf.org/bkchem/cdml',
			'http://www.inkscape.org/namespaces/inkscape',
			'http://www.opengis.net/gml',
			'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
			'http://www.w3.org/2000/svg',
			'http://www.w3.org/tr/rec-rdf-syntax/',
			'http://www.w3.org/2000/01/rdf-schema#',
		];

Namespaces to add:

One of the whitelisted namespaces is suspect:

Might check if there are any SVG files that use this namespace.

Some absent namespaces are significant. When Dublin Core came out in 2000, it provided a succinct set of terms in the dc/elements/1.1/ namespace. The next year, it came out with an expanded dc/terms/ namespace and vocabulary. In 2008, it encouraged dropping the first namespace in favor of the dc/terms/ namespace. WMF accepts the former but not the latter namespace.

MathML namespace[edit]

The MathML namespace is also not whitelisted. MathML allows sophisticated mathematical typesetting, but WMF blocks its upload. Users cannot upload this file:

<?xml version="1.0" encoding="utf-8"?>
<svg viewBox="0 0 300 200"
     version="1.1"
     xmlns="http://www.w3.org/2000/svg"
     xmlns:xlink="http://www.w3.org/1999/xlink">
  <title>SVG MathML test</title>
  <desc>Test if MathML is available in SVG. Will not upload to Commons due to MathML namespace.</desc>

  <metadata>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:dc="http://purl.org/dc/terms/"
             xmlns:cc="http://creativecommons.org/ns#" >
      <cc:Work rdf:about="">
        <dc:publisher>Wikimedia Commons</dc:publisher>
        <cc:license rdf:resource="https://creativecommons.org/publicdomain/zero/1.0/"/>
        <cc:attributionName rdf:resouce="http://commons.wikimedia.org/wiki/User:Glrx" />
        <cc:attributionURL rdf:resource="http://commons.wikimedia.org/wiki/File:SVG_MathML_test.svg" />
      </cc:Work>
    </rdf:RDF>
  </metadata>

  <text x="150" y="40" text-anchor="middle">SVG MathML test</text>

  <switch transform="translate(50,100)">
    <foreignObject width="200" height="50"
                   requiredExtensions="http://www.w3.org/1998/Math/MathML">
        <math xmlns="http://www.w3.org/1998/Math/MathML">
          <msqrt>
            <msup><mi>x</mi><mn>2</mn></msup>
            <mo>+</mo>
            <msup><mi>y</mi><mn>2</mn></msup>
          </msqrt>
        </math>
    </foreignObject>
    <text>\sqrt{x^2 + y^2}</text>
  </switch>

  <text x="10" y="175" font-size="8">should display a formula in either MathML or TeX</text>
</svg>

General issues[edit]

librsvg does not turn off its clip region.

Converting bitmaps to SVG[edit]

Photographs are usually poor choices for vectorization. (25 kB)

Many files on Commons are bitmaps, but some would be more useful as SVG files. Bitmap files are great with large, orthogonal, features, but they can struggle with thin features and curves. Zooming in on a feature will show more anti-aliasing fuzz or jagged edges. More details require more bits. Bitmap files can be difficult to edit. Changing lines or text involves not only adding the new content, but also erasing the old. Erasures can be difficult because the background must be reconstructed. It is difficult to copy text that is in a bitmap: the text is just a picture that must be converted to characters. It takes a lot of work to translate a bitmap to another language. Bitmap files that are good candidates for vectorization can be marked with {{Convert to SVG}}.

Notes[edit]

Expensive parser tests:

Conversions[edit]

Unfortunately, converting a bitmap file to a vector file may not be an easy task. It also may not be desired.

Technical bitmaps such as a QR code should remain bitmaps. (Do not convert QR code PNG files to JPEG bitmaps.)

Converting a photograph or other continuous-tone image to SVG is usually inappropriate. See w:Image tracing. Good candidates for conversion need to have significant structure. Some continuous-tone images have structured color gradients, so they can be vectorized.

Images with a lot of random details may be inappropriate. it does not take much information to describe a long straight line, but it does take a lot of information to describe 10,000 individual objects. There are times that randomness can be described by a pseudorandom process. (For example, MPEG replaces fricative sounds with a noise generator.)

Here is a progression of changes to a subject image. The details and appearance of an image can be improved and still be an efficient representation of the object. The last image has the detail of the gun powder grains without individually drawing each grain.

Many technical images can be good candidates for conversion to SVG. See, for example, Category:Cross sections of valves.

Issues[edit]

This section is confused. It should start with straightforward conversions such as diagrams that are easy to redraw.

Next, it can address stepped conversions. A stepped conversion is where a bitmap is still present in the SVG, but parts of the bitmap are replaced with SVG elements. Eventually, the SVG elements may eliminate the bitmap. A "stepped conversion" may include SVG files that will always contain a bitmap image. For example, the bitmap may be a photograph, but SVG may use text elements to label the photograph.

From there, it can address the random process methods. The section should not lead with the most difficult conversions. It can also serve as a counterpoint to not converting the Mona Lisa to SVG.

Straightforward conversions[edit]

Here is an example of a PNG file that has been converted to SVG.

The files are not widely used, but the SVG format makes it easier to fix some minor issues with the original file. For example, the variable such as VTorpedo can be edited to use the more common italic-variable convention of VTorpedo. The arrows for the torpedo and target velocities look like velocity vectors, but they do not make sense as velocity vectors. The diagram suggests that by the time the torpedo reaches the target's track, the target has already gone by that point.

The SVG also brings up an issue with SVG marker elements. In the past, I have created a new marker for each fill. There is a technical issue about inheritance of attributes such as stroke and fill. A use instance will not inherit from its environment because it is not part of the DOM tree. In some cases, an instance will use attributes that are set on the use element because they are part of the (inaccessible) tree.

Less simple conversions[edit]

File:Littoral Zones.jpg needs a lot of cleaning up.

File:Map of battle St. Mihiel.JPG needs cleaning up. Dashed lines are not good candidates for automatic vectorization. Adding color.

Map locations with given scale: St. Mihiel Saint-Mihiel (Q194932), Frenes Frênes (Q538768), Hattonchatel Hattonchâtel (Q30127896), Vigneulles , Thiacourt Thiaucourt-Regniéville (Q497719), Pont-a-Mousson Pont-à-Mousson (Q461413), Plain of the Woevre Woëvre (Q1476825), River Meuse Meuse (Q41986).

Conversions with gradients[edit]

An image may look complex, but it may just need the appropriate construct.

Conversions should be good[edit]

Suggested replacements should only be used if they are superior. Replacements may not be supperior.

The original JPEG is simple and clean. The vector replacement has problems. The JPEG uses a single font. The SVG uses several font sizes and uses colors. A yellow font can be lost in a white background, so the yellow font is stroked. The paper towels are flat in the JPEG, but they are wavy in SVG. One purpose of the paper towels is to evenly distribute the weight; wavy towels (especially when the waves line up) do not convey that purpose. The solution is divided in the JPEG but connected in the SVG. What is the distinction between Southern blot and northern blot?

Despite the image having simple vector shapes, the majority of the image is a bitmap.

The SVG vector was derived from File:Capillary blot setup.svg.

The file descriptions are slightly different: the first is about a Southern blot while the second is about a northern blot. The first is for DNA and the second is for RNA, but both procedures use agarose gel electrophoresis.[18]

Electroblotting makes more sense as a blot, but the electro-transfer is vertical. That has issues with applying voltage in the given images.

w:File:Electroblot.gif is public domain, but not yet transferred to Commons. It shows the vertical electrodes. w:Northern blot states, "Strictly speaking, the term 'northern blot' refers specifically to the capillary transfer of RNA from the electrophoresis gel to the blotting membrane."

investigate ...

Multiline text causes trouble[edit]

For translations, try to keep the text on one line. Text that is broken into many lines is troublesome.

  • Cathode
  • Aston Dark Space
  • Cathode Glow
  • Cathode Dark Space
  • Negative Glow
  • Faraday Space
  • Positive Column
  • Anode Glow
  • Anode Dark Space
  • Anode

The diagram has unconventional leader lines. The diagram has negative shading: the dark spaces are white; some glows are dark.

Stepped conversions[edit]

Here are files that can be converted in steps. The first step would use an underlying bitmap file with overlaid SVG text elements. Later, the bitmap image could be converted to SVG.

A stepped conversion with difficult image[edit]

Here's a file that has conversion problems and can be converted in steps.

For the first step, the PNG can be edited to remove the text and leader lines. That PNG can be inserted into an SVG file, and the text and leader lines can be redrawn using SVG primitives. Removing the text is usually simple, but removing the leader lines can be tricky. In some cases, the leader lines can be retained. In either case, the leader lines pose a problem with text alignment. The current layout requires the text to fit the space between the margin and the start of the leader line. That strategy works for PNG files, but it has problems with SVG because font metrics may change slightly. A substituted font with slightly different metrics may not fit between the margins and the leader lines. One fix would be to add a background filter to the text; it would overwrite the leader line with white (see filter below). Alternatively (and probably better) would be to right align the lefthand text and left align the righthand text. Another text fitting problem is the title: it runs from the left margin to the right margin. A slightly wider font would go outside both margins.

<filter id="textflood" filterUnits="objectBoundingBox" primitiveUnits="objectBoundingBox">
    <feFlood flood-color="white" flood-opacity="1.0" x="0" y="0" width="1.0" height="1.0" result="back"/>
    <feMerge>
        <feMergeNode in="back" />
        <feMergeNode in="SourceGraphic" />
    </feMerge>
</filter>

For the second step, the body image could be redone as SVG. Completely converting the image to SVG is hard because the image has gradient fills; a raster-to-vector conversion application will probably not have a good result. Rendering the intestines looks difficult, too. There are many twists and turns, so shading is difficult. Perhaps a good place for filter primitives.


A simpler target is the following image.

Significant detail[edit]

Trees have significant detail[edit]

Here, the JPEG image is much higher quality than the SVG. Both files are about the same size.

Even synthetic images can have significant detail[edit]

The JPEG image has more character than the SVG. The files have similar size.

Stamps[edit]

Originally, I thought some files were bitmaps, but now it looks like something much stranger happened. The original artist made an SVG with Inkscape and used an appropriate filter, but somehow the SVG file bloated out of control. Why?

See Category:Powered by Wikidata.

Rectangular stamp[edit]

Font family is "Sans", but the SVG text was converted to curves. There are many instances of filters, and those instances include "Rubber Stamp" and "Chalk and Sponge". The defs section is huge, and it has several huge clipping paths. However, only one clipping path is used. The wiki barcode does not use a clipping path, so it is drawn without special effects.

The SVG files have neutered flowRoot elements.

This SVG file uses the "Gill Sans" font.

Round stamp[edit]

The stamp runs into librsvg inability to do textPath. It is a lot of bytes for a simple image; individual debris has a lot of information. There is debris even in the unstamped areas. Some of the debris is black. Most of the debris is polygonal. The WIKIDATA rectangle is filled, so not all of the apparent background is transparent; clipping would be appropriate.

Redid as an SVG with a random process filter.

Vectorization[edit]

Find mirrored or rotated components of an image

How good can automatic vectorization be?

Recovering text[edit]

Files on Commons can be OCR'd (produces JSON with a text key with lines of OCR/d text):

English
Arm Bones.png
Polish
Tulejki zaciskowe.svg
Chinese
(zh)Illu epithelium.jpg

SVG extensions do not work with the above pattern. Need to use PNG rather than SVG. That can be achieved by supplying a thumbnail_size argument

Polish
Tulejki zaciskowe.svg
{"engine":"google","langs":["pl"],"psm":3,"crop":[],"image_hosts":["upload.wikimedia.org","upload.wikimedia.beta.wmflabs.org"],"text":"Typ \u015bci\u0105gaj\u0105cy\nTyp naciskaj\u0105cy\nTyp obustronny"}

So the Polish text is

  • Typ ściągający → pull-back type; pull-to-close
  • Typ naciskający → push type; push-to-close
  • Typ obustronny → dead-length type

Can I just go to https://ocr.wmcloud.org/api.php directly?

Would some JavaScript run into CORS? Will origin=* work? The simple return has

  • access-control-allow-origin: *

so it should work without problems.

See https://ocr.wmcloud.org/ for direct interface and api documentation.

There are tools for identifying fonts.

What?[edit]

Tracing...

Translations: internationalization and localization[edit]

Commons supports wikis in many different languages. Ideally, an image would be available in any language, but the reality is many images on Commons are just available in English. Images in a bitmap format such as PNG have painted in the text, so the text is not easy to change.

SVG can support translations.

Sadly, SVG has made some unusual choices. The class attribute is a space-separated list of tokens, but the systemLanguage attribute is a comma-separated list of tokens. The commas added confusion (some implementations used space-separated IETF language tags) and complicate pattern matching. Compare CSS [systemLanguage~="en"] (which wants a space-separated list) and [systemLanguage|="en"]] (which does not want a list).

Translations are welcome, but they have costs[edit]

There are important diagrams that have many translations.

The translations do not fit[edit]

(Should follow multiline text.)

Methods for variable-width dimension lines.

Some languages do not need much space, but other languages do. As a guess, I would say Spanish needs 30% more space than English. Chinese is very dense, but the characters are more detailed and should be larger. Diagrams meant for translation need to leave a lot of space.

I've seen some users add translations to similar languages. Instead of adding Catalan to an English SVG, add the language to the Spanish SVG.

Dimension lines cause problems. They set a particular width for the text. Putting the text above the dimension lines is a simple approach. Another way around the problem is to paint a text background that overlays the dimension line. In many cases the background merges with the background of the diagram, but sometimes it will cover nearby details in the diagram. Painting a background is difficult when the background has a color gradient; it requires a good match. Use a mask. Use a Gaussian blur filter on the background?

Methods ran into Phab:T316962. Sadly, FillPaint does not work for gradient fills on Commons or Chromium.

  • File:Oceanic divisions.svg
    fix dimension line issue.
    better align some labels. Yellow check.svg Half done
    also fix fonts and marker arrows. ✓ Done
    reposition some text for better translation. ✓ Done

Keeping diagrams in sync[edit]

We want to make images available in several languages. That has often been done by loading an original image into an editor and changing the labels. During the translation, the artist may improve the image. The improved graphics do not make it back into the original. If there were several other translations of the image, those retain the old graphics.

Here are two diagrams of the same chemical process that have significant changes. In the translation, many vessels are larger and easier to read. The piping is also different. Which one is more accurate?

The German version uses Unicode subscript characters for text such as N₂ H₂. The Spanish version uses separately placed subscripts for text such as N2.

Here's an image where the English version was improved with new numbers (and changed graphics) but the other languages were not.

MediaWiki language default is confused[edit]

MediaWiki believes English is the true default language. Relies on librsvg defaulting to en.

MediaWiki problems[edit]

I'm seeing some multilingual files that MediaWiki does not offer to show in various languages ("Render this image in ..."). I've come across a couple in the last month, and they are not the 256 kB case. Possibly newer page builder? Also may be one explicit langtag and a default.

Maybe this is a clue. Go to

and it will display the language drop down box. Now click on the "(default language)" option and GO. The language dropdown box disappears.

That is the same as going to

Alternatively, go to the Klingon version, which does not have a render this image in (language) selector:

So somebody creates a diagram in English on Commons. Somebody then applies SVG Translate to add a language such as French. SVG translate does not do the triple clause thing; instead it just adds "fr" clauses and keeps the default. The Commons file page will not display the potential languages. SVG Translate should ask for the default language (or use the lang= or xml:lang= attributes if available.

MediaWiki language fallbacks[edit]

MediaWiki sometimes tries to fallback to similar languages if the desired language is not available. See mw:Manual:Language#Fallback languages. Bulgarian does not have Russian as a fallback. Ukrainian has Russian as a fallback.

Consider File:Galvanic cell with no cation flow.svg. Currently, it has a Russian (ru) translation but not a Bulgarian (bg) translation; both language use Cyrillic script. The image is used in the Bulgarian wiki for the bg:Анод article. The page displays the SVG default English rather than Russian. Bulgarian does not have Russian as a fallback.

File:Galvanic cell with no cation flow.svg also does not have a Ukrainian translation. It is used in the Ukrainian wiki: uk:Напівреакція, but that image also displays as English.

Is this an argument that MWF should serve multilingual SVGs? It would provide understandable SVGs to users in ways that are not possible with a static map. A Japanese user may not understand English but may understand German.

MediaWiki language identifiers may not be the same as IETF langtags[edit]

MediaWiki language identifiers (which are all lowercase) are usually the same as IETF langtags (which are mixed case), but there are some differences. There is also an effort to conform some language tags.

sr-ec vs sr-Cryl, sr-el vs sr-Latn, als vs gsw.

In the als Wikipedia, SVG inclusions will now use the gsw IETF langtag if it is available.

MediaWiki language matching[edit]

Phab:T311965 MediaWiki mishandles hyphenated language tags in SVG files.

MediaWiki language default[edit]

MediaWiki defaults the language to en. It should remove that English bias. If a German editor creates an image that is used on several wikis, then the default display should be the German text.

So default the display to und.

That is a breaking change. We should scan existing SVG to discover those that have switch elements without default clauses.

MediaWiki language advertisements[edit]

Say a file is used on langtag.Wiki but does not have a langtag translation. In that case, the SVG file might have a link to SVG Translate. Alternatively, MW might notice that a file is switch translated but the file does not have the desired language. Then it could insert not only the img tag but also a link to translate the file. It could use the wiki symbol for translate: File:Translate (CoreUI Icons v1.0.0).svg — except it has a CC-BY 4.0 requirement so putting it in the SVG file would be cumbersome.

The information section of a file page often lists other versions of a file. A file may, for example, have PNG and SVG versions. There may also be different language versions.

A template is often used to keep the other-versions information up-to-date on the different file pages. For example, {{Other versions/War in Ukraine (2022)}} is transcluded on many file pages. The template usually consists of an image gallery that lists each file with a comment.

For translations, the comment often identifies the language. That raises the question of how to identify the language.

  1. Using English words such as French, Russian, or Japanese. This approach only works for English readers and does little for other languages.
  2. Using langtag such as fr, ru, or ja. This approach is too cryptic for most users; they would not know what the strings mean.
  3. Using MW {{#language: fr}} to obtain the language in its representation. Anyone visiting the file page would see français, русский, 日本語. That makes it easy for native users to recognize their language, but the most users would have trouble recognizing all the languages. A user seeing lietuvių may not know that means Lithuanian.
  4. Using {{language|fr}} to obtain the language in the MW page's language. The English version of the file page would show French, Russian, Japanese. The German version of the file page would show Französisch, Russisch, Japanisch. The translations depend upon the file page's uselang URL parameter.

Methods 3 and 4 are the better approaches. I previously used method 3, but now I think method 4 is better.

The advent of multilingual SVG files raised an issue of how they should be represented in gallery. Should there be one file that lists all the languages that it supports, or should the file be repeated for each language?

I prefer the former. The gallery is usually so small that the translations do not show up. Painting essentially the same file 15 times on the file page seems wasteful. It is very repetitive for files such as File:Map of USA with state names.svg, a multilingual map with 150 languages.

Clicking on the image or a link. Should it select the render this image in language?

The language gallery templates. More to say. {{Svg lang}}, {{Lang gallery}}, and {{Lgallery}}.

{{Lgallery}} supports Category:Switch-controlled SVG which exploits systemLanguage that browsers will not understand.

Switch translated files already have a general request to add translations. How about attracting translators for a specific language? Say XYZ.svg is an image used on the abc.Wiki but does not have an abc translation. How can we seek translators for the image?

  • Embed a translation request in the SVG image. That would pollute the image.
  • Add the image to a (possibly hidden) category of SVG images needing translation to language abc.

For the latter, make a {{Translation request}} template. Add the template to the File: page with the langtag abc. The template can link to https:svgtranslate.toolforge.org/BASEPAGE. The template can also add the File: page to the Category:Translation requests abc.

Does SVG Translate have URL parameters to specify the source and target languages?
github repo

A wiki could encourage its users to go to the appropriate category page. Or there could be a translate a random page.

Incnis Mrsi wrote {{Translate SVG}} to mark an SVG file for translation to a particular language. It is not specific to switch translations. It adds files to Category:SVG images to be translated to XXX.

includeonly Category: SVG images to be translated to language(langcode) /includeonly

{{Language}} has a second argument of |2=en to canonize the language in English.

Category:X does not display immediate text, so simply add the category to the template.

How MediaWiki handles images[edit]

When MediaWiki builds a page, it makes HTML img elements that will display image.

For a JPEG image:

For an SVG image

The image URL pattern is [URL prefix]/H/HH/[filename]/[size]px-[filename][suffix]

When my browser displays the page, it processes the img elements. The browser will use the src attribute to make an HTTP request for the image. First, the browser will look in the cache to see if it has a local copy. If that local copy exists and is current, then it will use the local copy. If the local copy exists but is stale, then it will make a network request asking the remote server whether the local copy is still current. If the local copy is still current, then the browser will use the local copy. If there is no local copy or that copy is no longer current, then the image will be transferred over the network.

When the server gets an HTTP image request, it will look in its cache to see if it has that image ready to go. If it does, then it can answer the request from its cache. Otherwise, the server will pick apart the URL, process the request, store the result in its cache, and transfer the result over the network. (Processing the request might say the image is still current, or processing may involve scaling the image.)

Real life is a bit more complicated. WMF's wikis are high-traffic websites, so just one server cannot do all the work. A connection to a wiki will go to one of many servers. That server may ask other WMF computers to do the work. The /H/HH/ pattern in the image URL is from an MD5 hash code. It offers an easy method of load leveling work among up to 256 computers.

MediaWiki source code[edit]

SvgReader has many uglies including the exponential language list. SvgHandler trims the list.

SvgHander::normaliseParamsInternal() is where the lang param must be in the lang list.

MediaWiki file page.

MediaWiki compiling a page.

MediaWiki serving an image. MediaHandler/ImageHandler/SvgHandler

  • SvgHandler::rasterize()

Thumbor serving a page.

Thumbor 7 changed to Python 3, a breaking change:[19]

Release 7.0.0 introduces a major breaking change due to the migration to python 3 and the modernization of our codebase. Please read the release notes for details on how to upgrade.

Gilles retires from WMF... https://mobile.twitter.com/monsieurperf/status/1409444342352187400

Language variants[edit]

Asking Klingon (tlh) — URL has .../langtlh-220px-
Asking English (en) — URL does not have .../langen-220px-

It's been a long time, and I need to check these claims. I want to point to the code.

Building a wiki page that contains an SVG file is a little more involved. There are circumstances where MediaWiki will include a language specifier in the image URL:

  • ...[filename]/lang[language]-[size]px-[filename].png

The language specifier is included if the wiki text has an explicit |lang= parameter. That is the user making an explicit request, and that request is honored even if the SVG file does not have that language.

On the English wikipedia, if there is no |lang= parameter, then no language specifier is emitted. This practice is due to WMF servers defaulting the SVG language to generic English. That practice makes it difficult to ask for the SVG file's default language. To get the default clause, one must ask for a langtag that does not exist in the file (e.g., tlh Klingon). (Make a table showing the issue.)

On other wikipedias, if there is no |lang= parameter, there is an attempt to use that wikipedia's default language.

MediaWiki checks the SVG file to see if it has any language dependencies. The check is simple, and the check can be fooled. Currently, it reads the first 256 kB of the file looking for systemLanguage attributes. As it finds those attributes, it builds a list of languages the SVG file supports.

If the wikipedia's default language is in that list, then MediaWiki emits a URL that requests that language.

There is logic behind these choices. Most SVG files are not multilingual, and even if they are multilingual, they often do not support many languages. The goal is to avoid building language-specific URLs that do not affect the image. If an SVG file does not support Russian, then it does not make sense to scale and cache a Russian version of the SVG that looks exactly the same as the English version.

Languages and fonts[edit]

Much like HTML, itis recommended that an SVG file declare its language. It can do that with xml:lang or lang attributes. Setting xml:lang on the toplevel svg element will tell generic XML tools the language, but it can also add undesired language restrictions to RDF metadata. That can be worked around by adding xml:lang="" to the metadata element. Consequently, using lang is simpler.

Few SVG files declare their language. In many cases, such a declaration is usually extraneous because basic SVG does not depend upon the language. SVG does not have any constructs that format text according to the language. For example, language does not determine whether a number should display as "123.45" or "123,45" or whether a date should display as "3 May 1999" or "May 3, 1999". It would be good if SVG had that ability, but it is not present yet. At best, the language attribute is a hint to programs that read an SVG file, but it does not affect the basic display of SVG text.

Compare HTML datetime attribute that uniformly describes time information, but it does not format such a string.
Compare JavaScript formatting functions and the Intl package.

CSS can be sensitive to the language attribute.

Unicode does not specify which character variants should be used. (Well, it does sometimes. Normal "r" and rounded "ꝛ". Normal "d" and insular "ꝺ".[20]) For example, the Latin small letter a has two major variants: double-story a or single-story a. Chinese ideographs have similar variations, and some languages use specific variants. Chinese, Japanese, and Korean may draw the same ideograph differently. Unicode does not distinguish the character, so the font selection must make the change.

CSS can select an appropriate font for a language.

:lang(zh) {font-family: ...; }
:lang(ja) {font-family: ...; }
:lang(ko) {font-family: ...; }
:lang(bn) {font-family: Noto Sans Bengali, ...; }

On WMF servers, the problem is the :lang selector is not supported by librsvg. Also, we would want to distinguish between zh-Hans and zh-Hant. Unfortunately, old versions of librsvg only distinguish up to the first hyphen.

There is not a good solution for the systemLanguage attribute. CSS can do case-insensitive, partial, matches to an IETF langtag:

[systemLanguage|="zh" i] {font-family: ...; }
[systemLanguage|="ja" i] {font-family: ...; }
[systemLanguage|="ko" i] {font-family: ...; }
[systemLanguage|="bn" i] {font-family: Noto Sans Bengali, ...; }

But CSS is not designed to parse comma-separated lists (SVG should have made the systemLanguage attribute a space separated list just like the class attribute). Even then, CSS does not have prefix matching (=|) on space-separated token lists (~= matching). One can use several selectors to cover the cases, but it is cumbersome.

Languages and layout[edit]

Vertical layout tests for English and Chinese. The green text would work for a plot's y-axis, but the librsvg used by WMF does not handle the CSS.

Consider an x-y plot. The x-axis label will be horizontal and handled normally. The y-axis label is often written rotated by 90° with a text anchor of start or middle. That works for Western European languages, but Chinese should not rotate the characters but rather write them top to bottom.

The normal method of producing the y-axis label for a Western European language would be to rotate the text by -90°. The rotation point would be logically on the font baseline. For Chinese, the normal method would not be a rotation but to set the writing-mode to top-to-bottom. The logical baseline is no longer the bottom of the text but rather the center of the text. If the Western text used a start anchor point, then the Chinese text would use an end anchor point.

CSS can do the transform or set the writing mode, but there are subtle issues. Using the CSS transform property will trump any transform attribute on the element (CSS priority). Similarly, CSS would not trump a transform in a style attribute. Also, such transforms are applied before a text element's x and y attributes. Coordinates on tspan elements may be problematic.

The green text in the "Vertical Layout tests" to the right uses CSS to adjust a possible y-axis label. It could use a better Western European language default. The CSS is

:lang(zh-Hans) { font-family: NSimSun, sans-serif; }
:lang(zh-Hant) { font-family: PMingLiU-ExtB, MingLiU_HKSCS-ExtB, Microsoft JhengHei, sans-serif; }
.vert { fill: green; }
.vert:lang(en) { transform: rotate(-90deg); transform-origin: 0px 0px; }
.vert:lang(zh) { writing-mode: tb; text-anchor: end; transform: translate(-0.5em, 0em); }

For English, the text is rotated. For Chinese, the writing-mode is changed; the text is offset to the left to compensate for the different baselines.

Currently, the WMF rasterizer does not handle the example.

Internationalization (i18n) and localization (l10n)[edit]

Many SVG files are in just one language. Such files should set the xml:lang or lang attribute in the svg element to identify that language. (SVG Translate should look for this attribute to set the default language; otherwise it should ask for the existing lanuguage.)

SVG files that use the switch element and the systemLanguage attribute are internationalized. One SVG file supports many languages. Such SVG files are also known as multilingual. There are not separate SVG files for each language. It is not clear whether such SVG file should have an xml:lang or lang attribute in their svg element.

There are systems that support many languages but produce output files that are monolingual. The output of these systems localized (specialized to a specific locale).

MediaWiki uses internationalized/multilingual SVG files to produce localized PNG files. The PNG files that librsvg produces are not multilingual.

That leads to semantic differences. When MediaWiki displays a multilingual SVG file, it displays the language desired by the wiki, but when I display an SVG file in my browser, my browser displays it in my preferred language.

Graphics editor roundtripping[edit]

A significant problem with i18n files is subsequent graphic editing. Importing an SVG file may not handle systemLanguage or may cause other damage.

A significant limitation of switch translated SVG is many graphics editing applications do not handle multilingual files. Most graphics applications have their own file format. They may import and export the SVG file format, but information may be lost during those conversions. We might expect grouping to survive, but we do not expect all SVG id attributes to survive. If the native file format does not handle translations, then we expect translations to be thrown away. If the import code does not handle switch, then all translations may be discarded.

The issue may be described as the ability to round trip information. Consider an SVG file with some particular information. If that SVG file is imported into an editing application and then exported, is the information still there?

What do we expect to round trip?

  • actual text
  • text anchor point
  • text anchoring method (left, middle end)

What do we not expect to round trip?

  • id
  • class
  • data-*
  • style

Unfortunately, expectations are not always met. For example, Adobe Illustrator losing the text anchor.

Generally, it is difficult to mark specific text.

Many tools us $nnnn text.

How about alternatives?

If some text is unique, then use it for the key. The problem is if the same text is added while editing, then the uniqueness is lost.

Use the text as a key, but prefix with a special character (e.g., "$") to mark it as a key.

Industrial option[edit]

Goals

  • Want to keep translations
  • Want to use any graphics editor

Sanitize SVG[edit]

Currently Commons just checks for safe SVG. For example, Commons rejects an upload if there are any on-* attributes. An alternative is to just strip unsafe SVG.

There is a Phabricator task for that. Phab:T334953 Introduce an SVG Sanitizer.

Generate monolingual SVG[edit]

XSLT localizer. Transform multilingual SVG to monolingual. It can also strip unneeded namespaces such as inkscape: (that will not remove properties in style elements or attributes.

Special option: SVG to skeleton.

XSLT information

Extract translations[edit]

Multilingual SVG to XLIFF.

Reintegrate translation[edit]

Two possibilities:

Skeleton + translations → monolingual SVG (preferred)

Skeleton + translations → multilingual SVG.

MediaWiki language handling[edit]

Explain the lang= URL parameter on Commons. Does that demand the /lang in the URL? There are multiple levels here, too. If I'm on a wiki and click an SVG image, it takes me to a File: page on that wiki that displays the wiki's language version. From there, I can click on the Commons link. That takes me to Commons and will display the default language version.

  • Phab:T134408 Thumbnail-like rendering of localized SVGs for client-side rendering, 4 May 2016. Early recognition of localizing SVG.
  • Phab:T134455 Add experimental option for direct SVG output via srcset, 4 May 2016. Needs a localizer.
  • Phab:T134407 Provide a way to reference fonts for client-side SVG rendering, 4 May 2016. CSS would win here.
  • Phab:T134482 Beta feature for opt-in client side SVG rendering, 5 May 2016. This seems problematic. Each wiki page would need either some JavaScript to select the SVG or PNG, or there would be an HTTP vary on the user's option that would double the cache requirements.

List of languages[edit]

See Phab:T259018.

MediaWiki API will report the available languages:

See the metadata: [ {"name": "translations", "value": [] } ]. It is clearly from the switch information. It will have entries such as { "name": "en", "value": 2 }. IIRC, the 1 and 2 values are whether it is a substring match or an exact match. Find the code to be sure.

I'm presuming this metadata is stored in the database rather than triggering a reparsing of the file. Check that out.

Templates[edit]

Translation units[edit]

Does not work for numbered items; see File:Steam locomotive scheme new.png and File:Steam locomotive.svg; steam locomotive (Q171043) (which has-parts)

  1. Firebox firebox (Q549635)
  2. Ashpan ash-pan (Q722734)
  3. Water (inside the boiler) water (Q283)
  4. Smokebox Smokebox (Q954573)
  5. Cab (several options, none great. e.g., truck cabin (Q224773))
  6. Tender tender (Q749311)
  7. Steam Dome steam dome (Q1158778)
  8. Safety Valve safety valve (Q730458)
  9. Regulator Valve pressure regulator (Q1260990)
  10. Superheater Header in smokebox
  11. Piston piston (Q45227)
  12. Blastpipe
  13. Valve Gear
  14. Regulator Rod
  15. Drive Frame
  16. Rear Pony Truck
  17. Front Pony Truck
  18. Bearing bearing (Q190100) and Axlebox
  19. Leaf Spring leaf spring (Q773544)
  20. Brake shoe brake shoe (Q124366)
  21. Air brake pump railway air brake (Q1196198) and pump (Q134574)
  22. (Front) Centre Coupler railway coupling (Q1501648)
  23. Whistle steam whistle (Q1765082) subclass of whistle (Q204917)
  24. Sand box

RDF hack[edit]

string (en) Wikidata item (en) Wikidata item (en)
Atomic number (Z) 'atomic number' (Q23809)
wdprop: atomic number atomic number
units:
'atomic number' (Q23809)
First ionization energy [eV] 'ionization energy' (Q483769)
wdprop: ionization energy ionization energy
units:
(Latn and Cyrl)
'ionization energy' (Q483769)

Wikidata (inconsistent templates, broken templates, and no Wikibase module)

Wikidata item, en, ?uselang=

Problem file[edit]

2022 Russian invasion of Ukraine

File:2022 Russian invasion of Ukraine.svg is an important map on Commons, but it is a mess. The map is needed in many languages, and how those translations are handled is a difficult issue. There are many localized versions of the map, but they may not get continuing updates to the original file. The conflict is active, so updates are desired.

The file can be improved in several ways, but some improvements may make the file difficult to edit. There are tradeoffs, and this file shows some of the problems.

Planar translations[edit]

The original map now has some multilingual additions, but they are essentially planar translations that SVG Translate and other translation tools cannot handle.

✓ Done Unwinding planar translations is tough. One needs to match the text elements by their position, but the positions may have moved slightly.

Now that the file uses the translation units that SVG Translate wants, several languages have been added. Some additional translations have been so close in time that I feared they would overwrite each other, but it does not appear that happened. SVG Translate may have a significant update model that allows concurrent translations.

Inkscape[edit]

The author of this SVG file uses Inkscape, so SVG Translate and hand edits to the file should not prevent the author from making changes. If the author has trouble, then it is important to list those troubles. It is possible the author with have trouble with class attributes.

A significant problem is users do not know how to add new date labels and text. Copying some text and then editing often produces confused translations or untranslateable text. If an entire switch is copied, then the English text is changed but the default text and all the other languages stay the same. The translations are confused. If just the text element is copied, then it also carries the systemLanguage attribute. That attribute prevents SVG Translate from translating the text. The best approach is to just insert new text; do not copy it from elsewhere in the image.

There are also strange edits that appear.

A switch element may contain several unrelated (and ultimately undisplayable) text elements. This may come about from copying the text elements. The copy somehow ends up within the switch. It should not display on the screen, so it would confuse the user.

Geometry elements are being inserted into switch elements that should contain only text elements. What determines where Inkscape will insert a new element? It should be treating a switch atomically. I have deleted several spurious geometry elements already, and now there are more:

<switch fill="#ffffff" transform="translate(1827.3,587.38)" id="switch4938">
  <rect style="display:inline;opacity:0.948718;fill:#dc0000;fill-opacity:1;stroke:#000000;stroke-width:0.245063;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:1.92453;stroke-opacity:1"
        id="rect352325-8-9-2-1-4-3-9-29-8-3"
        width="25.413" height="5.2650332" 
        x="-12.691059" y="-3.4384575" ry="1.2425818" 
        transform="rotate(0.33424498)"/>
  <text id="trsvg995" systemLanguage="fr"><tspan id="trsvg798">1er avril</tspan></text>
  <text id="trsvg996-tr" systemLanguage="tr"><tspan id="trsvg799-tr">1 Nisan</tspan></text>
  <text id="trsvg996-it" systemLanguage="it"><tspan id="trsvg799-it"> aprile</tspan></text>
  <text id="trsvg996-ru" systemLanguage="ru"><tspan id="trsvg799-ru">1 апреля</tspan></text>
  <text id="trsvg996-pt" systemLanguage="pt"><tspan id="trsvg799-pt">1 de Abril</tspan></text>
  <text id="trsvg996-el" systemLanguage="el"><tspan id="trsvg799-el">1 Απριλίου</tspan></text>
  <text id="trsvg996-ca" systemLanguage="ca"><tspan id="trsvg799-ca">1 d'abril</tspan></text>
  <text id="trsvg996-vi" systemLanguage="vi"><tspan id="trsvg799-vi">1 tháng 4</tspan></text>
  <text id="trsvg996"><tspan id="trsvg799">1 April</tspan></text>
</switch>

✓ Done The rect element will prevent any display of text. Also notice that the systemLanguage="en" clause was removed; it was probably replaced with the rect element. There is also the sneaky rotate by less than 1 degree transform. Inkscape is also inserting copious style information.

✓ Done Also, instead of editing a symbol definition, the use was exploded and the result edited in place.

Colors[edit]

Many people want the map colors changed. One concern was using web safe/colorblind-friendly colors. Consistent (and easily changed) colors can be done with styles.

Place names[edit]

The map is already large. There are hundreds of community names on the map. That presents the same translation bloat problem that a 100-language version of a US map presents. The map should use a skeleton file that is localized with a database of translations. WMF does not have that capability for SVG files. SVG also does not have an easy line-breaking method.

Need to work with what we have today. To keep the file size down, the switch elements are given or inherit styling from class="place". That allows the fill color, font family, and text-anchor to specified in one place rather than repeated on each element. The font size is also given or inherited. The font size is a function of the city's population. The text position is also specified on the switch element so it need not be repeated for each translation.

Finding place names[edit]

Using WikiData to translate place names is complicated by difficult-to-resolve Ukrainian place names. For example, "Pershotravneve" maps to more than 30 WikiData items.[22] To automate the search, the name should be attached to a map point; that practice is not common on SVG maps. The projection parameters can be found by following sources back to File:Ukraine adm location map.svg; the base map claims to be an equirectangular projection that includes administrative regions. The SVG size is 1,546 × 1,038. Then invert that point with the map projection to get a latitude and longitude of the community. Then do the WikiData query that coincides with that position.

Equirectangular projection, vertical stretching 150 %
Border coordinates

52.7
21.5←↕→40.7
44.1
Pictogram voting info.svg Info
This map is part of a series of location maps with unified standards: SVG as file format, standardised colours and name scheme. The boundaries on these maps always show the de facto situation and do not imply any endorsement or acceptance. In case of changes of the shown area the file is updated. The old version will be uploaded as a new file and thus is still available.

The file is 2,199 × 1,478 px. Radekhiv is at (350.01 px, 413.3 px) → 50.3° N, 24.56° E. Google Maps says 50.28° N, 24.60° E.[23]. The WikiData item is Radekhiv (Q904046); location is coordinate location (P625): 50°16′58″N 24°38′15″E

The vertical stretching comment of 150% is the same as shrinking the horizontal by 2/3. That gives the standard parallels as = ±48.1897.

Locations use circles; it might be better to use symbols.

 <g fill="#ff4" stroke="#777" stroke-width=".71">
  <circle cx="950.74" cy="379.56" r="2.49"/>
  <circle cx="246.42" cy="424.61" r="2.49"/>
  <circle cx="350.01" cy="413.3" r="2.49"/>
  <circle cx="340.11" cy="175.71" r="2.49"/>
  <circle cx="1252.6" cy="439.46" r="2.49"/>
  <circle cx="1283.4" cy="500.98" r="2.49"/>
  <circle cx="288.49" cy="210.71" r="2.49"/>
  <circle cx="297.69" cy="259.51" r="2.49"/>
  <circle cx="307.23" cy="319.26" r="2.49"/>
  <circle cx="372.29" cy="378.3" r="2.49"/>
  <circle cx="463.15" cy="243.6" r="2.49"/>
  <circle cx="527.85" cy="150.26" r="2.49"/>
  <circle cx="1596.7" cy="465.5" r="2.49"/>
  <circle cx="1671.8" cy="477.12" r="2.49"/>
 </g>
 <g fill="#ff4" stroke-width=".71">
  <g stroke="#777">
   <circle cx="1598.5" cy="574.58" r="2.49"/>
   <circle cx="1648.1" cy="611.47" r="2.49"/>
   <circle cx="1687.9" cy="570.18" r="2.49"/>
   <circle cx="1782.4" cy="651.77" r="2.49"/>
   <circle cx="1722.5" cy="661.25" r="2.49"/>
   <circle cx="1722.1" cy="533.15" r="2.49"/>
   <circle cx="1700.9" cy="655.44" r="3.2"/>
   <circle cx="1540.9" cy="377.32" r="2.49"/>
  </g>
  <circle cx="1577.5" cy="330.57" r="2.49" stroke="#787877"/>
  <g stroke="#777">
   <circle cx="1374.1" cy="333.64" r="2.49"/>
   <circle cx="1330.1" cy="312.07" r="2.49"/>
   <circle cx="1464.5" cy="261.7" r="2.49"/>
  </g>
 </g>

Sensible grouping may be done by finding a location circle and then finding nearby text. Alternatively, locate all circles near some text. The grouping also allows translation issues to be detected. For example, the anchor point of some text may need to be moved if a translation is significantly longer or shorter than the original.

<g>
  <circle class="city" r="3" />
  <text class="city" x="10" y="0">City Name</text>
</g>

The map may use a more sensible grouping of communities within districts.

Would like to detect content that is a date.

The text should use class attributes and CSS for the formatting. Map (font, font size, color) → class.

Several g elements are used to default the font size (or other formatting characteristics) of their contained text elements. Unwinding those groups is a difficult problem. Perhaps detect a group that has presentation attributes and only text children.

<g font-family="Calibri" font-size="3.27" font-weight="bold" stroke-width=".61">
  <text x="923.1" y="241.1">1 April</text>
  <text x="1180.34" y="253.12">1 April</text>
  <text x="1133.34" y="158.79">2 April</text>
  <text x="1372.74" y="238.32">2 April</text>
  <text x="1446.72" y="159.12">4 April</text>
  <text x="936.76" y="345.97">30 March</text>
  <text x="983.05" y="390.31">31 March</text>
  <text x="1047.81" y="215.22">31 March</text>
  <text x="1180.34" y="204.35">1 April</text>
</g>

✓ Done There are several screwy transform attributes. For example, transform="scale(1.000,1)". Some other transforms have rotations of a fraction of a degree. The effective rotations are small enough that they can be ignored (except for the additional translation they introduce). Some matrices have a similar small rotation. There is even the bizarre:

<circle transform="scale(1 -1)" cx="1291.2" cy="-261.91" r="2.49"/>
<circle cx="1128.4" cy="255.56" r="2.49" fill-rule="evenodd" stroke-linecap="round" stroke-linejoin="round"/>

✓ Done There are text elements that are not stroked but have stroking attributes.

✓ Done There are switch elements that have a single default clause. Other editors have used SVG Translate on the file, so the problem has disappeared.

✓ Done Bombing locations should be symbols, but they may not even be grouped. These use relative coordinates, so the path matching may be easier than expected. The DOM dropped path primitives.

 <path d="m791.33 600.27 5.164 3.958 1.051-3.741 1.484 3.803 4.113-3.556-2.226 4.607 3.525-.124-2.288 2.876 3.061 1.979-3.649.062 3.216 3.865-4.607-2.134-.495 4.143-2.257-3.34-3.154 4.298.309-5.38-4.298.866 2.69-3.525-2.196-3.587 3.061.588z" fill="red"/>
 <path d="m792.98 602.35 3.877 2.972.789-2.809 1.114 2.856 3.088-2.67-1.672 3.459 2.647-.093-1.718 2.159 2.299 1.486-2.74.047 2.415 2.902-3.459-1.602-.372 3.111-1.695-2.507-2.368 3.227.232-4.04-3.227.65 2.02-2.647-1.648-2.693 2.299.441z" fill="#ff8000"/>
 <path d="m794.64 604.5 2.49 1.908.507-1.804.716 1.834 1.983-1.715-1.074 2.221 1.7-.06-1.103 1.387 1.476.954-1.759.03 1.551 1.864-2.222-1.029-.239 1.998-1.088-1.61-1.521 2.072.149-2.594-2.073.418 1.297-1.7-1.059-1.729 1.476.283z" fill="#ff0"/>
 <path d="M798.43 608.4a.777.777 0 1 1-1.553 0 .777.777 0 1 1 1.553 0z"/>

The last path uses two arcs to make a circle of radius 0.777 and diameter 1.553. It would give the center of the bombing location and the presumptive symbol origin.

It may be better to localize the file with XSLT, fix some issues, and then restart the translations.

Source data[edit]

See w:Template:Russo-Ukrainian War detailed map and w:Module:Russo-Ukrainian War detailed map. These maps are made with technology on the English Wikipedia. The air bases and nuclear installations have names. Cities with latitude and longitude by Oblast. Labels are wikilinks such as Zelenodolsk, and following that link produces the WikiData item Zelenodolsk (Q640713). The diagram has lost of lot of information.

The module has some data in an apparent Lua object. I do not know if the data is available as JSON.

How do I find which items have links to a Wikipedia article?

What is the best query approach?

Wikidata API queries[edit]

From a wiki article, find the Q-item? See https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bwbentityusage

returns

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "33276544": {
                "pageid": 33276544,
                "ns": 0,
                "title": "Zelenodolsk, Ukraine",
                "wbentityusage": {
                    "Q10172305": {
                        "aspects": [
                            "S"
                        ]
                    },
                    "Q640713": {
                        "aspects": [
                            "C",
                            "D.en",
                            "O",
                            "S",
                            "T"
                        ]
                    }
                }
            }
        }
    }
}

Can a SPARQL query find which item has a link to the article?

Position-based SPARQL query[edit]

Find a settlement using its latitude and longitude.

For example, Zelenodolsk is at Point(33.652359815 47.563096347). Find the settlements near that point:

#title: places in Ukraine near a coordinate
# SELECT ?place ?placeLabel ?location WHERE {
#  wd:Q640713 wdt:P625 ?coord.                         # coordinates of the location
#  ?place wdt:P17 wd:Q212;                               # country: Ukraine
#         wdt:P625 ?location.
#  FILTER(geof:distance(?location, ?coord) < 10). # less than 10 km away
SELECT DISTINCT ?place ?placeLabel ?oblastLabel ?location ?distance WHERE {
  Bind("Point(33.652359815 47.563096347)"^^geo:wktLiteral as ?coord).
  ?place wdt:P31/wdt:P279* wd:Q12051488 . # populated place in Ukraine
  # ?place wdt:P131* ?oblast.
  # ?oblast wdt:P31 wd:Q3348196. # located in Ukrainian oblast
  
  # Search by Nearest
  SERVICE wikibase:around { 
    ?place wdt:P625 ?location . 
    bd:serviceParam wikibase:center ?coord .
    bd:serviceParam wikibase:radius "10" . 
    bd:serviceParam wikibase:distance ?distance .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
} 
Order by ?distance

Try it!

Inverted SPARQL query[edit]

Alternatively, invert the problem. Get a list of human settlements in Ukraine and use that list to match the names. This query takes less than 4 seconds. It does not find Kyiv. Kyiv is not located in an Oblast — probably like Washington D.C. is not located within a state. So the Oblast could be optional. Furthermore, not all settlements have a population. If I do not acquire population, then the query takes 30 seconds.

#title: populated places in Ukraine
# -> 30,000 results w/o population, 6000 w population, 1700 w pop >= 1000
SELECT DISTINCT ?place ?placeLabel ?oblastLabel ?location ?population ?native WHERE {
  ?place wdt:P31/wdt:P279* wd:Q12051488 . # populated place in Ukraine
  ?place wdt:P625 ?location . # coordinates of the location
  optional {
    ?place wdt:P1082 ?population . # population
    # filter (?population >= 200000) .
  }
  optional {
    ?place wdt:P131* ?oblast . # located in an administrative region
    ?oblast wdt:P31 wd:Q3348196. # that is a Ukrainian oblast
  } # located in an admin region
  optional {?place wdt:P1705 ?native .}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
} 
Order by ?placeLabel

Try it!

There are more issues for oblasts. Several settlements are repeated because their oblast changed over time. Consequently, start times and end times for administrative regions are important. Is there an easy way to screen for outdated oblasts?

The name matching does not work well. The map has about 600 place names, but only 249 matches are found. Many settlements do not have a native label (they may have a Ukrainian label). In addition, the English spelling used on the map does not always match the WikiData label. Approximate string matching may help.

Fixes[edit]

I made some fixes to the file, and there were surprises. Several graphic elements had been merged or shuffled, and it takes a lot of work to find even simple cases. It is tedious work by hand. Another problem with working on a frequently updated file: new revisions. A recent revision caught me half-way through doing some housekeeping edits. Now I need to figure out how to merge them. That is further complicated by Inkscape's verbose output: one attribute per line (with the addition of an id attribute to every element). It is tougher to edit the file by hand. Time to run it through an XML pretty printer. I cannot really complain. Inkscape maintained the file structure and even the XML comments of my most recent upload. More importantly, the recent edit added content.

Another realization is another SVG Translate issue. Most of the file is a planar translation. It has a high-level switch with the separate planes as g elements. SVG translate leaves the complicated groups with systemLanguage attributes alone, but it apparently processes the default clause. That processing includes adding switch translations to every text element.

I got caught again. This time, the file was changed with SVG Translate (the legend is now a good target for SVG Translate, but not the rest of the file) while I was working on some changes.

Planning other fixes....

Locations[edit]

✓ Done The circles used to display cities are inconsistent. There are several radii to represent the size of the city, but the Ukrainian cities have a gray stroke while the Russian-held cities do not have a stroked border:

<circle cx="1597.2" cy="395.17" r="2.49" fill="#ff4" stroke="#777" stroke-width=".71"/>
<circle cx="1680.7" cy="408.34" r="2.49" fill="red"/>

✓ Done The stroke width is almost exclusively 0.71, but there are some cases with 1.09, 0.5, and 0.41. Some CSS would be neater and allow quickly adding a border:

circle.uk {fill: #ff4; stroke: #777; stroke-width: .71px; }
circle.ru {fill: red; stroke: none; stroke-width: .71px; }
<circle cx="1597.2" cy="395.17" r="2.49" class="uk"/>
<circle cx="1680.7" cy="408.34" r="2.49" class="ru"/>

✓ Done The Russian fill is usually red 00, but sometimes it is #fa2c29 00.

  • 0#fa2c290
  • 0#ff00000

Just using red seems reasonable.

✓ Done Date label fills use yellow and a darker red:

  • 0#ff00
  • 0#dc00000

Much of the placename text is a blue #04a 00.

The placename text usually is the same as the placenames used in the English version. Just use the English placenames and then add back the few changes (e.g., French uses Kiev).

The biggest problem with placenames is the dot size and the font size. Those sizes reflect the population, but consistent handling of those items is tough. In addition, some placename text may need different text anchors. Putting a size value in the class would work to set the font size, but it may not work for SVG 1.1 circle elements. The r radius is a geometry property that can be set with CSS in SVG 2.0, but it is just an attribute in SVG 1.1.[24]

The issue of dot size.

Towns and Villages
Population Dot Size Possible r Label Size Possible font-size Contested city size
Capital Size: 35 8.71 label size: 140 17.79
Population 1M + Size: 28 label size: 130
Population 500K + Size: 24 label size: 120
Population 200K + Size: 20 label size: 110
Population 100K + Size: 16 label size: 100
Population 50K + Size: 14 label size: 90
Population 20K + Size: 12 label size: 80
Population 10K + Size: 10 label size: 70
Population 5K + Size: 8 label size: 60
Population < 5K Size: 6 label size: 0 or 50
	--Towns & Villages
			-- Dotsize vs. Population
	--Arranged by Oblasts, then cities, alphabetical order

Locations

Styling with class[edit]

I would like to use class attribute and CSS to set styling. I did that in the map legend, but some web searches suggest that it is difficult to use class/CSS formatting in Inkscape. I need to find out more to avoid making the file difficult for others to edit.

Some comments suggested that class must be set in the XML editor (which might be daunting for many editors and have substantial peril). In addition, changing the class may not cause Inkscape's visual display to be updated. How does Inkscape handle styling? There were also comments about using Inkscape styling extensions, but extensions are not a good route.

Date text[edit]

✓ Done The date text does vary among versions, but the translations are direct. A wholesale use of the systemLanguage="en" group followed by editing the dates should work.

Dates are done in Calibri bold. The date text depends on the background. Russian dates are white, Ukrainian dates are black. Unfortunately, librsvg does not handle class conjunctions:[25]

text.date { font-family: Calibri; font-weight: bold; font-size: 3.27px; text-anchor: middle; }
text.date.ru { fill: #FFF; }
text.date.uk { fill: #000; }
Dates and background rects[edit]

✓ Done A date label was made with a rect element for the background and a text element for the date. I changed the rect elements to use the #labelru and #labeluk symbols. I also paired the symbols with their corresponding text, so the SVG now looks like:

<use xlink:href="#labelru" x="961.3" y="378.84" />
<switch fill="#fff" transform="translate(965.88, 382.3)">
  <text systemLanguage="en"><tspan>25 February</tspan></text>
  <text systemLanguage="fr"><tspan>25 février</tspan></text>
  <text systemLanguage="tr"><tspan>25 Şubat</tspan></text>
  <text><tspan>25 February</tspan></text>
</switch>

✓ Done The text does not use text-anchor="middle", so the "25 Subat" will skew to the left.

✓ Done The text x-coordinate should be shifted to the midpoint of the use element. That would be .

Ideally, the origin of the symbol and the midpoint of the text would coincide. The #labelru and #labeluk symbols can be shifted to use the same origin as the text.

Filters would be a better way to handle the rect. Rather than having a separate use, the switch or the text could use a filter instead. The filter could even adjust to the length of the date. The support for filter may be troublesome. The best method may use feImage that points to SVG for a rounded rectangle image, but I doubt there is reasonable support for that construct. Using rectangles would have good support, but it would have sharp corners.

Ugh. Clear that bounding boxes for text elements are not computed correctly.

Dates and groups[edit]

Placing both elements in a group would allow positioning both. Such a grouping may be confusing to others.

Several labels have the same date and consequently redundant translations. A simplification would be to put each date into a symbol where it would be translated once. It could even be used in both symbols:

<use xlink:href="#labelru" ...>
<use xlink:href="#april_15" fill="#fff"/>

<use xlink:href="#labeluk" ...>
<use xlink:href="#april_15" fill="#000"/>

That change may also be confusing to others.

Dates and automatic translations[edit]

The Intl package can format international dates.[26]

var date = new Date(2022, 2, 15);
new Intl.DateTimeFormat("de", {day: "numeric", month: "long"}).format(date);

There are some issues: "en" → "March 15", "en-GB" → "15 March". For German, a period is added after the day. Hand translation for French gives "1er avril".[27] For Italian, "1º aprile".[28] The flourishes are only for the first day of the month.

It would be better if the dates were generated automatically rather than manually translated.

I use the Date object to parse the default date (Date.parse(el.textContent + " 2022")). Then I use the Intl package to compare the dates in the switch element clauses.

From a data standpoint, a more sensible default clause would use an ISO date format.

<switch>
  <text systemLanguage="en">15 March</text>
  <text>2022-03-15</text>
</switch>

Unfortunately, that would confuse SVG Translate. Going from English to another language would present "15 March", but going from default would present "2022-03-15".

Copies and strange transformations[edit]

I'm seeing strange changes to the SVG. Notice the y="31.370001. That suggests the number 31.37 was bumped by a single-float epsilon. Furthermore, transform="rotate(9.267) translate(1485.7, Y)" was rewritten as transform="rotate(9.267,-957.72641,9148.3009)". It is rotating the origin to a desired location!
Gross check (needs work):
1460.7007169041, 273.62571979975
1460.7006854713, 273.62571624018
Is the rewrite done by Inkscape or by SVG Translate?

<use xlink:href="#labelru" transform="rotate(9.267)" x="1473" y="31.370001" id="use4827" width="100%" height="100%"/>
<switch fill="#ffffff" transform="rotate(9.267,-957.72641,9148.3009)" id="switch4839">
  <text systemLanguage="en" id="trsvg973"><tspan id="trsvg776">25 February</tspan></text>
  <text systemLanguage="fr" id="trsvg974"><tspan id="trsvg777">25 février</tspan></text>
  <text id="trsvg975-tr" systemLanguage="tr"><tspan id="trsvg778-tr">25 Şubat</tspan></text>
  <text id="trsvg975-it" systemLanguage="it"><tspan id="trsvg778-it">25 febbraio</tspan></text>
  <text id="trsvg975-ru" systemLanguage="ru"><tspan id="trsvg778-ru">25 февраля</tspan></text>
  <text id="trsvg975-pt" systemLanguage="pt"><tspan id="trsvg778-pt">25 de Fevereiro</tspan></text>
  <text id="trsvg975-el" systemLanguage="el"><tspan id="trsvg778-el">25 Φεβρουαρίου</tspan></text>
  <text id="trsvg975"><tspan id="trsvg778">25 February</tspan></text>
</switch>

Somebody is going nuts duplicating rotated use elements.

<use xlink:href="#labelru" x="1937.40" y="544.58002"/>
<use xlink:href="#labelru" x="1937.40" y="544.58002" transform="rotate(0.334,-4875.8035,-16138.871)"/>
<use xlink:href="#labelru" x="1937.40" y="544.58002" transform="rotate(0.668,-5215.2903,-325.84574)"/>
<switch fill="#ffffff" transform="translate(1924.7,548.29)">
  <text systemLanguage="en"><tspan>6 March</tspan></text>
  <text systemLanguage="fr"><tspan>6 mars</tspan></text>
  <text systemLanguage="tr"><tspan>6 Mart</tspan></text>
  <text systemLanguage="it"><tspan>6 marzo</tspan></text>
  <text systemLanguage="ru"><tspan>6 марта</tspan></text>
  <text systemLanguage="pt"><tspan>6 de Março</tspan></text>
  <text systemLanguage="el"><tspan>6 Μαρτίου</tspan></text>
  <text systemLanguage="ca"><tspan>6 de març</tspan></text>
  <text><tspan>6 March</tspan></text>
</switch>

See SVGAnimatedTransformList. The API has a wonderful .consolidate() method. The API is incomplete. There is not a method to copy a transform or a list of transforms. Instead of using the API to concatenate two transform lists, it was easier to concatenate text strings:

el2.setAttribute("transform", el1.getAttribute("transform") + " " + el2.getAttribute("transform"));
Symbols[edit]

There are more symbols to extract: air bases, harbors, and power plants.

✓ Done Contested city

✓ Done The air base icon

✓ Done The harbor icon

✓ Done The power plant icons have changed from their original form. The Ukrainian version is a solid fill rather than a gradient. The Russian version still has a gradient, but it is not prominent. If I use a solid fill, then they can be a single symbol and the fill can be determined with class="uk" or class="ru".

Hydroelectric plant (not used?)

SVG Translate bogus langtags[edit]

✓ Done

Change bogus systemLanguage="zh_HANT" to systemLanguage="zh-HANT". Quick and dirty would select all the systemLanguage attributes and change underscores to hyphens. Killing bad langtags is good practice, but it will give horrible user interactions in SVG Translate. Users may continually try to translate a phrase that already has a translation. That would mean keeping both the bad langtag (to satisfy SVG Trnaslate) and the good langtag (to satisfy SVG). Then updates to the bad langtag would have to be copied to the good langtag. What a mess.

✓ Done SVG Translate seems to be duplicating clauses on subsequent invocations. Link to Phab issue.

Text element within switches with coordinates[edit]

I fixed a few switch element bodies that have translated text. Ideally, the switch element's transform property sets the starting text position. The text element and its first tspan element should not have x, y, or transform attributes.

<switch id="switch4565-3-6-9" transform="translate(1354.865,893.27667)" class="place" font-size="5.34px"
      style="font-family:'Liberation Sans', Arial, sans-serif;text-anchor:middle;fill:#0044aa">
    <text systemLanguage="en" id="trsvg2351-1-1-88" x="11.933496" y="1.381068"><tspan id="tspan17037-0-9-2">Vysokopillia</tspan></text>
    <text id="text4206-2-uk-1-5" inkscape:label="text4206-2" systemLanguage="uk" x="12" y="2"><tspan id="tspan16021-7-uk-5-0">Високопілля</tspan></text>
    <text id="text4206-2-7-94" inkscape:label="text4206-2" x="12" y="0"><tspan id="tspan16021-7-6-0">Vysokopillia</tspan></text>
</switch>

The problem may be common enough so I should try to detect it.

  • switch/text[x] | switch/text[y] | switch/text[transform]
  • switch/text[x or y or transform]
Elements with redundant style information[edit]

Replace style attribute with equivalent class value.

Perhaps hoist information to the parent switch element.

Overridden attributes[edit]

The fill attribute below is overridden by the style attribute. It should be removed (or replaced with the style value.

<path
  d="m 1843.9558,570.10404 c 2.6735,-9.49081 14.9422,-26.82254 30.9812,-32.57107 l -1.8944,-1.36528 8.0776,-0.85102 -3.664,6.36843 -1.0336,-2.24336 c -8.1598,4.31111 -14.7336,27.48597 -14.5545,35.09065 -1.1877,0.10078 -15.9766,-2.62975 -17.9123,-4.42835 z"
  fill="url(#bb)"
  id="path707-1"
  style="fill:url(#linearGradient17463);fill-opacity:1;stroke-width:1.3109"
  sodipodi:nodetypes="cccccccc" />
Transform removal[edit]

Some code that simplifies use elements with a transform attribute should be generalized. The width="100%" and height="100%" attributes may be removed.

<use
  xlink:href="#bomb"
  x="1569.3"
  y="857.94"
  id="use819-9"
  width="100%"
  height="100%"
  transform="translate(150.788,415.75052)" />

Simple translations are consolidated, but handling scales are problematic.

Null translations[edit]

If the textContent of all the switch clauses is the same...

            <switch id="switch329" transform="translate(147.16, 1453.72)">
              <text id="text274-zh-hant" systemLanguage="zh-hant"><tspan id="trsvg43-zh-hant">40</tspan></text>
              <text id="text274-zh-tw" systemLanguage="zh-tw"><tspan id="trsvg43-zh-tw">40</tspan></text>
              <text id="text274-zh-cn" systemLanguage="zh-cn"><tspan id="trsvg43-zh-cn">40</tspan></text>
              <text systemLanguage="en" id="trsvg1744"><tspan id="trsvg1120">40</tspan></text>
              <text id="text274-fr" systemLanguage="fr"><tspan id="trsvg43-fr">40</tspan></text>
              <text id="text274-es" systemLanguage="es"><tspan id="trsvg43-es">40</tspan></text>
              <text id="text274-el" systemLanguage="el"><tspan id="trsvg43-el">40</tspan></text>
              <text id="text274-uk" systemLanguage="uk"><tspan id="trsvg43-uk">40</tspan></text>
              <text id="text274-ka" systemLanguage="ka"><tspan id="trsvg43-ka">40</tspan></text>
              <text id="text274-lt" systemLanguage="lt"><tspan id="trsvg43-lt">40</tspan></text>
              <text id="text274-ca" systemLanguage="ca"><tspan id="trsvg43-ca">40</tspan></text>
              <text id="text274-ko" systemLanguage="ko"><tspan id="trsvg43-ko">40</tspan></text>
              <text id="text274-mn" systemLanguage="mn"><tspan id="trsvg43-mn">40</tspan></text>
              <text id="text274-nl" systemLanguage="nl"><tspan id="trsvg43-nl">40</tspan></text>
              <text id="text274-zh" systemLanguage="zh"><tspan id="trsvg43-zh">40</tspan></text>
              <text id="text274"><tspan id="trsvg43">40</tspan></text>
            </switch>

The matching algorithm should not just delete clauses that match the default. Language preferences can produce unexpected results.

Mongolian dates[edit]

Apparently, the Intl package gives an unexpected result. When checking the date 5 May, I get the mismatch message.

  • Bad date (mn): 5 сарын 28 != тавдугаар сарын 28

The transliterated result is "fifth month 28" rather than "5 month 28".

  • new Intl.DateTimeFormat('mn', { month: 'long', day: 'numeric' }).format(date) → "тавдугаар сарын 28"
  • new Intl.DateTimeFormat('mn', { month: 'short', day: 'numeric' }).format(date) → "5-р сарын 28"
  • new Intl.DateTimeFormat('mn', { month: 'narrow', day: 'numeric' }).format(date) → "Vын 28"
  • new Intl.DateTimeFormat('mn', { month: 'numeric', day: 'numeric' }).format(date) → "V/28"

A better understanding of casual group formatting[edit]

The file uses casual groups to impose a common style on items that do not make their own sensible group. For example, a small set of Ukrainian villages may be grouped to impose a Ukrainian fill color on the group. The group of villages does not have a good reason to exist as a group. If the Russians gained control of one of the villages, then the group would need to be pierced to change its rendering.

Similarly, several text elements may be grouped to impose a common font selection, size, or color. That is presentation style rather than semantics, and it should be done with CSS.

Many of the formatting groups have been removed from cities, places, and dates. That puts the city circles at toplevel. The places are one-level down inside a group of places. The dates are one-level down inside a group of dates.

There are elements that should be grouped, and they should be grouped to the point of making a symbol. For example, contested cities are represented as a checkerboard. The checkerboard is four grouped path elements followed by an outside-the-group rectangle. All of those elements are semantically related, and they should be a symbol in the defs section. These groups have been converted to use a g or symbol inside the defs element. That step uncovered some issues. The origin for an SVG 1.1 symbol is always the upper-left corner. Furthermore, Inkscape has trouble cloning something that is inside the defs element.

Some groupings would make sense. The circles used for cities should be grouped with the names of the city. The symbols used for nuclear power plants should be grouped with the name of the power plant. The arrows showing the troop movements should be grouped with their dates. The date label boxes should be grouped with the date text.

SVG does not have the notion of associated labels. Consider a flag note that contains a date. The file does that by drawing a rectangle and then overlaying that rectangle with text. That takes two elements. Logically, the elements should be grouped so they move together. To change the text, one must penetrate the grouping.

Given the restrictions on the symbol origin, a possibility for dates would be

<g transform="translate(...)">
  <use xlink:href="#labelru" x="0" y="0" />
  <switch class="date" fill="fff">
    <text systemLanguage="en">March 15</text>
    <text>March 15</text>
  </switch>
</g>

So all the group's children are at the origin. A symbol would need a negative offset. Limitations on the WMF renderer prevents using CSS to set the text color.

An alternative is to use a filter on the text. The filter would automatically size for the text, so it could be better than a fixed label size. Inkscape may start making copies....

Does Inkscape's notion of layers allow easy editing? That is, does it avoid the cumbersome ungrouping and regrouping of an ordinary g element?

What does it take to make an Inkscape layer? Presumably toplevel g elements with two inkscape: attributes and an id of the form layern (just like Inkscape identifies all its elements).[29]

<g
  inkscape:label="Layer 1"
  inkscape:groupmode="layer"
  id="layer1" />

To use layers, must all graphics elements be in toplevel layers? If not, what happens to graphics elements that are outside of the layers?

Furthermore, removing the two attributes or the id may not be wise.

Layers and objects can be locked with sodipodi:insensitive="true"; it is the presence of the attribute and not its value that matters. See https://wiki.inkscape.org/wiki/Inkscape-specific_XML_attributes and example at https://gist.github.com/hedefalk/5b428772f7deefc906a194f297371e9e . The latter file suggests the group id does not need a layer name.

See also https://wiki.inkscape.org/wiki/index.php/Inkscape_SVG_vs._plain_SVG .

Inkscape has the notion of symbols and clones, but I'm not sure that it expects to clone objects in the defs section. What sort of access does Inkscape give to the defs section?

Layers would do a better job of enforcing painting order. While the order was consistent, it is now confused. Some arrows are drawn before place names are rendered, and some arrows are drawn after.

Languages: LTR and RTL[edit]

The Hebrew and Arabic versions of this file swap the graphics on the map legend. There is also a cosmology illustration that does something similar. What is a good way to handle that problem? Two separate map legends?

Metadata for the file[edit]

Consider adding some metadata to the file. There was a recent comment that Inkscape prevented a user from editing someone else's SVG. Could that be a result of CC-BY-ND license? Or describing the license with a similar requirement? How could SVG Translate figure that out? How would it be handled in a production environment?

A CC license should have a URL link to the license.

A CC-BY license should have the attribution names or an attribution URL.

In a derivative work, it is not enough to just give attribution to the creator of the derivative work. The license attribution requirements do not disappear for a derivative work. Many Commons files state incomplete attributions.

There is a question about crediting the source maps. It may be that a link to their pages on Commons is enough. Should the metadata include the license information for the images that the derivative work uses?

<metadata>
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
           xmlns:dc="http://purl.org/dc/terms/"
           xmlns:cc="http://creativecommons.org/ns#" >
    <cc:Work rdf:about="">
      <dc:creator rdf:resource="https://commons.wikimedia.org/wiki/User_talk:Viewsridge"/>
      <dc:source>
        <rdf:Bag>
          <rdf:li rdf:resource="https://commons.wikimedia.org/wiki/File:Russo-Ukraine_Conflict_(2014-2021).svg"/>
          <rdf:li rdf:resource="https://commons.wikimedia.org/wiki/File:Ukraine_adm_location_map.svg"/>
        </rdf:Bag>
      </dc:source>
      <dc:publisher rdf:resource="http://commons.wikimedia.org"/>
      <cc:license rdf:resource="https://creativecommons.org/licenses/by-sa/4.0/deed.en"/>
      <cc:attributionName>
        <rdf:Seq>
          <rdf:li rdf:resource="https://commons.wikimedia.org/wiki/User_talk:Viewsridge"/>
          <rdf:li rdf:resource="https://commons.wikimedia.org/wiki/User:Rr016"/>
          <rdf:li rdf:resource="https://commons.wikimedia.org/wiki/User:NordNordWest"/>
        </rdf:Seq>
      </cc:attributionName>
      <cc:attributionUrl rdf:resource="https://commons.wikimedia.org/wiki/File:2022_Russian_invasion_of_Ukraine.svg" />
    </cc:Work>
  </rdf:RDF>
</metadata>

SVG Translate[edit]

SVG Translate is an application that helps users translate SVG files with text elements into other languages.

History[edit]

Discuss history, Summer of Code, ... XXX, Jarry1250,[30] 2017 Wishlist,[31] WMF Community Tech.[32]

What it does[edit]

Diagrams need to be simple for it to work well. Single line text is best. Leave plenty of room because some languages use more text than others.

See File:BirdBeaksA.svg and {{Other versions/BirdBeaksA}}. Several language versions that just vary the text labels.

Syntactically, the text to translate must be a text element with 0 or more tspan elements. The tspan elements may not have any children. The tool expects only text; it does not expect switch elements containing group (g) elements or other graphics elements.

What it does not do[edit]

It does not handle complex text. It expects text to be lines of simple, unadorned, text. Text that tries to emphasize some words with bold or italic styles are not handled. Similarly, it does not handle changing text colors or fonts. It does not handle subscripts or superscripts.

It does not handle adjusting the position of the text or the anchors. That is more of a graphics task than a translation task. SVG Translate is a text rather than a graphics application. In commercial translation settings, translators are not responsible for text positioning.

Vertical text. There are different conventions. In the USA, a book title on the book's spine is rotated +90°. In Europe, the title on the spine is rotated -90°. In China, text is written vertically without rotating the characters; generally, English readers find such text difficult. Bizzare Chinese ambulance.

Numbers, quantities, currency, and dates. SVG should have better support. Javascript has the Intl object, but Commons prohibits scripts. Different cultures use different punctuation and formatting for numbers: "1,000.00" (US) versus "1.000,00" (German). Quantities are numbers with units; the units may be spelled out (meters) or abbreviated with a symbol (m). Currency is even more problematic: should $10 stay as $10 or should it be converted to some other national currency (Marks, Pounds, or Euros)? Dates are often represented in a few ways. Translation from English "May 12, 1944" to German "12 Mai 1944". HTML has <time datetime="1944-05-12">May 12, 1944</time>, but it does not imply processing; it is just a machine-readable time for the text content.

Hyphenation and word breaking. The semantics of &shy; is confused: hy­phen­a­tion.

Exotic CSS can fix some issues, but there must be support.

SVG Translate usage[edit]

JoKalliauer question about SVG Translate usage: talk page

Translate me![edit]

Consider mechanism to solicit translations of popular or significant images.

Say an image were served as SVG. The image could have

<a href="http://svgtranslate.wmftools.org/File:image.svg">
  <text>{Translate icon}</text>
</a>

The translate icon could disappear if the file has the desired translations.

<a href="http://svgtranslate.wmftools.org/File:image.svg">
  <switch>
    <text systemLanguage="en"></text>
    <text>{Translate icon}</text>
  </switch>
</a>

Maybe there is a game to play with the systemLanguage attribute.

<a href="http://svgtranslate.wmftools.org/File:image.svg">
  <text><tspan systemLanguage="es">{translation requested}</tspan><tspan systemLanguage="ru">{translation requested}</tspan></text>
</a>

Good translation targets[edit]

There are English SVG files that are already included on many Wikipedias. Running SVG Translate on these files would make the files more accessible.

SVG Translate simple text requirement[edit]

SVG Translate expects lines of simple text. Font shifts, subscripts, superscripts (i.e., nested tspan elements) are not allowed. Many images have such text.

  • Refractive index nd (λ = 587.6 nm)
  • Abbe number V

Wikidata for

I cannot create this diagram with a Wikidata query that finds instances of optical glass (such as those in the Schott catalog).

SVG Translate Issues[edit]

clean code.

The resulting file has lots of identifiers.

SVG Translate does not obey translate="no".

The default clause problem[edit]

The tool does not work well with language preferences. Adding a German translation to

<text>Hello, world</text>

produces something similar to

<switch>
  <text systemLanguage="de">Hallo Welt.</text>
  <text>Hello, world.</text>
</text>

That works OK with WMF tools, but can be bizarre when displayed in a browser. For example, one can set a browser to prefer English but accept German. In that circumstance, the English language text should be displayed, but the SVG above will display German. The SVG agent does not know the default is English. The translation should be

<switch>
  <text systemLanguage="en">Hello, World.</text>
  <text systemLanguage="de">Hallo Welt.</text>
  <text>Hello, world.</text>
</text>

The fix is to copy the default clause, add systemLanguage="en", and then insert that element into the switch element. The insertNodeBefore(newNode, referenceNode). The SVG DOM method .cloneNode(deep) copies the id attribute, so the identifiers must be changed or removed.

Autocomplete[edit]

SVG Translate input boxes get filled in with previous inputs even if the file changed. Turn off autocomplete?

SVG Translate should handle textPath[edit]

The Gibraltar map has a textPath element.

Treat textPath as equivalent to tspan.

IIRC, mentioned in some Phabricator issue.

Possible add another test to reject textPath similar to this test:

        if (0 !== $this->document->getElementsByTagName('tref')->length) {
            // Tref tags not (yet) supported
            $this->logFileProblem('File {file} has <tref> tags');
            return false;
        }

SVG Translate descends too deeply[edit]

SVG Translate does not recognized planar translations. If it sees a switch element with systemLanguage clauses, then it should not process that subtree.

<switch>
  <g systemLanguage="tlh">
    <text>Klingon text cannot be translated to any other language</text>
  </g>
  <g>
    <text>Default text can be translated, but it will be ugly</text>
  </g>
</switch>

SVG Translate does not walk the tree, so it does not notice the problem. Instead, it fetches elements no matter their location in the with operations such as

        $texts = $this->document->getElementsByTagName('text');

If the elements have an ancestor with a systemLanguage attribute, then it should not process the element; the element's ability to be translated has been foreclosed already.

Where did I write about this before?

Does SVG Translate multiply text? No.[edit]

To simply matters with language matching, SVG Translate explodes a systemLanguage attribute with a language list. A single element with three langtags is morphed into three elements with a single langtag.

SVG allows elements with systemLanguage attributes anywhere; they need not be inside a switch.

Check that the combination of the two operations is safe or at least harmless. Maybe it is always harmless? If text is wrapped in a switch before being copied, then it should be harmless?

The fear is isolated text with systemLanguage="en, en-US, en-GB" gets duplicated three times.

Not a problem. The explosion is at line 370, and it is done while processing the children of a switch.

SVG Translate switch processing[edit]

The second line is a switch with annotation elements title, desc, and metadata.

SVG permits more than just text elements within a switch. In particular, title, desc, metadata, and comments should not confuse processing. Less likely but still permitted elements are metadata and animation elements.

Does the sorting of hyphenated langtags come into play? The title element should be first for many SVG agents.

Run SVG Translate:

Looks like SVG Translate works on line 2.

What is going on? The method makeTranslationReady() (line 136)

  • does a bunch of stuff not relevant here
  • loops through all the switch tags at line 329.
  • looks at the childNodes of the switch (it could just look at elements)
  • returns false if the childnode is a non-empty #text node.
  • returns false if the childnode is not an element. So a PI or comment should return false.
  • returns false if the childnode is not a text element.
  • examines systemLanguage
  • explodes if systemLanguage is multivalued.

I understood that makeTranslationReady() returning false nixes the translation of the whole file.

Maybe not. Look at _construct(). The last line is the call to makeTranslationReady(), but it does not check the result. Looks like somebody does not check isTranslationReady().

I should split the test file into three files. KISS.

Check switch processing[edit]

IETF lowercase at 293, but only to text elements.

At line 324, all the switch elements are grabbed. (No namespace check.)

Then all child nodes are examined at line 329.

(DOMElement https://www.php.net/manual/en/class.domelement.php does not have $children, so code must use $childNodes.)

Non-empty text nodes cause a problem at line 337; return false.

Non-element nodes cause a problem at line 344; return false. Does this mean a comment node returns false?

If the child is not a text element, it fails at line 349. That should kill title, desc, metadata, and animation elements, but I thought they were getting through....

Language split is at 356. It only happens inside a switch. That should answer the isolated systemLanguage element question.

An explosion will increase the number of text elements in the live list, so some will be missed.

Identifiers take up a lot of text[edit]

Here is a clause from a switch translation. Compare it with and without identifiers:

<text id="trsvg1077-9-1-8-zh-tw" systemLanguage="zh-tw"><tspan id="trsvg880-0-6-4-zh-tw">5/10</tspan></text>
<text systemLanguage="zh-tw"><tspan>5/10</tspan></text>

DOM: tags and namespace[edit]

file that uses arbitrary prefix for the SVG namespace.

The DOM distinguishes

  • nodeName
  • localName
  • namespaceURI

A nodeName may or may not have a prefix.

The tag name has some interesting capitalization issues for HTML, but that is not relevant here. In the case of just HTML, there is only one namespace, so the name ambiguity is not as important.

The specification says that getElementsbyTagName() searches for a qualified name. https://developer.mozilla.org/en-US/docs/Web/API/Element/tagName says that the tag name for an element is the nodeName. So what does a DOM implementation return? In an SVG file, if I ask for "text", do I get all the nodes in the default namespace? That is, those names do not need prefixes. Do I get the same nodes if I ask for "svg:text"? Do some implementations simply provide all the elements with the localName and ignore the namespace?

In particular, I think qualified names can vary depending on the context. It is possible to set the default namespace for a subtree.

The method getElementsByTagNameNS() is far clearer. It looks for elements with the localName and in the namespaceURI. An implementation is not going to mess that up. It also does not depend on which prefixes are active.

I expected this to fail. It may work even if it fails.

I added a rdf:text element. SVG Translate wrapped the .textContent with a tspan element in the SVG namespace, but it did not offer the text to be translated (the elements never got id attributes.

Looking at the code, the rdf:text element was found at line 201:

        // Strip empty tspans, texts, fill $idsInUse
        $idsInUse = [ 0 ];
        $translatableNodes = [];
        $tspans = $this->document->getElementsByTagName('tspan');
        $texts = $this->document->getElementsByTagName('text');

So the PHP DOM implementation looks for the localName.

The source for PHP

cites to the spec

which only claims the search is by "tag name" but does not define that term. The namespace version is explicit about using localName. So using the non-namespace version for an XML document is not a good idea.

It also becomes clear that the identifier assignment believes that trsvg[0-9]+ identifiers will only appear on text and tspan elements. Also, the code does not need to keep the entire set of attributes around.

Is the svg prefix required?[edit]

Phab:T316741 (Allow svg namespace prefixes other than 'svg') was opened.

I see a lot of tests that are something like ('svg:element' === $el->nodeName). Does that mean the code depends on files using the svg: prefix? Does using xmlns:foo="http://www.w3.org/2000/svg" and <foo:text> confuse SVG Translate?

Is the problem with the PHP DOM or with the code?

Should tests such as

                if ('tspan' === $node->nodeName || 'svg:tspan' === $node->nodeName) {
                    continue;
                }

be written

                if ('tspan' === $node->localName && 'http://www.w3.org/2000/svg' === $node->namespaceURI) {
                    continue;
                }

If a file does not use the SVG namespace, then SVG Translate may refuse to translate it.

Programming idioms from when the PHP DOM did not do namespaces?

Reorder text elements[edit]

Sorting the langtags so the hyphenated langtags come first.

Is it needed?

Does it compromise SVG 1.1 semantics? If I ask for en-GB, then it should not match the en clause.

If I promote en-RAREDIALECT to first position in the clauses, then asking for generic en gives me the rare dialect.

The only good way around this is allowReorder.

With the underscore, it would work around the old librsvg langtag matching bug.

/**
 * Reorder text elements within the document so that all sublocales (i.e. systemLanguage values
 * containing a hyphen or underscore, e.g. de_CH) are moved to the beginning of the switch
 * element, and all fallback elements are moved to the end.
 */
protected function reorderTexts(): void
  • Bulleted list item

The underscore issue[edit]

The third line is a switch with en, zh_Hans, zh-Hant, and default clauses.

SVG Translate uses parochial systemLanguage identifiers instead of IETF langtags. This bug arises as a workaround to langtag processing bugs in librsvg and langtag passing methods in MediaWiki. First, the C-language version of librsvg that WMF uses only matches langtags to the first hyphen. It treats zh-Hans and zh-Hant as equivalent. Second, MediaWiki passes langtags in the LANG environment variable; that variable expects a locale string rather than an IETF langtag. Third, librsvg uses the LANG environment variable as a langtag. SVG Translate exploits those problems by using zh_HANS rather than the correct zh-Hant IETF langtag. It makes the display work on (broken) WMF servers, but the SVG files do not work on other SVG user agents.

The underscore issue is even more twisted. Multiple SVG Translate invocations are adding multiple identical clauses with non-unique identifiers:

<switch id="switch2174" transform="translate(1853.7,532.54)" class="place" font-size="5.34px">
  <text id="text3001-zh-tw" systemLanguage="zh_TW"><tspan id="trsvg142-zh-tw">基夫沙里夫卡</tspan></text>
  <text id="text3001-zh-hant" systemLanguage="zh_HANT"><tspan id="trsvg142-zh-hant">基夫沙里夫卡</tspan></text>
  <text id="text3001-zh-tw" systemLanguage="zh_TW"><tspan id="trsvg142-zh-tw">基夫沙里夫卡</tspan></text>
  <text id="text3001-zh-hant" systemLanguage="zh_HANT"><tspan id="trsvg142-zh-hant">基夫沙里夫卡</tspan></text>
  <text id="text6217" systemLanguage="zh_TW"><tspan id="tspan6215">基夫沙里夫卡</tspan></text>
  <text id="text6221" systemLanguage="zh_HANT"><tspan id="tspan6219">基夫沙里夫卡</tspan></text>
  <text id="text6225" systemLanguage="zh_TW"><tspan id="tspan6223">基夫沙里夫卡</tspan></text>
  <text id="text6229" systemLanguage="zh_HANT"><tspan id="tspan6227">基夫沙里夫卡</tspan></text>
  <text id="text6233" systemLanguage="zh_TW"><tspan id="tspan6231">基夫沙里夫卡</tspan></text>
  <text id="text6237" systemLanguage="zh_HANT"><tspan id="tspan6235">基夫沙里夫卡</tspan></text>
  <text id="text6241" systemLanguage="zh_TW"><tspan id="tspan6239">基夫沙里夫卡</tspan></text>
  <text id="text6245" systemLanguage="zh_HANT"><tspan id="tspan6243">基夫沙里夫卡</tspan></text>
  <text id="text6249" systemLanguage="zh_HANT"><tspan id="tspan6247">基夫沙里夫卡</tspan></text>
     
  <text systemLanguage="en" id="trsvg1842"><tspan id="trsvg1218">Kivsharivka</tspan></text>
  <text id="text3001-it" systemLanguage="it"><tspan id="trsvg142-it">Kovšarovka</tspan></text>
  <text id="text3001-fr" systemLanguage="fr"><tspan id="trsvg142-fr">Kivcharivka</tspan></text>
  <text id="text3001-el" systemLanguage="el"><tspan id="trsvg142-el">Κιβσαρίφσκα</tspan></text>
  <text id="text3001-ru" systemLanguage="ru"><tspan id="trsvg142-ru">Ковшаровка</tspan></text>
  <text id="text3001-uk" systemLanguage="uk"><tspan id="trsvg142-uk">Ківшарівка</tspan></text>
  <text id="text3001-ka" systemLanguage="ka"><tspan id="trsvg142-ka">კივშარივკა</tspan></text>
  <text id="text3001-lt" systemLanguage="lt"><tspan id="trsvg142-lt">Kivšarivka</tspan></text>
  <text id="text3001-ca" systemLanguage="ca"><tspan id="trsvg142-ca">Kivxàrivka</tspan></text>
  <text id="text3001"><tspan id="trsvg142">Kivsharivka</tspan></text>
</switch>

The relevant SVG Translate code:

OK, it looks like a problem with inadvertently distinguishing equivalent langtags. The code mistakenly distinguishes zh_hant, zh_Hant, and zh_HANT.

I believe $language will be lowercase because

                $langCode = str_replace('_', '-', strtolower($lang));

So the code reasonable canonizes all langtags to lower case and converts non-standard underscore langtags to hypen langtags.

Consequently, the code below will work for all-lowercase langtags (such as the usual en or de) that are present in the SVG file, but it will never match langtags with an uppercase character (such as zh-Hant) because they have a capital letter. Furthermore, it will never match the converted, non-standard, langtags (such as zh_HANT) even with a case-insensitive match because the underscore was changed to a hyphen. Also notice that if two or more text elements match, then nothing will be changed, and no error will be logged.

                // Put text tag into document
                $path = 'fallback' === $language ?
                    "svg:text[not(@systemLanguage)]|text[not(@systemLanguage)]" :
                    "svg:text[@systemLanguage='$language']|text[@systemLanguage='$language']";
                $existing = $this->xpath->query($path, $switch);
                if (1 == $existing->length) {
                    // Only one matching text node, replace if different
                    if ($this->nodeToArray($newTextTag) === $this->nodeToArray($existing->item(0))) {
                        continue;
                    }
                    $switch->replaceChild($newTextTag, $existing->item(0));
                } elseif (0 == $existing->length) {
                    // No matching text node for this language, so we'll create one
                    $switch->appendChild($newTextTag);
                }

OK, tried a file with systemLanguage="FR", added a German translation, and the French clause was duplicated. SVG Translate produced:

    <switch transform="translate(20, 60)">
      <title>This title should display.</title>
      <desc>Test that title, desc, and metadata process correctly.</desc>
      <metadata/>
      <text systemLanguage="tlh" id="trsvg13"><tspan id="trsvg2">Klingon</tspan></text>
      <text systemLanguage="FR" id="trsvg14"><tspan id="trsvg3">French</tspan></text>
      <text systemLanguage="en" id="trsvg15"><tspan id="trsvg4">English</tspan></text>
      <text id="trsvg14" systemLanguage="fr"><tspan id="trsvg3">French</tspan></text>
      <text id="trsvg16-de" systemLanguage="de"><tspan id="trsvg5-de">German 2</tspan></text>
      <text id="trsvg16"><tspan id="trsvg5">Default</tspan></text>
    </switch>

The basic issue above is the case-sensitive comparison in text[@systemLanguage='$language']. XPath 1.0 does not have a case-sensitive comparison. A possible fix is to modify the XPath filter to fix both the lowercase and the underscore issues:

text[translate(@systemLanguage, "ABCDEFGHIJKLMNOPQRSTUVWXYZ_", "abcdefghijklmnopqrstuvwxyz-")='$language']

There are still some questions about the code. How does systemLanguage="zh_HANT" enter? Are translation selections edited in the UI? And how does the file's systemLanguage="zh-Hant" turn into systemLanguage="zh_HANT"?

Looks like this routine will change zh-hant into zh_HANT:

    /**
     * @param string $langCode
     * @return string
     */
    private static function langCodeToOs(string $langCode): string
    {
        if (false === strpos($langCode, '-')) {
            // No territory specified, so no change to make (fr => fr)
            return $langCode;
        }
        [ $prefix, $suffix ] = explode('-', $langCode, 2);
        return $prefix.'_'.strtoupper($suffix);
    }

The number of tspan elements must match[edit]

Here's a Phab:T216283#5106699 discussing mismatching number of tspan elements.

Somewhere there was a more specific problem where SVG Translate drops the tspan. Searching for it, but have not found it.

Look in the code to find this requirement.

IIRC, some code deletes empty text and tspan.

In addition, I think there was a Phabricator item where an empty tspan was removed, and that cause subsequent translation problems.

The trouble may be at line 254. All the translatable nodes (text and tspan elements) are scanned. If the element has no child nodes (which would be true for an empty tspan), then that element would be removed. Consider a two-line translation where some lines may be empty. This code will remove the empty line. When a translation on that switch is attempted, the number of tspan elements will not match.

            if (!$translatableNode->hasChildNodes()) {
                // Empty tag, will just confuse translators if we leave it in
                $translatableNode->parentNode->removeChild($translatableNode);
            }

The $ issue[edit]

FIXED https://github.com/wikimedia/svgtranslate/pull/620

SVG Translate is supposed to refuse to translate text that has a dollar sign followed by a number (e.g., "$17").

I suspect the code has a bug because it does not escape the $ in the preg_match on line 288.

// Text strings like $1, $2 will cause problems later because
// self::replaceIndicesRecursive() will try to replace them
// with (non-existent) child nodes.
if (preg_match('/$[0-9]/', $text->textContent)) {
    $this->logFileProblem('File {file} has text with $-numbers');
    return false;
}

The issue was mentioned at Phab:T271000#8134211. Bad language code: zh_Hans should be zh-Hans.

The style element must have trailing characters[edit]

Phab:T271595 SVG translate tool replaces all fields with "$1" (style element needs at least one trailing character)

What's left over after a failed media rule parse?

  • @media printer { text { fill: black; } }

The regex match does not balance parentheses, so the second brace is collected as part of the ruleset. The trailing character is a right brace.

Hoisting style attributes[edit]

At https://github.com/wikimedia/svgtranslate/blob/master/src/Model/Svg/SvgFile.php I hope this does not do what I think it does:

            // Non-translatable style elements on texts get lost, so bump up to switch
            if ($text->hasAttribute('style')) {
                $style = $text->getAttribute('style');
                $text->parentNode->setAttribute('style', $style);
            }

Splitting langtags duplicates identifiers[edit]

Splitting systemLanguage langtags duplicates id attributes:

                foreach ($realLangs as $realLang) {
                    // Although the SVG spec supports multi-language text tags (e.g. "en,fr,de")
                    // these are a really poor idea since (a) they are confusing to read and (b) the
                    // desired translations could diverge at any point. So get rid.
                    $singleLanguageNode = $sibling->cloneNode(true);
                    $singleLanguageNode->setAttribute('systemLanguage', $realLang);

                    // @todo: Should also go into tspans and change their ids, too.
                    // $prefix = implode( '-', explode( '-', $singleLanguageNode->getAttribute( 'id' ), -1 ) );
                    // $singleLanguageNode->setAttribute( 'id', "$prefix-$realLang" );

                    // Add in new element
                    $switch->appendChild($singleLanguageNode);
                }
                $switch->removeChild($sibling);

PHP static analyzer[edit]

SVG Translate already has some phpdoc (JavaDoc-style) comments, but the declarations have some issues. Phab:T316310 seeks to add the PHP static analyzer / linter Phan. A proposed patch adds Phan to the build process.

See the https://github.com/phan/phan/wiki/Tutorial-for-Analyzing-a-Large-Sloppy-Code-Base tutorial. A baseline run on SVG Translate produced

// PhanUndeclaredMethod : 10+ occurrences
// PhanTypeMismatchArgumentNullableInternal : 7 occurrences
// PhanCommentDuplicateParam : 3 occurrences
// PhanDeprecatedFunction : 3 occurrences
// PhanTypeMismatchArgumentInternalProbablyReal : 2 occurrences
// PhanTypeMismatchDeclaredParamNullable : 2 occurrences
// PhanTypeMismatchPropertyProbablyReal : 2 occurrences
// PhanAccessMethodInternal : 1 occurrence
// PhanTypeMismatchArgumentNullable : 1 occurrence
// PhanTypeMismatchArgumentProbablyReal : 1 occurrence
// PhanTypePossiblyInvalidDimOffset : 1 occurrence
// PhanTypeSuspiciousNonTraversableForeach : 1 occurrence

That does not look outrageous. Some look like expected trivial fixes. Maybe a higher analysis level will complain about code not looking at the returned value. I have not seen the start out with weak checking plan and increase the level later. Software engineering advice has been to set the warning level at maximum and just deal with the flood. The ratchet-up-later plan sounds interesting, but somebody needs to remember to do the ratcheting. (Ah, Eric just wanting to ignore 200+ warnings that had to be wrong.)

Phan has not been around a long time (in development in 2017). What other PHP linters are there?

Mere mortals cannot view the log[edit]

See https://phabricator.wikimedia.org/T271000#8201384

So raise a creature feature request for a logging window.

I think I was bit by silent errors. The code detected a problem, logged an error, stopped further processing in that metnod, but forged ahead with subsequent tasks that GIGO'd.

Phabricator tickets[edit]

I need to update some SVG Phabricator tickets and create some others.

T271000

Mentioned In
T319259: Check that document element is <svg> and in the right namespace
rGSVTd5fc9388e033: Allow tspan to be in svg namespace (#622)
T248252: SVG Translate: Skip unsupported text pattern and continue with the supported ones
rGSVTb8c23d973c36: Create (and catch) exceptions for all existing error states
T316741: Allow svg namespace prefixes other than 'svg'
rGSVT91689a31bb6b: Fix regex to find $1, $2, etc.
rGSVT05d2724e6765: Normalize lang codes when reading SVGs and writing translations
T261192: Rendering multilingual (systemLanguage) SVG files fails locally after upgrading librsvg from 2.40.21 to 2.44.10
T40010: RFC: Re-evaluate librsvg as SVG renderer on Wikimedia wikis
T275263: Translation dropdown not available on File: page after translating a specific SVG file on Commons via svgtranslate tool
T271595: SVG translate tool replaces all fields with "$1" (style element needs at least one trailing character)

Mentioned Here
T271595: SVG translate tool replaces all fields with "$1" (style element needs at least one trailing character)
T221382: [BUG] Some CSS selectors break translation input
T319259: Check that document element is <svg> and in the right namespace
T316310: Add Phan to SVG Translate CI
T316741: Allow svg namespace prefixes other than 'svg'
T221453: Add "newer" open fonts
T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive
T261192: Rendering multilingual (systemLanguage) SVG files fails locally after upgrading librsvg from 2.40.21 to 2.44.10
T40010: RFC: Re-evaluate librsvg as SVG renderer on Wikimedia wikis
T154237: SVG image wikisyntax can't use "lang=zh-hant"
T265549: Update librsvg to > 2.44.10

Hoisting attributes[edit]

An SVG image with attribute hoisting opportunities

SVG Translate produces verbose output. One reason is that it copies all the attributes on the original text element to the switch element clauses. Consequently, we see something like

<switch>
  <text x="100" y="200" font-family="Arial" font-size="10" systemLanguage="en"><tspan>Hello, world.</tspan></text>
  <text x="100" y="200" font-family="Arial" font-size="10" systemLanguage="de"><tspan>Hallo Welt.</tspan></text>
  <text x="100" y="200" font-family="Arial" font-size="10"><tspan>Hello, world.</tspan></text>
</text>

A more concise version would be

<switch transform="translate(100, 200)" font-family="Arial" font-size="10">
  <text systemLanguage="en"><tspan>Hello, world.</tspan></text>
  <text systemLanguage="de"><tspan>Hallo Welt.</tspan></text>
  <text><tspan>Hello, world.</tspan></text>
</text>

There can be a list of attributes to promote. For example, attributes font-family, font-size, font-weight, and font-style. If a switch element has only text elements and each of those elements have the same attribute z, then delete z from each child and move it to the switch element. The replacement may override a value on the switch element.

Class and Style confuse the issue[edit]

The class and style attributes complicate the replacement. The simple method would insist those attributes are not present on any element. There may still be a CSS selector that matches the switch or text elements. The priority of CSS rules is higher than attributes, so the only real trouble is the class attribute. Leave the class attribute on any element.

Note: SVG 1.1 does not require CSS or style attribute support.[33]

There are many ways to specify presentation properties, and those specifications may conflict. The conflicts are resolved by assigning priorities to the property specifications:[34]

  • Attributes (lowest priority)
  • CSS selectors have calculated priorities based on specificity and order
  • inline styles (may be overridden with CSS !important selectors) (highest priority)

The style priority makes hoisting difficult. It also means that inline styles are not (usually) overridden with CSS patterns.

Mozilla https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity says

Your global CSS file that sets visual aspects of your site globally may be overwritten by inline styles defined directly on individual elements. Both inline styles and !important are considered bad practice, but sometimes you need the latter to override the former.

Inkscape heavily uses inline styles. In a way, that makes interpretation easier: inline styles have the highest precedence, so it is the least confusing way to apply style information. It is unlikely to be overridden by other information. At the same time, it also means that converting inline style information to attributes may have unexpected results if a CSS selector applies contradicting information.

Prefer class[edit]

It makes sense to prefer class (or other selectors) over explicit style attributes. For example, say a class selector sets the font characteristics to certain values and the style attribute sets the same values. A class selector may apply to more elements, so it should be favored.

A quick and dirty way to play this game is to remove the style attribute and then Window.getComputedStyle(). Any properties in the style attribute that are already present may be removed. Coding can be a bit tricky.

Being more direct is also difficult. One can access the stylesheets and the style attribute, but the CSSRule (Media rules make a multiverse), CSSStyleRule (does not parse selectors), and CSSStyleDeclaration do not have more mechanism. CSSStyleRule provides the selectorText, but the interface does not provide a priority list of which rules apply to an element. Parsing and interpreting selectors is a difficult task.

There is querySelector, so the inverted test may be done. Should check at how well that method works. The method does not return the priority.

Pseudo selectors may be difficult to get right. For example, :active suggests the need to example all possibilities.

Animation may also cause trouble.

Transformation rewrites[edit]

The transform rewrite is more complicated. A potential method chooses a suitable translation, appends it to the switch element's transform, and then adjusts the coordinates of all the children. For the text element, if the coordinate adjusts to zero, then remove the attribute. Do not remove the zero attributes of tspan elements because they start new text chunks. A transform element on the text or tspan elements would cause a lot of trouble. So hoist the transform attributes first and give up if they do not hoist.

Is this step a good idea? If the file is localized, all this information would be moved back into the text element. In addition, the styling may be accomplished with a class attribute and CSS, so it is not a heavy penalty for each clause. Moving styling information into a text element may be the better goal. The issue of hoisting position information may still be reasonable.

Multiline trick[edit]

Merge with tspan count issue.

It may be possible to do a multiline with vertical alignment trick. Instead of a one-line translation or a two-line translation, make it a three-line translation. For one-line translations, only one tspan is used; the other two are left empty. For a two-line translation, the one-line tspan is left empty.

<switch>
  <text>
    <tspan x="0" y="0">One-line translation</tspan>
    <tspan x="0" y="-20">Two-line translation, line 1</tspan>
    <tspan x="0" y="20">Two-line translation, line 2</tspan>
  </text>
</switch>

When will librsvg support the CSS ls unit?

WMF warts[edit]

SVG and Inkscape[edit]

Inkscape is an editor that will preserve a lot of SVG because Inkscape uses SVG as its internal representation. Unlike other graphics editors, Inkscape does not have a different native format.

Inkscape produces bizarre numbers. Single precision formatted as a double. Metric conversions.

Inkscape uses concrete bounding boxes. Test the following scenario. An SVG image with 4 circles and each circle points to a linearGradient element. The linear gradient uses the default gradientUnits="objectBoundingBox". I believe Inkscape will clone 4 <codeLlinearGradient elements, and those elemnents will be gradientUnits="userSpaceOnUse". Test if moving an object changes the coordinates or creates a new linear gradient.

Inkscape units[edit]

I worked on this topic somewhere else....

Notice the strange units in this header.

<svg
   width="1281.4634mm"
   height="246.60315mm"
   viewBox="0 0 5125.8537 986.41263"
   version="1.1"
   id="svg370"
   xml:space="preserve"
   inkscape:version="1.2 (dc2aedaf03, 2022-05-15)"
   sodipodi:docname="Chronologie_constitutions_françaises.svg"
   xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
   xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
   xmlns:xlink="http://www.w3.org/1999/xlink"
   xmlns="http://www.w3.org/2000/svg"
   xmlns:svg="http://www.w3.org/2000/svg">

The viewBox x-zoom: 0.24999999512276 → 1/4 to a single-float epsilon
The viewBox y-zoom: 0.24999999239669 → 1/4 to a single-float epsilon
Convert the 5125.8537 pixels to inches: 53.394309375
Convert the 5125.8537 pixels to mm: 1356.215458125
Convert the width of 1281.4634 mm to inches: 50.45131496063
Convert the width of 1281.4634 mm to CSS pixels: 4843.3262362205

Accuracy[edit]

Some random issues about technical accuracy.

Refractometer[edit]

No clear schematic of how it works. Handheld and Abbe.

See w:Refractometer.

Refractometer

Refractometer

See Category:Refractometers

Motors[edit]

Compare diagram of a 3-phase induction motors File:Vierpolig-3stränge.svg with File:Asynchronous Motor.svg. Look at the flux lines.

Motor winding

3-phase 4-pole motor winding with 24 slots
Terminal Phase Slot Slot connect to
1 1 1 6 12
6 2 3 8 2
8 3 5 10 4
4 1 7 12 6
9 2 9 14 20
5 3 11 16 22
7 1 13 18 24
12 2 15 20 14
2 3 17 22 16
10 1 19 24 18
3 2 21 2 8
11 3 23 4 10
  • also concentric windings...

Wiring diagram on Commons?

NEMA and IEC

General

I have trouble with this diagram of a shaded pole motor. The vertical flux paths are too thick. Compare to an actual design. Also winding flux path should be similar width.

Microphones and flux path[edit]

A ribbon microphone with no flux return path.

Vacuum pumps and details[edit]

Rotary vane pump.svg

Stator shape, reflood, exhaust valve, oil seals, oil pump, foam, vanes through the axle.

Clearances. Bearings and something like an Oldham coupling.

Where was the liquid version?

Electrochemical cell[edit]

Galvanic cell labeled.svg

I do not think the salt bridge works that way. KNO3.

See https://kids.britannica.com/students/assembly/view/106626

Mechanical seals[edit]

The sealing only happens around the compression nut

Alos KF seals prevent over compression.

Biology[edit]

The comment ("This picture is obsolete. the pluripotent stemcell of the blood is giving origin to a lymphoid and a myeloid cell line.") at

and

Many have worked on similar diagrams, so sort out the effort.

  • File:Illu blood cell lineage.jpg 2006-05-17 on Commons. 77 kB. NIH (when?).
  • File:Hematopoiesis (human) diagram.png 2006-08-11 1.18 MB. A. Rad. Has dense text block. There is also an extensive description about the cell images. It has an incompatible Commons license: "GFDL-self. This image is released under the GFDL-self license and is considered freely distributable. This image or any reproductions/customizations thereof (or any reproductions/customizations of its reproductions/customizations, and so forth) may NOT be sold without my explicit consent." The separate licensing section has just {{self|GFDL}}, so the licensing terms are inconsistent.

The license issue is troubling. It affects all derivatives. It also has further issues because many files have been extracted from File:Hematopoiesis (human) diagram en.svg. See, for example, File:Monoblast.svg.

Comments on individual versions.

Look at Wikidata items. Examine instance and subclass relations. (develops from (P3094), follows (P155), followed by (P156))

Looking for files extracted from File:Hematopoiesis (human) diagram en.svg. Keeping track of derivatives is nice....

Clean up[edit]

Image maps[edit]

A post at Commons:Graphics village pump:

The template is

It uses <imagemap>.

The loom file is File:Simple_treadle_floorloom,_line_drawing.png.

The imagemap lines look like:

poly 1360 808 1904 553 1921 1152 1399 1437 [[Heddle|Heddles and heddle frames or harness]]

The post has various questions.

Multilingual from Wikidata items:

  • {{Label|Q173056}}loom
  • {{Label|Q2748498}}batten
  • {{Label|Q39515}}heddle
  • many loom terms are absent...

Several people have proposed Wikipedia links from a Wikidata item. From heddle (Q39515), hyperlink to the appropriate language Wikipedia entry if it exists.

Multilingual static (not tool tip) labels (as done at File:2022 Russian invasion of Ukraine.svg) are possible.

SVG has tool tips, and tool tips would be better than an image map because users do not supply the hit geometry. The method fails for MediaWiki because SVG is not served directly. Browsers have good support for tool tips, but they do not support SVG multilingual tool tips such as:

<g>
  <title lang="en">English</title>
  <title lang="de">Deutsch</title>
  <title lang="fr">Francais</title>
  ...
</g>

The SVG 2.0 specification has apparently dropped multilingual titles. See https://svgwg.org/svg2-draft/struct.html#TitleElement

Animated SVG can be done. Make some conventional SVG visible upon a mouse over. The table of chemical elements technology.

Copyright[edit]

MediaWiki upgrades[edit]

See Phab:T265549#8145732 Phab:T216815

The new version of librsvg now takes the langtag through the --accept-language command line argument rather than the $LANG environment variable. Well, it still uses the $LANG and other environment locale variables, but those environment variables must now be Unix locale strings.

In general, an SVG langtag should never be passed through the $LANG environment variable. The latter takes an opaque Unix locale string rather than an IETF language tag. It happened to work for early versions of librsvg because those versions did a getenv("LANG") and used that result as an IETF langtag. (As a separate issue, the early versions also did not process hyphenated langtags correctly.) The Rust versions of librsvg now use a library routine to digest the locale string.

Command line arguments[edit]

Outdated man page:

  1. -f --format: png
  2. -w --width: integer
  3. -h --height: integer
  4. -o --output filename
  5. -d --dpi-x (default 90!)
  6. -p --dpi-y (default 90!)
  7. -a --keep-aspect-ratio
  8. -u undocumented; see unlimited below

Source

Resvg command line arguments[edit]

MediaWiki external converter[edit]

For conventional MediaWiki, there is a configuration variable:

$wgSVGConverters = [
	'ImageMagick' => '$path/convert -background "#ffffff00" -thumbnail $widthx$height\\! $input PNG:$output',
	'sodipodi' => '$path/sodipodi -z -w $width -f $input -e $output',
	'inkscape' => '$path/inkscape -z -w $width -f $input -e $output',
	'batik' => 'java -Djava.awt.headless=true -jar $path/batik-rasterizer.jar -w $width -d $output $input',
	'rsvg' => '$path/rsvg-convert -w $width -h $height -o $output $input',
	'imgserv' => '$path/imgserv-wrapper -i svg -o png -w$width $input $output',
	'ImagickExt' => [ 'SvgHandler::rasterizeImagickExt', ],
];

The configuration should set the $wgSVGConverter variable. Presumably, that is set to rsvg.

I thought there were type declarations in MediaWiki PHP code. See, for example,

See PHP info:

It looks like the SVG Translate source has JavaDoc style annotations (but there was a strict assignment). See

See https://www.php.net/manual/en/function.str-replace.php and its "Replacement order gotcha" comment. Replacement is done left to right. The filename should always be rightmost.

     public function rasterize( $srcPath, $dstPath, $width, $height, $lang = false ) {
         $mainConfig = MediaWikiServices::getInstance()->getMainConfig();
         $svgConverters = $mainConfig->get( MainConfigNames::SVGConverters );
         $svgConverter = $mainConfig->get( MainConfigNames::SVGConverter );
         $svgConverterPath = $mainConfig->get( MainConfigNames::SVGConverterPath );
         $err = false;
         $retval = '';
         if ( isset( $svgConverters[$svgConverter] ) ) {
             if ( is_array( $svgConverters[$svgConverter] ) ) {
                 // This is a PHP callable
                 $func = $svgConverters[$svgConverter][0];
                 if ( !is_callable( $func ) ) {
                     throw new MWException( "$func is not callable" );
                 }
                 $err = $func( $srcPath,
                     $dstPath,
                     $width,
                     $height,
                     $lang,
                     ...array_slice( $svgConverters[$svgConverter], 1 )
                 );
                 $retval = (bool)$err;
             } else {
                 // External command
                 $cmd = str_replace(
                     [ '$path/', '$width', '$height', '$input', '$output' ],
                     [ $svgConverterPath ? Shell::escape( "{$svgConverterPath}/" ) : "",
                         intval( $width ),
                         intval( $height ),
                         Shell::escape( $srcPath ),
                         Shell::escape( $dstPath ) ],
                     $svgConverters[$svgConverter]
                 );
  
                 $env = [];
                 if ( $lang !== false ) {
                     $env['LANG'] = $lang;
                 }
  
                 wfDebug( __METHOD__ . ": $cmd" );
                 $err = wfShellExecWithStderr( $cmd, $retval, $env );
             }
         }
         // @phan-suppress-next-line PhanTypeMismatchArgumentNullable False positive
         $removed = $this->removeBadFile( $dstPath, $retval );
         if ( $retval != 0 || $removed ) {
             // @phan-suppress-next-next-line PhanPossiblyUndeclaredVariable cmd is set when used
             // @phan-suppress-next-line PhanTypeMismatchArgumentNullable cmd is set when used
             $this->logErrorForExternalProcess( $retval, $err, $cmd );
             return new MediaTransformError( 'thumbnail_error', $width, $height, $err );
         }
  
         return true;
     }

The change is straightforward.

Change the rsvg configuration (or make a rsvglang).

'rsvg' => '$path/rsvg-convert -w $width -h $height -l $lang -o $output $input',

Should this line also have a -u to match Thumbor?

Also add a resvg line

'resvg' => '$path/resvg -w $width -h $height --languages $lang $input $output',

Before line 350, set $lang to an actual langtag:

if ( $lang === false ) $lang = "en";

It might be cleaner to default $lang = 'en' in the argument list.... Declare the argument to be a string type.

Add a "$lang" substitution in lines 352–357. The substitution should be before the $input position in the array.

Delete lines 362–364 (do not pass a langtag in the environment). This will break old versions of librsvg.

Thumbor[edit]

For librsvg used by Thumbor

  • rsvg-convert source.svg -u -f png -w width -h height --accept-language lang
    def create_image(self, buffer):
        self.prepare_source(buffer)

        command = [
            self.context.config.RSVG_CONVERT_PATH,
            self.source,
            '-u',
            '-f',
            'png'
        ]

        if self.context.request.width > 0:
            command += ['-w', '%d' % self.context.request.width]

        if self.context.request.height > 0:  # pragma: no cover
            command += ['-h', '%d' % self.context.request.height]

        env = None
        if hasattr(self.context.request, 'lang'):
            env = {'LANG': self.context.request.lang.upper()}

        png = self.command(command, env)

        return super(Engine, self).create_image(png)

Change the above code:

Add before line 57:

        if hasattr(self.context.request, 'lang'):
            command += ['-l', self.context.request.lang]

The .upper() is not needed; langtags are not case sensitive. Alternatively, an unspecified lang may be set to "en".

Delete lines 58–59; do not pass the langtag in the environment.

Hyphenated langtags[edit]

Hyphenated langtags show NO TEXT — not even the default text.

Apparent thumbnailer issue with librsvg 2.44.10 and Thumbor/7.3.2.

Thumbnail for az-latn has no text
https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/IPv6_header-en.svg/langaz-latn-506px-IPv6_header-en.svg.png?20230520164902
Thumbnail for az has text
https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/IPv6_header-en.svg/langaz-506px-IPv6_header-en.svg.png?20230520164902

Non-English default langtags[edit]

File using non-English default language displays default language rather than English.

Unit tests[edit]

Investigate https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/853402/

https://github.com/wikimedia/operations-software-thumbor-plugins/blob/master/tests/integration/test_svg.py

In particular,

    def run_and_check_ssim_and_size(
        self,
        url,
        mediawiki_reference_thumbnail,
        perfect_reference_thumbnail,
        expected_width,
        expected_height,
        expected_ssim,
        size_tolerance,
    ):
        """Request URL and check ssim and size.

        Arguments:
        url -- thumbnail URL
        mediawiki_reference_thumbnail -- reference thumbnail file
        expected_width -- expected thumbnail width
        expected_height -- expected thumbnail height
        expected_ssim -- minimum SSIM score
        size_tolerance -- maximum file size ratio between reference and result
        perfect_reference_thumbnail -- perfect lossless version of the target thumbnail, for visual comparison
        """
        try:

Structural Similarity Index Measure (SSIM).

So make 200px × 200px images named 200px-Test_Patch_000 to 200px-Test_Patch_FFF.

Then check SSIM between those images. That will allow determining a reasonable accept/reject range.

To test default English is generated, test this SVG against the appropriate patches:

<?xml version="1.0" encoding="UTF-8"?>
<svg xmlns="http://www.w3.org/2000/svg" version="1.1" width="200" height="200">
  <switch>
    <rect width="100%" height="100%" fill="#0F0" systemLanguage="en">
      <title>en</title>
    </rect>
    <rect width="100%" height="100%" fill="#00F" systemLanguage="fr">
      <title>en</title>
    </rect>
    <rect width="100%" height="100%" fill="#0FF" systemLanguage="zh">
      <title>zh-Hans</title>
    </rect>
    <rect width="100%" height="100%" fill="#F0F" systemLanguage="zh-Hans">
      <title>zh-Hans</title>
    </rect>
    <rect width="100%" height="100%" fill="#FF0" systemLanguage="zh-Hant">
      <title>zh-Hant</title>
    </rect>
    <rect width="100%" height="100%" fill="#F00">
      <title>default</title>
    </rect>
  </switch>
</svg>

OK, File:SVG Test System Language.svg produces color patches.

During an upgrade, Thumbor started producing incorrect thumbnails. See Phab:T335361. That problem ($LANG versus $LC_ALL) was fixed. No unit tests were added, and it may be that WMF does not subject Thumbor to unit tests.

Generally, there should be unit tests for librsvg to make sure it does its job. In additions, there should be unit tests for the SVG thumbnailing code (the standard method and Thumbor). The later tests are needed to make sure that the language is properly communicated to librsvg. That mechanism will fail again soon.

The reason is Thumbor 7.3.2 is using librsvg v2.44. IIRC, that version still uses Unix environment variables to communicate the system language. (There may also be problems with hyphenated language tags: Unix may not understand the locale string sr-Latn or zh-Hans.) In later versions of librsvg, the system language should be passed through the --accept-language command line argument. Without valid unit tests, Thumbor may again quietly fail.

Where was the Phabricator issue that addressed environment variables?

Cannot set lang for filepath

Workaround: generate the filepath for 800px and then replace "/800px" with "/langXX-800px".

Try OCR

HTML 5[edit]

Wikitext supports some HTML markup.

  • abbr
  • <acronym title="Central Intelligence Agency">acronym CIA</acronym> (deprecated)
  • <address>address</address>
  • <aside>aside</aside> (aside)
  • cite (usage advice varies a lot for this element)
  • code
  • data
  • <details><summary>details</summary> The expanded description.</details>
  • kbd (user input)
  • mark
  • q
  • rp (rp should disappear; part of ruby)
  • ruby  (text)(kan)(ji)
  • samp (sample output)
  • var
  • <style></style> (appropriately neutered)

References[edit]

  1. Caching, Mozilla.org
  2. https://www.alibabacloud.com/blog/what-is-domain-resolution-and-how-it-works_597610
  3. https://serverfault.com/questions/347689/how-to-share-domain-name-with-multiple-servers
  4. See https://aeronav.faa.gov/user_guide/20211202/cug-complete.pdf at page 43. In those images, the NDB symbol is a dot, ring, and only 5 dotted rings.
  5. SVG 2.0 Chapter 5 Document Structure § 5.8
  6. Cory Doctorow, A Bug in early Creative Commons licenses has enabled a new breed of superpredator
  7. Village Pump: Cory Doctorow post on "copyleft trolls" mentions Commons
  8. Village pump:cc-by < 4.0 not ok any more
  9. e.g., https://id.loc.gov/vocabulary/relators.html
  10. Adobe, XMP Specification Part 1 at Table 4.
  11. Nevile, Liddy; Lissonnet, Sophie (January 2004) The Case for a Person/Agent Dublin Core Metadata Element Set[1]
  12. https://www.compart.com/en/unicode/charsets/Adobe-Symbol-Encoding
  13. https://www.compart.com/en/unicode/charsets/x-Adobe-Zapf-Dingbats-Encoding
  14. https://fonts2u.com/sonata.font https://adobe-type-tools.github.io/font-tech-notes/pdfs/5045.Sonata.pdf
  15. https://stackoverflow.com/questions/36486716/the-14-standard-pdf-fonts-and-character-encoding
  16. https://www.compart.com/en/unicode/charsets/Adobe-Standard-Encoding
  17. Mozilla (2021) SVG Fonts[2]
  18. https://www.enzolifesciences.com/science-center/technotes/2019/december/what-are-the-differences-between-northern-southern-and-western-blotting?/
  19. https://thumbor.readthedocs.io/en/latest/
  20. “ꝺ” U+A77A Latin Small Letter Insular D Unicode Character
  21. https://linux.die.net/man/1/xsltproc
  22. https://www.wikidata.org/w/index.php?title=Special:Search&limit=100&offset=0&profile=default&search=Pershotravneve&ns0=1&ns120=1
  23. https://www.google.com/maps/place/Radekhiv,+Lviv+Oblast,+Ukraine/@50.2811748,24.6012475,13z
  24. https://developer.mozilla.org/en-US/docs/Web/SVG/Element/circle
  25. See File:SVG CSS Test.svg for a test of .cls2.cls3 selection.
  26. https://tc39.es/ecma402/#sec-intl-datetimeformat-constructor
  27. Write the Date in French, wikihow.com. The first is pronounced "premier".
  28. Italian Ordinal Numbers and Numerical Rank, thoughtco.com. "il primo".
  29. Inkscape Tutorial. Chapter 6. SVG File Format. https://inkscapetutorial.org/svg-file-format.html
  30. https://www.mediawiki.org/wiki/User:Jarry1250/GSoC_2012_roadmap
  31. Meta:Community Wishlist Survey 2017/Multimedia and Commons/SVG-Translate
  32. Meta:Community Tech/SVG translation
  33. https://www.w3.org/TR/SVG/styling.html
  34. https://drafts.csswg.org/selectors/#specificity-rules