[Thunar-workers] [DokuWiki] page changed: implementation:mime-glob-match

Tue Mar 8 20:12:06 CET 2005

A page in your DokuWiki was added or changed. Here are the details:

Date        : 2005/03/08 19:12
Browser     : Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.5) Gecko/20050122 Firefox/1.0
IP-Address  : 217.85.185.87
Hostname    : pD955B957.dip.t-dialin.net
Old Revision: http://thunar.xfce.org/wiki/implementation:mime-glob-match?rev=1110286555
New Revision: http://thunar.xfce.org/wiki/implementation:mime-glob-match
Edit Summary: 
User        : benny

@@ -30,51 +30,58 @@
  </code>
  
  Below is a table of results:
  
- ^ Name             ^ Average filename length ^
- | Benedikt Meurer  | 12.491                  |
- | Jasper Huijsmans | 15.1931                 |
- | Jens Luedicke    | 10.9706                 |
+ ^ Name                            ^ Average filename length ^
+ | Benedikt Meurer                 | 12.491                  |
+ | Brian Tarricone                 | 18.8394                 |
+ | Jasper Huijsmans                | 15.1931                 |
+ | Jens Luedicke                   | 10.9706                 |
+ | Yves-Alexis Perez (Home folder) | 16.0649                 |
+ | Yves-Alexis Perez (MP3 folder)  | 34.6277                 |
  
- ==== Steps ====
- The required steps for the Glob Match detection are as follows
  
-   - Check __Literal__ patterns (case sensitive)
-   - Check __Simple__ patterns (case sensitive)
-   - Check __Complex__ patterns (case sensitive)
-   - Check __Literal__ patterns (case insensitive)
-   - Check __Simple__ patterns (case insensitive)
-   - Check __Complex__ patterns (case insensitive)
  
- where a __Literal__ pattern is a pattern that does not include any
- wildcards (such as ''*?[''), __Simple__ patterns are patterns that
- start with a ''*'' and do not include any other wildcards (e.g.
- ''*.txt'' or ''*.ps''), and __Complex__ patterns are all other
- patterns. It is obvious, that __Simple__ patterns are the most common
- patterns in the ''glob'' database, and should therefore be optimized
- the most.
+ ==== Types of Patterns ====
+ The Shared MIME Database specification mentions three different types of
+ patterns ([[implementation:mime-glob-match#Literal_Patterns|Literal
+ Patterns]], [[implementation:mime-glob-match#Simple_Patterns|Simple
+ Patterns]] and [[implementation:mime-glob-match#Complex_Patterns|Complex
+ Patterns]]) into which the allowed patterns can be divided. The Shared
+ MIME Database specification uses [[wp>glob]]-style
+ [[wp>Pattern_Matching|Pattern Matching]]; supported wildcards are:
+ 
+   * ''*'' matches zero or more occurences of arbitrary characters
+   * ''?'' matches exactly one occurence of an arbitrary character
+   * ''[..]'' defines a character class
  
  In general, the pattern data (as loaded from the ''glob'' database)
  should be stored //cache-friendly// way (read the
  [[infrastructure:related-links#alphasort|article about AlphaSort]] for
- an explanation). This is especially important for the __Simple__
- patterns. So instead of using singly-linked lists, like in the
+ an explanation). This is especially important for the Simple
+ Patterns. So instead of using singly-linked lists, like in the
  filer sample implementation below, Thunar should store all patterns
  in arrays (probably one array per pattern type), and make sure that
  the array items are properly aligned in memory.
  
  
  === Literal Patterns ===
- Since the Shared MIME Database contains only a few literal patterns
- (10 literal patterns in ''shared-mime-info-0.15'') and it is unlikely
- that vendors will add that many additional literal patterns, we should
- go with a simple string comparison here. The important fact is that we
- need to sort the list of literal patterns by their length to make sure
- we check the longest patterns first.
+ A Literal Pattern is a pattern that does not include any of the possible
+ wildcards, but only printable characters in the UTF-8 encoding, for
+ example ''README'' is such a pattern, or ''.DirIcon''.
  
- Assuming the patterns are sorted by string length, then the following
- implementation should provide reasonable performance (pseudo-code):
+ The Shared MIME Database contains only a few literal patterns (10 literal
+ patterns in ''shared-mime-info-0.15'') and it is unlikely that vendor will
+ add many literal patterns. Therefore we should go with a simple string
+ comparison here. The important fact is that we need to sort the list of
+ literal patterns by their length to make sure we check the longest pattern
+ first.
+ 
+ {{implementation:mime:20050306-literal_patterns_list.png}}
+ 
+ The literal patterns are arranged in a list, like shown on the picture
+ above. With such a sorted list, a possible implementation of the pattern
+ matching function will look like this:
  
  <code c>
  for (pattern in literal patterns)
    {
@@ -90,21 +97,24 @@
        goto match;
    }
  </code>
  
- The //literal patterns// should be stored in an array. At best,
- everything related to the //literal patterns// would be stored
- in one memory chunk: The array of ''(pattern pointer, type
- pointer)'' tuples first, then the pattern strings (pointed to
- by the ''pattern pointer''s). That way, a certain amount of
- locality is granted, heap fragmentation is reduced and prefetching
- can be used on modern CPUs if necessary.
+ Since the average filename is longer than 10 characters, but
+ the longest literal pattern in the Shared MIME Database (as of 
+ version ''0.15'') is only 8 characters long, the implementation
+ will stop with ''no_match'' after the first comparison in many
+ cases, and we won't waste CPU cycles while comparing patterns
+ that cannot match.
  
- It is easy to see that the above implementation will provide
- good results, as the average filename length is 12-13 and the longest
- literal pattern in the Shared MIME Database (as of version ''0.15'')
- is 8.
+ The //list view// above is just to illustrate the concept. The
+ implementation should use an array to handle the patterns. Since
+ the pattern strings are of variable length, we cannot simple use
+ a C array here; but instead we'll use a large memory chunk (which
+ will also include the Simple Patterns) to reduce heap fragementation
+ and increase locality, and store the pattern data byte-by-byte, maybe
+ different depending on the platform (data alignment!).
  
+ :!: More research required.
  
  
  === Simple Patterns ===
  This is the most common case of a pattern (**400** simple patterns in the
@@ -146,8 +156,12 @@
  
  (plus the magic rules > 80 checks first). Performing these actions
  on 3000 files will take time; it's therefore important to optimize
  the Glob Pattern Matching as much as possible.
+ 
+ Possible n-ary tree based design:
+ 
+ {{implementation:mime:20050308-simple_patterns_tree_lookup.png}}
  
  :!: More research required!
  
  
@@ -165,8 +179,20 @@
  ''IMAGE.GIF'', which will normally match the simple pattern
  ''*.gif'' in step 4), all the complex patterns will need to
  be tested first, so it may actually be a good idea to waste
  some more thoughts about this.
+ 
+ 
+ 
+ ==== Steps ====
+ The required steps for the Glob Match detection are as follows
+ 
+   - Check __Literal__ patterns (case sensitive)
+   - Check __Simple__ patterns (case sensitive)
+   - Check __Complex__ patterns (case sensitive)
+   - Check __Literal__ patterns (case insensitive)
+   - Check __Simple__ patterns (case insensitive)
+   - Check __Complex__ patterns (case insensitive)
  
  
  
  ==== Recommended reading ====



-- 
This mail was generated by DokuWiki at
http://thunar.xfce.org/wiki/