[Thunar-workers] [DokuWiki] page changed: implementation:mime-glob-match

thunar-workers at xfce.org thunar-workers at xfce.org
Sat Feb 19 19:53:49 CET 2005


A page in your DokuWiki was added or changed. Here are the details:

Date        : 2005/02/19 18:53
Browser     : Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.5) Gecko/20050122 Firefox/1.0
IP-Address  : 217.85.191.78
Hostname    : pD955BF4E.dip.t-dialin.net
Old Revision: http://thunar.xfce.org/wiki/implementation:mime-glob-match?rev=1107905661
New Revision: http://thunar.xfce.org/wiki/implementation:mime-glob-match
Edit Summary: 
User        : benny

@@ -1,5 +1,117 @@
  ====== Implementation - MIME Glob Match ======
+ The **MIME Glob Match** implementation should follow the suggestions from the
+ [[requirements:standards#shared_mime-info_database|Shared MIME Database
+ specification]] as closely as possible (unlike other projects)
+ to make sure the user will be presented with usable results (although the
+ user won't know where these results come from, since the MIME System will
+ be hidden from the UI).
+ 
+ 
+ ===== Ideas for the new Implementation =====
+ The implementation should be very fast, since the MIME type detection is
+ done on every single file, that is touched by the file manager (and can
+ thereby be considered a //common operation//).
+ 
+ 
+ ==== Steps ====
+ The required steps for the Glob Match detection are as follows
+ 
+   - Check __Literal__ patterns (case sensitive)
+   - Check __Simple__ patterns (case sensitive)
+   - Check __Complex__ patterns (case sensitive)
+   - Check __Literal__ patterns (case insensitive)
+   - Check __Simple__ patterns (case insensitive)
+   - Check __Complex__ patterns (case insensitive)
+ 
+ where a __Literal__ pattern is a pattern that does not include any
+ wildcards (such as ''*?[''), __Simple__ patterns are patterns that
+ start with a ''*'' and do not include any other wildcards (e.g.
+ ''*.txt'' or ''*.ps''), and __Complex__ patterns are all other
+ patterns. It is obvious, that __Simple__ patterns are the most common
+ patterns in the ''glob'' database, and should therefore be optimized
+ the most.
+ 
+ In general, the pattern data (as loaded from the ''glob'' database)
+ should be stored //cache-friendly// way (read the
+ [[infrastructure:related-links#alphasort|article about AlphaSort]] for
+ an explanation). This is especially important for the __Simple__
+ patterns. So instead of using singly-linked lists, like in the
+ filer sample implementation below, Thunar should store all patterns
+ in arrays (probably one array per pattern type), and make sure that
+ the array items are properly aligned in memory.
+ 
+ 
+ === Literal patterns ===
+ The __Literal__ patterns, like ''Makefile'' or
+ ''README'' can easily be checked using string comparisons. To further
+ speed up the //__Literal__ patterns//-steps, the patterns should be
+ sorted by length, where the longest patterns should appear first, and
+ in addition to the string value, the lookup database should also store
+ the length of the string.
+ 
+ When comparing the filename to a literal pattern entry, the code
+ should first check if the lengths match, then if both are of the
+ same length, it should perform a string comparison. If the code
+ detects that the filename length is less than the length of the
+ current literal pattern, then it should stop the lookup and
+ continue with the next step. This optimization is based on the
+ proposition that the literal patterns are sorted by their string
+ length (longest first), and so no other literal can match the
+ given filename, because all other literals are shorter than the
+ filename.
+ 
+ The following pseudo code demonstrate this step:
+ 
+ <code c>
+ for (all literal patterns)
+   {
+     if (length (literal) < length (filename))
+       break;
+     else if (length (literal) == length (filename))
+       {
+         if (literal == filename)
+           return mime type of literal;
+       }
+   }
+ </code>
+ 
+ 
+ 
+ === Simple Patterns ===
+ This is the most common case of a pattern, and it should therefore
+ be optimized the most. When loading the glob database, the __Simple__
+ patterns (or atleast a few of them) contained within should be compiled
+ into some form that can be easily used for pattern matching. For example,
+ all simple patterns that look like ''*ext'' (where ''ext'' has 4 or
+ less bytes**!!**) can be compiled into a 32bit integer (using two
+ ints, one to store the case-sensitive version and one to store
+ the case-insensitive (lower-cased) version), just like the Filer
+ example below does. This would turn a byte-wise string comparison
+ into a simple integer comparison. We could even use some
+ machine-specific optimizations here.
+ 
+ :!: More research required!
+ 
+ 
+ 
+ === Complex Patterns ===
+ This is the least common pattern. It needs to be checked using
+ ''fnmatch()'' (the glob-style pattern matching in GLib does not
+ support ''[...]'' character ranges). Due to the nature of the
+ patterns there's not much that can be done here to optimize the
+ pattern matching, and since its the least common case, it may
+ not be worth to spent much effort on this.
+ 
+ :!: For filenames that would match only in steps 3-6 (e.g.
+ ''IMAGE.GIF'', which will normally match the simple pattern
+ ''*.gif'' in step 4), all the complex patterns will need to
+ be tested first, so it may actually be a good idea to waste
+ some more thoughts about this.
+ 
+ 
+ 
+ ===== Old Implementation (from Filer) =====
  
  A possible implementation of the MIME-type detection //Glob match step//.
  This implementation is not well tested, and it's basicly just a testcase.
  It includes the **integer comparison**-optimization for the common case
@@ -228,8 +340,9 @@
                      gboolean     case_sensitive)
  {
    for (; pattern != NULL; pattern = pattern->next)
      if (g_pattern_match (pattern->pspec, name_length, name, name_reversed))
+ 
        return pattern->type;
  
    return NULL;
  }



-- 
This mail was generated by DokuWiki at
http://thunar.xfce.org/wiki/



More information about the Thunar-workers mailing list