[Thunar-workers] [DokuWiki] page changed: implementation:mime-glob-match

thunar-workers at xfce.org thunar-workers at xfce.org
Mon Mar 7 00:02:10 CET 2005


A page in your DokuWiki was added or changed. Here are the details:

Date        : 2005/03/06 23:02
Browser     : Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.5) Gecko/20050122 Firefox/1.0
IP-Address  : 217.229.249.18
Hostname    : pD9E5F912.dip.t-dialin.net
Old Revision: http://thunar.xfce.org/wiki/implementation:mime-glob-match?rev=1108855210
New Revision: http://thunar.xfce.org/wiki/implementation:mime-glob-match
Edit Summary: 
User        : benny

@@ -10,8 +10,12 @@
  ===== Ideas for the new Implementation =====
  The implementation should be very fast, since the MIME type detection is
  done on every single file, that is touched by the file manager (and can
  thereby be considered a //common operation//).
+ 
+ An [[implementation:mime-glob-match-analysis|analysis]] of the ''globs'' file
+ in the Shared MIME Database is was done to better understand the requirements
+ for the MIME globbing implementation.
  
  
  ==== Steps ====
  The required steps for the Glob Match detection are as follows
@@ -39,8 +43,77 @@
  filer sample implementation below, Thunar should store all patterns
  in arrays (probably one array per pattern type), and make sure that
  the array items are properly aligned in memory.
  
+ 
+ === Literal Patterns ===
+ Since the Shared MIME Database contains only a few literal patterns
+ (10 literal patterns in ''shared-mime-info-0.15'') and it is unlikely
+ that vendors will add that many additional literal patterns, we should
+ go with a simple string comparison here. The important fact is that we
+ need to sort the list of literal patterns by their length to make sure
+ we check the longest patterns first.
+ 
+ 
+ 
+ === Simple Patterns ===
+ This is the most common case of a pattern (**400** simple patterns in the
+ ''shared-mime-info-0.15'' package), and it should therefore
+ be optimized the most. When loading the glob database, the __Simple__
+ patterns (or atleast a few of them) contained within should be compiled
+ into some form that can be easily used for pattern matching. For example,
+ all simple patterns that look like ''*ext'' (where ''ext'' has 4 or
+ less bytes**!!**) can be compiled into a 32bit integer (using two
+ ints, one to store the case-sensitive version and one to store
+ the case-insensitive (lower-cased) version), just like the Filer
+ example below does. This would turn a byte-wise string comparison
+ into a simple integer comparison. We could even use some
+ machine-specific optimizations here.
+ 
+ Since the number of simple patterns is that huge, it may be worth to
+ investigate in reducing the number of comparisons instead of trying
+ to speed up the comparisons. There are several ways to achieve this
+ goal:
+ 
+   * An n-ary search tree
+   * Some kind of hash lookup
+ 
+ While reducing the number of comparisons, it's important to keep an
+ eye on data locality; else, if we stall the CPU 50 times with waiting
+ for main memory, we'll surely experience a huge performance problems
+ with large directories in the end.
+ 
+ :!: More research required!
+ 
+ 
+ 
+ === Complex Patterns ===
+ This is the least common pattern (only 5 complex patterns in the
+ ''shared-mime-info-0.15'' package). It needs to be checked using
+ ''fnmatch()'' (the glob-style pattern matching in GLib does not
+ support ''[...]'' character ranges). Due to the nature of the
+ patterns there's not much that can be done here to optimize the
+ pattern matching, and since its the least common case, it may
+ not be worth to spent much effort on this.
+ 
+ :!: For filenames that would match only in steps 3-6 (e.g.
+ ''IMAGE.GIF'', which will normally match the simple pattern
+ ''*.gif'' in step 4), all the complex patterns will need to
+ be tested first, so it may actually be a good idea to waste
+ some more thoughts about this.
+ 
+ 
+ 
+ ==== Recommended reading ====
+   * [[implementation:mime-glob-match-analysis|Analysis of the Glob-Patterns in the Shared MIME Database]]
+   * [[infrastructure:related-links#alphasort|Article about AlphaSort]]
+ 
+ 
+ 
+ ===== Obsolete Ideas =====
+ The ideas presented below are mostly obsoleted by the additional
+ [[implementation:mime-glob-match-analysis|analysis]] performed on
+ the Shared MIME Database.
  
  === Literal patterns ===
  The __Literal__ patterns, like ''Makefile'' or
  ''README'' can easily be checked using string comparisons. To further
@@ -132,41 +205,8 @@
  mentioned already - the literal pattern is not the common case.
  
  Open issues:
    * What about **case-insensitive** comparisons!? (We could use a 2nd list with the same basic structure as described above here, where all items are lower-cased properly - again, remember UTF-8 - we'd two lists, but only one function!)
- 
- 
- 
- === Simple Patterns ===
- This is the most common case of a pattern, and it should therefore
- be optimized the most. When loading the glob database, the __Simple__
- patterns (or atleast a few of them) contained within should be compiled
- into some form that can be easily used for pattern matching. For example,
- all simple patterns that look like ''*ext'' (where ''ext'' has 4 or
- less bytes**!!**) can be compiled into a 32bit integer (using two
- ints, one to store the case-sensitive version and one to store
- the case-insensitive (lower-cased) version), just like the Filer
- example below does. This would turn a byte-wise string comparison
- into a simple integer comparison. We could even use some
- machine-specific optimizations here.
- 
- :!: More research required!
- 
- 
- 
- === Complex Patterns ===
- This is the least common pattern. It needs to be checked using
- ''fnmatch()'' (the glob-style pattern matching in GLib does not
- support ''[...]'' character ranges). Due to the nature of the
- patterns there's not much that can be done here to optimize the
- pattern matching, and since its the least common case, it may
- not be worth to spent much effort on this.
- 
- :!: For filenames that would match only in steps 3-6 (e.g.
- ''IMAGE.GIF'', which will normally match the simple pattern
- ''*.gif'' in step 4), all the complex patterns will need to
- be tested first, so it may actually be a good idea to waste
- some more thoughts about this.
  
  
  
  ===== Old Implementation (from Filer) =====



-- 
This mail was generated by DokuWiki at
http://thunar.xfce.org/wiki/



More information about the Thunar-workers mailing list