rfc6203.IMAP4_Fuzzy_SEARCH_extension.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395
  1. Internet Engineering Task Force (IETF) T. Sirainen
  2. Request for Comments: 6203 March 2011
  3. Category: Standards Track
  4. ISSN: 2070-1721
  5. IMAP4 Extension for Fuzzy Search
  6. Abstract
  7. This document describes an IMAP protocol extension enabling a server
  8. to perform searches with inexact matching and assigning relevancy
  9. scores for matched messages.
  10. Status of This Memo
  11. This is an Internet Standards Track document.
  12. This document is a product of the Internet Engineering Task Force
  13. (IETF). It represents the consensus of the IETF community. It has
  14. received public review and has been approved for publication by the
  15. Internet Engineering Steering Group (IESG). Further information on
  16. Internet Standards is available in Section 2 of RFC 5741.
  17. Information about the current status of this document, any errata,
  18. and how to provide feedback on it may be obtained at
  19. http://www.rfc-editor.org/info/rfc6203.
  20. Copyright Notice
  21. Copyright (c) 2011 IETF Trust and the persons identified as the
  22. document authors. All rights reserved.
  23. This document is subject to BCP 78 and the IETF Trust's Legal
  24. Provisions Relating to IETF Documents
  25. (http://trustee.ietf.org/license-info) in effect on the date of
  26. publication of this document. Please review these documents
  27. carefully, as they describe your rights and restrictions with respect
  28. to this document. Code Components extracted from this document must
  29. include Simplified BSD License text as described in Section 4.e of
  30. the Trust Legal Provisions and are provided without warranty as
  31. described in the Simplified BSD License.
  32. Sirainen Standards Track [Page 1]
  33. RFC 6203 IMAP4 FUZZY Search March 2011
  34. 1. Introduction
  35. When humans perform searches in IMAP clients, they typically want to
  36. see the most relevant search results first. IMAP servers are able to
  37. do this in the most efficient way when they're free to internally
  38. decide how searches should match messages. This document describes a
  39. new SEARCH=FUZZY extension that provides such functionality.
  40. 2. Conventions Used in This Document
  41. In examples, "C:" indicates lines sent by a client that is connected
  42. to a server. "S:" indicates lines sent by the server to the client.
  43. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
  44. "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
  45. document are to be interpreted as described in RFC 2119 [KEYWORDS].
  46. 3. The FUZZY Search Key
  47. The FUZZY search key takes another search key as its argument. The
  48. server is allowed to perform all matching in an implementation-
  49. defined manner for this search key, including ignoring the active
  50. comparator as defined by [RFC5255]. Typically, this would be used to
  51. search for strings. For example:
  52. C: A1 SEARCH FUZZY (SUBJECT "IMAP break")
  53. S: * SEARCH 1 5 10
  54. S: A1 OK Search completed.
  55. Besides matching messages with a subject of "IMAP break", the above
  56. search may also match messages with subjects "broken IMAP", "IMAP is
  57. broken", or anything else the server decides that might be a good
  58. match.
  59. This example does a fuzzy SUBJECT search, but a non-fuzzy FROM
  60. search:
  61. C: A2 SEARCH FUZZY SUBJECT work FROM user@example.com
  62. S: * SEARCH 1 4
  63. S: A2 OK Search completed.
  64. How the server handles multiple separate FUZZY search keys is
  65. implementation-defined.
  66. Fuzzy search algorithms might change, or the results of the
  67. algorithms might be different from search to search, so that fuzzy
  68. searches with the same parameters might give different results for
  69. 1) the same user at different times, 2) different users (searches
  70. Sirainen Standards Track [Page 2]
  71. RFC 6203 IMAP4 FUZZY Search March 2011
  72. executed simultaneously), or 3) different users (searches executed at
  73. different times). For example, a fuzzy search might adapt to a
  74. user's search habits in an attempt to give more relevant results (in
  75. a "learning" manner). Such differences can also occur because of
  76. operational decisions, such as load balancing. Clients asking for
  77. "fuzzy" really are requesting search results in a not-necessarily-
  78. deterministic way and need to give the user appropriate warning about
  79. that.
  80. 4. Relevancy Scores for Search Results
  81. Servers SHOULD assign a search relevancy score for each matched
  82. message when the FUZZY search key is given. Relevancy scores are
  83. given in the range 1-100, where 100 is the highest relevancy. The
  84. relevancy scores SHOULD use the full 1-100 range, so that clients can
  85. show them to users in a meaningful way, e.g., as a percentage value.
  86. As the name already indicates, relevancy scores specify how relevant
  87. to the search the matched message is. It's not necessarily the same
  88. as how precisely the message matched. For example, a message whose
  89. subject fuzzily matches the search string might get a higher
  90. relevancy score than a message whose body had the exact string in the
  91. middle of a sentence. When multiple search keys are matched fuzzily,
  92. how the relevancy score is calculated is server-dependent.
  93. If the server also advertises the ESEARCH capability as defined by
  94. [ESEARCH], the relevancy scores can be retrieved using the new
  95. RELEVANCY return option for SEARCH:
  96. C: B1 SEARCH RETURN (RELEVANCY ALL) FUZZY TEXT "Helo"
  97. S: * ESEARCH (TAG "B1") ALL 1,5,10 RELEVANCY (4 99 42)
  98. S: B1 OK Search completed.
  99. In the example above, the server would treat "hello", "help", and
  100. other similar strings as fuzzily matching the misspelled "Helo".
  101. The RELEVANCY return option MUST NOT be used unless a FUZZY search
  102. key is also given. Note that SEARCH results aren't sorted by
  103. relevancy; SORT is needed for that.
  104. 5. Fuzzy Matching with Non-String Search Keys
  105. Fuzzy matching is not limited to just string matching. All search
  106. keys SHOULD be matched fuzzily, although exactly what that means for
  107. different search keys is left for server implementations to decide --
  108. including deciding that fuzzy matching is meaningless for a
  109. particular key, and falling back to exact matching. Some suggestions
  110. are given below.
  111. Sirainen Standards Track [Page 3]
  112. RFC 6203 IMAP4 FUZZY Search March 2011
  113. Dates:
  114. A typical example could be when a user wants to find a message
  115. "from Dave about a week ago". A client could perform this search
  116. using SEARCH FUZZY (FROM "Dave" SINCE 21-Jan-2009 BEFORE
  117. 24-Jan-2009). The server could return messages outside the
  118. specified date range, but the further away the message is, the
  119. lower the relevancy score.
  120. Sizes:
  121. These should be handled similarly to dates. If a user wants to
  122. search for "about 1 MB attachments", the client could do this by
  123. sending SEARCH FUZZY (LARGER 900000 SMALLER 1100000). Again, the
  124. further away the message size is from the specified range, the
  125. lower the relevancy score.
  126. Flags:
  127. If other search criteria match, the server could return messages
  128. that don't have the specified flags set, but with lower relevancy
  129. scores. SEARCH SUBJECT "xyz" FUZZY ANSWERED, for example, might
  130. be useful if the user thinks the message he is looking for has the
  131. ANSWERED flag set, but he isn't sure.
  132. Unique Identifiers (UIDs), sequences, modification sequences: These
  133. are examples of keys for which exact matching probably makes sense.
  134. Alternatively, a server might choose, for instance, to expand a UID
  135. range by 5% on each side.
  136. 6. Extensions to SORT and SEARCH
  137. If the server also advertises the SORT capability as defined by
  138. [SORT], the results can be sorted by the new RELEVANCY sort criteria:
  139. C: C1 SORT (RELEVANCY) UTF-8 FUZZY SUBJECT "Helo"
  140. S: * SORT 5 10 1
  141. S: C1 OK Sort completed.
  142. The message with the highest score is returned first. As with the
  143. RELEVANCY return option, RELEVANCY sort criteria MUST NOT be used
  144. unless a FUZZY search key is also given.
  145. If the server also advertises the ESORT capability as defined by
  146. [CONTEXT], the relevancy scores can be retrieved using the new
  147. RELEVANCY return option for SORT:
  148. C: C2 SORT RETURN (RELEVANCY ALL) (RELEVANCY) UTF-8 FUZZY TEXT
  149. "Helo"
  150. S: * ESEARCH (TAG "C2") ALL 5,10,1 RELEVANCY (99 42 4)
  151. S: C2 OK Sort completed.
  152. Sirainen Standards Track [Page 4]
  153. RFC 6203 IMAP4 FUZZY Search March 2011
  154. Furthermore, if the server advertises the CONTEXT=SORT (or
  155. CONTEXT=SEARCH) capability, then the client can limit the number of
  156. returned messages to a SORT (or a SEARCH) by using the PARTIAL return
  157. option. For example, this returns the 10 most relevant messages:
  158. C: C3 SORT RETURN (PARTIAL 1:10) (RELEVANCY) UTF-8 FUZZY TEXT
  159. "World"
  160. S: * ESEARCH (TAG "C3") PARTIAL (1:10 42,9,34,13,15,4,2,7,23,82)
  161. S: C3 OK Sort completed.
  162. 7. Formal Syntax
  163. The following syntax specification uses the augmented Backus-Naur
  164. Form (BNF) as described in [ABNF]. It includes definitions from
  165. [RFC3501], [IMAP-ABNF], and [SORT].
  166. capability =/ "SEARCH=FUZZY"
  167. score = 1*3DIGIT
  168. ;; (1 <= n <= 100)
  169. score-list = "(" [score *(SP score)] ")"
  170. search-key =/ "FUZZY" SP search-key
  171. search-return-data =/ "RELEVANCY" SP score-list
  172. ;; Conforms to <search-return-data>, from [IMAP-ABNF]
  173. search-return-opt =/ "RELEVANCY"
  174. ;; Conforms to <search-return-opt>, from [IMAP-ABNF]
  175. sort-key =/ "RELEVANCY"
  176. 8. Security Considerations
  177. Implementation of this extension might enable denial-of-service
  178. attacks against server resources. Servers MAY limit the resources
  179. that a single search (or a single user) may use. Additionally,
  180. implementors should be aware of the following: Fuzzy search engines
  181. are often complex with non-obvious disk space, memory, and/or CPU
  182. usage patterns. Server implementors should at least test the fuzzy-
  183. search behavior with large messages that contain very long words
  184. and/or unique random strings. Also, very long search keys might
  185. cause excessive memory or CPU usage.
  186. Invalid input may also be problematic. For example, if the search
  187. engine takes a UTF-8 stream as input, it might fail more or less
  188. badly when illegal UTF-8 sequences are fed to it from a message whose
  189. Sirainen Standards Track [Page 5]
  190. RFC 6203 IMAP4 FUZZY Search March 2011
  191. character set was claimed to be UTF-8. This could be avoided by
  192. validating all the input and, for example, replacing illegal UTF-8
  193. sequences with the Unicode replacement character (U+FFFD).
  194. Search relevancy rankings might be susceptible to "poisoning" by
  195. smart attackers using certain keywords or hidden markup (e.g., HTML)
  196. in their messages to boost the rankings. This can't be fully
  197. prevented by servers, so clients should prepare for it by at least
  198. allowing users to see all the search results, rather than hiding
  199. results below a certain score.
  200. 9. IANA Considerations
  201. IMAP4 capabilities are registered by publishing a standards track or
  202. IESG-approved experimental RFC. The "Internet Message Access
  203. Protocol (IMAP) 4 Capabilities Registry" is available from
  204. http://www.iana.org/.
  205. This document defines the SEARCH=FUZZY IMAP capability. IANA has
  206. added it to the registry.
  207. 10. Acknowledgements
  208. Alexey Melnikov, Zoltan Ordogh, Barry Leiba, Cyrus Daboo, and Dave
  209. Cridland have helped with this document.
  210. 11. Normative References
  211. [ABNF] Crocker, D., Ed. and P. Overell, "Augmented BNF for
  212. Syntax Specifications: ABNF", STD 68, RFC 5234,
  213. January 2008.
  214. [CONTEXT] Cridland, D. and C. King, "Contexts for IMAP4",
  215. RFC 5267, July 2008.
  216. [ESEARCH] Melnikov, A. and D. Cridland, "IMAP4 Extension to SEARCH
  217. Command for Controlling What Kind of Information Is
  218. Returned", RFC 4731, November 2006.
  219. [IMAP-ABNF] Melnikov, A. and C. Daboo, "Collected Extensions to
  220. IMAP4 ABNF", RFC 4466, April 2006.
  221. [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate
  222. Requirement Levels", BCP 14, RFC 2119, March 1997.
  223. [RFC3501] Crispin, M., "INTERNET MESSAGE ACCESS PROTOCOL - VERSION
  224. 4rev1", RFC 3501, March 2003.
  225. Sirainen Standards Track [Page 6]
  226. RFC 6203 IMAP4 FUZZY Search March 2011
  227. [RFC5255] Newman, C., Gulbrandsen, A., and A. Melnikov, "Internet
  228. Message Access Protocol Internationalization", RFC 5255,
  229. June 2008.
  230. [SORT] Crispin, M. and K. Murchison, "Internet Message Access
  231. Protocol - SORT and THREAD Extensions", RFC 5256,
  232. June 2008.
  233. Author's Address
  234. Timo Sirainen
  235. EMail: tss@iki.fi
  236. Sirainen Standards Track [Page 7]