-
Notifications
You must be signed in to change notification settings - Fork 21
Not indexed value support (MissingValue, EmptyValue) #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@icemac unfortunately, CI for python3.8-dev is broken. |
|
That's clearly an issue with the build environment, not with your code... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KeywordIndex line 76: insertNotIndexed should not have the argument newKeywords or should have a different name (such as insertSpecialIndexEntry).
KeywordIndex line 75: you may need to remove not indexed entries if oldkeywords is None.
KeywordIndex lines 81, 88: you want oldkeywords to be an OOSet but you store it as a list.
KeywordIndex line 84: I expect an insertNotIndex(...) somewhere in this else block (as there was one in the then block). In addition, it might be necessary to remove a potential MissingValue from the special index.
KeywordIndex line 120: not sure that an exception from the call should be silently swallowed. I suggest to at least log an entry.
KeywordIndex lines 149, 73: inconsistent check for missing _unindex entry (_marker versus None).
KeywordIndex: I suggest to rename ...NotIndexed to ...SpeciallyIndexed (or something similar) - as the document is indexed, just not in the "normal" way.
KeywordIndex, line 154: maybe, you do not need this line (in case, that _unindex is set to [] for an empty value (as this is the case for the old KeywordIndex).
The names MissingValue and EmptyValue are in "CamelCase" which indicates a class. Maybe, we should avoid "CamelCasing" for them to be more conformant with PEP8.
interfaces line 288: NotIndexedValue should not be used by the application; maybe, we want to indicate this by prefixing the name with _.
KeywordIndex - "pure not": I have not seen a "pure not" special handling for KeywordIndex. However, if we include MissingValue in a "pure not" for "UnIndex", we must as well include EmptyValue for KeywordIndex -- this could be done in UnIndex (to avoid code duplication in KeywordIndex.
UnIndex, line 565: the "pure not" should likely be implemented via _unindex rather than an enumeration of all keys (which may cause a huge multiunion to be executed). In addition: KeywordIndex might need that documents indexed under EmptyValue are included (by default) in a "pure not" result.
|
@d-maurer thanks for the helpful comments. However, the old implementation of KeywordIndex does not keep empty values like '()' in _unindex. If we want to use _unindex for 'pure not' queries, then |
|
Andreas Gabriel wrote at 2019-5-8 13:57 -0700:
@d-maurer thanks for the helpful comments. However, the old implementation of KeywordIndex does not keep empty values like '()' in _unindex.
You are right.
If we want to use _unindex for 'pure not' queries, then `_unindex` should also collect the special values.
You can do it -- but this is not a must:
the index knows what is in its `_unindex` and can compensate
as necessary. E.g. if it knows that `_unindex` does not contain
empty values, it could implement the "pure not" based on
"union(IISet(self._unindex), self._empty_...)".
|
|
@andbag As I wrote in #64, I believe that the current behaviour is not what was really intended: it would be much more natural if the first rather than the last attribute with a value succeeds. Maybe, we use the opportunity to document what it should mean the an index indexes several attributes, and maybe, we change the order in the process. Whether or not we do something about the documentation for the "several indexed attributes" case (or even change the order), we must ensure that an attribute with a value has precedence over one without a value. We can distinguish both cases by checking the return value of I am unsure how to handle the case "empty value" (in contrast to "missing value"): should an attribute with a non empty value have precedence over one with an empty value? This question is relevant only for |
|
Unfortunately, there are no tests yet that check the current behavior for multiple indexed attributes. I will submit a new PR for these tests so that we don't lose track if we change the current behavior. |
|
@d-maurer My observations show that the last attribute always wins, regardless of whether the value of last indexed attribute is set or not. The same applies to the existence of the last indexed attribute. Following test is based on code of master branch: If the last attribute has a value, it is stored in the index. In this regard, the option "indexed attributes" has no effect :(. That's why I don't think anyone's using the feature. |
|
Andreas Gabriel wrote at 2019-5-14 08:04 -0700:
@d-maurer My observations show that the last attribute always wins, regardless of whether the value of last indexed attribute is set or not.
...
Apparently, it is indeed worse than I had thought.
Obviously a bug. It makes no sense to let the name indicate
that an index can index several attributes while the index actually
indexes just the last one specified.
...
In this regard, the option "indexed attributes" has no effect :(. That's why I don't think anyone's using the feature.
You might be right.
I see two options:
* we raise an exception when more than a single attribute is indexed
* we document the feature "indexed attributes" and ensure that
the implementation follows the documentation -- at least
for "our own" indexes.
|
I prefer option one, because I don't want to implement features that nobody apparently requires. This feature can be much better implemented using a method that is executed by calling the single "indexed attribute". Which error fits best? TypeError or NotImplementedError? |
|
Andreas Gabriel wrote at 2019-5-15 08:18 -0700:
...
Which error fits best? TypeError or NotImplementedError?
I would opt for `NotImplementedError` (with a good description
what is not implemented).
|
|
Andreas Gabriel wrote at 2019-5-27 08:43 -0700:
@d-maurer
> * `None` value is interpreted as "missing" (maybe, this should indeed be the case;
> it should then be documented, e.g. in the `IIndexingMissingValue` interface)
I have no particular preference. In the old implementation, `None` was not indexed either. In contrast, an empty string was indexed. If FieldIndex implements `IIndexingEmptyValue`, empty string is currently interpreted as special value `empty`.
I recommend to treat the empty string as a "normal" (not a special)
value.
I recommend to treat `None` as a missing value (and document this).
Currently `_get_object_datum` returns `missing`, if the calculation results in `NoneType`, `AttributeError` or `TypeError` for ***@***.***(IIndexingMissingValue)`. Otherwise, `empty` is returned when the truth value test returns `False` for ***@***.***(IIndexingEmptyValue)`.
That definitely is not good enough: `0` (= zero) is a perfectly
"normal" value for an index with integer values.
And, in my view, `''` is a "normal" value for an index with `str` values.
There is no need to introduce a special value, if you can search
for the value in the "normal" way.
We have introduced the special values, because they allow us to
search for things not searchable in the "normal" way.
|
|
@d-maurer I've corrected the code and generalized it a bit. Before I improve the code, it would be nice if you could have a look at my changes. Especially the mapping of the special value mapping can now be configured and the purpose is documented in Products.ZCatalog/src/Products/PluginIndexes/interfaces.py Lines 306 to 308 in 6d5fa3e
The implementation looks like this
|
I find the idea good but have a few suggestions:
|
| try: | ||
| self.insertForwardIndexEntry(kw, documentId) | ||
| keys.append(kw) | ||
| except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that this exception handling is right: it does not index the object if one key cannot be indexed - and the problem is only reported via a log entry.
In my opinion, other alternatives would be better:
- log then ignore keys not indexable
- do not catch the exception (and let the whole operation fail)
- handle this
TypeErrorin the same way as if it had occurred during the object value determination (e.g. map tomissing).
In any case, the logic is at the wrong place. One would need similar logic for "update existing index info" and it should not be duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original version there was a bug that could lead to inconsistencies in the index. Also, the problem was not logged. In order not to have to abolish the old behavior completely, I would prefer the first variant. Consequently, _unindex is only allowed to store indexable keywords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: Since the type OOSet is forced for keywords in the meantime, a TypeError can also be raised under python3 e.g. in the method map_value. For consistency reasons, TypeError is now always handled in the same way when determining the attribute value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-maurer I'm beginning to wonder if it wouldn't be more sensible to escalate TypeError when a value in the keyword list is incompatible with the already indexed values. Otherwise the new values would have to be pre-validated before being indexed.
|
|
||
| newKeywords = OOSet(newKeywords) | ||
|
|
||
| self._unindex[documentId] = newKeywords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _unindex update could be done together with the "update existing index info" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
| else: | ||
| try: | ||
| # unique | ||
| newKeywords = set(newKeywords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At another place, the keywords are collected in an OOSet. Using different set types increases the constraints placed on the usable types for keywords: OOSet requires orderability (as the BTrees package as a whole); set requires hashability. I recommend to use OOSet uniformly (and avoid the tuple recasting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
@d-maurer I just can't imagine how I can implement such a method generically. In the end, the |
|
Andreas Gabriel wrote at 2019-5-31 02:43 -0700:
> * `KeywordIndex` maps `()` to `empty`. However, a `KeywordIndex` related
> object value could be any sequence, not just `tuple`. It might be better to
> replace the `dict` by methods (e.g. `map_value` ("map _value_ to a special value,
> if necessary") and `map_exception_to_special_value`).
@d-maurer I just can't imagine how I can implement such a method generically. In the end, the `_get_object_datum` method in combination with `map_exception_to_special_value` already serves the purpose, doesn't it?
@andbag Likely, you cannot have a fully generic implementation. Just a
sensible default which a derived class can override if the default
does not fit.
All is a question of class architecture.
In your architecture, we have an `_index_object` and a `_get_object_datum`
(which you name `_get_object_keywords` for `KeywordIndex`).
`_index_object` essentially gets an attribute name. Its task is to determine
the object's value from this attribute and then updates the index accordingly.
It uses `_get_object_datum` for the first subtask.
The task of `_get_object_datum` is to produce a value the index can
handle, either a "normal" or a "special" value.
Currently, you define a local function `_getSpecialValueFor`
inside `_get_object_{datum,keywords}` (you definitely want to put this
on the index level) which "generically" recognizes cases which require
the mapping to a special value (using the dict `special_values`).
I suggest a slightly different architecture. It has methods
`_index_object`, `_get_object_datum` (which likely should get official
rather than private) and `map_value` and
a tuple attribute `exceptions_treated_as_missing`.
In this architecture, `_get_object_datum` could look like:
```
exceptions_treated_as_missing = AttributeError, TypeError,
def _get_object_datum(self, obj, attr):
try:
datum = getattr(obj, attr)
if safe_callable(datum):
datum = datum()
return self.map_value(datum)
except self.exceptions_treated_as_missing:
return missing
```
The default implementation for `map_value` could be:
```
def map_value(self, value):
return missing if value is None else value
```
`KeywordIndex` could override `map_value` as
```
def map_value(self, value):
value = super(KeywordIndex,self).map_value(value)
if value is not missing:
... handle atomic value ...
# at this place, *value* is expected to be a sequence
value = empty if not value else OOSet(value)
return value
```
With Python 3, a new limitation arises: a `BTree` can no longer
have keys of different types (exception `None`).
`TypeError`'s are raised when the condition is violated.
The implementation draft above would result in a `missing`
when an object has keywords of different type.
Maybe, we do not want this behaviour.
|
|
What a pity that Python 3.8 segfaults when starting the test. I cleaned the caches and tried to restart the Python 3.8 job. |
|
Cool, cleaning the TravisCI cache seems to do the trick. |
| from Products.PluginIndexes.unindex import _marker | ||
| from Products.ZCatalog.query import IndexQuery | ||
|
|
||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having used similar code for Python 2/3 compatibility, I have been directed to use six instead. Consistently using six for Python 2/3 compatibility will facilitate code cleanup once Python 2 support is dropped.
| return tuple(pkl) | ||
| return OOSet(pkl) | ||
|
|
||
| def _get_component_datum(self, obj, attr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This almost looks like get_object_datum. Are you sure you need this special definition?
| else: | ||
| try: | ||
| self.index_objectKeywords(documentId, newKeywords) | ||
| except self.exceptions_treated_as_missing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here does not yet seem correct: assume newKeywords is a special value - but not one we want to support. It then goes into index_objectKeywords (which will fail because it is a special value).
| index=self.id)) | ||
| if self.providesSpecialIndex(missing): | ||
| newKeywords = missing | ||
| self.insertSpecialIndexEntry(missing, documentId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here seems not yet correct: assume keywords "a", 1, "b". The index_objectKeywords will have failed after it has indexed "a" and add the document to the missing index as well. First while keywords "a" and "b" are similar, they are not treated similar; second it may surprise to have an object both in a "normal" index as well as the "missing" index.
Despite the appearance, the logic could be right: you may already have ensured at a different place that newkeywords contains only keywords of the same type. In this case, if index_objjectKeywords fails at all, it will fail with the first keyword. I suggest to add a corresponding comment in this case.
| doc_id=documentId, | ||
| index=self.id)) | ||
|
|
||
| if self.providesSpecialIndex(missing): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already seen this quite complex logic before. I would centralize it (maybe in a locally defined function) to have it in a single place.
| return value | ||
|
|
||
| def index_objectKeywords(self, documentId, keywords): | ||
| """ carefully index keywords of object with integer id 'documentId' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are no longer "carefull" here. Likely, there is no longer any need because you have ensured elsewhere that keywords is homogenous and the indexing will fail with the first element if it fails at all.
| newSet = newKeywords = OOSet(newKeywords) | ||
|
|
||
| try: | ||
| fdiff = difference(oldSet, newSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if the keywords change type - and will let your index in a strange state. Assume "oldSet" to be OOSet(['a', 'b']), "newSet" to be OOSet([1, 2]). Then under Python3, the differencewill result in aTypeErrorwhich lets your document remain indexed underoldSetand gets newly indexed undermissing`.
Under Python 3, all indexed values must have a common type. Changing the keyword type will therefore not work (apart from constructed cases). Therefore, you should let an exception from the difference calls propagate (maybe log and provide a more specific error message) and not turn it into missing.
|
|
||
| def unindex_objectKeywords(self, documentId, keywords): | ||
| """ carefully unindex the object with integer id 'documentId'""" | ||
| """ carefully unindex keywords of object with integer id 'documentId' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not "carefull" here (drop the word).
| special value query term.""" | ||
|
|
||
| def map_value(value): | ||
| """ Map value, which is typically not generically indexable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "is typically not" is wrong.
I recommend:
def map_value(value):
"""Map (original) value to the value that should get indexed.
The (original) value obtained from the object might not be indexable in the normal way.
`map_value` gives you the chance to map it to a different, usually a special value in this case.
"""
|
Andreas Gabriel wrote at 2019-6-26 00:34 -0700:
andbag commented on this pull request.
> for kw in newKeywords:
- self.insertForwardIndexEntry(kw, documentId)
- if newKeywords:
- self._unindex[documentId] = list(newKeywords)
- except TypeError:
- return 0
+ try:
+ self.insertForwardIndexEntry(kw, documentId)
+ keys.append(kw)
+ except TypeError:
@d-maurer I'm beginning to wonder if it wouldn't be more sensible to escalate TypeError when a value in the keyword list is incompatible with the already indexed values. Otherwise the new values would have to be pre-validated before being indexed.
I am with you in this regard.
|
As discussed in issue #35 UnIndex can now support queries on MissingValue and EmptyValue. KeywordIndex implements currently the the new feature. I hope for active feedback.