Yes, it works for me!
Can I ask one more question here? I've found that jpmml evaluator selects equal tf-df values for original and lowercased strings even if transformation dictionary contains both variants. Is there a way to overcome that?
Check your PMML document to see the actual configuration (the TextIndex element should be always present, the TextIndexNormalization element is more specific).
The default value of both attributes is false, which means that the capitalization of tokens is ignored. It you change it to true, then only tokens that have correct capitalization will be taken into consideration.
Is the (J)PMML behaviour different from Scikit-Learn behaviour? If so, then it might be worthwhile to open a new issue to implement an appropriate fix. However, be sure to accompany this issue with a reproducible Python code example (eg. based on the "Sentiment" dataset, which is part of the JPMML-SkLearn integration test suite under the src/test/resources/ directory) - don't have the time to triangulate the potential problem myself.
Is the (J)PMML behaviour different from Scikit-Learn behaviour?
After the fix it gives same results.
The goal is that (J)PMML and Scikit-Learn predictions should match by default. So, it might be necessary to revisit the converter for the CountVectorizer (or TfidfVectorizer) transformation, and make sure that all case-sensitivity attributes are properly initialized.
Can I create pull request for it?
I don't generally accept PRs for IPR (copyrights etc.) reasons.
However, you're welcome to summarize your observations and code changes (eg. a patchfile), and I will carry them over to the JPMML-SkLearn repository as my original work. Will credit you in the commit message.