发表文章

[Java] 奇怪的分析错误 Weird parsing error[univocity-parsers]

sksamuel 2017-10-9 265

有时, 比方说每5次尝试, 我都会得到一个错误, 其中只分析了部分行。
它仅通过将行的第一个字符返回为 "完整行" 来体现自己。
我尝试了输入流, 字节数组等, 总是一样的事情。
版本2.5。6

我知道这是一个垃圾错误报告, 但我不知道还有什么要说, 除了错误是始终, 该行被分析为单个字符。

原文:

Sometimes, say every 5 attempts, I get an error where only part of the row is parsed.
It manifests itself by only returning the first character of the row as the "complete row".
I've tried input streams, byte arrays etc and always the same thing.
Version 2.5.6

I know this is a rubbish bug report but I don't know what else to say, other than the error is always that the row is parsed as a single character.

相关推荐
最新评论 (7)
jbax 2017-10-9
1

感谢您的报道!这听起来像是#186的重复, 就像这是问题的根本原因。

您是否可以共享使用字节数组来测试此代码?

原文:

Thanks for reporting! It sounds like a duplicate of #186 as something like that was the root cause of the issue.

Can you share the code you are using to test this using byte arrays?

sksamuel 2017-10-9
2

是的, 它看起来很相似。我使用的代码是一个更大的代码库的一部分。

csv 代码在这里:
https://github.com/51zero/eel-sdk/blob/master/eel-core/src/main/scala/io/eels/component/csv/CsvPublisher.scala#L25-L52

然后在这里的 csv 测试中显示:
https://github.com/51zero/eel-sdk/blob/master/eel-core/src/test/scala/io/eels/component/csv/CsvSourceTest.scala

还有:
https://github.com/51zero/eel-sdk/blob/master/eel-core/src/test/scala/io/eels/datastream/DataStreamTest.scala#L14-L16

我在本地使用额外的日志记录来打印记录, 因为这些行的格式不正确, 所以在我的代码库未通过测试之前。

我认为它可能是并发相关的, 因为当我使用调试器时, 我无法 re-produce 错误, 这表明调试器的缓慢性给了分析器额外的时间来做一些事情。

一个日志记录示例..。

在这一项中, 值数组只包含一个单引号, 但应该包含一行。

java.lang.IllegalArgumentException: requirement failed: Row should have a value for each field (11 fields=first_name,last_name,company_name,address,city,county,postal,phone1,phone2,email,web, 1 values=")

这在47测试中发生了4。

原文:

Yeah it does seem similar to that. The code I use is part of a larger code base.

The csv code is here:
https://github.com/51zero/eel-sdk/blob/master/eel-core/src/main/scala/io/eels/component/csv/CsvPublisher.scala#L25-L52

Which then shows up in the csv tests here:
https://github.com/51zero/eel-sdk/blob/master/eel-core/src/test/scala/io/eels/component/csv/CsvSourceTest.scala

And also here:
https://github.com/51zero/eel-sdk/blob/master/eel-core/src/test/scala/io/eels/datastream/DataStreamTest.scala#L14-L16

I've played about locally with extra logging to print out the record before my codebase fails the tests because the rows are malformed.

I think it might be concurrency related because when I use the debugger I cannot re-produce the error, indicating that the slowness of the debugger is giving the parsers extra time to do something.

A logging example...

In this one, the values array contains only a single quote mark, but should contain a row.

java.lang.IllegalArgumentException: requirement failed: Row should have a value for each field (11 fields=first_name,last_name,company_name,address,city,county,postal,phone1,phone2,email,web, 1 values=")

This then happened for 4 out of the 47 tests.

sksamuel 2017-10-9
3

我写了这个

class CsvTest extends FunSuite with Matchers with Logging {

  test("concurrent loading") {
    val executor = Executors.newFixedThreadPool(20)
    for (k <- 1 to 1000) {
      executor.execute(new Runnable {
        override def run() = {
          try {
            val data = IOUtils.toByteArray(getClass.getResourceAsStream("/uk-500.csv"))
            val source = CsvSource(data)
            require(source.toDataStream.collect.size == 500)
          } catch {
            case e: Exception =>
              logger.error("oops", e)
          }
        }
      })
    }
    executor.shutdown()
    executor.awaitTermination(1, TimeUnit.DAYS)
  }
}

我得到很多

com.univocity.parsers.common.TextParsingException: java.io.IOException - Stream closed
Parser Configuration: CsvParserSettings:
	Auto configuration enabled=true
	Autodetect column delimiter=false
	Autodetect quotes=false
	Column reordering enabled=true
	Empty value=null
	Escape unquoted values=false
	Header extraction enabled=false
	Headers=null
	Ignore leading whitespaces=true
	Ignore trailing whitespaces=true
	Input buffer size=1048576
	Input reading on separate thread=true
	Keep escape sequences=false
	Keep quotes=false
	Length of content displayed on error=-1
	Line separator detection enabled=true
	Maximum number of characters per column=4096
	Maximum number of columns=512
	Normalize escaped line separators=true
	Null value=null
	Number of records to read=all
	Processor=none
	Restricting data in exceptions=false
	RowProcessor error handler=null
	Selected fields=none
	Skip bits as whitespace=true
	Skip empty lines=true
	Unescaped quote handling=nullFormat configuration:
	CsvFormat:
		Comment character=#
		Field delimiter=,
		Line separator (normalized)=\n
		Line separator sequence=\n
		Quote character="
		Quote escape character="
		Quote escape escape character=null
Internal state when error was thrown: line=1, column=0, record=1, charIndex=107, headers=[first_name, last_name, company_name, address, city, county, postal, phone1, phone2, email, web]
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:364) ~[univocity-parsers-2.5.6.jar:2.5.6]
	at com.univocity.parsers.common.AbstractParser.stopParsing(AbstractParser.java:427) ~[univocity-parsers-2.5.6.jar:2.5.6]
	at io.eels.component.csv.CsvSource$$anonfun$schema$1.apply(CsvSource.scala:87) ~[classes/:?]
	at io.eels.component.csv.CsvSource$$anonfun$schema$1.apply(CsvSource.scala:63) ~[classes/:?]
	at scala.Option.getOrElse(Option.scala:121) ~[scala-library-2.11.11.jar:?]
	at io.eels.component.csv.CsvSource.schema(CsvSource.scala:63) ~[classes/:?]
	at io.eels.component.csv.CsvSource.parts(CsvSource.scala:93) ~[classes/:?]
	at io.eels.datastream.DataStreamSource.subscribe(DataStreamSource.scala:17) ~[classes/:?]
	at io.eels.datastream.DataStream$class.collect(DataStream.scala:860) ~[classes/:?]
	at io.eels.datastream.DataStreamSource.collect(DataStreamSource.scala:11) ~[classes/:?]
	at io.eels.component.csv.CsvTest$$anonfun$1$$anonfun$apply$mcZ$sp$1$$anon$1.run(CsvTest.scala:19) [test-classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
Caused by: java.io.IOException: Stream closed
	at sun.nio.cs.StreamDecoder.ensureOpen(StreamDecoder.java:46) ~[?:1.8.0_45]
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:148) ~[?:1.8.0_45]
	at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_45]
	at com.univocity.parsers.common.input.concurrent.CharBucket.fill(CharBucket.java:70) ~[univocity-parsers-2.5.6.jar:2.5.6]
	at com.univocity.parsers.common.input.concurrent.ConcurrentCharLoader.run(ConcurrentCharLoader.java:84) ~[univocity-parsers-2.5.6.jar:2.5.6]
	... 1 more
原文:

I've written this,

class CsvTest extends FunSuite with Matchers with Logging {

  test("concurrent loading") {
    val executor = Executors.newFixedThreadPool(20)
    for (k <- 1 to 1000) {
      executor.execute(new Runnable {
        override def run() = {
          try {
            val data = IOUtils.toByteArray(getClass.getResourceAsStream("/uk-500.csv"))
            val source = CsvSource(data)
            require(source.toDataStream.collect.size == 500)
          } catch {
            case e: Exception =>
              logger.error("oops", e)
          }
        }
      })
    }
    executor.shutdown()
    executor.awaitTermination(1, TimeUnit.DAYS)
  }
}

And I get lots of,

com.univocity.parsers.common.TextParsingException: java.io.IOException - Stream closed
Parser Configuration: CsvParserSettings:
	Auto configuration enabled=true
	Autodetect column delimiter=false
	Autodetect quotes=false
	Column reordering enabled=true
	Empty value=null
	Escape unquoted values=false
	Header extraction enabled=false
	Headers=null
	Ignore leading whitespaces=true
	Ignore trailing whitespaces=true
	Input buffer size=1048576
	Input reading on separate thread=true
	Keep escape sequences=false
	Keep quotes=false
	Length of content displayed on error=-1
	Line separator detection enabled=true
	Maximum number of characters per column=4096
	Maximum number of columns=512
	Normalize escaped line separators=true
	Null value=null
	Number of records to read=all
	Processor=none
	Restricting data in exceptions=false
	RowProcessor error handler=null
	Selected fields=none
	Skip bits as whitespace=true
	Skip empty lines=true
	Unescaped quote handling=nullFormat configuration:
	CsvFormat:
		Comment character=#
		Field delimiter=,
		Line separator (normalized)=\n
		Line separator sequence=\n
		Quote character="
		Quote escape character="
		Quote escape escape character=null
Internal state when error was thrown: line=1, column=0, record=1, charIndex=107, headers=[first_name, last_name, company_name, address, city, county, postal, phone1, phone2, email, web]
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:364) ~[univocity-parsers-2.5.6.jar:2.5.6]
	at com.univocity.parsers.common.AbstractParser.stopParsing(AbstractParser.java:427) ~[univocity-parsers-2.5.6.jar:2.5.6]
	at io.eels.component.csv.CsvSource$$anonfun$schema$1.apply(CsvSource.scala:87) ~[classes/:?]
	at io.eels.component.csv.CsvSource$$anonfun$schema$1.apply(CsvSource.scala:63) ~[classes/:?]
	at scala.Option.getOrElse(Option.scala:121) ~[scala-library-2.11.11.jar:?]
	at io.eels.component.csv.CsvSource.schema(CsvSource.scala:63) ~[classes/:?]
	at io.eels.component.csv.CsvSource.parts(CsvSource.scala:93) ~[classes/:?]
	at io.eels.datastream.DataStreamSource.subscribe(DataStreamSource.scala:17) ~[classes/:?]
	at io.eels.datastream.DataStream$class.collect(DataStream.scala:860) ~[classes/:?]
	at io.eels.datastream.DataStreamSource.collect(DataStreamSource.scala:11) ~[classes/:?]
	at io.eels.component.csv.CsvTest$$anonfun$1$$anonfun$apply$mcZ$sp$1$$anon$1.run(CsvTest.scala:19) [test-classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
Caused by: java.io.IOException: Stream closed
	at sun.nio.cs.StreamDecoder.ensureOpen(StreamDecoder.java:46) ~[?:1.8.0_45]
	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:148) ~[?:1.8.0_45]
	at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_45]
	at com.univocity.parsers.common.input.concurrent.CharBucket.fill(CharBucket.java:70) ~[univocity-parsers-2.5.6.jar:2.5.6]
	at com.univocity.parsers.common.input.concurrent.ConcurrentCharLoader.run(ConcurrentCharLoader.java:84) ~[univocity-parsers-2.5.6.jar:2.5.6]
	... 1 more
jbax 2017-10-9
4

谢谢, 我克隆了你的回购, 以获得您的 uk-500.csv 文件, 我可以在解析时始终得到错误此测试用例

这一问题的变化发生自版本 2.5.0, 它引入了对处理 BOM 标记的支持。我正在调查, 并希望得到这个钉很快, 但本质上它只会发生在 setReadInputOnSeparateThread = true , 而你不提供编码的 InputStream/文件。这将触发BomInput , 它将读取前几个字节, 以便在可能的情况下检测文件编码。

同时, 使事物一致工作的变通方法是:

  • 显式提供编码
  • 设置setReadInputOnSeparateThread = false
原文:

Thanks, I cloned your repo to get your uk-500.csv file and I am able consistently get errors when parsing with it in this test case

Variations of this issue are happening since version 2.5.0 which introduced support for handling BOM markers. I am investigating and will hopefully get this nailed soon, but essentially it will only happen when setReadInputOnSeparateThread = true AND you don't provide an encoding for your InputStream/File. This triggers the BomInput which will read the first few bytes in order to detect the file encoding if possible.

In the meantime, the workaround to have things working consistently is to either:

  • provide the encoding explicitly
  • set setReadInputOnSeparateThread = false
sksamuel 2017-10-9
5

快干活!我现在就用这个方法, 谢谢。

原文:

Fast work! I'll use the workaround now, thanks.

jbax 2017-10-9
6

修正并发布了一个 2.5. 7-快照构建, 希望结束这一点。感谢您报告问题和使用我们的解析器!

原文:

Fixed and released a 2.5.7-SNAPSHOT build that hopefully puts an end to this. Thank you for reporting the issue and for using our parsers!

jbax 2017-10-9
7

只是让您知道版本 2.5.7 已通过此修补程序发布。欢呼!

原文:

Just letting you know that version 2.5.7 has been released with this fix. Cheers!

返回
发表文章
sksamuel
文章数
1
评论数
3
注册排名
60616