Saturday, May 29, 2021

Finding CSV files that start with a BOM using ripgrep

Finding CSV files that start with a BOM using ripgrep

For sqlite-utils issue 250 I needed to locate some test CSV files that start with a UTF-8 BOM.

Here's how I did that using ripgrep:

$ rg --multiline --encoding none '^(?-u:\xEF\xBB\xBF)' --glob '*.csv' .

The --multiline option means the search spans multiple lines - I only want to match entire files that begin with my search term, so this means that ^ will match the start of the file, not the start of individual lines.

--encoding none runs the search against the raw bytes of the file, disabling ripgrep's default BOM detection.

--glob '*.csv' causes ripgrep to search only CSV files.

The regular expression itself looks like this:

^(?-u:\xEF\xBB\xBF)

This is rust regex syntax.

(?-u: means "turn OFF the u flag for the duration of this block" - the u flag, which is on by default, causes the Rust regex engine to interpret input as unicode. So within the rest of that (...) block we can use escaped byte sequences.

Finally, \xEF\xBB\xBF is the byte sequence for the UTF-8 BOM itself.

Created 2021-05-28T22:23:45-07:00 · Edit



from Hacker News https://ift.tt/3wHJQiB

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.