From: Eric Wong <bofh@bogomips.org>
To: Rene Maurer <rm@cumparsita.ch>
Cc: dtas-all@nongnu.org
Subject: [PATCH] player: support guessing encodings for comments
Date: Mon, 29 Jan 2018 00:58:27 +0000 [thread overview]
Message-ID: <20180129005827.GA25130@starla> (raw)
In-Reply-To: <20180111194317.GA6392@starla>
Eric Wong wrote:
> Ugh, this is taking a while. I have a mix of UTF-8 and
> ISO-8859-1 and probably some totally bogus filenames available to me :x
Maybe the following patch is alright, a few other things I want
to work on around mlib before I release.
---8<---
Subject: [PATCH] player: support guessing encodings for comments
This can be helpful for end users and is close to what other
players use. We can fallback to Encoding.default_external by
default (typically UTF-8) and then again using `charlock_holmes'
if installed.
Note: path names remain binary, because that's how proper
filesystems operate.
---
lib/dtas.rb | 2 ++
lib/dtas/encoding.rb | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
lib/dtas/source/sox.rb | 4 +++-
test/test_encoding.rb | 20 +++++++++++++++++
4 files changed, 83 insertions(+), 1 deletion(-)
create mode 100644 lib/dtas/encoding.rb
create mode 100644 test/test_encoding.rb
diff --git a/lib/dtas.rb b/lib/dtas.rb
index ac416d7..3c2cdb4 100644
--- a/lib/dtas.rb
+++ b/lib/dtas.rb
@@ -42,3 +42,5 @@ def self.dedupe_str(str)
require_relative 'dtas/compat_onenine'
require_relative 'dtas/spawn_fix'
+require_relative 'dtas/encoding'
+DTAS.extend(DTAS::Encoding)
diff --git a/lib/dtas/encoding.rb b/lib/dtas/encoding.rb
new file mode 100644
index 0000000..71c877f
--- /dev/null
+++ b/lib/dtas/encoding.rb
@@ -0,0 +1,58 @@
+# Copyright (C) 2018 all contributors <dtas-all@nongnu.org>
+# License: GPL-3.0+ <https://www.gnu.org/licenses/gpl-3.0.txt>
+# frozen_string_literal: true
+
+# This module gets included in DTAS
+module DTAS::Encoding # :nodoc:
+ def self.extended(mod)
+ mod.instance_eval { @charlock_holmes = nil}
+ end
+
+private
+
+ def try_enc_harder(str, enc, old) # :nodoc:
+ case @charlock_holmes
+ when nil
+ begin
+ require 'charlock_holmes'
+ @charlock_holmes = CharlockHolmes::EncodingDetector.new
+ rescue LoadError
+ warn "`charlock_holmes` gem not available for encoding detection"
+ @charlock_holmes = false
+ end
+ when false
+ enc_fallback(str, enc, old)
+ else
+ res = @charlock_holmes.detect(str)
+ if det = res[:ruby_encoding]
+ str.force_encoding(det)
+ warn "charlock_holmes detected #{str.inspect} as #{det}..."
+ str.valid_encoding? or enc_fallback(str, det, old)
+ else
+ enc_fallback(str, enc, old)
+ end
+ end
+ str
+ end
+
+ def enc_fallback(str, enc, old) # :nodoc:
+ str.force_encoding(old)
+ warn "could not detect encoding for #{str.inspect} (not #{enc})"
+ end
+
+public
+
+ def try_enc(str, enc, harder = true) # :nodoc:
+ old = str.encoding
+ return str if old == enc
+ str.force_encoding(enc)
+ unless str.valid_encoding?
+ if harder
+ try_enc_harder(str, enc, old)
+ else
+ enc_fallback(str, enc, old)
+ end
+ end
+ str
+ end
+end
diff --git a/lib/dtas/source/sox.rb b/lib/dtas/source/sox.rb
index f702b41..03487fe 100644
--- a/lib/dtas/source/sox.rb
+++ b/lib/dtas/source/sox.rb
@@ -50,17 +50,19 @@ def mcache_lookup(infile)
out =~ /^Sample Rate\s*:\s*(\d+)/n and dst['rate'] = $1.to_i
out =~ /^Precision\s*:\s*(\d+)-bit/n and dst['bits'] = $1.to_i
+ enc = Encoding.default_external
if out =~ /\nComments\s*:[ \t]*\n?(.*)\z/mn
comments = dst['comments'] = {}
key = nil
$1.split(/\n/n).each do |line|
if line.sub!(/^([^=]+)=/ni, '')
- key = DTAS.dedupe_str($1.upcase)
+ key = DTAS.dedupe_str(DTAS.try_enc($1.upcase, enc))
end
(comments[key] ||= ''.b) << "#{line}\n" unless line.empty?
end
comments.each do |k,v|
v.chomp!
+ DTAS.try_enc(v, enc)
comments[k] = DTAS.dedupe_str(v)
end
end
diff --git a/test/test_encoding.rb b/test/test_encoding.rb
new file mode 100644
index 0000000..d9af968
--- /dev/null
+++ b/test/test_encoding.rb
@@ -0,0 +1,20 @@
+# Copyright (C) 2018 all contributors <dtas-all@nongnu.org>
+# License: GPL-3.0+ <https://www.gnu.org/licenses/gpl-3.0.txt>
+# frozen_string_literal: true
+require './test/helper'
+require 'dtas'
+require 'yaml'
+
+class TestEncoding < Testcase
+ def test_encoding
+ data = <<EOD # <20180111114546.77906b35@cumparsita.ch>
+---
+comments:
+ ARTIST: !binary |-
+ RW5yaXF1ZSBSb2Ryw61ndWV6
+EOD
+ hash = YAML.load(data)
+ artist = DTAS.try_enc(hash['comments']['ARTIST'], Encoding::UTF_8)
+ assert_equal 'Enrique Rodríguez', artist
+ end
+end
--
EW
prev parent reply other threads:[~2018-01-29 0:58 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-11 10:45 dtas-0.15.0 "!binary" in yaml file Rene Maurer
2018-01-11 17:38 ` Eric Wong
2018-01-11 19:43 ` Eric Wong
2018-01-12 9:06 ` Rene Maurer
2018-01-29 0:46 ` Eric Wong
2018-01-29 0:58 ` Eric Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://80x24.org/dtas/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180129005827.GA25130@starla \
--to=bofh@bogomips.org \
--cc=dtas-all@nongnu.org \
--cc=rm@cumparsita.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/dtas.git/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).