everything related to duct tape audio suite (dtas)
 help / color / mirror / code / Atom feed
* dtas-0.15.0 "!binary" in yaml file
@ 2018-01-11 10:45 Rene Maurer
  2018-01-11 17:38 ` Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Rene Maurer @ 2018-01-11 10:45 UTC (permalink / raw)
  To: dtas-all

Hello

After updating to dtas-0.15.0, I see the following data in the output
of dtas-ctl:

,----
| comments:
|     TRACKNUMBER: '7'
|     TIN: '07898659721703'
|     DATE: '1940-06-04'
|     DISCNUMBER: '1'
|     LABELID: '89061'
|     ORGANIZATION: BATC Diegon - para bailar
|     GENRE: Tango
|     TITLE: En la buena y en la mala *
|     ARTIST: !binary |-
|       RW5yaXF1ZSBSb2Ryw61ndWV6
|     PERFORMER: Armando Moreno
|     GROUPING: !binary |-
|       Um9kcsOtZ3VleiwgTW9yZW5v
|     VERSION: T-1940
`----

ARTIST and GROUPING contain a non ASCII Character (the 'í' in
'Rodríguez').

My yaml parser was not able to handle this by default. This is now
fixed in my code.

Is it possible to have (for example) UTF-8 data in the output instead
of "!binary"? What is the purpose of the "!binary" constructor?

Sorry this may be a beginners questions as I do not know yaml at all.

Besides, I have the impression that the 0.15.0 release notes are not
available on https://80x24.org/dtas/NEWS. 

Thanks,
René


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dtas-0.15.0 "!binary" in yaml file
  2018-01-11 10:45 dtas-0.15.0 "!binary" in yaml file Rene Maurer
@ 2018-01-11 17:38 ` Eric Wong
  2018-01-11 19:43   ` Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2018-01-11 17:38 UTC (permalink / raw)
  To: Rene Maurer; +Cc: dtas-all

Rene Maurer <rm@cumparsita.ch> wrote:
> Is it possible to have (for example) UTF-8 data in the output instead
> of "!binary"? What is the purpose of the "!binary" constructor?

Yes.  We could add checks and convert to the users preferred
encoding.  dtas loads everything as binary by default because
tags are too varied to assumed to be UTF-8 (or ISO-8859-1).

> Sorry this may be a beginners questions as I do not know yaml at all.
> 
> Besides, I have the impression that the 0.15.0 release notes are not
> available on https://80x24.org/dtas/NEWS. 

Oops, will fix.  Thanks for the heads up.


I'm also not sure why your post was moderated, mailman seems to
want to encourage people to subscribe before posting; will try
to get it fixed.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dtas-0.15.0 "!binary" in yaml file
  2018-01-11 17:38 ` Eric Wong
@ 2018-01-11 19:43   ` Eric Wong
  2018-01-12  9:06     ` Rene Maurer
  2018-01-29  0:58     ` [PATCH] player: support guessing encodings for comments Eric Wong
  0 siblings, 2 replies; 6+ messages in thread
From: Eric Wong @ 2018-01-11 19:43 UTC (permalink / raw)
  To: Rene Maurer; +Cc: dtas-all

Eric Wong <e@80x24.org> wrote:
> Rene Maurer <rm@cumparsita.ch> wrote:
> > Is it possible to have (for example) UTF-8 data in the output instead
> > of "!binary"? What is the purpose of the "!binary" constructor?
> 
> Yes.  We could add checks and convert to the users preferred
> encoding.  dtas loads everything as binary by default because
> tags are too varied to assumed to be UTF-8 (or ISO-8859-1).

Ugh, this is taking a while.  I have a mix of UTF-8 and
ISO-8859-1 and probably some totally bogus filenames available to me :x

> I'm also not sure why your post was moderated, mailman seems to
> want to encourage people to subscribe before posting; will try
> to get it fixed.

Trying with a different address...


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dtas-0.15.0 "!binary" in yaml file
  2018-01-11 19:43   ` Eric Wong
@ 2018-01-12  9:06     ` Rene Maurer
  2018-01-29  0:46       ` Eric Wong
  2018-01-29  0:58     ` [PATCH] player: support guessing encodings for comments Eric Wong
  1 sibling, 1 reply; 6+ messages in thread
From: Rene Maurer @ 2018-01-12  9:06 UTC (permalink / raw)
  To: dtas-all

Eric Wong <normalperson@yhbt.net> wrote:

>> Yes.  We could add checks and convert to the users preferred
>> encoding.  dtas loads everything as binary by default because
>> tags are too varied to assumed to be UTF-8 (or ISO-8859-1).

I understand.
Thank you for pointing this out.

> Trying with a different address...

Sorry yes.
I have switched back my other email address.

Best, René


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dtas-0.15.0 "!binary" in yaml file
  2018-01-12  9:06     ` Rene Maurer
@ 2018-01-29  0:46       ` Eric Wong
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2018-01-29  0:46 UTC (permalink / raw)
  To: Rene Maurer; +Cc: dtas-all

Rene Maurer <rmnet@mailc.net> wrote:
> I have switched back my other email address.

No need.  There's no good reason to restrict posting based on
address or subscription status.  Something is wacky with nongnu.org...


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] player: support guessing encodings for comments
  2018-01-11 19:43   ` Eric Wong
  2018-01-12  9:06     ` Rene Maurer
@ 2018-01-29  0:58     ` Eric Wong
  1 sibling, 0 replies; 6+ messages in thread
From: Eric Wong @ 2018-01-29  0:58 UTC (permalink / raw)
  To: Rene Maurer; +Cc: dtas-all

Eric Wong wrote:
> Ugh, this is taking a while.  I have a mix of UTF-8 and
> ISO-8859-1 and probably some totally bogus filenames available to me :x

Maybe the following patch is alright, a few other things I want
to work on around mlib before I release.

---8<---
Subject: [PATCH] player: support guessing encodings for comments

This can be helpful for end users and is close to what other
players use.  We can fallback to Encoding.default_external by
default (typically UTF-8) and then again using `charlock_holmes'
if installed.

Note: path names remain binary, because that's how proper
filesystems operate.
---
 lib/dtas.rb            |  2 ++
 lib/dtas/encoding.rb   | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++
 lib/dtas/source/sox.rb |  4 +++-
 test/test_encoding.rb  | 20 +++++++++++++++++
 4 files changed, 83 insertions(+), 1 deletion(-)
 create mode 100644 lib/dtas/encoding.rb
 create mode 100644 test/test_encoding.rb

diff --git a/lib/dtas.rb b/lib/dtas.rb
index ac416d7..3c2cdb4 100644
--- a/lib/dtas.rb
+++ b/lib/dtas.rb
@@ -42,3 +42,5 @@ def self.dedupe_str(str)
 
 require_relative 'dtas/compat_onenine'
 require_relative 'dtas/spawn_fix'
+require_relative 'dtas/encoding'
+DTAS.extend(DTAS::Encoding)
diff --git a/lib/dtas/encoding.rb b/lib/dtas/encoding.rb
new file mode 100644
index 0000000..71c877f
--- /dev/null
+++ b/lib/dtas/encoding.rb
@@ -0,0 +1,58 @@
+# Copyright (C) 2018 all contributors <dtas-all@nongnu.org>
+# License: GPL-3.0+ <https://www.gnu.org/licenses/gpl-3.0.txt>
+# frozen_string_literal: true
+
+# This module gets included in DTAS
+module DTAS::Encoding # :nodoc:
+  def self.extended(mod)
+    mod.instance_eval { @charlock_holmes = nil}
+  end
+
+private
+
+  def try_enc_harder(str, enc, old) # :nodoc:
+    case @charlock_holmes
+    when nil
+      begin
+        require 'charlock_holmes'
+        @charlock_holmes = CharlockHolmes::EncodingDetector.new
+      rescue LoadError
+        warn "`charlock_holmes` gem not available for encoding detection"
+        @charlock_holmes = false
+      end
+    when false
+      enc_fallback(str, enc, old)
+    else
+      res = @charlock_holmes.detect(str)
+      if det = res[:ruby_encoding]
+        str.force_encoding(det)
+        warn "charlock_holmes detected #{str.inspect} as #{det}..."
+        str.valid_encoding? or enc_fallback(str, det, old)
+      else
+        enc_fallback(str, enc, old)
+      end
+    end
+    str
+  end
+
+  def enc_fallback(str, enc, old) # :nodoc:
+    str.force_encoding(old)
+    warn "could not detect encoding for #{str.inspect} (not #{enc})"
+  end
+
+public
+
+  def try_enc(str, enc, harder = true) # :nodoc:
+    old = str.encoding
+    return str if old == enc
+    str.force_encoding(enc)
+    unless str.valid_encoding?
+      if harder
+        try_enc_harder(str, enc, old)
+      else
+        enc_fallback(str, enc, old)
+      end
+    end
+    str
+  end
+end
diff --git a/lib/dtas/source/sox.rb b/lib/dtas/source/sox.rb
index f702b41..03487fe 100644
--- a/lib/dtas/source/sox.rb
+++ b/lib/dtas/source/sox.rb
@@ -50,17 +50,19 @@ def mcache_lookup(infile)
       out =~ /^Sample Rate\s*:\s*(\d+)/n and dst['rate'] = $1.to_i
       out =~ /^Precision\s*:\s*(\d+)-bit/n and dst['bits'] = $1.to_i
 
+      enc = Encoding.default_external
       if out =~ /\nComments\s*:[ \t]*\n?(.*)\z/mn
         comments = dst['comments'] = {}
         key = nil
         $1.split(/\n/n).each do |line|
           if line.sub!(/^([^=]+)=/ni, '')
-            key = DTAS.dedupe_str($1.upcase)
+            key = DTAS.dedupe_str(DTAS.try_enc($1.upcase, enc))
           end
           (comments[key] ||= ''.b) << "#{line}\n" unless line.empty?
         end
         comments.each do |k,v|
           v.chomp!
+          DTAS.try_enc(v, enc)
           comments[k] = DTAS.dedupe_str(v)
         end
       end
diff --git a/test/test_encoding.rb b/test/test_encoding.rb
new file mode 100644
index 0000000..d9af968
--- /dev/null
+++ b/test/test_encoding.rb
@@ -0,0 +1,20 @@
+# Copyright (C) 2018 all contributors <dtas-all@nongnu.org>
+# License: GPL-3.0+ <https://www.gnu.org/licenses/gpl-3.0.txt>
+# frozen_string_literal: true
+require './test/helper'
+require 'dtas'
+require 'yaml'
+
+class TestEncoding < Testcase
+  def test_encoding
+    data = <<EOD # <20180111114546.77906b35@cumparsita.ch>
+---
+comments:
+  ARTIST: !binary |-
+    RW5yaXF1ZSBSb2Ryw61ndWV6
+EOD
+    hash = YAML.load(data)
+    artist = DTAS.try_enc(hash['comments']['ARTIST'], Encoding::UTF_8)
+    assert_equal 'Enrique Rodríguez', artist
+  end
+end
-- 
EW


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-01-29  0:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-11 10:45 dtas-0.15.0 "!binary" in yaml file Rene Maurer
2018-01-11 17:38 ` Eric Wong
2018-01-11 19:43   ` Eric Wong
2018-01-12  9:06     ` Rene Maurer
2018-01-29  0:46       ` Eric Wong
2018-01-29  0:58     ` [PATCH] player: support guessing encodings for comments Eric Wong

Code repositories for project(s) associated with this public inbox

	http://80x24.org/dtas.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).