From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.0.0.0/16 X-Spam-Status: No, score=-3.0 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id ED96E1FAE2 for ; Mon, 29 Jan 2018 00:58:37 +0000 (UTC) Received: from localhost ([::1]:51721 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1efxmO-0005TU-Vh for e@80x24.org; Sun, 28 Jan 2018 19:58:36 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42963) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1efxmM-0005TC-JL for dtas-all@nongnu.org; Sun, 28 Jan 2018 19:58:36 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1efxmI-00037i-Bf for dtas-all@nongnu.org; Sun, 28 Jan 2018 19:58:34 -0500 Received: from dcvr.yhbt.net ([64.71.152.64]:47372) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1efxmI-00036o-2m for dtas-all@nongnu.org; Sun, 28 Jan 2018 19:58:30 -0500 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 753DA1FAE2; Mon, 29 Jan 2018 00:58:27 +0000 (UTC) Date: Mon, 29 Jan 2018 00:58:27 +0000 From: Eric Wong To: Rene Maurer Subject: [PATCH] player: support guessing encodings for comments Message-ID: <20180129005827.GA25130@starla> References: <20180111114546.77906b35@cumparsita.ch> <20180111173808.GA7531@dcvr> <20180111194317.GA6392@starla> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20180111194317.GA6392@starla> Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 64.71.152.64 X-BeenThere: dtas-all@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: duct tape audio suite List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: dtas-all@nongnu.org Errors-To: dtas-all-bounces+e=80x24.org@nongnu.org Sender: "dtas-all" Eric Wong wrote: > Ugh, this is taking a while. I have a mix of UTF-8 and > ISO-8859-1 and probably some totally bogus filenames available to me :x Maybe the following patch is alright, a few other things I want to work on around mlib before I release. ---8<--- Subject: [PATCH] player: support guessing encodings for comments This can be helpful for end users and is close to what other players use. We can fallback to Encoding.default_external by default (typically UTF-8) and then again using `charlock_holmes' if installed. Note: path names remain binary, because that's how proper filesystems operate. --- lib/dtas.rb | 2 ++ lib/dtas/encoding.rb | 58 ++++++++++++++++++++++++++++++++++++++++++++= ++++++ lib/dtas/source/sox.rb | 4 +++- test/test_encoding.rb | 20 +++++++++++++++++ 4 files changed, 83 insertions(+), 1 deletion(-) create mode 100644 lib/dtas/encoding.rb create mode 100644 test/test_encoding.rb diff --git a/lib/dtas.rb b/lib/dtas.rb index ac416d7..3c2cdb4 100644 --- a/lib/dtas.rb +++ b/lib/dtas.rb @@ -42,3 +42,5 @@ def self.dedupe_str(str) =20 require_relative 'dtas/compat_onenine' require_relative 'dtas/spawn_fix' +require_relative 'dtas/encoding' +DTAS.extend(DTAS::Encoding) diff --git a/lib/dtas/encoding.rb b/lib/dtas/encoding.rb new file mode 100644 index 0000000..71c877f --- /dev/null +++ b/lib/dtas/encoding.rb @@ -0,0 +1,58 @@ +# Copyright (C) 2018 all contributors +# License: GPL-3.0+ +# frozen_string_literal: true + +# This module gets included in DTAS +module DTAS::Encoding # :nodoc: + def self.extended(mod) + mod.instance_eval { @charlock_holmes =3D nil} + end + +private + + def try_enc_harder(str, enc, old) # :nodoc: + case @charlock_holmes + when nil + begin + require 'charlock_holmes' + @charlock_holmes =3D CharlockHolmes::EncodingDetector.new + rescue LoadError + warn "`charlock_holmes` gem not available for encoding detection= " + @charlock_holmes =3D false + end + when false + enc_fallback(str, enc, old) + else + res =3D @charlock_holmes.detect(str) + if det =3D res[:ruby_encoding] + str.force_encoding(det) + warn "charlock_holmes detected #{str.inspect} as #{det}..." + str.valid_encoding? or enc_fallback(str, det, old) + else + enc_fallback(str, enc, old) + end + end + str + end + + def enc_fallback(str, enc, old) # :nodoc: + str.force_encoding(old) + warn "could not detect encoding for #{str.inspect} (not #{enc})" + end + +public + + def try_enc(str, enc, harder =3D true) # :nodoc: + old =3D str.encoding + return str if old =3D=3D enc + str.force_encoding(enc) + unless str.valid_encoding? + if harder + try_enc_harder(str, enc, old) + else + enc_fallback(str, enc, old) + end + end + str + end +end diff --git a/lib/dtas/source/sox.rb b/lib/dtas/source/sox.rb index f702b41..03487fe 100644 --- a/lib/dtas/source/sox.rb +++ b/lib/dtas/source/sox.rb @@ -50,17 +50,19 @@ def mcache_lookup(infile) out =3D~ /^Sample Rate\s*:\s*(\d+)/n and dst['rate'] =3D $1.to_i out =3D~ /^Precision\s*:\s*(\d+)-bit/n and dst['bits'] =3D $1.to_i =20 + enc =3D Encoding.default_external if out =3D~ /\nComments\s*:[ \t]*\n?(.*)\z/mn comments =3D dst['comments'] =3D {} key =3D nil $1.split(/\n/n).each do |line| if line.sub!(/^([^=3D]+)=3D/ni, '') - key =3D DTAS.dedupe_str($1.upcase) + key =3D DTAS.dedupe_str(DTAS.try_enc($1.upcase, enc)) end (comments[key] ||=3D ''.b) << "#{line}\n" unless line.empty? end comments.each do |k,v| v.chomp! + DTAS.try_enc(v, enc) comments[k] =3D DTAS.dedupe_str(v) end end diff --git a/test/test_encoding.rb b/test/test_encoding.rb new file mode 100644 index 0000000..d9af968 --- /dev/null +++ b/test/test_encoding.rb @@ -0,0 +1,20 @@ +# Copyright (C) 2018 all contributors +# License: GPL-3.0+ +# frozen_string_literal: true +require './test/helper' +require 'dtas' +require 'yaml' + +class TestEncoding < Testcase + def test_encoding + data =3D < +--- +comments: + ARTIST: !binary |- + RW5yaXF1ZSBSb2Ryw61ndWV6 +EOD + hash =3D YAML.load(data) + artist =3D DTAS.try_enc(hash['comments']['ARTIST'], Encoding::UTF_8) + assert_equal 'Enrique Rodr=C3=ADguez', artist + end +end --=20 EW