Ourcontributionscanbesummarizedas:1)WeproposeahierarchicalapproachtosolvetheunexploredonlinehandwrittenChinesetextlinegenerationtask.2)Weintroduceasimplebuteffectivelayoutgeneratorthatcangeneratecharacterpositionsbasedonthetextcontentsandwritingstylethroughin-context-likelearning.3)Weconstructa1DU-Netnetworkforfontgenerationanddesignamulti-scalecontrastive-learning-basedstyleencodertoimprovetheabilityofcalligraphystyleimitation.
OnlineHandwrittenData.Ingeneral,onlinehandwritingisakindoftimeseriesdata,whichiscomposedofaseriesoftrajectorypoints.Eachpoint([h,v,s]∈3AppendixAAppendixA.1DenoisingDiffusionProbabilisticModelsInthispaper,weadopttheDenoisingDiffusionProbabilisticModel(DDPM),whichisagenerativemodelthatoperatesbyiterativelyapplyingadenoisingprocesstonoise-corrupteddata.Thisprocess,knownasthereversedenoisingprocessaimstograduallyrefinethenoisyinputtowardsgeneratingrealisticsamples.DDPMlearnstomodeltheconditionaldistributionofcleandatagivennoisyinputs,whichisderivedfromtheforwarddiffusionprocess.Denotetheforwardprocessasqqitalic_qandthereverseprocessasppitalic_p,theforwardprocessstartsfromtheoriginaldataX0subscript0X_{0}italic_Xstart_POSTSUBSCRIPT0end_POSTSUBSCRIPTandincrementallyaddsGaussiannoisetothedata:
where{αt}t=0Tsubscriptsuperscriptsubscript0\{\alpha_{t}\}^{T}_{t=0}{italic_αstart_POSTSUBSCRIPTitalic_tend_POSTSUBSCRIPT}start_POSTSUPERSCRIPTitalic_Tend_POSTSUPERSCRIPTstart_POSTSUBSCRIPTitalic_t=0end_POSTSUBSCRIPTarenoiseschedulehyperparameters,TTitalic_Tisthetotalnumberoftimesteps.DuetotheMarkoviannatureoftheforwardtransitionkernelq(Xt|Xt1)conditionalsubscriptsubscript1q(X_{t}|X_{t-1})italic_q(italic_Xstart_POSTSUBSCRIPTitalic_tend_POSTSUBSCRIPT|italic_Xstart_POSTSUBSCRIPTitalic_t-1end_POSTSUBSCRIPT),wecandirectlysampleXt~q(Xt|X0)similar-tosubscriptconditionalsubscriptsubscript0X_{t}\simq(X_{t}|X_{0})italic_Xstart_POSTSUBSCRIPTitalic_tend_POSTSUBSCRIPT~italic_q(italic_Xstart_POSTSUBSCRIPTitalic_tend_POSTSUBSCRIPT|italic_Xstart_POSTSUBSCRIPT0end_POSTSUBSCRIPT)withoutrelianceonanyotherttitalic_t:
Theoretically,itisequivarianttooptimizingtheevidencelowerboundoflogp(X0)subscript0\logp(X_{0})roman_logitalic_p(italic_Xstart_POSTSUBSCRIPT0end_POSTSUBSCRIPT):
whichdefinesthereversetransitionkernelpθ(Xt1|Xt,t)subscriptconditionalsubscript1subscriptp_{\theta}(X_{t-1}|X_{t},t)italic_pstart_POSTSUBSCRIPTitalic_θend_POSTSUBSCRIPT(italic_Xstart_POSTSUBSCRIPTitalic_t-1end_POSTSUBSCRIPT|italic_Xstart_POSTSUBSCRIPTitalic_tend_POSTSUBSCRIPT,italic_t).
Forthelayoutgeneratormodel,weadopta2-layerLSTMwithahiddensizeof128asthelayoutgenerator.Wealsoattemptedtouseatransformerasareplacementandfoundthattheresultswerenearlythesame.Therefore,wechosethesimplerandfasterLSTMmodel.Forthedenoisermodel,recallourmodelconsistsofa1DU-Netnetworkasthedenoiser,acharacterembeddingdictionary,andamulti-scalecalligraphystyleencoder.Wesetthedimensionofcharacterembeddingas150andthedimensionoftimeembeddingfordiffusionmodelsas32.
Actually,thelayoutplannermoduleandthecharactersynthesizercanbetrainedjointly.Thesizeinformationthatthelayoutplannerpredictsforeachcharacterisusedasinputtothesingle-charactersynthesizer,whichisexpectedtogeneratecharacterswithspecificsizes.However,wefindthatnotnormalizingthecharactersizeswillaffectthemodeltolearnstructuralinformationaboutthecharacters,leadingtounstablegenerationresults.Ourapproachistodecouplethetrainingprocessofthelayoutplannermoduleandthecharactersynthesizer.Whentrainingthecharactersynthesizer,westillfirstnormalizeallsinglecharacterstoafixedheight.Inthisway,thesynthesizeronlyneedstolearntogeneratecharactersatastandardsize.Wetakefulladvantageofthenaturethatonlinehandwritingdatahasnobackgroundnoise,allowingustodirectlyscalethegeneratedstandard-sizedcharactersandfillthemintotheircorrespondingboundingboxes.
WeimplementourmodelinPytorchandrunexperimentsonNVIDIATITANRTX24GGPUs.BothtrainingandtestingarecompletedonasingleGPU.Fortrainingthelayoutplanner,theoptimizerisAdamwithaninitiallearningrateof0.01andthebatchsizeis32.Fortrainingthediffusioncharactersynthesizer,theinitiallearningrateis0.001,thegradientclippingis1.0,learningratedecayforeachbatchis0.9998.Wetrainthewholemodelwith400Kiterationswiththebatchsizeof64,whichtakesabout4days.
AccurateRateandCorrectnessRate:Unlikesinglecharacters,whenrecognizingtextlines,thenumberofcharacterswithinthemisunknown.Asaresult,theremaybediscrepanciesbetweenthetotalnumberofcharactersandthenumberofcorrectlyrecognizedcharactersinthecontentparsedbytherecognizer.Therefore,intheabsenceofalignment,itisnotappropriatetosimplyusetheratioofcorrectlyrecognizedcharacterstothetotalnumberofcharactersasameasureofcontentscore.Instead,evaluationmetricsbasedoneditdistancearecommonlyused:
whereNtsubscriptN_{t}italic_Nstart_POSTSUBSCRIPTitalic_tend_POSTSUBSCRIPTrepresentsthetotalnumberofcharactersintherealhandwrittentextline,whileSesubscriptS_{e}italic_Sstart_POSTSUBSCRIPTitalic_eend_POSTSUBSCRIPT,DesubscriptD_{e}italic_Dstart_POSTSUBSCRIPTitalic_eend_POSTSUBSCRIPT,andIesubscriptI_{e}italic_Istart_POSTSUBSCRIPTitalic_eend_POSTSUBSCRIPTrespectivelydenotesubstitutionerrors,deletionerrors,andinsertionerrors.