Jihun Park

I'm a M.S.-Ph.D. integrated course student at the Department of Artificial Intelligence at DGIST, South Korea, working under the supervision of Prof. Sunghoon Im at the DGIST Computer Vision Lab.

I earned my B.S. degree in Mechanical Engineering from Zhejiang University in 2022, and most recently worked as a generative model research intern at Baidu Global Business Unit in Shenzhen, China. My research interests lie in computer vision and deep learning, with a focus on Image/Video Generation (Autoregressive, Diffusion), Style Transfer, and Vision-Language Tasks.

News

2026

May 2026One paper accepted to ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM).
Mar 2026One paper accepted to IEEE Signal Processing Letters (SPL).
Feb 2026Our AI-Created Artworks are being exhibited at Kansong Art Museum Daegu for 3.5 months (2026.2.14 – 2026.5.31). [Link]
Feb 2026Two papers accepted to CVPR 2026 and one paper accepted to CVPR 2026 Findings.
Feb 2026Our team received the Encouragement Prize from the 32nd Samsung Human-Tech Paper Award.

2025

Dec 2025Started a Generative Model Research Intern position at Baidu Inc. Global Business Unit.
Nov 2025AI-driven artworks by our research team exhibited at DGIST Gallery (Nov. 2025 – Feb. 2026). [Link]
Nov 2025One paper accepted to AAAI 2026 as an oral paper.
Oct 2025Received the Best Oral Presentation Award from DGIST EECS/AI.
Jul 2025One paper accepted to ICCV 2025 Workshop.
Jul 2025Attended ICVSS 2025 in Sicily, Italy.
Feb 2025One paper accepted to CVPR 2025 as a highlight paper (Top 3.7%).

2024

Oct 2024One paper accepted to ACCV 2024.
Feb 2024Our team received the Encouragement Prize from the 30th Samsung Human-Tech Paper Award.

Publications

* indicates equal contribution. † indicates the corresponding author.

arXiv 2025

A Training-Free Style-aligned Image Generation with Scale-wise Autoregressive Model

Jihun Park*, Jongmin Gim*, Kyoungmin Lee*, Minseok Oh, Minwoo Choi, Jaeyeul Kim, Woochool Park, Yan Zhang, Zhenpeng Zhan, Sunghoon Im†

arXiv, 2025

Paper

We present a training-free style-aligned image generation method that leverages a scale-wise autoregressive model. While large-scale text-to-image (T2I) models, particularly diffusion-based methods, have demonstrated impressive generation quality, they often suffer from style misalignment across generated image sets and slow inference speeds, limiting their practical usability. To address these issues, we propose three key components: initial feature replacement to ensure consistent background appearance, pivotal feature interpolation to align object placement, and dynamic style injection, which reinforces style consistency using a schedule function. Unlike previous methods requiring fine-tuning or additional training, our approach maintains fast inference while preserving individual content details. Extensive experiments show that our method achieves generation quality comparable to competing approaches, significantly improves style alignment, and delivers inference speeds over six times faster than the fastest model.

TOMM 2026

Mitigating Noisy Correspondence in Video-Text Retrieval via Noise-mined Adaptive Self-Labeling

Jeonghoon Kim*, Hyeon Kang*, Jihun Park, Jinhwoi Kim, Jaeyeul Kim, Sunghoon Im†

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2026

Paper

In video-text retrieval, addressing the noisy correspondence problem is crucial for achieving accurate retrieval performance. Recent methods address this challenge by either suppressing the impact of noisy data or predictions distorted by noise. However, existing approaches often overlook two critical aspects: distinguishing hard noise from semantically ambiguous cases, and preserving latent associations within unmatched negative pairs. To address this limitation, we propose a Noise-mined Adaptive Self-Labeling (NASL), which effectively manages noisy data during training. NASL consists of two loss functions: 1) Noise-mined Matching Loss (NML), which identifies and penalizes noisy data based on a two-stage suppression strategy, and 2) Adaptive Self-labeling Loss (ASL), which employs optimal transport to explore extra relationships among false negatives in noisy conditions and provides soft supervision to prevent excessive penalization of semantically plausible pairs. Extensive experiments demonstrate that NASL improves the separation between clean and noisy data while effectively mitigating noise, leading to significant performance improvements on the MSR-VTT, DiDeMo, MSVD, LSMDC and ActivityNet datasets.

SPL 2026

CascadeOcc: Rethinking 3D Occupancy World Models with Cascaded VQ Representations

Kyumin Hwang*, Wonhyeok Choi*, Jaeyeul Kim, Jihun Park, Daehee Park†, Sunghoon Im†

IEEE Signal Processing Letters (SPL, IF: 3.9), 2026

Paper

This letter proposes CascadeOcc, a novel occupancy world model that prioritizes intrinsic structural hierarchy over extrinsic auxiliary modalities for autonomous driving. Occupancy world models—forecasting the future driving environment and planning the driving trajectory—effectively bridge perception and planning, but current approaches often heavily rely on external modalities or large language models, failing to fully exploit the inherent structural potential of occupancy representations themselves. To enhance representational capacity for complex 3D scenes, we integrate a cascaded Vector Quantized (VQ) mechanism into an autoregressive framework. Following a coarse-to-fine principle, CascadeOcc progressively refines fine-grained details from global structures through a multi-scale architecture. Additionally, we incorporate a TimeMixer to capture multi-scale temporal dependencies, establishing a dual-hierarchy mechanism in both space and time. Experimental results on 4D occupancy forecasting and motion planning benchmarks demonstrate that CascadeOcc achieves superior performance among vision-centric approaches, validating that optimizing inherent representations is a powerful alternative to relying on external foundation models.

CVPR 2026

TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization

Wonhyeok Choi, Kyumin Hwang, Jihun Park, Kyoungmin Lee, Seunghun Lee, Jaeyeul Kim, Minwoo Choi, Sunghoon Im†

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

Paper

Multi-task learning (MTL) involves the simultaneous optimization of multiple task-specific losses, often leading to gradient conflicts and scale imbalances that result in negative transfer. While existing multi-task optimization methods attempt to mitigate these challenges, they either lack the stochasticity needed to escape poor local minima or fail to explicitly address conflicts at the gradient level. In this work, we propose TaskForce, a novel multi-task optimization framework incorporating cooperative multi-agent reinforcement learning (MARL), where agents learn to find an effective joint optimization strategy based on their respective task gradients and losses. To keep the optimization process compact yet informative, agents observe a summary of the training dynamics that consists of the gradient Gram matrix---capturing both gradient magnitudes and pairwise alignments---and task loss values. Each agent then predicts the balancing parameters that determine the weight of their contribution to the final gradient update. Crucially, we design a hybrid reward function that incorporates both gradient-based signals and loss improvement dynamics, enabling agents to effectively resolve gradient conflicts and avoid poor convergence by considering both direct gradient information and the resulting impact on loss reduction. TaskForce achieves consistent improvements over state-of-the-art MTL baselines on NYU-v2, Cityscapes, and QM9, demonstrating the promise of cooperative MARL in complex multi-task scenarios.

CVPR 2026

A Training-Free Style-Personalization via SVD-Based Feature Decomposition

Kyoungmin Lee*, Jihun Park*, Jongmin Gim*, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im†

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

Paper

We present a training-free framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.

CVPR-F 2026

FREESTYLE: An Anchor-Free Mechanism for Training-Free Style-Aligned Image Generation

Minseok Oh*, Jihun Park*, Jongmin Gim, Minwoo Choi, Kyoungmin Lee, Ferdinando Fioretto†, Sunghoon Im†

IEEE/CVF Conference on Computer Vision and Pattern Recognition Findings (CVPR Findings), 2026

Paper

Text-to-Image (T2I) generation models have become central to creative workflows, where producing style-consistent image sets is crucial for applications such as visual identity design, illustration, and asset creation. While recent training-free methods have enabled efficient style-aligned synthesis, they commonly rely on anchor-dependent propagation, where style features from a single reference image are shared across the batch. This dependency often leads to inconsistent results: when the anchor fails to capture the intended reference style, its unintended artifacts are propagated to all samples, and excessive feature sharing further causes content leakage and semantic distortion. To overcome these limitations, we propose FREESTYLE, a training-free and anchor-free framework for robust style alignment within a scale-wise autoregressive generation paradigm. Our method integrates three core components—Majority Voting (MV), which aggregates dominant style cues across the batch to form a representative style feature; Majority Style Injection (MSI), which adaptively injects these aggregated features to enforce global style coherence; and Set-Based Projection (SBP), which refines local object regions by projecting them onto a shared style manifold for context-aware adaptation. Extensive experiments demonstrate that, without any retraining or parameter updates, FREESTYLE achieves state-of-the-art performance in both style consistency and content preservation, while maintaining real-time inference efficiency.

ICEIC 2026

Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation

Sanggyun Ma*, Wonjoon Choi*, Jihun Park*, Jaeyeul Kim, Seunghun Lee, Jiwan Seo, Sunghoon Im†

The 25th International Conference on Electronics, Information, and Communication (ICEIC), 2026

Paper

We present Bridging Geometric and Semantic (BriGeS), an effective method that fuses geometric and semantic information within foundation models to enhance Monocular Depth Estimation (MDE). Central to BriGeS is the Bridging Gate, which integrates the complementary strengths of depth and segmentation foundation models. This integration is further refined by our Attention Temperature Scaling technique. It finely adjusts the focus of the attention mechanisms to prevent over-concentration on specific features, thus ensuring balanced performance across diverse inputs. BriGeS capitalizes on pretrained foundation models and adopts a strategy that focuses on training only the Bridging Gate. This method significantly reduces resource demands and training time while maintaining the model's ability to generalize effectively. Extensive experiments across multiple challenging datasets demonstrate that BriGeS outperforms state-of-the-art methods in MDE for complex scenes, effectively handling intricate structures and overlapping objects.

AAAI 2026

Infinite-Story: A Training-Free Consistent Text-to-Image Generation

Jihun Park*, Kyoungmin Lee*, Jongmin Gim*, Hyeonseo Jo, Minseok Oh, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Minwoo Choi, Sunghoon Im†

The 40th Annual AAAI Conference on Artificial Intelligence (AAAI), 2026

Oral Paper · Encouragement Prize, 32nd HumanTech Paper Award

Paper Project Code

We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation tailored for multi-prompt storytelling scenarios. Built upon a scale-wise autoregressive model, our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency. To overcome these issues, we introduce three complementary techniques: Identity Prompt Replacement, which mitigates context bias in text encoders to align identity attributes across prompts; and a unified attention guidance mechanism comprising Adaptive Style Injection and Synchronized Guidance Adaptation, which jointly enforce global style and identity appearance consistency while preserving prompt fidelity. Unlike prior diffusion-based approaches that require fine-tuning or suffer from slow inference, Infinite-Story operates entirely at test time, delivering high identity and style consistency across diverse prompts. Extensive experiments demonstrate that our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models, highlighting its effectiveness and practicality for real-world visual storytelling.

ICCVw 2025

Semantic-Enhanced Monocular Depth Estimation via Fusion and Distillation of Foundation Models

Sanggyun Ma*, Wonjoon Choi*, Jihun Park, Jaeyeul Kim, Sunghoon Im†

International Conference on Computer Vision Workshop (ICCVw), 2025 / ICEIC 2026

Paper

CVPR 2025

Style-Editor: Text-driven Object-centric Style Editing

Jihun Park*, Jongmin Gim*, Kyoungmin Lee*, Seunghun Lee, Sunghoon Im†

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Highlight Paper (Top 3.7%) · Encouragement Prize, 30th HumanTech Paper Award

Paper Project Code

We present Text-driven object-centric style editing model named Style-Editor, a novel method that guides style editing at an object-centric level using textual inputs. The core of Style-Editor is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric editing that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style editing across object regions. Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image's background. This loss is applied to dynamically identified background areas. Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style editing.

ACCV 2024

Content-Adaptive Style Transfer: A Training-Free Approach with VQ Autoencoders

Jongmin Gim*, Jihun Park*, Kyoungmin Lee*, Sunghoon Im†

17th Asian Conference on Computer Vision (ACCV), 2024

Paper

We introduce Content-Adaptive Style Transfer (CAST), a novel training-free approach for arbitrary style transfer that enhances visual fidelity using vector quantized-based pretrained autoencoder. Our method systematically applies coherent stylization to corresponding content regions. It starts by capturing the global structure of images through vector quantization, then refines local details using our style-injected decoder. CAST consists of three main components: a content-consistent style injection module, which tailors stylization to unique image regions; an adaptive style refinement module, which fine-tunes stylization intensity; and a content refinement module, which ensures content integrity through interpolation and feature distribution maintenance. Experimental results indicate that CAST outperforms existing generative-based and traditional style transfer models in both quantitative and qualitative measures.

Education & Experience

Generative Model Research Intern (Advisor: Yan Zhang)

Baidu Global Business Unit, Shenzhen, China

Dec. 2025 – Mar. 2026

M.S.-Ph.D. Integrated Course (Advisor: Sunghoon Im)

Department of Artificial Intelligence, DGIST, South Korea

Mar. 2023 – Present

Bachelor's Degree

Department of Mechanical Engineering, Zhejiang University, China

Sep. 2018 – Jul. 2022

Chungnam Samsung Academy, South Korea

Mar. 2015 – Feb. 2018

Academic Activities

Honors & Awards

Encouragement Prize, 32nd HumanTech Paper Award, Samsung Electronics Co., Ltd, Feb. 2026
Best Oral Presentation Award, DGIST EECS/AI Student Conference, Oct. 2025
Encouragement Prize, 30th HumanTech Paper Award, Samsung Electronics Co., Ltd, Feb. 2024

Reviewer

The Association for the Advancement of Artificial Intelligence (AAAI), 2026

Teaching

Invited Speaker of DGIST Generative AI Integrated Seminar, Oct. 2024
Teaching Assistant (TA), Advanced Deep Learning, Mar–Jun 2024

Other Experiences

Exhibition of our team's AI-Created Artworks at Kansong Art Museum Daegu, Feb. 2026 – May. 2026 [Link]
Exhibition of our team's research on AI-driven art at DGIST, Nov. 2025 – Feb. 2026 [Link]
Attended International Computer Vision Summer School (ICVSS 2025) in Sicily, Italy, Jul. 2025
Selected to represent DGIST at the official institutional booth during the 2025 Korea Science Festival, Apr. 2025