Quantcast
Channel: Serverphorums.com
Viewing all 23908 articles
Browse latest View live

perf,ftrace: fuzzer triggers warning in trace_events_filter code

$
0
0
So I've modified my fuzzer to try to exercise the
PERF_EVENT_IOC_SET_FILTER ioctl() and it is starting to turn up some
warnings.

For example, this one:

[28509.873731] ------------[ cut here ]------------
[28509.879188] WARNING: CPU: 1 PID: 9572 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x4f2/0x9b0()
[28509.890174] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek intel_rapl iosf_mbi snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel coretemp snd_hda_controller kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hda_core aesni_intel snd_hwdep tpm_tis ppdev i915 iTCO_wdt evdev iTCO_vendor_support snd_pcm aes_x86_64 snd_timer lrw snd tpm gf128mul soundcore glue_helper ablk_helper cryptd psmouse drm_kms_helper lpc_ich serio_raw pcspkr parport_pc mei_me mfd_core parport mei drm battery i2c_i801 video i2c_algo_bit wmi processor button sg sr_mod sd_mod cdrom ehci_pci ehci_hcd xhci_pci ahci xhci_hcd libahci libata e1000e crc32c_intel ptp fan scsi_mod usbcore pps_core usb_common thermal thermal_sys
[28509.967457] CPU: 1 PID: 9572 Comm: perf_fuzzer Tainted: G W 4.1.0-rc7+ #155
[28509.976717] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
[28509.985188] ffffffff81a1abb0 ffff8800ce757cb8 ffffffff816d7229 0000000000000000
[28509.993795] 0000000000000000 ffff8800ce757cf8 ffffffff81072eba 0000000000000160
[28510.002406] ffff8800cda26208 ffff8800364e4a90 0000000000000000 ffff8800cda26200
[28510.010990] Call Trace:
[28510.014189] [<ffffffff816d7229>] dump_stack+0x45/0x57
[28510.020242] [<ffffffff81072eba>] warn_slowpath_common+0x8a/0xc0
[28510.027171] [<ffffffff81072faa>] warn_slowpath_null+0x1a/0x20
[28510.033947] [<ffffffff8114b3c2>] replace_preds+0x4f2/0x9b0
[28510.040401] [<ffffffff8114c213>] ? ftrace_profile_set_filter+0x23/0x100
[28510.048083] [<ffffffff8114b902>] create_filter+0x82/0xb0
[28510.054381] [<ffffffff8114c244>] ftrace_profile_set_filter+0x54/0x100
[28510.061831] [<ffffffff8119088b>] ? strndup_user+0x4b/0xc0
[28510.068170] [<ffffffff811661c0>] perf_ioctl+0x170/0x4d0
[28510.074356] [<ffffffff81202270>] do_vfs_ioctl+0x2e0/0x4e0
[28510.080681] [<ffffffff81168305>] ? __perf_sw_event+0x65/0xa0
[28510.087299] [<ffffffff8106312d>] ? __do_page_fault+0x2ad/0x460
[28510.094105] [<ffffffff812024f1>] SyS_ioctl+0x81/0xa0
[28510.099983] [<ffffffff816df172>] system_call_fastpath+0x16/0x7a
[28510.106857] ---[ end trace 2ea55cf8a8b076c3 ]---

This corresponds to
/* Make sure the stack is empty */
pred = __pop_pred_stack(&stack);
if (WARN_ON(pred)) {
err = -EINVAL;
filter->root = NULL;
goto fail;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH 05/12] mm: Introduce arch_pgd_init_late()

$
0
0
On 06/11, Ingo Molnar wrote:
>
> @@ -1592,6 +1592,22 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> syscall_tracepoint_update(p);
> write_unlock_irq(&tasklist_lock);
>
> + /*
> + * If we have a new PGD then initialize it:
> + *
> + * This method is called after a task has been made visible
> + * on the task list already.
> + *
> + * Architectures that manage per task kernel pagetables
> + * might use this callback to initialize them after they
> + * are already visible to new updates.
> + *
> + * NOTE: any user-space parts of the PGD are already initialized
> + * and must not be clobbered.
> + */
> + if (p->mm != current->mm)
> + arch_pgd_init_late(p->mm, p->mm->pgd);
> +

Cosmetic, but imo

if (!(clone_flags & CLONE_VM))
arch_pgd_init_late(...);

will look better and more consistent.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] netdevice: add netdev_pub helper function

$
0
0
From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Fri, 12 Jun 2015 15:30:29 +0200

> Being able to utilize this makes much code a lot simpler and cleaner.
> It's a nice convenience function.
>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Please do not ever submit patches adding new interfaces without
also submitting changes showing actual uses of the new interface.

Otherwise it's impossible to see how really useful it actually
is.

I'm not applying this until you do so, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] Doc: networking: Fix URL for wiki.wireshark.org in udplite.txt

Re: linux-next: build failure after merge of the tip tree

$
0
0
From: "Chickles, Derek" <Derek.Chickles@caviumnetworks.com>
Date: Fri, 12 Jun 2015 15:59:54 +0000

>> -----Original Message-----
>> From: Michael Ellerman [mailto:mpe@ellerman.id.au]
>> Sent: Friday, June 12, 2015 3:51 AM
>> To: Thomas Gleixner; Ingo Molnar; H. Peter Anvin; Peter Zijlstra; David
>> S.Miller
>> Cc: linux-next@vger.kernel.org; linux-kernel@vger.kernel.org;
>> sfr@canb.auug.org.au; Chickles, Derek; Burla, Satananda; Manlunas, Felix;
>> Richter, Robert; Makarov, Aleksey; Vatsavayi, Raghu
>> Subject: linux-next: build failure after merge of the tip tree
>>
>> Hi all,
>>
>> After merging the tip tree, today's linux-next build (x86_allmodconfig)
>> failed like this:
...
>> And so on.
>>
>> Caused by the interaction of commit d6472302f242 "x86/mm: Decouple
>> <linux/vmalloc.h> from <asm/io.h>" from the tip tree with commit
>> f21fb3ed364b
>> "Add support of Cavium Liquidio ethernet adapters" from the net-next tree.
>>
>> I applied the following fix for today:
...
> Thanks. Much appreciated.

This doesn't work, neither of these emails are a formal proper submission
of a fix for this build failure.

One of you has to do the work to formally submit the patch to netdev
with a full signoff and commit log message so that it gets fixed in my
tree.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

RE: linux-next: build failure after merge of the tip tree

$
0
0
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> ...
> > Thanks. Much appreciated.
>
> This doesn't work, neither of these emails are a formal proper submission
> of a fix for this build failure.
>
> One of you has to do the work to formally submit the patch to netdev
> with a full signoff and commit log message so that it gets fixed in my
> tree.
>
> Thanks.

Yes, we're working on this. Hopefully, we'll have this submitted later today, with the build fix and sparse warning fixes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH v8 3/5] pwm: kona: Fix incorrect config, disable, and polarity procedures

$
0
0
On 15-05-30 09:41 AM, Tim Kryger wrote:
> On Tue, May 26, 2015 at 1:08 PM, Jonathan Richardson
> <jonathar@broadcom.com> wrote:
>> The config procedure didn't follow the spec which periodically resulted
>> in failing to enable the output signal. This happened one in ten or
>> twenty attempts. Following the spec and adding a 400ns delay in the
>> appropriate locations resolves this problem.
>>
>> The disable and polarity procedures now also follow the spec. The old
>> procedures would result in no change in signal when called.
>
> I think you may want to adjust your commit title and message to more
> clearly describe what this change is doing. Perhaps something like:
>
> pwm: kona: Modify settings application sequence
>
> Update the driver so that settings are applied in accordance with the
> most recent version of the hardware spec. The revised sequence clears
> the trigger bit, waits 400ns, writes settings, sets the trigger bit,
> and waits another 400ns. This corrects an issue where occasionally a
> requested change was not properly reflected in the PWM output.
>
> Otherwise, this patch looks reasonable so
>
> Reviewed-by: Tim Kryger <tim.kryger@gmail.com>

Fine with me. Same with two comments below. Will re-spin when I can see
Thierry's modification to use pwmchip_add_with_polarity() instead of
pwmchip_add_inversed().

Thanks.

>
>>
>> Reviewed-by: Arun Ramamurthy <arunrama@broadcom.com>
>> Reviewed-by: Scott Branden <sbranden@broadcom.com>
>> Tested-by: Scott Branden <sbranden@broadcom.com>
>> Signed-off-by: Jonathan Richardson <jonathar@broadcom.com>
>> ---
>> drivers/pwm/pwm-bcm-kona.c | 47 +++++++++++++++++++++++++++++++++++---------
>> 1 file changed, 38 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/pwm/pwm-bcm-kona.c b/drivers/pwm/pwm-bcm-kona.c
>> index 32b3ec6..c87621f 100644
>> --- a/drivers/pwm/pwm-bcm-kona.c
>> +++ b/drivers/pwm/pwm-bcm-kona.c
>> @@ -76,19 +76,36 @@ static inline struct kona_pwmc *to_kona_pwmc(struct pwm_chip *_chip)
>> return container_of(_chip, struct kona_pwmc, chip);
>> }
>>
>> -static void kona_pwmc_apply_settings(struct kona_pwmc *kp, unsigned int chan)
>> +/*
>> + * Clear trigger bit but set smooth bit to maintain old output.
>> + */
>> +static void kona_pwmc_prepare_for_settings(struct kona_pwmc *kp,
>> + unsigned int chan)
>> {
>> unsigned int value = readl(kp->base + PWM_CONTROL_OFFSET);
>>
>> - /* Clear trigger bit but set smooth bit to maintain old output */
>> value |= 1 << PWM_CONTROL_SMOOTH_SHIFT(chan);
>> value &= ~(1 << PWM_CONTROL_TRIGGER_SHIFT(chan));
>> writel(value, kp->base + PWM_CONTROL_OFFSET);
>>
>> + /*
>> + * There must be a min 400ns delay between clearing enable and setting
>> + * it. Failing to do this may result in no PWM signal.
>> + */
>> + ndelay(400);
>> +}
>
> Since it doesn't function as an enable, please call it the trigger bit.
>
>> +
>> +static void kona_pwmc_apply_settings(struct kona_pwmc *kp, unsigned int chan)
>> +{
>> + unsigned int value = readl(kp->base + PWM_CONTROL_OFFSET);
>> +
>> /* Set trigger bit and clear smooth bit to apply new settings */
>> value &= ~(1 << PWM_CONTROL_SMOOTH_SHIFT(chan));
>> value |= 1 << PWM_CONTROL_TRIGGER_SHIFT(chan);
>> writel(value, kp->base + PWM_CONTROL_OFFSET);
>> +
>> + /* PWMOUT_ENABLE must be held high for at least 400 ns. */
>> + ndelay(400);
>> }
>
> Same here.
>
>>
>> static int kona_pwmc_config(struct pwm_chip *chip, struct pwm_device *pwm,
>> @@ -133,8 +150,14 @@ static int kona_pwmc_config(struct pwm_chip *chip, struct pwm_device *pwm,
>> return -EINVAL;
>> }
>>
>> - /* If the PWM channel is enabled, write the settings to the HW */
>> + /*
>> + * Don't apply settings if disabled. The period and duty cycle are
>> + * always calculated above to ensure the new values are
>> + * validated immediately instead of on enable.
>> + */
>> if (test_bit(PWMF_ENABLED, &pwm->flags)) {
>> + kona_pwmc_prepare_for_settings(kp, chan);
>> +
>> value = readl(kp->base + PRESCALE_OFFSET);
>> value &= ~PRESCALE_MASK(chan);
>> value |= prescale << PRESCALE_SHIFT(chan);
>> @@ -164,6 +187,8 @@ static int kona_pwmc_set_polarity(struct pwm_chip *chip, struct pwm_device *pwm,
>> return ret;
>> }
>>
>> + kona_pwmc_prepare_for_settings(kp, chan);
>> +
>> value = readl(kp->base + PWM_CONTROL_OFFSET);
>>
>> if (polarity == PWM_POLARITY_NORMAL)
>> @@ -175,9 +200,6 @@ static int kona_pwmc_set_polarity(struct pwm_chip *chip, struct pwm_device *pwm,
>>
>> kona_pwmc_apply_settings(kp, chan);
>>
>> - /* Wait for waveform to settle before gating off the clock */
>> - ndelay(400);
>> -
>> clk_disable_unprepare(kp->clk);
>>
>> return 0;
>> @@ -207,13 +229,20 @@ static void kona_pwmc_disable(struct pwm_chip *chip, struct pwm_device *pwm)
>> {
>> struct kona_pwmc *kp = to_kona_pwmc(chip);
>> unsigned int chan = pwm->hwpwm;
>> + unsigned int value;
>> +
>> + kona_pwmc_prepare_for_settings(kp, chan);
>>
>> /* Simulate a disable by configuring for zero duty */
>> writel(0, kp->base + DUTY_CYCLE_HIGH_OFFSET(chan));
>> - kona_pwmc_apply_settings(kp, chan);
>> + writel(0, kp->base + PERIOD_COUNT_OFFSET(chan));
>>
>> - /* Wait for waveform to settle before gating off the clock */
>> - ndelay(400);
>> + /* Set prescale to 0 for this channel */
>> + value = readl(kp->base + PRESCALE_OFFSET);
>> + value &= ~PRESCALE_MASK(chan);
>> + writel(value, kp->base + PRESCALE_OFFSET);
>> +
>> + kona_pwmc_apply_settings(kp, chan);
>>
>> clk_disable_unprepare(kp->clk);
>> }
>> --
>> 1.7.9.5
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH v8 4/5] pwm: kona: Add debug info to config function

$
0
0
On 15-05-30 09:42 AM, Tim Kryger wrote:
> On Tue, May 26, 2015 at 1:08 PM, Jonathan Richardson
> <jonathar@broadcom.com> wrote:
>> Adds debugging info to config function where duty cycle and period
>> are calculated and verified.
>>
>> Signed-off-by: Jonathan Richardson <jonathar@broadcom.com>
>> ---
>> drivers/pwm/pwm-bcm-kona.c | 25 +++++++++++++++++++++++--
>> 1 file changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pwm/pwm-bcm-kona.c b/drivers/pwm/pwm-bcm-kona.c
>> index c87621f..0ddf19b 100644
>> --- a/drivers/pwm/pwm-bcm-kona.c
>> +++ b/drivers/pwm/pwm-bcm-kona.c
>> @@ -138,18 +138,39 @@ static int kona_pwmc_config(struct pwm_chip *chip, struct pwm_device *pwm,
>> dc = div64_u64(val, div);
>>
>> /* If duty_ns or period_ns are not achievable then return */
>> - if (pc < PERIOD_COUNT_MIN || dc < DUTY_CYCLE_HIGH_MIN)
>
> The original code was based on the SPEAr PWM driver which has a non-zero
> PWMDCR_MIN_DUTY such that the second condition here can evaluate to true.
>
> This isn't the case for the Kona PWM where DUTY_CYCLE_HIGH_MIN is zero.
>
>> + if (pc < PERIOD_COUNT_MIN) {
>> + dev_warn(chip->dev,
>> + "%s: pwm[%d]: period=%d is not achievable, pc=%lu, prescale=%lu\n",
>> + __func__, chan, period_ns, pc, prescale);
>> return -EINVAL;
>> + }
>
> Why not just print the minimum allowable period with the provided clock?
>
> I don't think pc and prescale will be particularly helpful to users.
>
> Also, do we really need to print __func__ here?
>
>> +
>> + if (dc < DUTY_CYCLE_HIGH_MIN) {
>> + if (0 != duty_ns) {
>> + dev_warn(chip->dev,
>> + "%s: pwm[%d]: duty cycle=%d is not achievable, dc=%lu, prescale=%lu\n",
>> + __func__, chan, duty_ns, dc, prescale);
>> + }
>> + return -EINVAL;
>> + }
>
> The above block is unreachable code.
>
>>
>> /* If pc and dc are in bounds, the calculation is done */
>> if (pc <= PERIOD_COUNT_MAX && dc <= DUTY_CYCLE_HIGH_MAX)
>> break;
>>
>> /* Otherwise, increase prescale and recalculate pc and dc */
>> - if (++prescale > PRESCALE_MAX)
>> + if (++prescale > PRESCALE_MAX) {
>> + dev_warn(chip->dev,
>> + "%s: pwm[%d]: Prescale (=%lu) within max (=%d) for period=%d and duty cycle=%d is not achievable\n",
>> + __func__, chan, prescale, PRESCALE_MAX,
>> + period_ns, duty_ns);
>> return -EINVAL;
>> + }
>> }
>
> The user got here because they specified a period larger than the maximum
> supported so why not tell them largest value that can be supported instead
> of confusing them with prescale and PRESCALE_MAX?
>
>>
>> + dev_dbg(chip->dev, "pwm[%d]: period=%lu, duty_high=%lu, prescale=%lu\n",
>> + chan, pc, dc, prescale);
>> +
>
> This could be more clear. It prints pc but calls it period.
>
>> /*
>> * Don't apply settings if disabled. The period and duty cycle are
>> * always calculated above to ensure the new values are
>> --
>> 1.7.9.5
>>

We can defer this for now until I can look into it further. It's not a
priority. I'm more concerned with core changes and kona pwm fix.

Thanks,
Jon


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: next-20150610 - repeated hangs at e1000e_phc_gettime+0x2e/0x60

$
0
0
On Thu, 11 Jun 2015 22:57:48 -0400, Valdis Kletnieks said:

> 0) next-20150603 works, so the problem landed in linux-next in the last week.
>
> 1) All 3 times happened while I was at home, using wireless, so
> the interface didn't have link and was ifconfig'ed down.

All 3 crashes happened at almost exactly 4 hours of uptime, but here
in my office I'm now at 6 hours on the same kernel while running with
the interface plugging in and doing traffic.

I have a fighting chance of mostly finishing a bisect over the weekend,
I'll let you know where that leads.

Re: [PATCH tip/perf/core] tools lib traceevent: Fix python/perf.so compiling error

$
0
0
On Fri, Jun 12, 2015 at 03:17:11AM +0000, Wang Nan wrote:
> 'make build-test' finds an error that make_python_perf_so fails due to
> missing of libtraceevent-dynamic-list:
>
> '.../python2' util/setup.py \
> --quiet build_ext; \
> mkdir -p python && \
> cp python_ext_build/lib/perf.so python/
> /path/to/ld: cannot open linker script file /path/to/kernel/tools/lib/traceevent/libtraceevent-dynamic-list: No such file or directory
> collect2: error: ld returned 1 exit status
> error: command 'x86_64-linux-gcc' failed with exit status 1
> cp: cannot stat 'python_ext_build/lib/perf.so': No such file or directory
> make[3]: *** [python/perf.so] Error 1
> make[2]: *** [python/perf.so] Error 2
> test: test -f ./python/perf.so
> make[1]: *** [make_python_perf_so] Error 1
> make: *** [build-test] Error 2
> make: Leaving directory `/path/to/kernel/tools/perf'
>
> This is caused by commit e3d09ec8126fe2c9a3ade661e2126e215ca27a80
> ("tools lib traceevent: Export dynamic symbols used by traceevent
> plugins") that, it adds the list file to LDFLAGS but forgot to add it
> to dependency list of python/perf.so.
>
> This patch fixes this problem.
>
> Signed-off-by: Wang Nan <wangnan0@huawei.com>

Acked-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

[PATCH net-next 0/3] bpf: share helpers between tracing and networking

$
0
0
Introduce new helpers to access 'struct task_struct'->pid, tgid, uid, gid, comm
fields in tracing and networking.

Share bpf_trace_printk() and bpf_get_smp_processor_id() helpers between
tracing and networking.

Alexei Starovoitov (3):
bpf: introduce current->pid, tgid, uid, gid, comm accessors
bpf: allow networking programs to use bpf_trace_printk() for
debugging
bpf: let kprobe programs use bpf_get_smp_processor_id() helper

include/linux/bpf.h | 4 +++
include/uapi/linux/bpf.h | 19 +++++++++++++
kernel/bpf/core.c | 7 +++++
kernel/bpf/helpers.c | 58 ++++++++++++++++++++++++++++++++++++++
kernel/trace/bpf_trace.c | 28 ++++++++++++------
net/core/filter.c | 8 ++++++
samples/bpf/bpf_helpers.h | 6 ++++
samples/bpf/tracex2_kern.c | 24 ++++++++++++----
samples/bpf/tracex2_user.c | 67 ++++++++++++++++++++++++++++++++++++++------
9 files changed, 199 insertions(+), 22 deletions(-)

--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

[PATCH net-next 2/3] bpf: allow networking programs to use bpf_trace_printk() for debugging

$
0
0
bpf_trace_printk() is a helper function used to debug eBPF programs.
Let socket and TC programs use it as well.
Note, it's DEBUG ONLY helper. If it's used in the program,
the kernel will print warning banner to make sure users don't use
it in production.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
include/linux/bpf.h | 1 +
kernel/bpf/core.c | 4 ++++
kernel/trace/bpf_trace.c | 20 ++++++++++++--------
net/core/filter.c | 2 ++
4 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1b9a3f5b27f6..4383476a0d48 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -150,6 +150,7 @@ struct bpf_array {
u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
void bpf_prog_array_map_clear(struct bpf_map *map);
bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
+const struct bpf_func_proto *bpf_get_trace_printk_proto(void);

#ifdef CONFIG_BPF_SYSCALL
void bpf_register_prog_type(struct bpf_prog_type_list *tl);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1fc45cc83076..c5bedc82bc1c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -733,6 +733,10 @@ const struct bpf_func_proto bpf_ktime_get_ns_proto __weak;
const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak;
const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
const struct bpf_func_proto bpf_get_current_comm_proto __weak;
+const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
+{
+ return NULL;
+}

/* Always built-in helper functions. */
const struct bpf_func_proto bpf_tail_call_proto = {
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3a17638cdf46..4f9b5d41869b 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -147,6 +147,17 @@ static const struct bpf_func_proto bpf_trace_printk_proto = {
.arg2_type = ARG_CONST_STACK_SIZE,
};

+const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
+{
+ /*
+ * this program might be calling bpf_trace_printk,
+ * so allocate per-cpu printk buffers
+ */
+ trace_printk_init_buffers();
+
+ return &bpf_trace_printk_proto;
+}
+
static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
@@ -168,15 +179,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_get_current_uid_gid_proto;
case BPF_FUNC_get_current_comm:
return &bpf_get_current_comm_proto;
-
case BPF_FUNC_trace_printk:
- /*
- * this program might be calling bpf_trace_printk,
- * so allocate per-cpu printk buffers
- */
- trace_printk_init_buffers();
-
- return &bpf_trace_printk_proto;
+ return bpf_get_trace_printk_proto();
default:
return NULL;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index 20aa51ccbf9d..65ff107d3d29 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1442,6 +1442,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
return &bpf_tail_call_proto;
case BPF_FUNC_ktime_get_ns:
return &bpf_ktime_get_ns_proto;
+ case BPF_FUNC_trace_printk:
+ return bpf_get_trace_printk_proto();
default:
return NULL;
}
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

[PATCH net-next 3/3] bpf: let kprobe programs use bpf_get_smp_processor_id() helper

$
0
0
It's useful to do per-cpu histograms.

Suggested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
kernel/trace/bpf_trace.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 4f9b5d41869b..88a041adee90 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -181,6 +181,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_get_current_comm_proto;
case BPF_FUNC_trace_printk:
return bpf_get_trace_printk_proto();
+ case BPF_FUNC_get_smp_processor_id:
+ return &bpf_get_smp_processor_id_proto;
default:
return NULL;
}
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: perf,ftrace: fuzzer triggers warning in trace_events_filter code

$
0
0
On Fri, 12 Jun 2015 17:18:22 -0400 (EDT)
Vince Weaver <vincent.weaver@maine.edu> wrote:

>
> So I've modified my fuzzer to try to exercise the
> PERF_EVENT_IOC_SET_FILTER ioctl() and it is starting to turn up some
> warnings.

Is there any way to know what the filter string you used that generated
this?

-- Steve

>
> For example, this one:
>
> [28509.873731] ------------[ cut here ]------------
> [28509.879188] WARNING: CPU: 1 PID: 9572 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x4f2/0x9b0()
> [28509.890174] Modules linked in: fuse x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek intel_rapl iosf_mbi snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel coretemp snd_hda_controller kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hda_core aesni_intel snd_hwdep tpm_tis ppdev i915 iTCO_wdt evdev iTCO_vendor_support snd_pcm aes_x86_64 snd_timer lrw snd tpm gf128mul soundcore glue_helper ablk_helper cryptd psmouse drm_kms_helper lpc_ich serio_raw pcspkr parport_pc mei_me mfd_core parport mei drm battery i2c_i801 video i2c_algo_bit wmi processor button sg sr_mod sd_mod cdrom ehci_pci ehci_hcd xhci_pci ahci xhci_hcd libahci libata e1000e crc32c_intel ptp fan scsi_mod usbcore pps_core usb_common thermal thermal_sys
> [28509.967457] CPU: 1 PID: 9572 Comm: perf_fuzzer Tainted: G W 4.1.0-rc7+ #155
> [28509.976717] Hardware name: LENOVO 10AM000AUS/SHARKBAY, BIOS FBKT72AUS 01/26/2014
> [28509.985188] ffffffff81a1abb0 ffff8800ce757cb8 ffffffff816d7229 0000000000000000
> [28509.993795] 0000000000000000 ffff8800ce757cf8 ffffffff81072eba 0000000000000160
> [28510.002406] ffff8800cda26208 ffff8800364e4a90 0000000000000000 ffff8800cda26200
> [28510.010990] Call Trace:
> [28510.014189] [<ffffffff816d7229>] dump_stack+0x45/0x57
> [28510.020242] [<ffffffff81072eba>] warn_slowpath_common+0x8a/0xc0
> [28510.027171] [<ffffffff81072faa>] warn_slowpath_null+0x1a/0x20
> [28510.033947] [<ffffffff8114b3c2>] replace_preds+0x4f2/0x9b0
> [28510.040401] [<ffffffff8114c213>] ? ftrace_profile_set_filter+0x23/0x100
> [28510.048083] [<ffffffff8114b902>] create_filter+0x82/0xb0
> [28510.054381] [<ffffffff8114c244>] ftrace_profile_set_filter+0x54/0x100
> [28510.061831] [<ffffffff8119088b>] ? strndup_user+0x4b/0xc0
> [28510.068170] [<ffffffff811661c0>] perf_ioctl+0x170/0x4d0
> [28510.074356] [<ffffffff81202270>] do_vfs_ioctl+0x2e0/0x4e0
> [28510.080681] [<ffffffff81168305>] ? __perf_sw_event+0x65/0xa0
> [28510.087299] [<ffffffff8106312d>] ? __do_page_fault+0x2ad/0x460
> [28510.094105] [<ffffffff812024f1>] SyS_ioctl+0x81/0xa0
> [28510.099983] [<ffffffff816df172>] system_call_fastpath+0x16/0x7a
> [28510.106857] ---[ end trace 2ea55cf8a8b076c3 ]---
>
> This corresponds to
> /* Make sure the stack is empty */
> pred = __pop_pred_stack(&stack);
> if (WARN_ON(pred)) {
> err = -EINVAL;
> filter->root = NULL;
> goto fail;
> }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

[PATCH net-next 1/3] bpf: introduce current->pid, tgid, uid, gid, comm accessors

$
0
0
eBPF programs attached to kprobes need to filter based on
current->pid, uid and other fields, so introduce helper functions:

u64 bpf_get_current_pid_tgid(void)
Return: current->tgid << 32 | current->pid

u64 bpf_get_current_uid_gid(void)
Return: current_gid << 32 | current_uid

bpf_get_current_comm(char *buf, int size_of_buf)
stores current->comm into buf

They can be used from the programs attached to TC as well to classify packets
based on current task fields.

Update tracex2 example to print histogram of write syscalls for each process
instead of aggregated for all.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
These helpers will be mainly used by bpf+tracing, but the patch is targeting
net-next tree to minimize merge conflicts and they're useful in TC too.

The feature was requested by Wang Nan <wangnan0@huawei.com> and
Brendan Gregg <brendan.d.gregg@gmail.com>

include/linux/bpf.h | 3 ++
include/uapi/linux/bpf.h | 19 +++++++++++++
kernel/bpf/core.c | 3 ++
kernel/bpf/helpers.c | 58 ++++++++++++++++++++++++++++++++++++++
kernel/trace/bpf_trace.c | 6 ++++
net/core/filter.c | 6 ++++
samples/bpf/bpf_helpers.h | 6 ++++
samples/bpf/tracex2_kern.c | 24 ++++++++++++----
samples/bpf/tracex2_user.c | 67 ++++++++++++++++++++++++++++++++++++++------
9 files changed, 178 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2235aee8096a..1b9a3f5b27f6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -188,5 +188,8 @@ extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
extern const struct bpf_func_proto bpf_tail_call_proto;
extern const struct bpf_func_proto bpf_ktime_get_ns_proto;
+extern const struct bpf_func_proto bpf_get_current_pid_tgid_proto;
+extern const struct bpf_func_proto bpf_get_current_uid_gid_proto;
+extern const struct bpf_func_proto bpf_get_current_comm_proto;

#endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 602f05b7a275..29ef6f99e43d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -230,6 +230,25 @@ enum bpf_func_id {
* Return: 0 on success
*/
BPF_FUNC_clone_redirect,
+
+ /**
+ * u64 bpf_get_current_pid_tgid(void)
+ * Return: current->tgid << 32 | current->pid
+ */
+ BPF_FUNC_get_current_pid_tgid,
+
+ /**
+ * u64 bpf_get_current_uid_gid(void)
+ * Return: current_gid << 32 | current_uid
+ */
+ BPF_FUNC_get_current_uid_gid,
+
+ /**
+ * bpf_get_current_comm(char *buf, int size_of_buf)
+ * stores current->comm into buf
+ * Return: 0 on success
+ */
+ BPF_FUNC_get_current_comm,
__BPF_FUNC_MAX_ID,
};

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1e00aa3316dc..1fc45cc83076 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -730,6 +730,9 @@ const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
const struct bpf_func_proto bpf_ktime_get_ns_proto __weak;
+const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak;
+const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
+const struct bpf_func_proto bpf_get_current_comm_proto __weak;

/* Always built-in helper functions. */
const struct bpf_func_proto bpf_tail_call_proto = {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 7ad5d8842d5b..d1dce346c56f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -14,6 +14,8 @@
#include <linux/random.h>
#include <linux/smp.h>
#include <linux/ktime.h>
+#include <linux/sched.h>
+#include <linux/uidgid.h>

/* If kernel subsystem is allowing eBPF programs to call this function,
* inside its own verifier_ops->get_func_proto() callback it should return
@@ -124,3 +126,59 @@ const struct bpf_func_proto bpf_ktime_get_ns_proto = {
.gpl_only = true,
.ret_type = RET_INTEGER,
};
+
+static u64 bpf_get_current_pid_tgid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ struct task_struct *task = current;
+
+ if (!task)
+ return -EINVAL;
+
+ return (u64) task->tgid << 32 | task->pid;
+}
+
+const struct bpf_func_proto bpf_get_current_pid_tgid_proto = {
+ .func = bpf_get_current_pid_tgid,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+};
+
+static u64 bpf_get_current_uid_gid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+ struct task_struct *task = current;
+ kuid_t uid;
+ kgid_t gid;
+
+ if (!task)
+ return -EINVAL;
+
+ current_uid_gid(&uid, &gid);
+ return (u64) from_kgid(current_user_ns(), gid) << 32 |
+ from_kuid(current_user_ns(), uid);
+}
+
+const struct bpf_func_proto bpf_get_current_uid_gid_proto = {
+ .func = bpf_get_current_uid_gid,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+};
+
+static u64 bpf_get_current_comm(u64 r1, u64 size, u64 r3, u64 r4, u64 r5)
+{
+ struct task_struct *task = current;
+ char *buf = (char *) (long) r1;
+
+ if (!task)
+ return -EINVAL;
+
+ memcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm)));
+ return 0;
+}
+
+const struct bpf_func_proto bpf_get_current_comm_proto = {
+ .func = bpf_get_current_comm,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_STACK,
+ .arg2_type = ARG_CONST_STACK_SIZE,
+};
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 50c4015a8ad3..3a17638cdf46 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -162,6 +162,12 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_ktime_get_ns_proto;
case BPF_FUNC_tail_call:
return &bpf_tail_call_proto;
+ case BPF_FUNC_get_current_pid_tgid:
+ return &bpf_get_current_pid_tgid_proto;
+ case BPF_FUNC_get_current_uid_gid:
+ return &bpf_get_current_uid_gid_proto;
+ case BPF_FUNC_get_current_comm:
+ return &bpf_get_current_comm_proto;

case BPF_FUNC_trace_printk:
/*
diff --git a/net/core/filter.c b/net/core/filter.c
index d271c06bf01f..20aa51ccbf9d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1459,6 +1459,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return &bpf_l4_csum_replace_proto;
case BPF_FUNC_clone_redirect:
return &bpf_clone_redirect_proto;
+ case BPF_FUNC_get_current_pid_tgid:
+ return &bpf_get_current_pid_tgid_proto;
+ case BPF_FUNC_get_current_uid_gid:
+ return &bpf_get_current_uid_gid_proto;
+ case BPF_FUNC_get_current_comm:
+ return &bpf_get_current_comm_proto;
default:
return sk_filter_func_proto(func_id);
}
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index f531a0b3282d..bdf1c1607b80 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -25,6 +25,12 @@ static void (*bpf_tail_call)(void *ctx, void *map, int index) =
(void *) BPF_FUNC_tail_call;
static unsigned long long (*bpf_get_smp_processor_id)(void) =
(void *) BPF_FUNC_get_smp_processor_id;
+static unsigned long long (*bpf_get_current_pid_tgid)(void) =
+ (void *) BPF_FUNC_get_current_pid_tgid;
+static unsigned long long (*bpf_get_current_uid_gid)(void) =
+ (void *) BPF_FUNC_get_current_uid_gid;
+static int (*bpf_get_current_comm)(void *buf, int buf_size) =
+ (void *) BPF_FUNC_get_current_comm;

/* llvm builtin functions that eBPF C program may use to
* emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/tracex2_kern.c b/samples/bpf/tracex2_kern.c
index 19ec1cfc45db..dc50f4f2943f 100644
--- a/samples/bpf/tracex2_kern.c
+++ b/samples/bpf/tracex2_kern.c
@@ -62,11 +62,18 @@ static unsigned int log2l(unsigned long v)
return log2(v);
}

+struct hist_key {
+ char comm[16];
+ u64 pid_tgid;
+ u64 uid_gid;
+ u32 index;
+};
+
struct bpf_map_def SEC("maps") my_hist_map = {
- .type = BPF_MAP_TYPE_ARRAY,
- .key_size = sizeof(u32),
+ .type = BPF_MAP_TYPE_HASH,
+ .key_size = sizeof(struct hist_key),
.value_size = sizeof(long),
- .max_entries = 64,
+ .max_entries = 1024,
};

SEC("kprobe/sys_write")
@@ -75,11 +82,18 @@ int bpf_prog3(struct pt_regs *ctx)
long write_size = ctx->dx; /* arg3 */
long init_val = 1;
long *value;
- u32 index = log2l(write_size);
+ struct hist_key key = {};
+
+ key.index = log2l(write_size);
+ key.pid_tgid = bpf_get_current_pid_tgid();
+ key.uid_gid = bpf_get_current_uid_gid();
+ bpf_get_current_comm(&key.comm, sizeof(key.comm));

- value = bpf_map_lookup_elem(&my_hist_map, &index);
+ value = bpf_map_lookup_elem(&my_hist_map, &key);
if (value)
__sync_fetch_and_add(value, 1);
+ else
+ bpf_map_update_elem(&my_hist_map, &key, &init_val, BPF_ANY);
return 0;
}
char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex2_user.c b/samples/bpf/tracex2_user.c
index 91b8d0896fbb..cd0241c1447a 100644
--- a/samples/bpf/tracex2_user.c
+++ b/samples/bpf/tracex2_user.c
@@ -3,6 +3,7 @@
#include <stdlib.h>
#include <signal.h>
#include <linux/bpf.h>
+#include <string.h>
#include "libbpf.h"
#include "bpf_load.h"

@@ -20,23 +21,42 @@ static void stars(char *str, long val, long max, int width)
str = '\0';
}

-static void print_hist(int fd)
+struct task {
+ char comm[16];
+ __u64 pid_tgid;
+ __u64 uid_gid;
+};
+
+struct hist_key {
+ struct task t;
+ __u32 index;
+};
+
+#define SIZE sizeof(struct task)
+
+static void print_hist_for_pid(int fd, void *task)
{
- int key;
+ struct hist_key key = {}, next_key;
+ char starstr[MAX_STARS];
long value;
long data[MAX_INDEX] = {};
- char starstr[MAX_STARS];
- int i;
int max_ind = -1;
long max_value = 0;
+ int i, ind;

- for (key = 0; key < MAX_INDEX; key++) {
- bpf_lookup_elem(fd, &key, &value);
- data[key] = value;
- if (value && key > max_ind)
- max_ind = key;
+ while (bpf_get_next_key(fd, &key, &next_key) == 0) {
+ if (memcmp(&next_key, task, SIZE)) {
+ key = next_key;
+ continue;
+ }
+ bpf_lookup_elem(fd, &next_key, &value);
+ ind = next_key.index;
+ data[ind] = value;
+ if (value && ind > max_ind)
+ max_ind = ind;
if (value > max_value)
max_value = value;
+ key = next_key;
}

printf(" syscall write() stats\n");
@@ -48,6 +68,35 @@ static void print_hist(int fd)
MAX_STARS, starstr);
}
}
+
+static void print_hist(int fd)
+{
+ struct hist_key key = {}, next_key;
+ static struct task tasks[1024];
+ int task_cnt = 0;
+ int i;
+
+ while (bpf_get_next_key(fd, &key, &next_key) == 0) {
+ int found = 0;
+
+ for (i = 0; i < task_cnt; i++)
+ if (memcmp(&tasks, &next_key, SIZE) == 0)
+ found = 1;
+ if (!found)
+ memcpy(&tasks[task_cnt++], &next_key, SIZE);
+ key = next_key;
+ }
+
+ for (i = 0; i < task_cnt; i++) {
+ printf("\npid %d cmd %s uid %d\n",
+ (__u32) tasks.pid_tgid,
+ tasks.comm,
+ (__u32) tasks.uid_gid);
+ print_hist_for_pid(fd, &tasks);
+ }
+
+}
+
static void int_exit(int sig)
{
print_hist(map_fd[1]);
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

[PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support

$
0
0
Hello,

This patchset contains the following assorted updates for the cgroup
writeback support.

0001-writeback-do-foreign-inode-detection-iff-cgroup-writ.patch
0002-vfs-writeback-replace-FS_CGROUP_WRITEBACK-with-MS_CG.patch
0003-writeback-blkio-add-documentation-for-cgroup-writeba.patch

0001 fixes a bug where clear FS_CGROUP_WRITEBACK flag didn't fully
disable cgroup writeback support if the filesystem code uses
wbc_init_bio() and wbc_account_io().

0002 replaces FS_CGROUP_WRITEBACK with MS_CGROUPWB so that cgroup
writeback support can be enabled / disabled per superblock rather than
filesystem type.

0003 updates blkio documentation with information on cgroup writeback
support.

This patchset is on top of block/for-4.2/writeback and available in
the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-updates

diffstat follows. Thanks.

Documentation/cgroups/blkio-controller.txt | 83 +++++++++++++++++++++++++++--
fs/ext2/super.c | 4 -
fs/fs-writeback.c | 16 ++++-
fs/namespace.c | 2
include/linux/backing-dev.h | 2
include/linux/fs.h | 1
include/uapi/linux/fs.h | 1
7 files changed, 96 insertions(+), 13 deletions(-)

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

[PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support

$
0
0
Update Documentation/cgroups/blkio-controller.txt to reflect the
recently added cgroup writeback support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: cgroups@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
---
Documentation/cgroups/blkio-controller.txt | 83 ++++++++++++++++++++++++++++--
1 file changed, 78 insertions(+), 5 deletions(-)

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index cd556b9..68b6a6a 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
on individual groups and throughput should improve.

-What works
-==========
-- Currently only sync IO queues are support. All the buffered writes are
- still system wide and not per group. Hence we will not see service
- differentiation between buffered writes between groups.
+Writeback
+=========
+
+Page cache is dirtied through buffered writes and shared mmaps and
+written asynchronously to the backing filesystem by the writeback
+mechanism. Writeback sits between the memory and IO domains and
+regulates the proportion of dirty memory by balancing dirtying and
+write IOs.
+
+On traditional cgroup hierarchies, relationships between different
+controllers cannot be established making it impossible for writeback
+to operate accounting for cgroup resource restrictions and all
+writeback IOs are attributed to the root cgroup.
+
+If both the blkio and memory controllers are used on the v2 hierarchy
+and the filesystem supports cgroup writeback, writeback operations
+correctly follow the resource restrictions imposed by both memory and
+blkio controllers.
+
+Writeback examines both system-wide and per-cgroup dirty memory status
+and enforces the more restrictive of the two. Also, writeback control
+parameters which are absolute values - vm.dirty_bytes and
+vm.dirty_background_bytes - are distributed across cgroups according
+to their current writeback bandwidth.
+
+There's a peculiarity stemming from the discrepancy in ownership
+granularity between memory controller and writeback. While memory
+controller tracks ownership per page, writeback operates on inode
+basis. cgroup writeback bridges the gap by tracking ownership by
+inode but migrating ownership if too many foreign pages, pages which
+don't match the current inode ownership, have been encountered while
+writing back the inode.
+
+This is a conscious design choice as writeback operations are
+inherently tied to inodes making strictly following page ownership
+complicated and inefficient. The only use case which suffers from
+this compromise is multiple cgroups concurrently dirtying disjoint
+regions of the same inode, which is an unlikely use case and decided
+to be unsupported. Note that as memory controller assigns page
+ownership on the first use and doesn't update it until the page is
+released, even if cgroup writeback strictly follows page ownership,
+multiple cgroups dirtying overlapping areas wouldn't work as expected.
+In general, write-sharing an inode across multiple cgroups is not well
+supported.
+
+Filesystem support for cgroup writeback
+---------------------------------------
+
+A filesystem can make writeback IOs cgroup-aware by updating
+address_space_operations->writepage() to annotate bio's using the
+following two functions.
+
+* wbc_init_bio(@wbc, @bio)
+
+ Should be called for each bio carrying writeback data and associates
+ the bio with the inode's owner cgroup. Can be called anytime
+ between bio allocation and submission.
+
+* wbc_account_io(@wbc, @page, @bytes)
+
+ Should be called for each data segment being written out. While
+ this function doesn't care exactly when it's called during the
+ writeback session, it's the easiest and most natural to call it as
+ data segments are added to a bio.
+
+With writeback bio's annotated, cgroup support can be enabled per
+super_block by setting MS_CGROUPWB in ->s_flags. This allows for
+selective disabling of cgroup writeback support which is helpful when
+certain filesystem features, e.g. journaled data mode, are
+incompatible.
+
+wbc_init_bio() binds the specified bio to its cgroup. Depending on
+the configuration, the bio may be executed at a lower priority and if
+the writeback session is holding shared resources, e.g. a journal
+entry, may lead to priority inversion. There is no one easy solution
+for the problem. Filesystems can try to work around specific problem
+cases by skipping wbc_init_bio() or using bio_associate_blkcg()
+directly.
--
2.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

[PATCH 1/3] writeback: do foreign inode detection iff cgroup writeback is enabled

$
0
0
Currently, even when a filesystem doesn't set the FS_CGROUP_WRITEBACK
flag, if the filesystem uses wbc_init_bio() and wbc_account_io(), the
foreign inode detection and migration logic still ends up activating
cgroup writeback which is unexpected. This patch ensures that the
foreign inode detection logic stays disabled when inode_cgwb_enabled()
is false by not associating writeback_control's with bdi_writeback's.

This also avoids unnecessary operations in wbc_init_bio(),
wbc_account_io() and wbc_detach_inode() for filesystems which don't
support cgroup writeback.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
fs/fs-writeback.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f60de54..f0520bc 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -513,6 +513,11 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
struct inode *inode)
{
+ if (!inode_cgwb_enabled(inode)) {
+ spin_unlock(&inode->i_lock);
+ return;
+ }
+
wbc->wb = inode_to_wb(inode);
wbc->inode = inode;

@@ -575,11 +580,16 @@ void wbc_detach_inode(struct writeback_control *wbc)
{
struct bdi_writeback *wb = wbc->wb;
struct inode *inode = wbc->inode;
- u16 history = inode->i_wb_frn_history;
- unsigned long avg_time = inode->i_wb_frn_avg_time;
- unsigned long max_bytes, max_time;
+ unsigned long avg_time, max_bytes, max_time;
+ u16 history;
int max_id;

+ if (!wb)
+ return;
+
+ history = inode->i_wb_frn_history;
+ avg_time = inode->i_wb_frn_avg_time;
+
/* pick the winner of this round */
if (wbc->wb_bytes >= wbc->wb_lcand_bytes &&
wbc->wb_bytes >= wbc->wb_tcand_bytes) {
--
2.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

[PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB

$
0
0
FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
cgroup writeback; however, different super_blocks of the same
file_system_type may or may not support cgroup writeback depending on
filesystem options. This patch replaces FS_CGROUP_WRITEBACK with a
kernel-internal super_block->s_flags MS_CGROUPWB. The concatenated
and abbreviated name is for consistency with other MS_* flags.

ext2_fill_super() is updated to assert MS_CGROUPWB.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
---
fs/ext2/super.c | 4 ++--
fs/namespace.c | 2 +-
include/linux/backing-dev.h | 2 +-
include/linux/fs.h | 1 -
include/uapi/linux/fs.h | 1 +
5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 549219d..472ed34 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -879,7 +879,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
if (!parse_options((char *) data, sb))
goto failed_mount;

- sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
+ sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | MS_CGROUPWB |
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
MS_POSIXACL : 0);

@@ -1543,7 +1543,7 @@ static struct file_system_type ext2_fs_type = {
.name = "ext2",
.mount = ext2_mount,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV | FS_CGROUP_WRITEBACK,
+ .fs_flags = FS_REQUIRES_DEV,
};
MODULE_ALIAS_FS("ext2");

diff --git a/fs/namespace.c b/fs/namespace.c
index 1f4f9da..507b90b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2669,7 +2669,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,

flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
- MS_STRICTATIME);
+ MS_STRICTATIME | MS_CGROUPWB);

if (flags & MS_REMOUNT)
retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index dfce808..1489131 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -260,7 +260,7 @@ static inline bool inode_cgwb_enabled(struct inode *inode)

return bdi_cap_account_dirty(bdi) &&
(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
- (inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
+ (inode->i_sb->s_flags & MS_CGROUPWB);
}

/**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5e1dcf..66e35dc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1912,7 +1912,6 @@ struct file_system_type {
#define FS_HAS_SUBTYPE 4
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_USERNS_DEV_MOUNT 16 /* A userns mount does not imply MNT_NODEV */
-#define FS_CGROUP_WRITEBACK 32 /* Supports cgroup-aware writeback */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9b964a5..60316e7 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -93,6 +93,7 @@ struct inodes_stat_t {
#define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */

/* These sb flags are internal to the kernel */
+#define MS_CGROUPWB (1<<27) /* cgroup-aware writeback enabled */
#define MS_NOSEC (1<<28)
#define MS_BORN (1<<29)
#define MS_ACTIVE (1<<30)
--
2.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH 1/1] PCI: X-Gene: Disable Configuration Request Retry Status for X-Gene v1 PCIe

$
0
0
Hi Duc,

On Thu, Jun 11, 2015 at 01:08:14PM -0700, Duc Dang wrote:
> X-Gene v1 PCIe controller has a bug in Configuration Request Retry
> Status (CRS) logic:
> When CPU tries to read Vendor ID and Device ID of not-existed
> remote device, the controller returns 0xFFFF0001 instead of
> 0xFFFFFFFF; this will add significant delay in boot time as
> pci_bus_read_dev_vendor_id will wait for 60 seconds before
> giving up.

OK, help me understand how this works. I think this is related to the
problem I reported where if the slot is empty, "lspci" doesn't show
anything, not even the Root Port leading to the slot.

I think this happens because when we try to read the Root Port's config
space,

- the slot below the Root Port is empty
- the Root Port's link is down
- xgene_pcie_map_bus() returns NULL because !port->link_up
- pci_generic_config_read32() returns PCIBIOS_DEVICE_NOT_FOUND

so it looks like the Root Port itself doesn't exist.

I proposed to change xgene_pcie_map_bus() so it didn't check whether the
link was up. That change makes reads of the Root Port's config space work.

After we learn the Root Port exists, the PCI core enumerates devices below
the Root Port, e.g., on bus 01. X-Gene advertises that it supports CRS, so
we enable it. When we try to read the Vendor ID of 01:00.0, there's no
response from the device (because the slot is empty), and the Root Complex
should complete the read by fabricating data of all ones, i.e., 0xFFFFFFFF.
But apparently X-Gene supplies 0xFFFF0001 instead, which means "there's a
device here, but it's not ready yet," so the PCI core retries the read for
60 seconds before timing out.

This patch is basically a quirk that keeps X-Gene from advertising CRS
support, so the PCI core won't enable CRS. In the example above, I guess
that means the Root Complex will supply 0xFFFFFFFF and the core will see
that the slot is empty.

But this patch leaves the "!port->link_up" test in xgene_pcie_map_bus().
Doesn't that mean the core will still not discover the Root Port when the
slot is empty?

It seems to me that you would want both the xgene_pcie_map_bus() change and
this patch. The first would fix the problem that we don't enumerate Root
Ports leading to empty slots, and the second would fix the problem that we
enable CRS and timeout when enumerating below those Root Ports.

One more question below:

> So for X-Gene v1 PCIe controllers, disable CRS capability
> advertisement by clearing CRS Software Visibility bit before
> returning the Root Capability value to the callers. This is done
> by implementing X-Gene PCIe specific xgene_pcie_config_read32 for
> CFG read accesses to replace the generic default pci_generic_config_read32
> function.
>
> Signed-off-by: Duc Dang <dhdang@apm.com>
> ---
> drivers/pci/host/pci-xgene.c | 48 +++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 47 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/host/pci-xgene.c b/drivers/pci/host/pci-xgene.c
> index ee082c0..741a253 100644
> --- a/drivers/pci/host/pci-xgene.c
> +++ b/drivers/pci/host/pci-xgene.c
> @@ -59,6 +59,12 @@
> #define SZ_1T (SZ_1G*1024ULL)
> #define PIPE_PHY_RATE_RD(src) ((0xc000 & (u32)(src)) >> 0xe)
>
> +#define ROOT_CAP_AND_CTRL 0x5C
> +
> +/* PCIe IP version */
> +#define XGENE_PCIE_IP_VER_UNKN 0
> +#define XGENE_PCIE_IP_VER_1 1
> +
> struct xgene_pcie_port {
> struct device_node *node;
> struct device *dev;
> @@ -67,6 +73,7 @@ struct xgene_pcie_port {
> void __iomem *cfg_base;
> unsigned long cfg_addr;
> bool link_up;
> + u32 version;
> };
>
> static inline u32 pcie_bar_low_val(u32 addr, u32 flags)
> @@ -140,9 +147,44 @@ static void __iomem *xgene_pcie_map_bus(struct pci_bus *bus, unsigned int devfn,
> return xgene_pcie_get_cfg_base(bus) + offset;
> }
>
> +int xgene_pcie_config_read32(struct pci_bus *bus, unsigned int devfn,
> + int where, int size, u32 *val)
> +{
> + void __iomem *addr;
> + struct xgene_pcie_port *port = bus->sysdata;
> +
> + addr = bus->ops->map_bus(bus, devfn, where & ~0x3);
> + if (!addr) {
> + *val = ~0;
> + return PCIBIOS_DEVICE_NOT_FOUND;
> + }
> +
> + *val = readl(addr);

Can't you just call pci_generic_config_read32() directly instead of
duplicating its code here?

> + /*
> + * X-Gene v1 PCIe controller has a bug in Configuration Request
> + * Retry Status (CRS) logic:
> + * When CPU tries to read Vendor ID and Device ID of not-existed
> + * remote device, the controller returns 0xFFFF0001 instead of
> + * 0xFFFFFFFF; this will add significant delay in boot time as
> + * pci_bus_read_dev_vendor_id will wait for 60 seconds before
> + * giving up.
> + * So for X-Gene v1 PCIe controllers, disable CRS capability
> + * advertisement by clearing CRS Software Visibility bit before
> + * returning the Root Capability value to the callers.
> + */
> + if (pci_is_root_bus(bus) && (port->version == XGENE_PCIE_IP_VER_1) &&
> + ((where & ~0x3) == ROOT_CAP_AND_CTRL))
> + *val &= ~(PCI_EXP_RTCAP_CRSVIS << 16);
> +
> + if (size <= 2)
> + *val = (*val >> (8 * (where & 3))) & ((1 << (size * 8)) - 1);
> +
> + return PCIBIOS_SUCCESSFUL;
> +}
> +
> static struct pci_ops xgene_pcie_ops = {
> .map_bus = xgene_pcie_map_bus,
> - .read = pci_generic_config_read32,
> + .read = xgene_pcie_config_read32,
> .write = pci_generic_config_write32,
> };
>
> @@ -483,6 +525,10 @@ static int xgene_pcie_probe_bridge(struct platform_device *pdev)
> port->node = of_node_get(pdev->dev.of_node);
> port->dev = &pdev->dev;
>
> + port->version = XGENE_PCIE_IP_VER_UNKN;
> + if (of_device_is_compatible(port->node, "apm,xgene-pcie"))
> + port->version = XGENE_PCIE_IP_VER_1;
> +
> ret = xgene_pcie_map_reg(port, pdev);
> if (ret)
> return ret;
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Viewing all 23908 articles
Browse latest View live




Latest Images